檢視 Crawler software 的原始碼

Crawler software is a common software used to collect a large amount of information, and crawling information by using vulnerabilities is called malicious crawler.
Web crawler is a program that automatically extracts web pages. It downloads Web pages for search engines from the world wide web. It is an important part of search engines. The traditional crawler starts from the URL of one or several initial web pages to obtain the URL on the initial web page. In the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stop conditions of the system are met. The workflow of the focused crawler is complex. It needs to filter the links irrelevant to the topic according to a certain web page analysis algorithm, keep the useful links and put them into the URL queue waiting to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until it reaches a certain condition of the system. In addition, all web pages captured by the crawler will be stored by the system, analyzed, filtered and indexed for future query and retrieval; For focused crawlers, the analysis results obtained in this process may also provide feedback and guidance for the future capture process.