HTML Downloader
Last updated
Last updated
The HTML Downloader downloads web pages from the internet using the HTTP protocol. Before discussing the HTML Downloader, we look at Robots Exclusion Protocol first.
Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a website, a crawler should check its corresponding robots.txt first and follow its rules.
To avoid repeat downloads of robots.txt file, we cache the results of the file. The file is downloaded and saved to the cache periodically. Here is a piece of robots.txt file taken from https://www.amazon.com/robots.txt. Some of the directories like creatorhub are disallowed for Google bot.
User-agent: Googlebot Disallow: /creatorhub/* Disallow: /rss/people/*/reviews Disallow: /gp/pdp/rss/*/reviews Disallow: /gp/cdp/member-reviews/ Disallow: /gp/aw/cr/
Besides robots.txt, performance optimization is another important concept we will cover for the HTML downloader.
Below is a list of performance optimizations for HTML downloaders.
To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs. The picture below shows an example of a distributed crawl.
DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.
Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc.
Some web servers respond slowly or may not respond at all. To avoid long wait times, a maximal wait time is specified. If a host does not respond within a predefined time, the crawler will stop the job and crawl some other pages.