URL frontier
Last updated
Last updated
URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness. A few noteworthy papers on URL frontier are mentioned in the reference materials [5] [9]. The findings from these papers are as follows:
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as a denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.
The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintaining a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. The picture below shows the design that manages politeness.
Queue router: It ensures that each queue (b1, b2, ... bn) only contains URLs from the same host.
Mapping table: It maps each host to a queue.
FIFO queues b1, b2 to bn: Each queue contains URLs from the same host.
Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector.
Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks.
A random post from a discussion forum about Apple products carries a very different weight than posts on the Apple home page. Even though they both have the “Apple” keyword, it is sensible for a crawler to crawl the Apple home page first.
We prioritize URLs based on usefulness, which can be measured by PageRank, website traffic, update frequency, etc. “Prioritizer” is the component that handles URL prioritization. Refer to the reference materials for in-depth information about this concept.
The picture below shows the design that manages URL priority.
Prioritizer: It takes URLs as input and computes the priorities.
Queue f1 to fn: Each queue has an assigned priority. Queues with high priority are selected with higher probability.
Queue selector: Randomly choose a queue with a bias towards queues with higher priority.
The picture below presents the URL frontier design, and it contains two modules:
Front queues: manage prioritization
Back queues: manage politeness
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time- consuming and resource intensive. A few strategies to optimize freshness are listed as follows:
Recrawl based on web pages’ update history.
Prioritize URLs and recrawl important pages first and more frequently.
In real-world crawl for search engines, the number of URLs in the frontier could be hundreds of millions [4]. Putting everything in memory is neither durable nor scalable. Keeping everything in the disk is undesirable neither because the disk is slow, and it can easily become a bottleneck for the crawl.
We adopted a hybrid approach. The majority of URLs are stored on disk, so the storage space is not a problem. To reduce the cost of reading from the disk and writing to the disk, we maintain buffers in memory for enqueue/dequeue operations. Data in the buffer is periodically written to the disk.