Robustness

Besides performance optimization, robustness is also an important consideration. We present a few approaches to improve the system robustness:

  • Consistent hashing: This helps to distribute loads among downloaders. A new downloader server can be added or removed using consistent hashing. Refer to Chapter 5: Design consistent hashing for more details.

  • Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data.

  • Exception handling: Errors are inevitable and common in a large-scale system. The crawler must handle exceptions gracefully without crashing the system.

  • Data validation: This is an important measure to prevent system errors.

Last updated