The Science of Crawl (Part 2): Content Freshness — URX Blog

Here at URX we are building the world's first mobile app search API. Backing this API is a search engine containing a large corpus of web documents, meticulously maintained and carefully grown by crawling internet content. We've come to discover that building a functional crawler can be done relatively cheaply, but building a robust crawler requires overcoming a few technical challenges. In this series of blog posts, we will walk through a few of these technical challenges including content deduplication, link prioritization, feature extraction and re-crawl estimation.

