The paper entitled "RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee" (Cho, Schonfeld, University of California) deals with the important topic of how many pages should be collected to cover most of the web, and how to ensure that important documents are not left out when the crawl is halted. They answered these questions by showing that crawling a number of important pages or the most important part of the web helps to limit the amount needed. This is important because a large corpus is expensive to house and the amount of computing cost is high.
They state that comparing search engines by the number of pages they index is misleading because of the huge amount of data available on the web. They say that for example calendar pages generated by dynamic sites have links to "next day" and so on meaning that potentially useless information is collected. Additionally no search engine is capable of downloading the entire web, we don't even know how big it is. So when should they stop the crawl? At 8 billion pages like Google?
Their method, RankMass is a metric for comparing the quality of search engine indexes and is a variant of the Personalised PageRank metric which assumes that users go to only important pages. They change this a little bit because they look at how much of the important documents the set covers. The coverage measurement involves looking at pages in the subset which are not known. Their crawler is focused on the pages users go to. It can prioritize the crawl downloading high personalised PageRank first and so the highest RankMass is achieved when the crawl is over.
How are pages deemed important?
This could as they say be based on the relevance to the queries but this would mean having a set of queries to start with which isn't ideal. They say that PageRank with it's random surfer model is very effective but as we have seen it can be easily spammed (by webmasters and seo people I presume!). Personalised PageRank assumes that a user eventually goes to a trusted site rather than to a page of equal probability.
The RankMass metric is based on the link structure of the whole web rather than just the graph structure in the subset, meaning that they don't have to download a huge amount. They download all the pages that are reachable from neighbouring sites of a page and calculate RankMass, however they note that users are unlikely to go only to one trusted page. This method however is greedy so they adapted it to form the "Windowed-RankMass algorithm".
"The Windowed-RankMass algorithm is an adaptation of the RankMass algorithm and is designed to allow us to reduce the overhead by batching together sets of probability calculations and downloading sets of pages at a time."
The starting point is referred to as a set of seeds which are a number of documents which form the starting point of the crawl:
"Deciding on the number of seeds is influenced by many factors such as: the connectivity of the web, spammability of the search engine, efficiency of the crawl, and many other factors".
The result of their evaluation and experiments showed that their RankMass metric, that maximizes the PageRank of every page, is very effective. It allows search engines to specify the end of the crawl based on specific conditions. This means that the crawl runs until the required percent of the web's PageRank is collected.
Why should you care?
This interesting papers shows that search engines can have smaller sized indices and still prove very effective. This should improve both precision and recall, due to the fact that there are fewer unimportant documents that are considered in the computation stages. The constant talk and wow-factor associated with the huge size of indices are shown to be rather irrelevant really when you consider the actual quality of these indices. The bigger the index the harder it is to manage.
You should care because the quality of your sites becomes crucial. Not only the way that they are built but also their ability to attract users based on how important they are seen to be.