Science for SEO: The Random surfer becomes the Cautious surfer

An interesting paper: "Incorporating Trust into Web Search" written by Lan Nie Baoning Wu and Brian D. Davison from Lehigh University.

This paper deals with the issue of spam, generated by pages being engineered to deceive the search engines. They say that ranking systems should take take into consideration the trustworthiness of a source. TrustRank seeks to solve this issue, but they propose to:

"incorporate a given trust estimate into the process of calculating authority for a cautious surfer."

First some very brief info, in case you need a refresh:

The "Random surfer" is part of the PageRank algorithm. It represents a user clicking at random, with no real goal. The probability that s/he clicks on a link is determined by the number of links on that page. This explains why PageRank is not entirely passed on to the page it links to but is dependant on the number of links on that page. It's all based on probablilities.

The "damping factor" is the probability of the random surfer not stopping to click on links. It's always set at a value between 0 and 1. The closer to 1 the score is, the more likely s/he is going to click on links. Google sets this to 0.85 to start with. Not only does it allow for a score to be assigned to a page but it speeds up computations as well.

Now for the goods:

With a little knowledge about search engines, they note that it's easy to add keywords to content or generate some inbound links (something we are all familiar with). They rightly call this "spam". It does affect the results, and that has been the role of SEO for sometime, but today I believe that SEO's work far closer with the search engines than they ever did before, so better practices are at work.

They mention how PageRank calculates an authority score based on the number and quality of inbound links, and that HITS look at hubs that link to important pages. They state that the issue with these methods is that they assume the content and links can be trusted.

They say that PageRank and TrustRank can't be used to calculate authority effectively:

"The main reason is that algorithms based on propagation of trust depend critically on large, representative starting seed sets to propagate trust (and possibly distrust) across the remaining pages.

In practice, selecting (and labeling) such a set optimally is not likely to be feasible, and so labeled seed sets are expected to be only a tiny portion of the whole web. As a result, many pages may not have any trust or distrust value just because there is no path from the seed pages. Thus, we argue that estimates of trust are better used as hints to guide the calculation of authority, not replace such calculations.".

Basically it's not easy to label and select a load of pages that are deemed trustworthy, so the dataset that you would create wouldn't big enough to be effective.

In their method, they penalize spam pages and leave good ones untouched.

The "Cautious surfer" attempts to stay away from spam pages. They altered the "Random surfer" damping factor, which is usually set to 0.85. This damping factor is altered based on the trustworthiness of a page. This causes PageRank however to treat all links as a potential next page for the random surfer, and these may not all be trustworthy. They dynamically changed the damping factor to address this issue.

They found that their method could improve PageRank precision at 10 by 11-26% and improve the top 10 result quality by 53-81%.

Basically the idea is that using the "Cautious surfer model" to existing ranking algorithms will significantly improve their performance.

This would mean that SEO doesn't change that much, seeing as most of us are striving to deliver reliable and rich content to users. I think it would however come down harshly on some widely used techniques more efficiently, like keyword stuffing and link buying. In fact getting links from rubbish places may mean that the site incurs a penalty.

For a lot more detailed information on this, read the paper.