My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
and update your bookmarks.

December 05, 2008


I thought I'd share "Countering Web Spam with Credibility-Based Link Analysis" by James Caverlee (Texas A&M University) and Ling Liu (Georgia Institute of Technology) at PODC'07 today.

PageRank,TrustRank and HITS all couple link credibility and page quality, which isn't ideal because good links doesn't necessarily mean that you have a quality page here.  I think page authority and quality are very important areas of research right now.

So, these guys used a credibility-based link analysis and called it "CredibleRank".  The credibility of information is directly used in the quality assessment of each page.  It proves to be way more more spam-resilient than both PageRank and TrustRank.  These two algorithms rely on the assumption that the quality of a page and the quality of a page’s links correlate.  This unfortunately leaves them open to spam.  

CredibleRank incorporates credibility information directly into the quality assessment of each page on the Web.  

They found that a page’s link quality should depend on it's own outlinks and that it is related to the quality of the outlinks of its neighbours.  So they use the local characteristics of pages and place in the Web graph as opposed to the global properties of the entire Web that the other algorithms use.

Relying on a whitelist (set of known good pages) isn't very useful because Spammers can camoflage their low rubbish outlinks to spam pages by linking to known whitelist pages.  They advocate the use of a Blacklist (known spam pages) instead, where the proximity of page to spam pages.  They're penalised for low quality outlinks.

"First, the initial score distribution for the iterative PageRank calculation (which is typically taken to be a uniform distribution) can be seeded to favor high credibility pages. While this modification may impact the convergence rate of PageRank, it has no impact on ranking quality since the iterative calculation will converge to a single final PageRank vector regardless of the initial score distribution." 

They found that CredibleRank does not negatively impact good sites, because they compared the ranking of each whitelist site under PageRank against its ranking on CredibleRank, and the fluctuation was only of 26 spots, so it isn't unfairly treating clean sites.

It proves to be so far spam resilient and efficient, and outperforms TrustRank and PageRank.  Excellent stuff.


theGypsy said...

Nice stuff once more... reading it did remind me of the whole 'trustrank' approach... One more for your list would be Yahoo's HarmonicRank;

Have a good weekend and shall cya on the trails!!


Experiments in Cyberspace said...

A very good approach, indeed. One can typically assume that backlinks relate to popularity rather than quality, hence more subject to spam. Whereas, quality content is more likely to link to other quality content, albeit less popular.

CJ said...

Hey Dave,

thanks for that, always a pleasure when our paths cross :)

EIC, you're quite right, not all quality content has a lot of links to it. Evaluating page authority is difficult.

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at