My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
and update your bookmarks.

October 30, 2008

Corpus for nasty web spam

Researches who study webspam are limited by the lack of corpus available.  There is one that gets used quite often called "WEBSPAM-UK2007", released by Yahoo.  There's also the 2006 version.  It's really useful but as they say, it was generated to aid the researchers so it's biased towards their needs.  Also, you can't compare results unless they're tested on the same collection.

The University of Milan downloaded loads of documents for the collection starting from a set of hosts listed in DMOZ for the uk domain.  They followed links recursively in breadth-first mode.  Then lots of volunteers tagged it up. 

Things that they found that identified a spam host was the number of keywords in the URL, the anchor text in links, sponsored links and content copied from the engine results.

There are:
  • 8123 tagged as "normal"
  • 2113 tagged as "Spam"
  • 426 tagged as "undecided"
Yahoo do loads of work on web spam, check out the results of their tests at AIRWeb and "the web spam challenge".

This also a good resource for you, listing the characteristics of nasty spam things.

It's really interesting to research web spam, because at the end of the day it's one of the most crippling things to a search engine.  It ruins quality, and is highly unwelcome in the index, taking up valuable resources.  It also ruins the experience for users, and basically spreads a lot of pain in our information seeking community.  It's by no means an easy problem to solve.  Links are mostly looked at using methods such as SVM.  Maybe it's time to look beyond links?  

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at