Science for SEO: Corpus for nasty web spam

October 30, 2008

Corpus for nasty web spam

Researches who study webspam are limited by the lack of corpus available. There is one that gets used quite often called "WEBSPAM-UK2007", released by Yahoo. There's also the 2006 version. It's really useful but as they say, it was generated to aid the researchers so it's biased towards their needs. Also, you can't compare results unless they're tested on the same collection.

The University of Milan downloaded loads of documents for the collection starting from a set of hosts listed in DMOZ for the uk domain. They followed links recursively in breadth-first mode. Then lots of volunteers tagged it up.

Things that they found that identified a spam host was the number of keywords in the URL, the anchor text in links, sponsored links and content copied from the engine results.

There are:

8123 tagged as "normal"
2113 tagged as "Spam"
426 tagged as "undecided"

Yahoo do loads of work on web spam, check out the results of their tests at AIRWeb and "the web spam challenge".

This also a good resource for you, listing the characteristics of nasty spam things.

It's really interesting to research web spam, because at the end of the day it's one of the most crippling things to a search engine. It ruins quality, and is highly unwelcome in the index, taking up valuable resources. It also ruins the experience for users, and basically spreads a lot of pain in our information seeking community. It's by no means an easy problem to solve. Links are mostly looked at using methods such as SVM. Maybe it's time to look beyond links?

Science for SEO

October 30, 2008

Corpus for nasty web spam

No comments:

About Me

Follow me on Twitter

Subcribe

CJ's shared items

My Blog List

Blog Archive

ShareThis

Content Recommendations powered by Evri