The University of Milan downloaded loads of documents for the collection starting from a set of hosts listed in DMOZ for the uk domain. They followed links recursively in breadth-first mode. Then lots of volunteers tagged it up.
Things that they found that identified a spam host was the number of keywords in the URL, the anchor text in links, sponsored links and content copied from the engine results.
- 8123 tagged as "normal"
- 2113 tagged as "Spam"
- 426 tagged as "undecided"
Yahoo do loads of work on web spam, check out the results of their tests at AIRWeb and "the web spam challenge".
This also a good resource for you, listing the characteristics of nasty spam things.
It's really interesting to research web spam, because at the end of the day it's one of the most crippling things to a search engine. It ruins quality, and is highly unwelcome in the index, taking up valuable resources. It also ruins the experience for users, and basically spreads a lot of pain in our information seeking community. It's by no means an easy problem to solve. Links are mostly looked at using methods such as SVM. Maybe it's time to look beyond links?