We are all familiar with webspam but not a lot of people know that it is a classification problem in computing. It's hard to get all of the features right, and difficult to find an efficient classifier.
There's an interesting paper called "Improving Web Spam Classifiers Using Link Structure" by Qingqing Gan and Torsten Suel from the Polytechnic University in Brooklyn,NY.
The usual features used in spam detection are (they provided comprehensive lists):
- fraction of words drawn from globally popular words.
- fraction of globally popular words used in page, measured as the number of unique popular words in a page divided by the number of words in the most popular word list.
- fraction of visible content, calculated as the aggregate length (in bytes) of all non-markup words on a page divided by the total size (in bytes) of the page.
- number of words in the page title.
- amount of anchor text in a page.
- compression rate of the page, using gzip.
They also calculated the following link features for each site:
- percentage of pages in most populated level
- top level page expansion ratio
- in-links per page
- out-links per page
- out-links per in-link
- top-level in-link portion
- out-links per leaf page
- average level of in-links
- average level of out-links
- percentage of in-links to most popular level
- percentage of out-links from most emitting level
- cross-links per page
- top-level internal in-links per page on this site
- average level of page in this site
In addition, they added the following:
- number of hosts in the domain. We observed that domains with many hosts have a higher probability of spam.
- ratio of pages in this host to pages in this domain.
- number of hosts on the same IP address. Often spammers register many domain names to hold spam pages.
They used the C4.5 classifier. (This produces a decision tree or a rule set (these are easier to understand), and is a statistical classifier. The trees are built from training sets (which are already classified). I'll add that using C5.0 is considerably faster, has lower error rates and more features.) - Then they used a second classifier and found that the results were far better, it "uses the baseline classification results for neighboring sites in order to flip the labels of certain sites."
Until a really robust and fast method is found, then there will always be the problem of webspam. It pollutes search engines, and annoy users no end. I hope to see a lot more work in this area in the future. It's not my area of expertise, although classification methods are similar to those I use but I find it really interesting and worthwhile research.