Science for SEO: Using link structure to fight webspam

November 07, 2008

Using link structure to fight webspam

We are all familiar with webspam but not a lot of people know that it is a classification problem in computing. It's hard to get all of the features right, and difficult to find an efficient classifier.

There's an interesting paper called "Improving Web Spam Classifiers Using Link Structure" by Qingqing Gan and Torsten Suel from the Polytechnic University in Brooklyn,NY.

The usual features used in spam detection are (they provided comprehensive lists):

fraction of words drawn from globally popular words.
fraction of globally popular words used in page, measured as the number of unique popular words in a page divided by the number of words in the most popular word list.
fraction of visible content, calculated as the aggregate length (in bytes) of all non-markup words on a page divided by the total size (in bytes) of the page.
number of words in the page title.
amount of anchor text in a page.
compression rate of the page, using gzip.

They also calculated the following link features for each site:

percentage of pages in most populated level
top level page expansion ratio
in-links per page
out-links per page
out-links per in-link
top-level in-link portion
out-links per leaf page
average level of in-links
average level of out-links
percentage of in-links to most popular level
percentage of out-links from most emitting level
cross-links per page
top-level internal in-links per page on this site
average level of page in this site

In addition, they added the following:

number of hosts in the domain. We observed that domains with many hosts have a higher probability of spam.
ratio of pages in this host to pages in this domain.
number of hosts on the same IP address. Often spammers register many domain names to hold spam pages.

They used the C4.5 classifier. (This produces a decision tree or a rule set (these are easier to understand), and is a statistical classifier. The trees are built from training sets (which are already classified). I'll add that using C5.0 is considerably faster, has lower error rates and more features.) - Then they used a second classifier and found that the results were far better, it "uses the baseline classification results for neighboring sites in order to flip the labels of certain sites."

Until a really robust and fast method is found, then there will always be the problem of webspam. It pollutes search engines, and annoy users no end. I hope to see a lot more work in this area in the future. It's not my area of expertise, although classification methods are similar to those I use but I find it really interesting and worthwhile research.