Science for SEO: relevance rank

January 15, 2009

Search Engine Result Evaluation

Search engines are often evaluated using information retrieval techniques such a precision and recall. These methods are very effective metrics in these systems but less so in search engines. The reasons for this is that high precision isn't necessarily a good measure of user satisfaction. The quality of the resources is of course a factor but what users class as authoritative may vary.

This does really show that results are personal to each user, we're not looking for the same things every time and if we are, maybe not for the same reasons. This is why personalisation is a good solution, but that's a topic for another day.

Usually you can classify queries into navigational ones or information motivated ones. This also affects the evaluation of the search engine. Information ones are hardest because you're looking for a bunch of relevant documents but the query isn't usually rich enough to establish what exactly is needed. Navigational queries such as looking for the Sofitel in Bangkok are much easier because they're more exact.

You can use human evaluators or automated methods to check how good the results are. Human evaluators are very biased towards their own motivations of course which have in the past shown that results vary widely. Automated testing isn't biased of course, the machine doesn't care, but it isn't always very representative of human search if you like. Google use human evaluators and also live traffic experiments.

Here I'll introduce a few papers you might find interesting on the subject. I've chosen a bit of a mixture but of course there are many more ways to do this.

"Search Engine Ranking Efficiency Evaluation Tool" by Alhalabi, Kubat and Tapia from the University of Miami.

They also note that "precision" and "recall" doesn't take into consideration ranking quality. They propose using SEREET (Search Engine Ranking Efficiency Evaluation Tool).

They compare a known correctly ordered list to a search engine's one. The method is to start at 100 points and then deduct from those each time a relevant document isn't present in the search engine rankings and also if an irrelevant document is returned. It's basically (the number of misses/RankLength) x 100. RankLength is the number of links in the rank list.They found it was more sensitive to change and efficient in space and time.

"Automatic Search Engine Performance Evaluation with Click-through Data Analysis" by Liu, Fu, Zhang, Ru from Tsinghua University.

They note than human evaluation is too time consuming to be an efficient method of evaluation. Their click-through data analysis method allows them to evaluate automatically. Navigational type queries, query topics and answers are made by the system based on user query and click behaviour. They found that they got similar results from those of human evaluators.

"Evaluation of Web-Based Search Engines Using User-Effort Measures" - Tang and Sun from Reutgers University

They looked at "user-effort-sensitive evaluation measures", namely search length, rank correlation and first 20 full precision. They say this is better because it focuses on the quality of the ranking. They found overall that the 3 measures were consistent. "Search length" is the number of non-relevant documents the users has to sift through, "Rank correlation" is comparing the user ranking to the search engine ranking, and "First 20 Full Precision" is the ratio of relevant document within the total set of documents returned.

Sphinn spam - some solutions

*Before you read, a clarification - I'm aware that Sphinn is working on a newer version, and also that they use editors, mods and user interaction for spam fighting - this post is about those techniques and their limitations, and also introduces some new ideas*

Sphinn has been swamped with spam recently, I've seen a lot of it myself and it's been reported by other users, including Zigojacko. What's up?

Although Sphinn is small in comparison to Digg, it uses the same kind of system. People submit stories and they get voted by other members. The posts with the most votes go to the "Hot topics" page, which is also the content that you'll get in your feed. Basically, it's the same thing. People post advertising for their products and services instead of information rich resources that can be shared with the community. It's also a drain on resources. All in all, a nasty thing that needs to be dealt with.

Ways that this problem can be solved include:

Having a human spam editor
Getting users to flag spam
Moderation
Captcha (not effective for human submissions)
Relevance rank
and finally...personalisation.

Having a human spam editor isn't ideal in a very dynamic environment like Sphinn. It works for Wikipedia, but it moves at a much slower pace. Captcha is only useful for deterring bots (although some can break captcha now). Moderation uses human resources and is time consuming, Tamar and Danny at Sphinn make it clear in the Zigojacko thread. Moderators should not have to clear out the spam anyway. That leaves...

Personalization:

Digg already announced at Web 2.0 expo that they were working on a personalised front page. This means that, yes, you might still get spam on your front page, but it's not really going to be worthwhile for the spammers, seeing as their audience becomes very small all of a sudden. You get to moderate your own "front page", and in this sense, I guess something like Twine is worth a look (I really like Twine btw).

There is a way for spammers to use this to their advantage though, and this would be through social network monitoring, to detect where their interest group is, and then target them in some way, like with paid for ads. It is still tricky though.

Relevance rank:

Most people will be aware of this, it's basically ranking results by relevance, but first you have to decide what's relevant.

On Sphinn, new submissions come in on the "what's new" section as they are submitted, which I like. Sometimes stuff I'm interested in doesn't get many votes and they'd be buried before I could come across them (not "find" - I'm never looking on Sphinn, I'm browsing). This section is easily spammed and to be honest it's not as bad as I've seen it elsewhere.

There has to be a filter as stories come in to minimise spam at this level. One way to do it would be to use a topic detection algorithm and train it on a clean already existing Sphinn corpus. The system can draw patterns from the training data which help it label a submission with "Sphinn" or "Foe". The patterns will be numerous! A cool by-product is a way to visualize the community.

This type of method needs to be flexible as well though, otherwise if you used an unconventional title for example, or weird words, your submission would be chucked out. The more you train it the better it gets and I would define Sphinn as a closed environment, which makes the problem easier to deal with. There are only so many categories. It's not as difficult as tracking spam in a global engine. On top of that you could take into consideration user interaction to solidify your method.

Or failing all of that, we could beg: "Spammers, please please stop peeing in the beer".