*Before you read, a clarification - I'm aware that Sphinn is working on a newer version, and also that they use editors, mods and user interaction for spam fighting - this post is about those techniques and their limitations, and also introduces some new ideas*
Although Sphinn is small in comparison to Digg, it uses the same kind of system. People submit stories and they get voted by other members. The posts with the most votes go to the "Hot topics" page, which is also the content that you'll get in your feed. Basically, it's the same thing. People post advertising for their products and services instead of information rich resources that can be shared with the community. It's also a drain on resources. All in all, a nasty thing that needs to be dealt with.
Ways that this problem can be solved include:
- Having a human spam editor
- Getting users to flag spam
- Moderation
- Captcha (not effective for human submissions)
- Relevance rank
- and finally...personalisation.
Having a human spam editor isn't ideal in a very dynamic environment like Sphinn. It works for Wikipedia, but it moves at a much slower pace. Captcha is only useful for deterring bots (although some can break captcha now). Moderation uses human resources and is time consuming, Tamar and Danny at Sphinn make it clear in the Zigojacko thread. Moderators should not have to clear out the spam anyway. That leaves...
Personalization:
Digg already announced at Web 2.0 expo that they were working on a personalised front page. This means that, yes, you might still get spam on your front page, but it's not really going to be worthwhile for the spammers, seeing as their audience becomes very small all of a sudden. You get to moderate your own "front page", and in this sense, I guess something like Twine is worth a look (I really like Twine btw).
There is a way for spammers to use this to their advantage though, and this would be through social network monitoring, to detect where their interest group is, and then target them in some way, like with paid for ads. It is still tricky though.
Relevance rank:
Most people will be aware of this, it's basically ranking results by relevance, but first you have to decide what's relevant.
On Sphinn, new submissions come in on the "what's new" section as they are submitted, which I like. Sometimes stuff I'm interested in doesn't get many votes and they'd be buried before I could come across them (not "find" - I'm never looking on Sphinn, I'm browsing). This section is easily spammed and to be honest it's not as bad as I've seen it elsewhere.
There has to be a filter as stories come in to minimise spam at this level. One way to do it would be to use a topic detection algorithm and train it on a clean already existing Sphinn corpus. The system can draw patterns from the training data which help it label a submission with "Sphinn" or "Foe". The patterns will be numerous! A cool by-product is a way to visualize the community.
This type of method needs to be flexible as well though, otherwise if you used an unconventional title for example, or weird words, your submission would be chucked out. The more you train it the better it gets and I would define Sphinn as a closed environment, which makes the problem easier to deal with. There are only so many categories. It's not as difficult as tracking spam in a global engine. On top of that you could take into consideration user interaction to solidify your method.
Or failing all of that, we could beg: "Spammers, please please stop peeing in the beer".
3 comments:
I tried training a bayesian classifier to determine sphinn spam a couple of weeks ago. I got a reasonable start - 90% of spam correctly identified, 10% false negative, 1-3% false positive.
These figures are good but not good enough for a production website. I couldn't improve them further due to a couple of problems:
1. There is no publicy available sphinn spam data. Grabbing the upcoming and hot feeds and assuming anything that gets dumped within 3 hours is not accurate enough. I'd need an RSS feed of known spam.
2. Sphinn has a surprisingly wide variety of posts. You'd think that stories about gambling should be off topic for Sphinn, but no, this went hot.
There are a lot of other good factors that could help, such as the age of the poster's account, number of repeated words in the comment, URL depth etc.
I guess I should have another go, or at least write up my findings.
I think this is an in-house issue, because they have all of the data, and without it you can't reliably train and test a classifier.
Have you tried using an SVM with the Naive Bayes?
Def write something up and have another go :)
I didn't use an SVM but I should definitely look into it.
I tried to get around the lack of a spam feed by classifying posts by topic and grouping topics into "good" and "bad". A lot of Sphinn's spam submissions are well-written web pages, but they're just not relevant, like posts about holidays to India.
It was a fun exercise and I'll pick it up again.
Post a Comment