A cool paper caught my attention today:"A few bad votes too many?: towards robust ranking in social media" - it's written by researchers from Emroy University and the Georgia institute of Technology (ACM SIGIR '08).
People vote all the time in social networks, be it Digg, Sphinn, Linkedin, and many others that we visit frequently. These votes are used to rank, filter and retrieve high quality content. There is a lot of "noise" though due to voting for your friends and gaming the system, this degrades the quality and reliability of that data.
Their solution to this problem is to build a machine learning based ranking framework for social media. It integrates user interactions and content relevance. It is trained to deal with vote spam attacks. It's not possible to do this "post-factum" because it would be much too slow. Answers already deals with some obvious vote spam in this way and while it awaits moderation, the user experience is degraded. As social networks grow, social network spam becomes more sophisticated and it can change significantly due to the varying popularity of the content. Voting spam includes not only malicious voting but also non-expert voting. If you don't know anything about VB.Net and you vote on a post about it, your vote is not as important as the vote of an expert.
They pulled out all sorts of information from the topic threads such as the date it had been posted, number of responses,...and then they extracted textual features from the relationship between the threads, users responses and queries. They extracted all the usual data about the users such as number of topics posted, votes given, etc...
They used their own ranking algorithm (GBrank) and added some noise:
"Our experiments demonstrate that user vote information provides much contribution to the high accuracy of our GBrank, when there is no vote spam. However, if user votes in CQA have been polluted by spam from malicious users and we continue using GBrank trained by clear data without vote spam, GBrank will still put much reliance on user vote information which however is supplying inaccurate information due to the spam".
"In order to create a robust ranking method, we enhance our GBrank by using polluted training data during learning process. We apply the general vote spam model, described in Section 4, to generate vote spam into unpolluted QA data. Then, we train the ranking function based on new polluted data".
They proved that it works:
"We have presented a robust, effective method which incorporates social and content information for retrieving information from social media. In particular, we focused on the robustness of ranking in the presence of malicious feedback (vote spam), analyzing general models for common vote spam strategies and developing a training method that improves the robustness of ranking by injecting simulated spam into the training data."
Nicely done too -Social networks, like the search engines have trouble with spam. Comment spam is another matter entirely but is definitely being looked at now. It's interesting to see how the methods used by the search engines can be used in social media as well. For comment spam the similarity of method is obvious since we are dealing with natural language, but with the information gathered on users in social networks, the vote spam is being dealt with in a new way. Is there anything that can be learnt from sponsored search?