Readwriteweb posted an article about blog search and the best tools out there at the moment. I happened to come across a paper by Marti Hearst, Susan Dumais and Matthew Hurst called "What should blog search should look like". It was presented at SSM 08 and is particularly interesting, not just because of who wrote it.
They acknowledge that blog search isn't very good right now and propose a "faceted navigation interface" as being a good place to start. They say that blog search needs to be integrated with search of other forms of social media, so that particular topics can be analysed.
They note that some of the problems surrounding blog search have been the lack of academic work on search interfaces, and also interfaces that don't make good use of the available data. They mention Mishne & de Rijke who looked at query log analysis on interface design and who found that:
52% chose adhoc queries with named entities
25% (of the rest) high level topics
23% (remaining) navigational and adult queries and so on
20% of the most popular queries were related to breaking news
So blog search was used for thoughts on topics and discussion about current events.
They note that blogs are different from other web documents because of the language used, structure, and recency is more important too. The data is people centric and subjective.
Their methods involves sentiment analysis over particular topics over time, finding quality authors, and useful information published in the past. They rightly say that current blog engines try to do this but aren't very good at it. Google doesn't list enough, and others don't list ones that are current. They also highlight the need for sentiment analysis (which we have seen in many papers now) but say that as well as product review sites we should include microblogs, academic journals and other publications.
They say that blog search should:
Organise and aggregate the results more effectively focusing on comments, who else has blogged about the topic...Blogpulse they say is very simplistic but Blogrunner for example was better.
The quality of blogs needs to be properly assessed using good metrics like original content vs complementary content, Amount of relevant content covered, style and tone.
Subtopics need to be identified.
Information relating to the authors (comment authors, links in and what kinds of things link in, number of authors, quality of comments, variety of viewpoints.
They propose to use these variables in a PageRank type algorithm.
They propose a faceted one which they believe to be efficient for navigation on information collections. their facets are related to the variables listed above and also a few others. They also say that people search is highly important, the idea of content claiming by Ramakrishnan & Tomkins discusses this. They also think that people should have individual profiles. Another idea from the authors is to include matching blog style and personality. It would also be useful to use usual text classification (using "links typed by opinion polarity"), relevance feedback, collaborative filtering, and implicit selection.
They do note that these in the past have not proved very successful but should work for blogs and that additionally descriptive queries would help.
This is just a shortish summary of their paper, I suggest reading the whole thing for a full picture and must say that it is a very good paper unsurprisingly, and a very good starting point for elaborate research and discussions on the topic - you'll need ACM access.