Science for SEO

February 02, 2009

How to build blog search

Readwriteweb posted an article about blog search and the best tools out there at the moment. I happened to come across a paper by Marti Hearst, Susan Dumais and Matthew Hurst called "What should blog search should look like". It was presented at SSM 08 and is particularly interesting, not just because of who wrote it.

They acknowledge that blog search isn't very good right now and propose a "faceted navigation interface" as being a good place to start. They say that blog search needs to be integrated with search of other forms of social media, so that particular topics can be analysed.

They note that some of the problems surrounding blog search have been the lack of academic work on search interfaces, and also interfaces that don't make good use of the available data. They mention Mishne & de Rijke who looked at query log analysis on interface design and who found that:

52% chose adhoc queries with named entities

25% (of the rest) high level topics

23% (remaining) navigational and adult queries and so on

20% of the most popular queries were related to breaking news

So blog search was used for thoughts on topics and discussion about current events.

They note that blogs are different from other web documents because of the language used, structure, and recency is more important too. The data is people centric and subjective.

Their methods involves sentiment analysis over particular topics over time, finding quality authors, and useful information published in the past. They rightly say that current blog engines try to do this but aren't very good at it. Google doesn't list enough, and others don't list ones that are current. They also highlight the need for sentiment analysis (which we have seen in many papers now) but say that as well as product review sites we should include microblogs, academic journals and other publications.

They say that blog search should:

Organise and aggregate the results more effectively focusing on comments, who else has blogged about the topic...Blogpulse they say is very simplistic but Blogrunner for example was better.

The quality of blogs needs to be properly assessed using good metrics like original content vs complementary content, Amount of relevant content covered, style and tone.

Subtopics need to be identified.

Information relating to the authors (comment authors, links in and what kinds of things link in, number of authors, quality of comments, variety of viewpoints.

They propose to use these variables in a PageRank type algorithm.

The interface:

They propose a faceted one which they believe to be efficient for navigation on information collections. their facets are related to the variables listed above and also a few others. They also say that people search is highly important, the idea of content claiming by Ramakrishnan & Tomkins discusses this. They also think that people should have individual profiles. Another idea from the authors is to include matching blog style and personality. It would also be useful to use usual text classification (using "links typed by opinion polarity"), relevance feedback, collaborative filtering, and implicit selection.

They do note that these in the past have not proved very successful but should work for blogs and that additionally descriptive queries would help.

This is just a shortish summary of their paper, I suggest reading the whole thing for a full picture and must say that it is a very good paper unsurprisingly, and a very good starting point for elaborate research and discussions on the topic - you'll need ACM access.

January 30, 2009

TGIF - wicked!

Hi all, welcome to another TGIF post. I hope that you have had a chilled out week and that you have enjoyed working and got time to have a bit of fun, if your work isn't also your source of fun. This week I have been experiencing Sydney for the 1st time and mostly working unfortunately but I also made it to the beach and swam in the pool enough times to balance all that out :)

It was pointed out to me that this post is basically "what I wasted time on this week" - fair point.

Without further ado...

Stuff I liked this week:

The list of top 10 complaint letters courtesy of The Telegraph.

Levitated have some super cool pictures of abstract things which I really like (fractals and stuff).

The 100m long picture depicting people walking on Warschauer strasse railroad bridge - shot over 20 days.

Quotes:

"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind". (Lord Kelvin) [but I don't agree :)]

"Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration". (S.Kelly-Bootle)

"It's hard enough to find an error in your code when you're looking for it; it's even harder when you've assumed your code is error-free".( S Mconnell)

"As a rule, software systems do not work well until they have been used, and have failed repeatedly, in real applications". (D. Parnas)

Facts:

Every second around 100 lightning bolts strike the Earth.

Every year lightning kills 1000 people.

If you could drive your car straight up you would arrive in space in just over an hour.

On the day that Alexander Graham Bell was buried the entire US telephone system was shut down for 1 minute in tribute.

The only letter not appearing on the Periodic Table is the letter “J”.

The microwave was invented after a researcher walked by a radar tube and a chocolate bar melted in his pocket.

Presenting the "Round orbita mouse" - it's all round, it's wireless, it spins around...

Effective Query Log Anonymization

Check out this very good Google tech talk about using query logs:

"User search query logs have proven to be very useful, but have vast potential for misuse. Several incidents have shown that simple removal of identifiers is insufficient to protect the identity of users. Publishing such inadequately anonymized data can cause severe breach of privacy. While significant effort has been expended on coming up with anonymity models and techniques for microdata/relational data, there is little corresponding work for query log data -- which is different in several important aspects. In this work, we take a first cut at tackling this problem. Our main contribution is to define effective anonymization models for query log data, along with techniques to achieve such anonymization. "

Heat diffusion for Social net marketing

The paper we look at here is called "Mining Social Networks Using Heat Diffusion Processes for Marketing Candidates Selection" and is by Yang, Liu and King from The Chinese University of Hong kong.

Companies have started social networks more and more for WOM promotion, increase bran awareness, attract potential clients and so on. This is openly apparent in Facebook pages, Twitter, collaborative filtering, blogs and many many others. This paper presents a model enabling marketers to make the best use of these networks using the "Heat Diffusion Process" (which is an idea borrow from the field of physics). They have 3 models and 3 algorithms to demonstrate that allow for marketing samples to be collected.

In physics the heat diffusion model states that heat flows from a position of high temperature to one of low temperature.

They show that these methods allow us to select the best marketing candidates using the clustering properties of social networks, the planning of a marketing strategy sequentially in time, and they construct a model that can diffuse negative and positive comments about products and brands to simulate the complex discussions within social networks. They want to use their work to help marketeers and companies to defend themselves against negative comments. The idea is to get a subset of individuals to adopt a new product or service based on a potential network of customers.

The heat diffusion model has a time dependant property which means that it can simulate product adoptions step by step. The selection algorithms can represent the clustering coefficient of real social networks. All users of social networks can diffuse comments that can influence other users. Based on this they say that nodes 1 and 2 represent adopters and the heat reaches nodes 3,4 and 5 as time elapses. The users in the trust circle of other users have a greater influence. Not all of the people in the trust circle however will be contacted about the heat source. Also some users are more active than others in diffusing information. they observe that bad news or negative comments diffuse much faster than other news.

Individuals are selected as seed for heat diffusion. The influence of individuals is based on the number of individuals they influence. The heat vector they construct decides on the amount of heat needed for each source. In order for them to diffuse properly they need a lot of heat. Thermal conductivity is then calculated. It sets the heat diffusion rate. The adoption threshold is then set, as if one consumers heat value is higher than others, they are likely to adopt this product.

If a user doesn't like the product, s/he is allocated negative heat as they will diffuse negative comments. At some point someone in the network will provide different information which might be positive. If that user adopts the product anyway, they will diffuse positive comments. Two defense candidates are then selected and then the negative impact is alleviated.

They conclude:

"So far, our work considers social network as a static network only, and ignores newcomers, new relationships between existing members and the growth of the network’s size. In the future, we plan to consider the evolution property of social networks, and permit our social network to grow at a certain rate"

Why should you care?

This paper shows a new and very different way of analyzing social networks. It gives you a nice opening ti discussions concerning this in a different light. The current methods used in business are not foolproof and such research shows us how simplistic they are and how they can actually be misleading.

The RankMass crawler

The paper entitled "RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee" (Cho, Schonfeld, University of California) deals with the important topic of how many pages should be collected to cover most of the web, and how to ensure that important documents are not left out when the crawl is halted. They answered these questions by showing that crawling a number of important pages or the most important part of the web helps to limit the amount needed. This is important because a large corpus is expensive to house and the amount of computing cost is high.

They state that comparing search engines by the number of pages they index is misleading because of the huge amount of data available on the web. They say that for example calendar pages generated by dynamic sites have links to "next day" and so on meaning that potentially useless information is collected. Additionally no search engine is capable of downloading the entire web, we don't even know how big it is. So when should they stop the crawl? At 8 billion pages like Google?

Their method, RankMass is a metric for comparing the quality of search engine indexes and is a variant of the Personalised PageRank metric which assumes that users go to only important pages. They change this a little bit because they look at how much of the important documents the set covers. The coverage measurement involves looking at pages in the subset which are not known. Their crawler is focused on the pages users go to. It can prioritize the crawl downloading high personalised PageRank first and so the highest RankMass is achieved when the crawl is over.

How are pages deemed important?

This could as they say be based on the relevance to the queries but this would mean having a set of queries to start with which isn't ideal. They say that PageRank with it's random surfer model is very effective but as we have seen it can be easily spammed (by webmasters and seo people I presume!). Personalised PageRank assumes that a user eventually goes to a trusted site rather than to a page of equal probability.

The RankMass metric is based on the link structure of the whole web rather than just the graph structure in the subset, meaning that they don't have to download a huge amount. They download all the pages that are reachable from neighbouring sites of a page and calculate RankMass, however they note that users are unlikely to go only to one trusted page. This method however is greedy so they adapted it to form the "Windowed-RankMass algorithm".

"The Windowed-RankMass algorithm is an adaptation of the RankMass algorithm and is designed to allow us to reduce the overhead by batching together sets of probability calculations and downloading sets of pages at a time."

The starting point is referred to as a set of seeds which are a number of documents which form the starting point of the crawl:

"Deciding on the number of seeds is influenced by many factors such as: the connectivity of the web, spammability of the search engine, efficiency of the crawl, and many other factors".

The result of their evaluation and experiments showed that their RankMass metric, that maximizes the PageRank of every page, is very effective. It allows search engines to specify the end of the crawl based on specific conditions. This means that the crawl runs until the required percent of the web's PageRank is collected.

Why should you care?

This interesting papers shows that search engines can have smaller sized indices and still prove very effective. This should improve both precision and recall, due to the fact that there are fewer unimportant documents that are considered in the computation stages. The constant talk and wow-factor associated with the huge size of indices are shown to be rather irrelevant really when you consider the actual quality of these indices. The bigger the index the harder it is to manage.

You should care because the quality of your sites becomes crucial. Not only the way that they are built but also their ability to attract users based on how important they are seen to be.

January 29, 2009

"I won't adopt the semantic web!"

I've heard variants of this for quite a long time, in fact since the semantic web thing became mainstream. It's never easy to introduce something new, as novel applications designers will know, and users don't want to learn something new, however easy it is to pick up. The semantic web is suffering from this too, in fact many webmasters don't want to adopt it saying it's too much work or they can't see why it's important etc...

Well low and behold (you know me by now) I found a paper which addresses this issue in a very intelligent way and answers quite a few questions. It's called "What is an analogue for the semantic web and why is having one important?" by mc schraefel from the university of Southampton (which regularly produces cool papers).

First off, an "analogue" in this sense is something that has a direct resemblance to something else. For example print (books and things) is analogous to the web because a web page is like a page from a book or newspaper, manual and so on. The central topic of this paper is about finding something analogous to help people understand the semantic web so it's not so foreign and scary or whatever.

The author says that the web has been represented by Pages + Links and proposes that the semantic web be represented by Notebook + Memex.

The Memex was invented by Vannevar Bush in 1945. It's the concept of an online library, an interconnected knowledge-base.

He points out that the semantic web offers very powerful way to interact with information, and to build new interactions for that information. He does, as I do, believe that the entire issue concerning the semantic web and its acceptance is due to the research community not communicating it properly.

"It is important to note that the motivation for this question of analogue is not a marketing/packaging question to help sell the Semantic Web, but is simply a matter of fundamental importance in any research space: it is critical to have both a shared and sharable understanding of a (potentially new) paradigm. If we do not have such a shared understanding, we cannot interrogate the paradigm for either its technical or, perhaps especially, its social goals."

He notes that all web 2.0 stuff has been based on highly familiar models, as RSS for example is still text and the idea of tag clouds and tagging is still displayed just like a catalogue. The understanding of the web when it was 5 years old is very different to the understanding of the semantic web which has just turned 5. He also very nicely points out that the wikipedia entry for the semantic web is a bit rubbish: "All that description tells anyone about the semantic Web is that it is for Machines." The emphasis as far as semantic web researchers are concerned is the end user. It is all about creating powerful links in information so that question-answering is made possible and a whole host of other knowledge discovery methods.

"But how do we describe this potential? For a community steeped in rich link models, Hypertext is an obvious conceptualization. But beyond this community, Hypertext equals “a page with links” – it equals the current Web, not the rich possibility of what we might call Real Hypertext, which was modeled in Note Cards and Microcosm."

The hyperlink in the semantic web can be thought of as "meaning" which is one of the hard to grasp concepts. "That is, the way meaning is communicated that is not via the explicit prose page or catalogue page, but is via the exposure of the ways in which data is associated, and can be discovered, by direct semantic association, for the reader/interactor/explorer to make meaning."

The author says that the notepad is a good way of introducing the semantic web because it is a page but is unstructured, Calendars and things can be shared, and things like Twitter allow users to post snippets. "One item can act as a way of redefining another". Google docs and various snippet keepers on the web (like the now defunct Google notebook) mean that data is generated and shared and linked in too. The memex is good too because it is designed to retrieve data and not "denature it" by this we mean that it isn't taken out of context.

The semantic web languages like RDF mean that everything can be connected and retrieved intelligently and effectively. It is "automatic structure extraction" which means that you can look at all of your information in the context it was created or saved in. The data can also be associated to other relevant information too. For example you could find other people talking about the same topic, or working on the same kind of project, events you might be interested in...

The author observes that the notepad model has limitations like the structure and also "viewing page 6 next to page 36" isn't easy. He suggests the note card model which is a stack of cards with ideas, people, data about stuff written on them and interlinked to external data. "The relevance of the note card model to the concept of the Semantic Web as personal work space with associated public data is in the integration of personal ideas with external sources: the idea cards are backed up with/informed by the quotations from external sources."

He concludes asking if we are ready for a system to support such creativity. This means that computers are no longer simply used for productivity. He does believe that we are. I also believe that we are. More importantly I believe that the semantic web is a foundation for far more intelligent systems. We need to experiment with this model and develop it properly in order to advance.

Obviously this kind of system can't be implemented by humans completely and needs to be automated, which it is really. There are all sorts of programs out there that will create stuff for you.

Why should you care?

If you don't embrace the semantic web it is likely that you will be left behind because it is a necessary step in web evolution. It will become widespread and it is important to be prepped for it. It should have your support because it is an ingenious and very powerful way of dealing with the ever growing mass of data available.

January 28, 2009

Information credibility analysis

I wanted to draw a little attention to a Japanese project called the "Information Credibility Criteria Project". The NICT (National Institute of Information and Communications Technology) started it in 2006.

This project is all about looking at how information sources are not all equal in that they are written by different people who...are also not equal! If you write a post about banana skins and their use in cancer treatment, unless you're a researcher in this area, you post isn't 100% credible. If you are a researcher in that area, then it is more credible because you have the proved expertise to write about such a thing. They don't exclusively look at writers and their authority but also at other criteria, that are not easy to determine automatically:

Credibility of information contents

They use predicate argument structures rather than words for this analysis and use "automatic synonymous expression acquisition" to deal with synonymous expressions. The sentences in the documents are classified into opinion, events and facts. Opinions are classified into positive and negatives ones. An ontology is produced dynamically for each given topic which helps the user interface with the data.

There are a lot of different variables that come into play when we look at the credibility of a document. The grammar, syntax and accurateness of the data presented are all strong variables when I generally look at a blog post or a website.

Credibility of information sender

They classify writers into individuals or organisations, celebrities or intellectuals, real name or alias and many more groupings. This information is gleaned from meta-information but they also use NLP techniques for this too. The credibility evaluation is based on the quantity and quality of the information the user sender has produced so far.

Credibility estimated from document style and superficial characteristics

They take into account whether a document is written in a formal or informal way, what kind of language is being used, how sophisticated the layout is and other such criteria.

Credibility based on social evaluation of information contents/sender

This is based on how the sender is viewed by others. They use opinion mining from the web based on NLP or using existing rankings or evaluations available.

The research can be applied to all areas of electronic information access, such as email, web docs, desktop docs,...The idea is not to replace the human but to support the human in his/her judgment of an information source.

Document credibility is an area that I believe is very important for the future of the web. We can rank documents in a sequence, as Google does for example, based on their relevance to the initial user query. Google looks at authority as well, and also at the content, and other factors too. The problem though is that without a thorough analysis like the one being devised by NICT there are documents that are perhaps not as important finding themselves at the top of the rankings for example.

Looking at things like author authority rather than simply document authority is useful obviously but if this isn't flexible enough then good relevant documents could be omitted. Someone who has never written anything before will not I assume be considered to be very authoritative, and someone who has written a lot of bad content shoots themselves in the foot for all their future work! It therefore becomes important to have a certain standing on the web or rather in the information community. If you are not considered very influential, then your work might not be considered influential also.

Obviously there is a lot more research to be done here and I really look forward to reading a lot more about it. You can check the publications page if you want to read more about the work that NICT has been doing since 2007.

Why should you care?

If this type of method works perfectly, you will need to not only provide good content but also have a good reputation. This is applicable both to companies and individuals. By finding out about the author in particular and taking that into account for overall document scoring an engine could wipe a good deal of spam but also the standard for "good content" would be set. It all reminds me of FOAF and the other methods which exist for tagging up individuals and their connections online. This is a fundamental part of the semantic web after all and it could be easily exploited in this way.