My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

Showing posts with label Google. Show all posts
Showing posts with label Google. Show all posts

January 23, 2009

G patent: identifying similar passages in text

The patent entitled "Identifying and Linking Similar Passages in a Digital Text Corpus" was published on the 22nd of January and filed on the 20th July 2007.

It's a really interesting one, not just because it covers a topic I'm particularly interested in but because it describes a very useful method for digital libraries in particular.  Digital libraries are different to web documents because they don't have loads of functional links in them.  They mention that using references and citations listed in the documents isn't useful because they aren't used outside of academia or such related activities.  

Basically they're saying that it's hard to browse a load of documents in a digital library efficiently.  You can't navigate the corpus like you would navigate the web because of the nature of the structure.  

"As a result, browsing the documents in the corpus can be less stimulating than traditional web browsing because one can not browse by related concept or by other characteristics."

They're saying that finding papers in a digital library is boring because everything is classified either by the keywords the conferences ask for in that particular section of the paper or by author, title, year, subject...It would be far more useful to browse by related concept for example.  And I agree.

The claim:

"A computer-implemented method of identifying similar passages in a plurality of documents stored in a corpus, comprising:building a shingle table describing shingles found in the corpus, the one or more documents in which the shingles appear, and locations in the documents where the shingles occur; identifying a sequence of multiple contiguous shingles that appears in a source document in the corpus and in at least one other document in the corpus; generating a similar passage in the source document based at least in part on the sequence of multiple contiguous shingles; and storing data describing the similar passage. " ("shingles" are simply fragments)

Documents are processed and similar passages amongst them are identified.  Data describing the similarities is stored and the "passage mining engine"  then groups similar passages into further groups which are based on the degree of similarity amongst other things, so we have a ranking algorithm too.  They also describe an interface which shows the user the hyperlinks that are associated with these passages so they can easily navigate them.

Their method basically identifies all shingles, gathers as much data as is available on them (location, documents they appear in, etc...) and then groups them together into clusters based on similarity.

Users could navigate passages that are relevant to them in text rather than the whole document which may not be in its entirety.  Being able to browse all this data by related features like that would help us find far more relevant papers for our information needs.  

This is a different approach to the one where an entire document is analysed (like in LSA) and classified and defined in terms of its overall features.  Using passages instead means that the entire exercise is far more granular.  Here we take into account that a document may be about a topic in a broad sense but actually about several particular subtopics.  We can also tell that perhaps part of a document is useful to a user in response to a query but not the whole thing.  

Search engines for digital libraries containing scientific papers for example do not perform half as well as the search engines we're used to using on the web.  Google scholar can sometimes yield much better results than Citeseer for example, but then they work very differently.  The documents are usually in PDF format or something similar so as Google note you need to be able to make that machine readable for starters.  

This conveniently, as far as I'm concerned, brings us to the elusive and wonderful exercise of summarization.  I say this because if you have  a number of fragments from different documents and that you can identify how similar they are, you can discard any duplicate information and create a complete summary from the data retrieved for your user, also offering up access to each individual document if the user wants to read the whole thing or the original passages.  This is not ground breaking in summarization but the model described in the patent fits.

I really like that idea.

December 15, 2008

Designing for conversation

This is a very cool and light-hearted conversation led by Heather Gold at Google.  It's really funny and really interesting too.  It's very interesting for social media people, and also for any other community situation.

She also says that she won't use "the words leverage or synergize unless its for a very important lifesaving purpose".

"Innovative comedian Heather Gold explains basic differences between presentation and conversation and the assumptions underneath each. More entertainingly (and usefully) she demonstrates these ideas by creating a great conversation in the room so that all can feel the difference."

December 09, 2008

LSI - No more!

With the help of some very cool Tweeters, I found some interesting facts about LSI and SEO.  They are @dpn and @Mendicott.

For a simple idea of what LSI/A is please read the wikipedia entry on it.  The original paper is here.

LSI was patented in 1988 by Scott Deerwester (doing humanitarian work now), Susan Dumais (HCI/IR @ Microsoft),George Furnas (HCI @Uni Michigan), Richard Harshman (Psychologist @ Uni Western Ontario), Thomas Landauer (Psychologist @ Uni Colorado/Pearson), Karen Lochbaum (where did she go?)  and Lynn Streeter (Knowledge technologist @ Pearson).

We will look at Susan Dumais here because she's actively submitting:

Unsurprisingly all her recent search is in HCI and personalisation, just like Google, and Microsoft and...well everyone:

"The Web changes everything: Understanding the dynamics of Web content". (WSDM 2009)

"The Influence of Caption Features on Clickthrough Patterns in Web Search" (SIGIR 08)

"To Personalize or Not to Personalize:Modeling Queries with Variation in User Intent" (SIGIR 08)

"Supporting searchers in searching". (ACL keynote 08)

"Large scale analysis of Web revisitation patterns" (CHI 08)

"Here or There: Preference judgments for relevance". (ECIR 08)

"The potential value of personalizing search". (SIGIR 07)

"Information Retrieval In Context" (IUI 07)

Humm...No LSI here.

LSI papers since its introduction:

"Adaptive Label-Driven Scaling for Latent Semantic Indexing" -Quan/Chen/Luo/Xiong (USTC/Reutgers) => exploiting category labels to extend LSI (SIGIR 08)

"Model-Averaged Latent Semantic Indexing"- Efron => Extended with Akaike information criterion (SIGIR 07)

"MultiLabel Informed Latent Semantic Indexing"- Yu/Tresp => using the multi-label informed latent semantic indexing (MLSI) algorithm (SIGIR 05)

"Polynomial Filtering in Latent Semantic Indexing for Information Retrieval"- Kokiopouplou/Saad => LSI based on polynomial filtering (SIGIR 04)

"Unitary Operators for Fast Latent Semantic Indexing (FLSI)" - Hoenkamp => introduces alternatives to SVD that use far fewer resources, yet preserve the advantages of LSI.(SIGIR o1)

"A Similarity-based Probability Model for Latent Semantic Indexing" - Ding => checks the statistical significance of the semantic dimensions (SIGIR 99)

"Probabilistic Latent Semantic Indexing" - Hofmann => "In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model" - (SIGIR 99)

"A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval" Kolda/O'Leary => Replacing low-Rank approximation with truncated SVD approximation (ACM 1998)

Well...

The initial theory of LSI and it's methodology has been extended a great deal throughout the years.  The basic LSI method is important as it's a great way to introduce topic detection and such things.  There is a lot more to build on from there though.

There are so many more, some other methods are the Generalized Hebbian Algorithm, Partial least square analysis, Latent Dirichlet Allocation...

@Mindicott reports that "SEO" first appeared in Google in 1998.  "Search engine optimisation"Search engine optimisation + Latent semantic indexing" appeared in 2005.

@dnp quite rightly says that "SVD on huge datasets is BS".

It appears to me that the LSI that the SEO community refers to is in fact the base model which has been extended and changed and improved quite a bit since 1988.  This is quite expected, and therefore when you say "Oh I'm using LSI", you would be asked which method or if you've extended it yourself etc...

Currently the focus on keywords, which is what LSI uses isn't quite right anymore.  I've seen a lot of recent research (and so have many of you) talking about semantics.  There is lot of work on using semantic units which are not always keywords anyway.

The question should be "What multitudes of methods is Google using?" and "I wonder which LSI method is being used, although I know it is just one factor in a very very large system".  Not "How should I optimise my site for LSI" - I'd ask you which type.  I believe that Matt Cutts said something very generic when he said Google used LSI :)

December 07, 2008

The Gooleplex: serious issues?

Piotr Cofta (BT Plc) wrote a very interesting paper for the 10th Int. Conf. on Electronic Commerce (ICEC) ’08 Innsbruck, Austria.

It questions the Googleplex as a whole rather than just "Google", and honestly raises some serious issues with it.  They have an awful lot of power and we have placed an awful lot of trust in it.  To an extent, they rely on this trust to be successful and function.

Here are a few main points:

Google try and monitor everything they can on the Internet to gain as much user data as possible, thus monitoring behaviour.  It's important for them to have a stake in every possible interaction mode, not just search for example.

Google focuses on innovation, so it frantically chases top researchers to develop trends and obviously gives out free tools that are used to test and develop thousands of new ideas at the same time.

PageRank and query logs enable Google to identify trends that are likely to stay.  It's cheap to run as well.  The author reckons that more of this data will be available than is already presented via AdSense.

The trends unsurprisingly are used to fuel the advertising market

People aren't identified during personalisation but computers are.  The fact as we all know that you can log in to a tool means that you're signed in to everything.  This way an enormous amount of user behaviour data can be captured.

There are endless opportunities for new applications, but people search is super important.  Google are tracking people who share similarities, habits and customs.  The author qualifies the task as "mathematically trivial", and says however that it is hugely importance for us all.

The author also says that "the Googleplex is not malicious in itself."  It's a business.  They have a huge amount of power and we need to see if they eventually abuse it of not, despite the "Do no evil" strap line.  Interestingly he asks if the Googleplex will be compromised by others.  He asks 3 questions: whether the Googleplex can be harmful to individuals, society and social values.  People like to develop trust and confidence for organisations like this, this is dangerous to a caertain extent.

He says the the "Crude PageRank value" can be used (as well all know) as the strength socre of a site and also it's reputation score.  In a way it must be said, I think, that assigning a numerical value to something is indeed giving something away to the users so that they can feel more satisfied and confident in the organisation.  Even if the numbers don't really add up on purpose, it still fulfills its function, as far as the score delivered to the users is concerned.  SEO peeps did use it as a measure of success once, and held it as very important.  Evidence of this can be seen in the mountains of blogs talking about it.

He says in a totally other way than I'll put it here, that we put trust in the results because we don't know how the whole thing works.  I've already said that we cannot exactly know whether the results we are being served up are the exact right ones for us, but rather the best ones they can come up with.  Maybe the perfect documents for your needs don't show.  people have such confidence that they usually defend Google passionately when this issue is raised.  From a scientific perspective though, it is very natural to consider that the results might not be the best.  One research project did show that Google didn't perform well at all in comparison to the actual human expert ranking for example.   

Do a very very simplistic test and choose your expert area (a very specific one) and rank the most important documents you would give someone on this topic.  then check Google and see what you find.  More complex tests of course will yield more exact results.

He makes a good point about the "Do no evil thing" by saying that it can't possibly do evil or good because it's fully automated.  When there have been mistakes and when people have sporadically written on blog or blog comments about how this can be questioned, the idea dies down quite quickly, because we love and trust Google. It's not their fault that nasty ads get served up or that an update goes painfully wrong, it's the system.

He's right in suggesting that maybe all the free stuff we get does come at quite a high price. 

I urge you to read the paper in its entirety, and to do that you'll have to get ACM Digital library access, which is well worth the purchase.  It is in my opinion for both marketeers and computer scientists alike a very necessary professional tool.

December 01, 2008

Google tech talk on the semantic web

This is by Professor Abraham Bernstein.  It's very interesting, it's all about what the semantic web is in brief, but mostly about the various techniques used such as SPARQL, Querix, Ginseng, OWL DL...these are however rubbish for humans mostly.  He explores how to make the semantic web accessible to the general public.

One of the solutions involve natural language queries (yey!) - but it's a "complete mess" at the moment, being ambiguous, domain specific, etc...BUT he did find, as I did, and Jimmy Lin did, that users preferred natural language querying.  

During the Google Ninja challenge I set you all, I have found that the majority of people are indeed using natural language in Google, and I think this is because of the complexity of the search context.  Watch this space.





November 26, 2008

Google, my backend system

Right now, we're the web equivalent of the horse and cart.  We have invented the wheel, and domesticated animals and this has revolutionised our existence, especially the way in which we do business, but...I don't see Ferrari's, E-type Jags or anything like that right now in web world.

Are we heading that way?  Yes, for sure.

Search is not supposed to be something independent of the rest of your web experience, or actually you digital experience.  You are going to be able to access search from any device or environment without actually having to go to a search engine.

Imagine you're typing away an article for your photography blog.  The intelligent environment you're in is already aware that you are writing something for you blog, because it has seen patterns and features develop over time.  You highlight something and summon Google.  It does something pretty cool.  It goes out with the highlighted words as a query, but already has the context of the query, because of it knowing all about your writing for your photography blog.  

It rushes out, and visits all the top results.  These results are dependant not just on the keyword phrase but also on the other variables gathered from your intelligent environment.  Then it pulls out all the key concepts and information and writes you a summary to answer your question.  You can "repair" the results by typing something like "No, I meant..." or "Perfect! Tell me more about the canon".   And off it goes again.  Or "Let me see the top 5 documents", or "Show me related information",...

From a mobile device, you could summon Google during a conversation with someone for example.  Imagine you're trying to figure out where the closest restaurant is to you both.  You summon Google and ask it, where's the best place for us to meet, she's vegetarian".  You can request an answer to a question like "Did Angelina Jolie really bungee jump yesterday?" and get a response such as "Yes she did.  She jumped off a bridge in New Zealand".

I look forward to summoning Google and saying "Remember when I was writing that paper for that conference?  There was a quote by x about y in it, what was it?"..."That's right, who else said something about that?"...

Google becomes a backend system.  Gasp!  No but there is nothing more natural than for the engine to be in the background.  I think that conversational systems removed from search are fun toys, but their real use is in information retrieval.  Once you start getting used to having all your information at your fingertips as and when you ask, you are also going to get used to conversing with the system pretty quickly.

For this kind of thing of thing to work you will need strong summarization systems, natural language generation and understanding, and machine translation, personalization, machine learning also, not to mention all the other supporting technologies without which it could never happen.   Luckily these are all under development right now.

One of the most interesting questions I believe is "Does your behaviour change now that the search engine is conversational in nature?" - Does it become your friend, do you get attached to it because it shows human qualities, or do you treat it like a tool?  Does the way that you search change now that you no longer actually go to a search engine web page?  Are you more focused, more specific, more vague?  

Hummmm....so how are businesses going to take advantage of the search market then?  Clearly ads are still going to be served up, but how do you make sure the clever agent like your content most of all?

Google discuss the more immediate future of search here.

November 25, 2008

SearchWiki according to me

I don't usually post about already well covered news, but in the case of the Google SearchWiki I will make a small exception.  Search Wiki allows you to manipulate the search engine results and leave comments for others about a result.

There is an awful lot more information on the actual Google blog, Danny Sullivan wrote a nice guide as well, and there's a Q&A with Google about it as well.

I've asked around and most general users don't seem to have even noticed it was there.  My mum definitely has no idea what the whole thing is about, because she doesn't want to break anything, she isn't going to press on any of the buttons.  The more savvy users right down to the programmers said they weren't bothered with it either.

I keep forgetting its there and so I haven't used it very much.  I think that we would all begin to use it if we started to see the benefit.  Sadly in order to see the benefit, you have to start using it!

Google said “It’s a new way to empower users. You can remember answers to repeat queries. It lets you add your personal touch to our algorithms” (See the Q&A doc).

I genuinely think it is indeed a tool to help you alter the results to suit your particular slant on a particular query.  I also think it's a pretty cool way to collect a huge amount of user data and also human edited results provide more information on the authority of the resource.

Remember how we look at Social Media sites like Digg and said that the voting was warped because of it being so easy to manipulate?  Well seeing this is in a "closed" environment, meaning that nobody else but you gets to see it, there is no reason to manipulate the results.  Also the issue with the weighting of each vote is also no issue because it's proper to a single user.

1-800-GOOG-411 was all about collecting phonemes to feed into a machine to make voice search possible today.  I think SearchWiki is along the same lines.

November 21, 2008

User Experience at Google

At CHI 2008, Google presented a paper called "User Experience at Google – Focus on the user and all else will follow".  It's an overview of how the UX team at Google operate and how Google gets that super important job done.

Here they discuss their bottom-up 'ideas' culture, their data-driven engineering approach, their fast, highly iterative web development cycle, and their global product perspective of designing for multiple countries.  Google's core products are search, applications and commerce.  The UX team is located all over the world.

Here are some highlights:

Lots of cool stuff comes out of the 20% scheme but the UX team have to make sure they're not just technically feasible and fascinating but also useful to the user.  There are also so many projects that the UX team has to be super organised to include all the projects.

The UX teams educate and inform all of the teams on good user experience practice and work hard to make sure it is ingrained in their minds.  In fact they word it very nicely they say "UX aims to get user empathy, and design principles into every Google engineer's head".  This is what they call "entering the corporate DNA".  

All nooglers (new Googlers) are sent on a "Life of the user" training.  The UX team also hosts "Field Fridays", "any Googler can attend field studies to connect them with the everyday problems and “delighters” of our users."   There are "Office hours" sessions for each product area where Googlers can get involved hands-on.  20% projects get some help in these sessions.  

They don't do usability tests for each feature, instead they bundle up testing into "Regular testing programs" for any product area.  They streamline the recruitment process and spare 5-10 minute "Piggy-back" slots are made available for smaller projects.  They have a "User research knowledgebase", to make information accessible to teams by product area.  

As for the whole of Google, they use a data-driven approach.  Absolutely everything is tracked at Google, which is really sensible.  Also computing people do have an unnatural passion for data I might add.  Some UX experts work on usage data where they gather things like page-views but also product growth, number of "active" users (they mention that defining these isn't straightforward) but for Blogger for example they use a variable-length time window, based on what is typical for each blogger because this product isn't the same as the others.  They also use A/B testing but of course it doesn't stop there, there's a load of qualitative and quantitative data as well.  

Updates and changes to products including new things coming out means that they have to use: "a number of agile techniques such as guerilla usability testing (e.g. limited numbers of users hijacked from the Google cafeteria at short notice), prototyping on the fly, and online experimentation." They'll use live instant messaging also.

On a global scale, they have to make sure that the cultural, regulatory and structural differences between locations are addressed correctly.  They use Global Payment as an example, which impacts Google Ads and checkout, as well as having financial regulations  tax issues.  Geotargeting also comes under this.  How can they predict the location of a user, or their language?  This is why the team is global, and they carry out global projects.  

I think it all sounds really exciting and well structured.  I would love to see that data :) 

November 14, 2008

Google Tech talk

On the Google channel on YouTube you'll find a tech talk called "Knowledge-based Information Retrieval with Wikipedia" from October 31st 2008.

It covers the limitations of search engines today.  Documents and queries aren't really understood at all, because they're still viewed as tokens.

They tested a method where they consult Wikipedia for knowledge.  It hasn't worked so far but there has been a lot of research on it.  Wikipedia is useful for semantic relatedness (see Wikirelate).  

The idea is treated like an ontology here.  Wikipedia how ever is not a formal structure so it's not easy.  It's believe that it can be used in this way, using HCI rather than AI or NLP.

Koru is introduced for exploratory search.  It works well, although improvements are necessary.  "Wikiminer" is also demoed.  

For an awful lot more detail and interesting information, take an hour, sit back and enjoy the talk.

November 11, 2008

Google ninja challenge-some results

The Google Ninja challenge was launched on the 23rd of October.  Volunteers were asked to fill in a preliminary questionnaire, and then were given 8 things to query in Google.  Then they did a de-brief survey.

In the preliminary survey they were asked how confident they were that they would find all the information.  Most were 80% confident.  They did however struggle, or not find it quite as easy as they had thought.  Internet professionals were no exception to the rule.

What is hard about those queries?  That's a question I'll be asking you.

It is hard, although some people have managed quite easily to find the answers.  Some gave off-topic answers that were as close as they could get, some just couldn't find the information.

I'm still collecting data, if you think you can handle the challenge, give it a shot and see how you go.

October 21, 2008

Internet progress fast - CS slow

I wrote a rather long post at High Rankings and decided that it deserved a place on the blog.  A very good point was made by Randy about how fast the Internet moves.  There are new developments almost daily, new systems, new ways of doing things emerge, and we all keep up with the new trends and algorithms.  Computer science research that is 4 years old (or even older!) isn't as current, it is true.  

This is because it takes ages for a lot of methods to be evaluated properly so that they can safely be used in public systems like search engines or social networks for example.  Some systems aren't designed to use some methods, and only when they have gone through many iterations, they suddenly see the need to incorporate a certain method or even a few.

Stemming for example is quite old, it goes back to 1966 when the Lovins stemmer was made.  Google I believe (but not totally sure of the exact date) implemented stemming to queries in 2003.  That's 37 years!  I think they were already using it in the internal system though, it's a pretty standard method in IR after all.  I wrote a stemmer in 2005 and it only started being used in 2007, not a lot of people saw any use for a stemmer that stemmed to exact words, but now it's pretty standard too.  That took 2 years.

PageRank came about in 1995 and was implemented when Google was publicly released in 1998, that's 3 years.  

I work in conversational systems and it has taken a while for the science community and also the industry to see why they could be useful.  Now there's a lot of research in this area, and the first chatbot was invented in 1966 (ELIZA).  It's not until recently that companies have started using chatbots on their websites (Ikea for example) and suddenly the potential for such systems in IR is being realised.  Long wait!  We don't even have all the technology needed yet to make something really good.

I think it's really important for the SEO community to keep track of papers released by IR researchers and also NLP/AI researchers when the work is related to search engines particularly.  It's useful to learn about the methods being developed and then it gives some insight into how they might be implemented (although this could take some time!).  You can use Citeseer to find them, or DBLP, and checking the references too can be useful.  Those are where my massive reading list comes from!

Of course some methods do get implemented quite quickly and I think that this happens when they are specifically built for a system in build progress.  The big search engines have people working solely on this and also companies like IBM for example.  What I mean is that you shouldn't discount methods that have been published a few years ago.  A lot of social media stuff was published quite some years ago too.

Happy reading :)

October 07, 2008

What is semantic search?

There's been a lot of talk recently about "semantic search", and is also refered to as the "readwrite web".  Powerset, Cognition, Ask, Hakia, and many others are "semantic search engines".  It's not a new concept, research has been available in academia for at least 10 years.  In fact a lot of people involved in that in the early years are involved in the newly released semantic search engines today.  Not a big surprise!

So, what are semantics for start:

It refers to meaning in language (or code or anything else).  It used syntax and pragmatics as well as contextual information to provide the meaning of the text or even audio stream if you want to use that.  It's not just about finding similarities or context between 2 words but rather taking the entire text or query to establish meaning.  

What is the semantic web?

It's a common framework allowing information to be shared and reused.  Information is stored in machine readable formats.  

"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." - Tim Berners-Lee

That's probably the best definition, seeing as he invented it.  It's  relevant to semantic search and uses many of the same techniques and also is based on the same idea.  It's not a new version of the web, but rather an extension.  There are a lot of conferences about this worldwide, such as "The semantic web technology conference" for example.  

Technologies used obviously include ontologies (which are like big storage boxes full of information on how words and concepts link to each other).  These are built in OWL mostly, Natural language processing tools for named entity extraction for example, Data interchage formats (like RDF/XML or turtle for example), schemas like the RDF Schema, XML to provide syntax for content structure and SPARQL which is a web query language for semantic resources.   

And what is semantic search?

Google uses PageRank to identify relevancy, whereas semantic search engines use meaning to return highly relevant results.  Google returns keyword/keyphrase results, and the semantic solution returns information.  

The data has to be really structured in ontologies just like in the semantic web.  A semantic network is created which links all of the concepts and words together.  It used word sense disambiguation (WSD) in order to decipher what a word may related to.  WordNet, which you can download for free, is a machine readable dictionary that a lot of scientists have used for this task, although it's far from foolproof.  Here is a very comprehensive list of which semantic search engines use what kind of procedure, in pretty plain English.

Google does respond to natural language queries, such as "Where was Marilyn Monroe born?".  Hakia doesn't understand the query and tells me what Marilyn's real name was.  Powerset (only searches Wikipedia and Freebase) comes up with the goods, "LA".  Working in a "closed domain", Powerset has an easier job than Hakia who searches the whole web, just like Google.  Google however delivered where Hakia didn't this time round.

Then I tried in all 3 "Is chili bad for you?" - Hakia came up with books reviews for a book called "Bad chili", Google came up with a forum article with that exact question in it, and Powerset delivered an in depth article on the effects of chili on humans.  The following results for Powerset are all off though.  Hakia results continue with the book, but Google gives me loads of results all about the effect of chili on the body.

Have a go yourself and see what happens.  

This little test definitely shows that Google can come up with the goods, whereas the semantic engines struggle.  More work needed there.  I would be very surprised if Google were shunning semantic web technology and natural language queries.  I would leave that open for discussion actually.

The future?  Natural language queries, and natural language generation for a straight answer to a question, and a summary of all of the most relevant resources in one text, and the option to read the individual documents.  That's not an easy feat!

October 02, 2008

Nexplore search engine

The Nexplore search engine has been released in beta.  It's fast, it's busy, it's fun and it's pretty accurate.  You can search the web, news, video, images, blogs and podcasts.  It's clearly aimed at social networking, it's all very web 2.0.  

They say:

“Our Web 2.0 application model uses innovative cloud-computing techniques to create a highly effective distributed search engine that easily scales to meet volume demands without compromising performance. We’ve combined this backend with a social overlay to fine tune and share results and a user interface built with advanced RIA technologies to create a compelling, highly productive user experience. NeXplore Search is poised for growth as users seek more effective and enjoyable ways to find the information they need.”

So they've made it fast using cloud computing and use RIA's (Rich Internet Applications) to enhance interactivity and expressiveness.  It's quite different to the Google model, which is very centered on providing information and not necessarily creating a networking and rich environment like Nexplore.

I tried a couple of searches (to have a clear picture I'd have to do loads more to be fair).  I search for 2 of my favourite things:

The Breeders:

Google -> wikipedia, myspace, a random blog, random blog, random blog
Nexplore -> random blog, myspace, wikipedia, and the same random blogs as Google

Ashtanga yoga:

Google -> wikipedia, wikipedia, yoga school, BBC, Amazon
Nexplore -> ashtanga.com (A very authoritative site), a yoga school, wikipedia, a yoga school, and ashtanga.com again.

So clearly in my searches Google has preferred wikipedia results, which is cool, because I might well want to know the definition of "Ashtanga yoga" or know who "The Breeders" are.  Nexplore also provides that but starts with a random blog.  Really the results are pretty much the same for some searches, just in a different order.  Nexplore does better on "Ashtanga yoga" I think because it gives me a well known authoritative site rather than wikipedia first.  Here the results weren't the same as in Google.

Google gives me 4 video options in the results for "The Breeders".  Nexplore doesn't do that.  You have to search under "video", so no universal search in the results.

I searched for myself and the results contained less instances of the person of the same name who works for NASA, so that's fine.

Nexplore has a super busy interface, do not be duped by the very minimalistic landing search page.  Under every result there's a social media sharing option, a massive preview window pops up when you hover over the results at any time, there's a wiki search box which is constantly on the right of the page...you see what I mean...it's very busy.  You can view result in a line, which makes them a bit hard to navigate through because there's 25 results on a page.  You can also just view the site preview, which looks a bit like a music library.  

Nexplore does personalise results for you as you can bin them, preview them or save them.  Google is looking into this too with the thumbs up or down thing.  It also gives you a "popular" searches library which I like. 

I'm not too sure about ad relevance as I get one for puppies for sale for "The Breeders" search when the engine has already established that it's a rock band.  For "Ashtanga yoga" 2 ads are relevant but one if for a hotel, which has a yoga class.  But that hotel site isn't really about ashtanga yoga.  Maybe it was the best ad to display from the collection available.

So.  I think Nexplore has good results, as good as Google in my short experiment, and way better than Cuil.  I think it's too busy, there's too much going on, and I don't like the preview popup.  The other two option, gallery view and list just aren't clear enough in my view because they don't give enough info, which you get from the description.  I like the personalisation.  I like that it gave me authoritative sites at the top.  

The social media button to share the site are on every result.  This further crowds the interface.  It's a great idea though, just maybe it can be done a bit differently to relieve the interface of clutter.  I hate all this popping up stuff!  Also, I don't like how the pictures display on the image search.

The verdict:  I will use it for w while and see how it grows on me, after all every change takes a while to adjust to.  I don't want to write it off because I don't like the clutter so early on.  I might really like it all once I use it often.  The results are good.  

I'm really not ready to dump my classic, simple, Google interface, and the results are fine for my purposes.  I think that it's very hard for a new search engine to come along with new cool ideas, and this one has really tried hard.  I'll definitely persevere and see how I go.

Social media in this kind of engine is all done for the sites because of the button under every result.  It gives them more chance at visibility than in Google.

The CTO, Dion Hinchcliffe runs a blog called "Musings and ruminations on building great systems".  It's really wordy but quite interesting so read it if you have the time.  You can follow him on Twitter as well.

You can also read about Nexplore at "Beyond Search", where Stephen Arnold gives his take on it.

If we did all migrate to Nexplore, I think we'd still Google things in it anyway.

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.