My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

October 31, 2008

TGIF - woooo!

Welcome back to Friday, I hope that you're all looking forward to either a fun packed weekend or a relaxed one chilling on the sofa.  I hope none of you are moving house like I was last week - not fun.

Anyway without much ado, here are the gems for today:

A project manager, hardware engineer and software engineer were in a car heading down a hill when the brakes failed. The driver managed to get it stopped by using the gears and a convenient dirt track. 

All three jumped out and after peering under the car the hardware engineer said, "I see what the problem is and with this handy roll of duct tape I think I can fix it good enough to get us to the next town". The project manager quickly interrupted, "No, no, no. Before we do anything we need to decide on a vision for our future, figure out a plan and assign individual deliverables". At which point the software engineer said, "You know what, I think we should push the car back up to the top of the hill and see if it happens again".

Next:

“The cloning of humans is on most of the lists of things to worry about from Science, along with behavior control, genetic engineering, transplanted heads, computer poetry and the unrestrained growth of plastic flowers” (Lewis Thomas)

"Spam will be a thing of the past in two years' time." (2004)

...aw.....Bill....

A small surpise - I'm adding a TGIF cheesy tune too - can you guess what it is?  Enjoy :)



Multilingual SEO

We often forget that English is not the only language online, and that many other nations who speak a multitude of different languages also have seo needs.

According to Internet world stats the top 5 most spoken languages are distributed like so: 29.4% of users are English speakers, 18.9% Chinese, 8.5% Spanish, 6.4% Japanese, 4.7% French.

Translating content directly from English for example, will not be good enough because there are many nuances proper to each language.  There may not be enough keywords in a particular language to allow for a full optimisation.  

The local culture may not be the same as for English speaking countries, so the copy will need to be adapted to what they're used to, the same style and the right tone.

We sometimes forget but different English speaking countries have different spellings and words for things too, so these also need to be taken into account, although it is essentially the same language.

It is better imho to re-write the entire copy from scratch rather than translate it.

Google is the most used search engine in most countries, but there are also countries who prefer other engines, like Yandex in Russia (45%), and the Chinese like Baidu.

You might want to look into registering the domain in the country that you're targeting, and keep the local extension in the URL.  The competitors will be different to those for the English version of the site for example.  The link building will also be localised.

Social media is also affected as you'll have to target the sites that people are using in that country.  For example in the US Facebook and MySpace dominate.  In Brazil it's Orkut, in Russia V Kontakte, Thailand Hi-5... 

I suggest using a fluent speaker of the language who is great at content writing, with a really good knowledge of the local culture and SEO.  

There's a free ranking tool that will let you know how your site ranks in multiple languages here. 

October 30, 2008

Corpus for nasty web spam

Researches who study webspam are limited by the lack of corpus available.  There is one that gets used quite often called "WEBSPAM-UK2007", released by Yahoo.  There's also the 2006 version.  It's really useful but as they say, it was generated to aid the researchers so it's biased towards their needs.  Also, you can't compare results unless they're tested on the same collection.

The University of Milan downloaded loads of documents for the collection starting from a set of hosts listed in DMOZ for the uk domain.  They followed links recursively in breadth-first mode.  Then lots of volunteers tagged it up. 

Things that they found that identified a spam host was the number of keywords in the URL, the anchor text in links, sponsored links and content copied from the engine results.

There are:
  • 8123 tagged as "normal"
  • 2113 tagged as "Spam"
  • 426 tagged as "undecided"
Yahoo do loads of work on web spam, check out the results of their tests at AIRWeb and "the web spam challenge".

This also a good resource for you, listing the characteristics of nasty spam things.

It's really interesting to research web spam, because at the end of the day it's one of the most crippling things to a search engine.  It ruins quality, and is highly unwelcome in the index, taking up valuable resources.  It also ruins the experience for users, and basically spreads a lot of pain in our information seeking community.  It's by no means an easy problem to solve.  Links are mostly looked at using methods such as SVM.  Maybe it's time to look beyond links?  


October 28, 2008

Free IR newsletter by BCS-IRSG

The British Computer Society "information retieval specialist group" (BCS-IRSG), releases a newsletter every so often, packed with brilliant information called "Informer".  Recent research, new IR book reviews and conference coverage for expample feature often.

It's a great way to stay on top of what's going on in IR, and find new resources as well.

In this issue you will find:

* Product Review: “Aduna Autofocus 5.0” by Bob Bater

* Book Review: “Social Computing, Behavioural Modeling and Prediction” Reviewed by Paul Matthews

* Workshop Review: "FDIA & Search Solutions 2008” Reviewed by Udo Kruschwitz and Alvaro Huertas

* Book Review: “Visualization for Information Retrieval” Reviewed by Andrew Neill

Enjoy.

AI and its implications for SEO

Headup is a new plugin for Firefox that uses semantic web methods to produce personalised data related to the current webpage from all over the web.  It's been defined as a "personal discovery agent to the web", an intelligent search agent basically.  It uses Silverlight 2 to store information locally so that privacy isn't an issue.

It's in beta right now, but you can request an invite, it won't be launched until early next year.

It supports Digg, Gmail, Google, Wikipedia and some others.  It also has geolocation capabilities through Yahoo.

This is the kind of thing that I have been seeing increasingly in the last 2 years, with systems that can replace the usual search results browsing by delivering a wealth of information directly to the user.  

These intelligent agents are AI entities which act upon their environment and observe changes and patterns. They learn from their environment and gather knowledge. They use this information to fulfill their duty, in this case providing relevant information to users.  

Nicholas Negroponte (one laptop per child project, and head of MIT Media labs) wrote a book in the early 90's about creating a more human environment.  This idea has been around for a while now, and there is research being actively carried out in this area.

It changes the SEO landscape quite significantly, because information pertinent to the user's information needs comes from all over the place, rather than from the search engine exclusively.  What would happen if this became widespread?  Visibility on the web becomes a very important issue for companies.  The social media marketing trend would probably become set in concrete as it were.  Getting people through to sites through many different places than simply search engines becomes important.  And what becomes of the search engine?  Do they become back-end systems?

Check out Headup and see what you think.


October 27, 2008

Cleaning up misconceptions

I asked around for common myths and misconceptions about the web or the Internet that people found the most annoying and gave them the chance to put them right :)

1)  The Internet is not the same as the world-wide-web:
  • The Internet was invented by DARPA with ARPANET in 1969 (it is a global system of interconnected computer networks that interchange data by packet switching)
  • Tim Berners-Lee invented the world-wide-web at CERN in 1989 (it is a system of hyperlinked documents accessible via the Internet) - it was released in 1992.
2) Very little information is available online, that's why Google launched the Google books library  project.

3) No one owns the Internet (thankfully - at least not for now).  It is a collaborative information space.

4) Because Google is the most used search engine, it could be said that it dominates the www - Google certainly has a lot of clout.  Nobody owns the www either though.

5) You write a "post" and not a "blog" - The post is on your blog

6) Cookies are not spyware

7) Internet accelerators do not speed up your connection, they optimise how your system works with it.

8) The Internet won't die in 2012.  It was predicted that the Internet would become unusable in 2007. 

9) Yes, Yahoo could have acquired Google for $5 billion back in 2001

10) Setting up a database is not like excel

11) There is such a thing in computing as web 1.0, 2.0 and 3.0

12) Web 3.0 is not the semantic web - the semantic web is an extension of web 3.0

13) No Chrome does not have a keylogger

14) Accessibility does not impose limitations on web design 

15) Microsoft is not developing the iLoo

Feel free to add your own to the list or argue one of them down, there are many many more that could be added to this list.

October 24, 2008

bgC3

Despite the weekly "aw...Bill" bit in the TGIF posts, I like Bill.  I think he's a cool guy who despite saying some silly things (I have too, but I'm not rich and famous so not a lot of people care), has been visionary.  Lets not forget the work of "The Bill and Melissa Gates foundation" either.

Ok, his company, Microsoft, has ripped me off several times with rubbish OS, and expensive and unnecessary Office products (use OpenOffice), but I have to say that without Microsoft I may not have found so much of the free and amazing software I use now!

Anyway, that aside, he's creating a new company called bgC3 (Bill Gates Catalyst and 3 is his 3rd project).  It's supposed to be a think tank.  He is still Microsoft's chairman and is still involved there but has actually "left".  This is him branching out and doing his own thing again.  It's obviously all computer science and technology oriented research that he has in mind.  The idea is that he'll bring together clever people to work on cutting-edge ideas.  

This is what labs around the world do, get clever people to work on innovative and cutting-edge ideas and technology.  The problem in academia is funding.  This is something that bgC3 won't be struggling with I suspect.

Check out Tech Flash more a lot more information.

TGIF - at long last.

Welcome...you survived yet another week, and you're now on the home run...the weekend hits soon. 

As every friday here at SFS, we'll treat ourselves to some fun stuff - here goes:

The huge printing presses of a major Chicago newspaper began malfunctioning on the Saturday before Christmas, putting all the revenue for advertising that was to appear in the Sunday paper in jeopardy. None of the technicians could track down the problem. Finally, a frantic call was made to the retired printer who had worked with these presses for over 40 years. "We'll pay anything; just come in and fix them," he was told.

When he arrived, he walked around for a few minutes, surveying the presses; then he approached one of the control panels and opened it. He removed a dime from his pocket, turned a screw 1/4 of a turn, and said, "The presses will now work correctly." After being profusely thanked, he was told to submit a bill for his work.

The bill arrived a few days later, for $10,000.00! Not wanting to pay such a huge amount for so little work, the printer was told to please itemize his charges, with the hope that he would reduce the amount once he had to identify his services. The revised bill arrived: $1.00 for turning the screw; $9,999.00 for knowing which screw to turn.

Commentary: most debugging problems are fixed easily; identifying the location of the problem is hard.

It is...it drives me mad.  Luckily I have some clever friends to help out :)

Our usual...this brings to mind the mess that is Vista:

"If you can't make it good, at least make it look good".

....aw...Bill....

No SEO, no links - tons of traffic and #1 rankings!

we're all trying to get traffic to our clients websites and maybe even our own, and we work so hard on SEO, social media, marketing techniques to try and achieve that.

My boyfriend started a blog and within a few weeks he had over 500 people coming along.

No PageRank, no links, no social media work, nothing.  No SEO involvement at all. How did he do it?

Easy: he posted about coding errors. Some of them were very rare.  When he got a tough one, he searched everywhere online to find the answer to his problem, but often couldn't find a related resource,  When he had finally worked it out, he posted about how to fix it.

Because he was the only person to provide specialist information like this, he topped the rankings easily, and people flocked to his blog :)

What did he do next?  Shut it down as he was short on time and had other interests.  Typical :)

What has he done now?  Started blogging again, and yes...it's working out. 

October 23, 2008

Think you're good with Google?


 Are you a Google ninja?  Can you master any query thrown at you?  Would you like to win a cool book all about language and computers (it'll help you understand search engines)?

If so, take part in the Google Ninja Challenge:

You're given 8 query contexts or questions, and then you're asked to find out the answer using Google.

You keep your queries from the first till the last one you used to find the right information - it's not about how few queries you found the information in, it's simply about finding the information.

About the experiment:

This is part of an experiment not on Google but rather on users, so it's an HCI experiment.  The KIA project (knowledge interaction agent) is all about natural language generation and understanding.  We can't do any of that if we don't know how users search for things or what language they use for example.  The first part of the experiment happened in 2006/7 and was based on an irritating chatbot system that helped us understand how accepting users were of susch things.  You can read my research on that here, it's a Springer paper from HCI International.

The winner of the book is chosen by a group of researchers at the university.  The reason for that person winning will be revealed in the experiment analysis afterwards.

If you want to take part....

  1. Start here by filling out the intro survey
  2. Once you've done that - play with the 8 queries
  3. Finally fill in the de-brief form with all your answers
Have fun people and thanks :)

October 22, 2008

Google tricks and treats


 Google hosted an online session with presentations from Googlers, with question-answering time too using Google moderator.  Matt Cutts was there of course, and John Mueller, Kaspar Szymanski, and other notable Googlers.

I hate duplicate content more than Google, so when I can avoid it, I do!  I won't post in great detail because Google are going to make available the whole thing in a few days.  That way you can experience it all yourself.

They covered topics such as the new 404 solution, and "SEO myths", they also talked about personalisation and answered a whole bunch of interesting questions from attendees.

They quoted again that only 5% of the code online was actually valid.  Browsers get used to it but valid code will serve you well as it will be more efficient on mobile devices for example.

To stop your site being indexed, you should use a robots meta tag with a noindex directive.  Nofollow on the other hand is just a request and doesn't mean that any bot will honour it.  They said that you shouldn't disallow the spiders, let them crawl, but they won't index it if you told them not to in the correct way.  The same goes for PDF.  Of course it was said that you really shouldn't put anything that you don't want found on the web, it is after all a public domain.

As far as links go, as always, cheap and spammy links from those cheap directories will get you nowhere.  You are much better have a link from a very well respected blog or news site than 100's of those rubbish links.  

They define links as editorial votes about your page, they tell Google more about it.  They check on-page and off-page signals, and always go for quality over quantity.  My last post was about how Google didn't do so well in expert document tests, because quality relies on links.  Their definition of quality may be different to the definition of quality put forward by the guys who did the expert vs Google experiment.  They both mean "worthiness and excellence" in my opinion, just not from the same perspective.

I wanted to know how they found paid links, other than people reporting sites for using them via the spam report, but my question didn't pop up.  They've really been cracking down over this issue, and it's in their guidelines as well.  Don't buy links, and if you do, use nofollow so they don't get spidered and artificially inflate your rankings.  There isn't an automated method as yet that I know of.  It's not illegal to buy links, just it messes up Google's method.  Once they automate this, I think everyone had better stop buying links :)
  
Duplicate content has never been penalised, but there is a risk of one of those pages not being indexed.  They say to put your preferred URL in your sitemap.

What about DMOZ?  A few weeks ago they took the bit about submitting your site to DMOZ and the Yahoo directory out of the guidelines.  In Google groups, John Mueller said that they weren't devaluing these links, they just don't feel that they need to recommend it.  During the Trick and treats event they said that DMOZ was really useful.  In south-east Asian countries for example it isn't easy to type, it's easier to browse.

Also, if you've got a killer blog, definitely link it in to your site, it will increase its value.

They talked about how Live launched U Rank that allows you to influence your rankings and share them with friends.  Google said they weren't going to do anything like that because this method creates too much noise, allowing for a messy evaluation, and also this can be manipulated easily.   I think the Google personalisation from everything that they've published and said, is more of a private affair, like iGoogle.  They're working on natural language understanding as well, and probably generation as well, it would make sense, the 2 go together after all.  Read Greg's post about it for more information.

PageRank fails on quality - proved again

IR always belonged to the realm of digital libraries, then the search engines arrived and often IR is associated with this area, which uses a lot of technology and methods from digital libraries anyway.

Some experts in digital libraries, Michael L. Nelson, Martin Klein, and Manoranjan Magudamudi did an interesting evaluation and compared expert rankings to search engine rankings.  The paper is called "Correlation of Expert and Search Engine Rankings", and it was released 21st October 2008.

Expert ranking means that experts contribute to the rankings, rather than it being an automated machine task.  They chose a good example to test on, lists from ARWU, IMDB, Billboard, ATP, Fortune, Money, US news, WTA.

Their question is "Does authority mean quality?" and the answer is "although authority means quality, quality does not necessarily mean authority".

"US News & World Report publishes a list of (among others) top 50 graduate business schools to answer this question we conducted 9 experiments using 8 expert rankings on a range of academic, athletic, financial and popular culture topics. We compared the expert rankings with the rankings in Google, Live Search (formerly MSN) and Yahoo (with list lengths of 10, 25, and 50). In 57 search engine vs. expert comparisons, only 1 strong and 4 moderate correlations were statistically significant. In 42 inter-search engine comparisons, only 2 strong and 4 moderate correlations were statistically significant. The correlations appeared to decrease with the size of the lists: the 3 strong correlations were for lists of 10, the 8 moderate correlations were for lists of 25, and no correlations were found for lists of 50."

Interestingly they state that if a webpage doesn't rank in the first few pages, it's as if it doesn't exist.  I think this is true of search engine rankings but I know a lot of blogs with low ranking that are popular through word of mouth and social networks.  Jill is right, rankings really aren't the be all and end all.

"We then created a program that will create an ordinal ranking of the URLs in a SE independent of any keyword query. We then used Kendall’s Tau (t ) to test for statistically significant (p < t =" 0.60)" t =" 0.80)"> moderate (0.40 < t ="0.60)" t =" 0.80)">
They found that the bigger the list, the fewer the correlations, and in fact they found very few.  They say that PageRank showed its limitations because it's a conventional hyperlink method, which doesn't take into account quality scores.  They say that Cho and Baeza-Yates found that PageRank was biased against new pages, even if they were of the highest quality.  

Really important papers to read from their refs:


U Rank

Microsoft has unleashed a personalisation centered search engine, it's called "U Rank".  

They say that they want to use it to discover more about how people search, share and edit information, and how they organise their search results.  You can move around your search results, delete stuff, make notes and make it all visible to your friend, and also recommending sites to them.  I like the idea of sharing my search results with people, because I often do a search for someone and then send them the best results for their information need, so this would just make it much easier.   They also offer the possibility of mixing up photos or images with video footage results, and I can see that happening quite easily.

Read Write Web have a good post about U Rank and notice that you can't move results from the second page to the first page - I think this is a pretty big problem.  The dragging and dropping doesn't work so well either they noticed.

You have to have a LIVE account to use it.
 

 

October 21, 2008

Linkscape...sigh

Linkscape is a cool tool created by SEOMoz which retrieves statistics about links, in fact you can try it out here.

It has created a lot of noise, there are 11,200 results for it in Google if you search "Linkscape seo".  I won't go into it because you can read a wealth of posts and articles if you want to know what it is and what all the fuss is about.

In favour (to name but one):


Against (to name but one):


Balanced (not so many around):


Basically - privacy is the problem, and worrying about data being sold.

I do understand the issues that are being put across and respect everyone's view on this too.

But...

...am I seeing a trend here?  Every time something cool gets released (like Chrome), it gets knocked for privacy and security issues.  And most of this is usually fueled by hearsay and not based on many concrete facts for the most part (remember how Chrome had a keylogger?).  How is any kind of progress possible if we all kick anything new that gets released?

I don't want my bank account details, phone number, home address, medical history or email account contents listed on any search engine or system or to be accessed by any public tool.  The websites? Sure...take a look, I don't mind.

Sigh...

CAPTCHA broken!


Dr Jeff Yan and PhD student Ahmad Salah El Ahmad have revealed widespread vulnerabilities in the Microsoft email service.  They actually cracked this in 2007 but had to notify Microsoft first and wait for them to work on it to publish their findings.

The CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) was responsible for the vulnerabilities.  This is used to defend against automated systems that grab email accounts to deliver spam or put ads on blogs.  

They found that it was computers that could break CAPTCHA.  Normally the machines get confused by the letters because they have to separate them and hen put them in the right order.  The method used by the scientists took 80 milliseconds to break the CAPTCHA.  They removed the arcs in the Microsoft scheme (literally arcs drawn around the characters) that make the letters hard to decipher and then managed to put them all in the right order.  It's their colour filing method which was key to their success, combined with the usual vertical histogram analysis.  Using this method CAPTCHA can be broken 60% of the time - Wow. 

They say about the MSN scheme:

"Security. A major problem of this scheme is that it is vulnerable to our simple segmentation
attack. The segmentation resistance built into this scheme seems to be largely about
preventing bounding-box based segmentation, and apparently its designers never realised that
a simple color filling process can be used to do segmentation effectively and that a
combination of vertical and color filling segmentation can be powerful. Moreover, it is easy
to tell arcs from characters by examining characteristics such as pixel counts, shapes,
locations, relative positions, and distances to baseline. In addition, the use of a fixed number
of characters per challenge also aids our segmentation attack."

The problem is, as Dr Yan says, once the character recognition is done, then it's just a matter of using recognition techniques like neural-networks (around since the 1940's).  Recognition is very easy today.

Their work was presented to the companies concerned before being made public of course, and they have contributed to making CAPTCHA more efficient and robust.  The thing is, humans must still be able to decipher it.

Read their full paper here.


Google webmaster chat event

There will be a webmaster chat event hosted by Google on the 22nd October noon (EST) or 9am (PST).

They intend to bust some webmastering myths with some presentations and then there will be a long Q&A session via Live chat.

From their post:

"Here's what you'll need
  • About an hour of free time
  • A computer with audio capabilities that is connected to the Internet and has these additional specifications (We'll be broadcasting via the Internet tubes this time rather than over the phone lines)
  • A URL for the chat, which you can only get when you register for the event (don't worry -- it's fast and painless!)
  • Costumes: optional
You need to register for the event and post questions via Google moderator.


New Google patent about ads

William Slawski over at SEO by the Sea brings our attention to a patent about Google's possible intentions for advertising in podcasts, television and radio.

Snippet from the patent:

"Systems and methods for delivering audio content to listeners. In general, one aspect can be a method that includes receiving a request to download a podcast, and determining a targeted advertisement to be inserted into the podcast. The method also includes inserting the targeted advertisement into the podcast dynamically at a predetermined time. Other implementations of this aspect include corresponding systems, apparatus, and computer program products." 

For loads more in depth information trek over the web to the original post.

Internet progress fast - CS slow

I wrote a rather long post at High Rankings and decided that it deserved a place on the blog.  A very good point was made by Randy about how fast the Internet moves.  There are new developments almost daily, new systems, new ways of doing things emerge, and we all keep up with the new trends and algorithms.  Computer science research that is 4 years old (or even older!) isn't as current, it is true.  

This is because it takes ages for a lot of methods to be evaluated properly so that they can safely be used in public systems like search engines or social networks for example.  Some systems aren't designed to use some methods, and only when they have gone through many iterations, they suddenly see the need to incorporate a certain method or even a few.

Stemming for example is quite old, it goes back to 1966 when the Lovins stemmer was made.  Google I believe (but not totally sure of the exact date) implemented stemming to queries in 2003.  That's 37 years!  I think they were already using it in the internal system though, it's a pretty standard method in IR after all.  I wrote a stemmer in 2005 and it only started being used in 2007, not a lot of people saw any use for a stemmer that stemmed to exact words, but now it's pretty standard too.  That took 2 years.

PageRank came about in 1995 and was implemented when Google was publicly released in 1998, that's 3 years.  

I work in conversational systems and it has taken a while for the science community and also the industry to see why they could be useful.  Now there's a lot of research in this area, and the first chatbot was invented in 1966 (ELIZA).  It's not until recently that companies have started using chatbots on their websites (Ikea for example) and suddenly the potential for such systems in IR is being realised.  Long wait!  We don't even have all the technology needed yet to make something really good.

I think it's really important for the SEO community to keep track of papers released by IR researchers and also NLP/AI researchers when the work is related to search engines particularly.  It's useful to learn about the methods being developed and then it gives some insight into how they might be implemented (although this could take some time!).  You can use Citeseer to find them, or DBLP, and checking the references too can be useful.  Those are where my massive reading list comes from!

Of course some methods do get implemented quite quickly and I think that this happens when they are specifically built for a system in build progress.  The big search engines have people working solely on this and also companies like IBM for example.  What I mean is that you shouldn't discount methods that have been published a few years ago.  A lot of social media stuff was published quite some years ago too.

Happy reading :)

October 20, 2008

SISN toolkit for social media


 Kicking off the week with a technical poster from Michigan University called "SISN: A Toolkit for Augmenting Expertise Sharing via Social Networks".  

They're developing a toolkit to support expertise sharing via social networks.  The toolkit "SISN" (Seeking Information via Social Networks) is "is a general purpose toolkit for social network-based information sharing applications that combines techniques in information retrieval, social network, and peer-to-peer system".

Their toolkit requires:

• A collection of users with their expertise being represented by their profiles
• A social network that connects the users and place them along the query/referral/answer pipeline
• A collection of searching strategies to spread the query/referral efficiently across different user groups
• A coupling of the system with daily communication channels (e.g. IM and
Email) to provide a convenient interface between the major parties involved in the information seeking process

The system consists of:
  • Information profiling
  • An index
  • Categorizer
  • Social network module
  • Social network module
  • Profile Promoter and Peer Profile Learner
 They say that there is a gap between the search for information on social networks and the advent of the web and Internet in this information era:

"We believe that, by providing a general-purpose toolkit as a platform for sharing and seeking expertise via social networks, the current work helps us advance towards narrowing the gap between the social and technical perspectives of social network-based information seeking, i.e. the gap “between what we have to do socially and what computer science as a field knows how to do technically”.

I think that this would lead to a whole new rush to optimise yourself so that you show up in the experts for your chosen subject area.  

Unfortunately as far as I can tell they haven't made it freely available to us as yet which is a shame, but I'll watch that space.

October 18, 2008

MAMA by Opera Dev


 "The Metadata Analysis and Mining Application" (MAMA) engine is a search engine that works in a really different way.  It indexes based on page structure: markup style, scripting, coding,...

"Say you want to find a sampling of Web pages that have more than 100 hyperlinks or for pages that use the Font-size CSS property that also use the FONT element with a Size attribute? Many parties would be interested in such a service, even if the market would be smaller than for a "traditional" search engine."

They also say:

  • Browser manufacturers and others can use MAMA data on the popularity of widely used technologies to prioritize bugs and justify adding support for new technology to in-progress releases.
  • Standards bodies can use the data to measure the success and adoption rates of various technologies.
  • Web developers can use the same data to justify support of various technologies in their work.
  • It can provide real-world, practical samples of the Web developer's "art", for inspiration and instruction. 

You can get a load of interesting facts from it like, what the most popular element on the internet is, or how popular Fash is,...there's a key findings report you can have a look at.  It has a load of stats on things like which server is the most popular, how many URLs in the index passed validation, the average length of external CSS and so much more.

You'll also find interesting the document "The "average" web page", all about how those look today from a structural perspective.   

It's a dream come true for developers, and I think that it'll be quite popular.

October 17, 2008

TGIF - long week!

So welcome once again to Friday, I hope it's not too busy at your end and that you're looking forward to the weekend ahead.  Here we go:

Holy Taco have a list of cool super geeky T-shirts.  I really like the ones that say this:
  • Video games ruined my life, good thing I have 2 extra lives
  • I'm not a geek.  I'm a level 9 warlord
  • I built microsoft and survived
So funny geeky quotations for this week - warning, these are very very geeky!
  • "To err is human... to really foul up requires the root password."
  • "After Perl everything else is just assembly language."
  • "Unix is user-friendly. It's just very selective about who its friends are."
  • The software recommends using Windows Vista or better so I installed Linux."
  • "640K ought to be enough for anybody." (Bill Gates, 1981)
.....aw Bill....not again....

Top freeware stuff

I thought I'd share a simple list of some of my favourite freeware, stuff I use all the time and really wouldn't like to live without.  This isn't the full list by any means but here goes:

Utilities:

Natural language / A.I tools:
SEO tools - now there are many many lists of these already so this is a short one:
I'll leave it there but there are so many more.  I hope you've found some useful and fun things in this list to use and enjoy!  

October 15, 2008

Geeks fight the digital divide

This blog is taking part in Blog Action Day and this is the contribution, putting the floodlights on "Free Geek":

"Free Geek" has been around since 2000.  It's a non-profit organisation that recycles computer scrap to make working machines, to distribute into the community to those who can't afford them, thus helping bridge the digital divide:

"In the eight years since its formation, Free Geek has recycled over 1,500 tons of electronic scrap and refurbished over 15,000 computer systems that are now in use by individuals and organizations in the community."

These cool new machines are loaded with OpenSource software such as Linux and OpenOffice for example, and then they're shipped out into the community.  They're not called computers anymore after that, they're called "FreekBoxes".  

You can apply to adopt a computer, you can volunteer and help build them, and you can donate computers, keyboards, mice, monitors,...They also need stuff like toilet roll, food and printer paper to keep the operation working, and also screwdrivers and such things.

They are based in Portland, OR.  You can go to the site and donate money though.  It's a brilliant initiative and maybe some of you will be inspired to do the same thing in your local community.




October 14, 2008

Free links with 404's

Matt Cutts wrote a post about the announcement on Google webmaster blog, where Google stats that you can tell who is linking to one of your 404's.  This means that you can ask sites to link to another page of your site instead, thus avoiding the loss of links.

Using Google webmaster central you can get a list of 404's and also a list of sites linking to each page.

For more info read  Matt's blog post.

October 13, 2008

Predictions from 1984

Nicholas Negroponte gave a talk in 1984 about his predictions for technology, and he was very accurate.  Nicholas Negroponte founded the MIT media lab and is also behond the "one laptop per child" innitiative.  I was 6 years old when he gave that talk, how old were you?

He predicted wikipedia, teleconferencing, touchscreens, and other things, and he did the whole presentation of course using videotape footage.  It's really important to be able to be visionary and look ahead.  It's even harder to do it and have others listen to you, and he does this well.



Cognition - a short interview

I've been playing with the Cognition search engine for a while now and also sent the link on to some colleagues of which my friend Dan who is a proper algorithm geek, like I am.  Dr Kathleen Dahlgren from Cognition answered some questions for us, here they are:

- How does cognition feel about personalised search?

Personalized search can be augmented when the search engine understands language and can automatically see relationships that are opaque to pattern-matchers.  For example, if a person is interested in rhythm and blues, they are also interested in R&B, and probably blues as well.  But not blues meaning a bad mood.  These subtleties are all handled by Cognition.

- Are there plans for a multilingual solution?

There are plans.  The semantic map is relevant in all languages; it is universal.  But linguists need to tie concepts to the words of other languages.

- How are the ontologies constructed?

Originally they were constructed by hand.  Currently Cognition adds digitized ontologies automatically.

-  Cognition claims that no other NLP processing technology comes close in breadth and depth of understanding of English... how so?

The closest semantic map, WordNet, has 2.5 times fewer word stems and 20 times less
semantic information.

- What exactly is meant by the "context" of the text they are processing?

The context is the other words in a sentence.  So in “strike a match”, “strike” means “ignite” and “match” means “phosphorus-tipped stick”.  But in “striking workers”, “strike” means “walkout”.

- What metrics are used to measure the quality of the engine?

We have many different metrics and regression tests.  Our main method is to index identical content with another search engine, produce 50 typical queries, and test them for relevance using the two search engines.  Recall is measured as relative recall, lacking a gold standard in which all documents have been inspected.  In relative recall, the total of relevant search results by the two search engines is counted as full recall.  In such tests, Cognition always performs with over 90% precision and recall.  Google, for example, in 3 such tests had 20% precision and 20% recall.
  
- What exactly is meant by a "phrase" in the stat database?

A phrase is a frequently-occurring set of terms that are always juxtaposed, such as The Bill of Rights, U.S. Congress, United Airlines, or Securities and Exchange Commission.  

- Are there prebuilt macros for common phrases?

Yes – 200,000 of them.

It's really a very interesting system to use, and I reckon it'll improve leaps and bounds in the future as well.  We will be playing with this a great deal, I'll blog about it again, so watch this space!

October 10, 2008

TGIF - fun facts

Welcome to Friday - to finish the week, here are interesting some facts from computing.

  • The QWERTY keyboard layout is 129 years old.
  • Macquariums are aquariums made from old macintosh computers.
  • Bill gates & Paul Allen started a company called Traf-O-Data to monitor traffic flow.
  • David Bradley wrote the code for [Ctr]+[Alt]+[Delete] key sequence.
  • Tetris has sold over 40 million copies worldwide, since it began in 1982
  • The first hard drive available for the Apple had a capacity of 5 megabytes.
And also:

In 1945 a computer at Harvard malfunctioned and Grace Hopper, who was working on the computer, investigated, found a moth in one of the circuits and removed it. Ever since, when something goes wrong with a computer, it is said to have a bug in it.

And you can play Tetris free here.


Degree needed to be an SEO?

Jane Copland over at SEO Chicks wrote a thought provoking article on whether SEO degree programs are needed, and whether they would be of any help at all.  I commented over there but I thought I'd give my take on it here.  There was a post about it at SEW.  I commented there too.  I think that coming from 10 years of university education I have a strong opinion on it and this opinion is informed because I've taught freshers and I know from experience what is limiting in degree programs and what is really useful.  I don't expect everyone to agree with me of course :)

First off, this post is really about the young ones (18-22), who work on an undergrad degree.  Post-grad degrees are really different, and don't serve the same purpose.  PhD's are even more different, and aren't even degrees anyway, you've probably got a few by then.

My first point is that I think that young people like that are leaving home for the first time for the majority of them, and also have not been pushed to work on their own initiative.  They also haven't had the opportunity to learn the skills that an undergrad degree can give you.  Being at university gives them the support they need, both emotional and practical, and it also allows them to mix with people their own age, and learn that drinking 12 pints will make you sick.

On a degree program they learn to organise their work on their own, write at length will full references, which means they have to do substantial research on a given topic, learn things quickly (modules can last just a term), learn to present effectively, work on a team assignment (which seriously is the biggest issue usually),  there are workshops on interview skills,...there's lots of benefits.

You can learn these things from going straight into industry but you learn in a different way.  You don't pick up the skills you do at uni, you pick up others, but when undergrads make their way into the industry, they also learn these skills.  Admittedly for most of them the transition from Uni to the work place is a bit of a shock at first.  The advantage that I think they have is that coming from degrees like marketing, computing and business for example, they are able to put SEO into context, because they are familiar with the wider picture.  That is really useful.  They can also provide a different perspective which can only be useful to an employer.

The limitations of a degree is of course first off the cost, and then the fact that in Uni you don't learn many professional skills unless you do a degree which is vocational in nature.  Those related to SEO tend to teach you useful skills though.  I think that working 9-5 is usually a shock, and then finding out that it's actually 9-7 is a bit worse.  Working in an environment where it can be noisy and being asked to do lots of things at the same time is hard too.

Actually one guy said to me "I'm so tired and its only Wednesday, I don't know if I can keep it up 5 days a week" - He did though and he's doing really really well now.

Having a degree in SEO however would just negate all the benefits of coming from a different background.  You only learn SEO, maybe a module in web design or something, but you wouldn't learn about information retrieval in any depth or business, or if you did I suspect it wouldn't be a very long module.  It's a shame in my opinion to not take advantage of learning something else to put SEO into context.  Also, I think it's best learnt on the job.

You can also take another route which is to work in SEO and study at the same time.  Some people shudder at the thought, but it's a very efficient way to get the best of both worlds.  It is possible, I'm doing it and a lot of others are doing it too.  It's hard work, but you get used to it and it's rewarding.

I came in the SEO profession by chance really.  I spent a lot of time building search engines, pulling Google apart, building indices, becoming proficient in NLP and one day someone said, "oh could you optimise my site?" - I looked a bit puzzled and said yes.  That was my first project.  I quickly figured out that I had a lot to learn, and so I spent lots of time learning everything.  My background in linguistics and computing helped a lot, and it still does.   SEO was a bit different back then but you move with the times and that's exciting.

Those coming into the profession at an older age can draw from their wealth of professional experience, which I guess is the same as learning all about a different subject to SEO, and bringing with them an awful lot of knowledge colleagues can draw from and enjoy.

For computing:
A masters is useful if you want a career change - I went from a degree in translation and  linguistics to a masters in computing (machine translation).  A PhD is worth doing if you want to work for the big search engines, in the cool labs, in academia and mostly consider doing it if someone else is paying!

Coming from such an academic background, I am obviously going to favour having a degree.  If I didn't think University was useful, I wouldn't have stayed 10 years.

October 09, 2008

The Random surfer becomes the Cautious surfer


 An interesting paper: "Incorporating Trust into Web Search" written by Lan Nie Baoning Wu and Brian D. Davison from Lehigh University.

This paper deals with the issue of spam, generated by pages being engineered to deceive the search engines.  They say that ranking systems should take take into consideration the trustworthiness of a source.  TrustRank seeks to solve this issue, but they propose to:

"incorporate a given trust estimate into the process of calculating authority for a cautious surfer."

First some very brief info, in case you need a refresh:

The "Random surfer" is part of the PageRank algorithm.  It represents a user clicking at random, with no real goal.  The probability that s/he clicks on a link is determined by the number of links on that page.  This explains why PageRank is not entirely passed on to the page it links to but is dependant on the number of links on that page.  It's all based on probablilities.  

The "damping factor" is the probability of the random surfer not stopping to  click on links.  It's always set at a value between 0 and 1.   The closer to 1 the score is, the more likely s/he is going to click on links.  Google sets this to 0.85 to start with.  Not only does it allow for a score to be assigned to a page but it speeds up computations as well. 

Now for the goods:

With a little knowledge about search engines, they note that it's easy to add keywords to content or generate some inbound links (something we are all familiar with).  They rightly call this "spam".  It does affect the results, and that has been the role of SEO for sometime, but today I believe that SEO's work far closer with the search engines than they ever did before, so better practices are at work.  

They mention how PageRank calculates an authority score based on the number and quality of inbound links, and that HITS look at hubs that link to important pages.  They state that the issue with these methods is that they assume the content and links can be trusted.  

They say that PageRank and TrustRank can't be used to calculate authority effectively:

"The main reason is that algorithms based on propagation of trust depend critically on large, representative starting seed sets to propagate trust (and possibly distrust) across the remaining pages.

In practice, selecting (and labeling) such a set optimally is not likely to be feasible, and so labeled seed sets are expected to be only a tiny portion of the whole web. As a result, many pages may not have any trust or distrust value just because there is no path from the seed pages. Thus, we argue that estimates of trust are better used as hints to guide the calculation of authority, not replace such calculations.".  

Basically it's not easy to label and select a load of pages that are deemed trustworthy, so the dataset that you would create wouldn't big enough to be effective.  

In their method, they penalize spam pages and leave good ones untouched. 

The "Cautious surfer" attempts to stay away from spam pages.  They altered the "Random surfer" damping factor, which is usually set to 0.85.  This damping factor is altered based on the trustworthiness of a page.  This causes PageRank however to treat all links as a potential next page for the random surfer, and these may not all be trustworthy.  They dynamically changed the damping factor to address this issue.

They found that their method could improve PageRank precision at 10 by 11-26% and improve the top 10 result quality by 53-81%.

Basically the idea is that using the "Cautious surfer model" to existing ranking algorithms will significantly improve their performance.

This would mean that SEO doesn't change that much, seeing as most of us are striving to deliver reliable and rich content to users.  I think it would however come down harshly on some widely used techniques more efficiently, like keyword stuffing and link buying.  In fact getting links from rubbish places may mean that the site incurs a penalty.  

For a lot more detailed information on this, read the paper.
 

October 08, 2008

What you missed at the Web 2.0 expo NY

If like me you missed the web 2.0 expo in NY, you can still watch the footage on YouTube, here are some of the talks I'm sad I missed:

First up, Gary Vaynerchuk with "Building Personal Brand Within the Social Media Landscape"

A few things he says in this talk:

- Don't do things you hate

- Give a shit about your users - answer emails, give a crap about your user base.

- Start with yourself, ask "what do I want to do for the rest of my life?" and monetize that.

- Keep Hustling, it's the most important word ever.

- You need a business model - make some cash along the way.

- Legacy is greater than currency.

"Ive turned down 40 television deals.  There's no reason for me to share the equity. My content is mine The people that controlled it - newspaper, television, and radio- are no longer in control, and that is a huge factor that people have not totally wrapped their head around. - you need to build brand equity".

- If you love it, you will win. Get out there and network.

- Which tool should I use, tiwttter, facebook,...? - all of them! Communicate with your user base all the time. It's a massive opportunity.

- Work 9-5, 7-2am work on your idea.

- Be patient.



Next up: Jay Adelson - "Organizing Chaos: The Growth of Collaborative Filters"

A few snippets:

- Before all you had was information on clicks and you used PageRank and BackRub, but now there is a wealth of information to be collected from users.

- The younger generation don't have the privacy issues that the older generation have. They're happy to share info online.

- We're moving from a "seek" culture to one where we're constantly connected.

- Next gen = find people like you and use that collective wisdom to find things that are specifically interesting to you.

- Future of Digg - they collect loads of data on you, and then they're going to change the front page so it's unique to you.



Tim O'Reilley also did a talk, you can watch it here.

The web 2.0 summit 2008 is in San Francisco 5-7 November if you can make it.

October 07, 2008

What is semantic search?

There's been a lot of talk recently about "semantic search", and is also refered to as the "readwrite web".  Powerset, Cognition, Ask, Hakia, and many others are "semantic search engines".  It's not a new concept, research has been available in academia for at least 10 years.  In fact a lot of people involved in that in the early years are involved in the newly released semantic search engines today.  Not a big surprise!

So, what are semantics for start:

It refers to meaning in language (or code or anything else).  It used syntax and pragmatics as well as contextual information to provide the meaning of the text or even audio stream if you want to use that.  It's not just about finding similarities or context between 2 words but rather taking the entire text or query to establish meaning.  

What is the semantic web?

It's a common framework allowing information to be shared and reused.  Information is stored in machine readable formats.  

"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." - Tim Berners-Lee

That's probably the best definition, seeing as he invented it.  It's  relevant to semantic search and uses many of the same techniques and also is based on the same idea.  It's not a new version of the web, but rather an extension.  There are a lot of conferences about this worldwide, such as "The semantic web technology conference" for example.  

Technologies used obviously include ontologies (which are like big storage boxes full of information on how words and concepts link to each other).  These are built in OWL mostly, Natural language processing tools for named entity extraction for example, Data interchage formats (like RDF/XML or turtle for example), schemas like the RDF Schema, XML to provide syntax for content structure and SPARQL which is a web query language for semantic resources.   

And what is semantic search?

Google uses PageRank to identify relevancy, whereas semantic search engines use meaning to return highly relevant results.  Google returns keyword/keyphrase results, and the semantic solution returns information.  

The data has to be really structured in ontologies just like in the semantic web.  A semantic network is created which links all of the concepts and words together.  It used word sense disambiguation (WSD) in order to decipher what a word may related to.  WordNet, which you can download for free, is a machine readable dictionary that a lot of scientists have used for this task, although it's far from foolproof.  Here is a very comprehensive list of which semantic search engines use what kind of procedure, in pretty plain English.

Google does respond to natural language queries, such as "Where was Marilyn Monroe born?".  Hakia doesn't understand the query and tells me what Marilyn's real name was.  Powerset (only searches Wikipedia and Freebase) comes up with the goods, "LA".  Working in a "closed domain", Powerset has an easier job than Hakia who searches the whole web, just like Google.  Google however delivered where Hakia didn't this time round.

Then I tried in all 3 "Is chili bad for you?" - Hakia came up with books reviews for a book called "Bad chili", Google came up with a forum article with that exact question in it, and Powerset delivered an in depth article on the effects of chili on humans.  The following results for Powerset are all off though.  Hakia results continue with the book, but Google gives me loads of results all about the effect of chili on the body.

Have a go yourself and see what happens.  

This little test definitely shows that Google can come up with the goods, whereas the semantic engines struggle.  More work needed there.  I would be very surprised if Google were shunning semantic web technology and natural language queries.  I would leave that open for discussion actually.

The future?  Natural language queries, and natural language generation for a straight answer to a question, and a summary of all of the most relevant resources in one text, and the option to read the individual documents.  That's not an easy feat!

October 06, 2008

SEOmoz's index


 Today Rand from SEOmoz announced that his people had been creating their own index of the web so that they can launch the Linkscape tool for SEO professionals to use, as well as some new pro membership tools.  They released a Linkscape comic  à la Chrome, but it's really quite funny too.

In short, Linkscape has a crawler (perhaps more than one doing different jobs), it uses a popularity measure and assigns a link popularity score to each page.  It uses MozTrust which is apparently like TrustRank to identify trustworthiness.  It provides you with metrics on how your site sits compared to others in the space.  It'll also give you "the quantity and distribution of the anchor text  that a given site or page has  received from its inlinks".

You can try it out here.  It looks good!

SEOmoz always have some really cool tools on offer, some free, some not, but all equally useful.  This announcement is quite exciting, I think it'll grow to be something really fun to play with, or to work with, depending on how geeky you are.

Hakia's new stuff


Hakia, the semantic search engine, has launched a few new features.  A bit of info about Hakia if you missed the launch etc...

It works by retrieving information by matching concepts and meanings.  They have their own algorithm for ranking called "SemanticRank".  It can categorise results, it uses parallelism (Treat = cure), it makes suggestions and allows for user refinement.  

Their moto I believe should be "quality over popularity".  Google likes popularity, so this is a different take on things.   

QDEX is the system by which they analyse and store web pages.  It's different from the standard inverted index, because it allows for semantically rich data to be processed quickly.   It analyses the content of a web page, extracts all possible queries that could be related to it, then the queries allow the system to fine documents, paragraphs and such things for retrieval.  It's all done off-line.  This works well because it can decompose content.

SemanticRank is based on sentence analysis and and concept match between the query and the best sentence available in each paragraph.  They say they also use syntactic and morphologocal analysis.  There's no keyword matching of course because it;s all semantics and no boolean  matching.

Anyway, today they launched some new features.  They've added the "credible sites" tab, where you can look at results from authorities, such as edu, gov and such sites, and they're asking librarians to suggest sites and "information professionals" (I'm not sure who that covers).  The resources must be current, peer reviewed, non-commercial and authentic (or at least fulfill most of these requirements).

For now you can only use it for the topics of the environment, health and medicine.  The sites are by experts, although anyone can submit a resource.  

 There's also a few other tabs:  "news", "images" and "meet others".  The "meet others" tab is interesting because they're adding social networking to search.  There are different rooms where you can post a comment or upload something for other to see, on the topic of the room of course, and then you can discuss it.  You can rate the messages.  It seems to be used but I'm not sure how much people will get involved, seeing as they usually go to a search engine with a particular information need.  It might do though.    

Hakia also offer a personalisation service.  You get a myHakia account, and then you can get information and news from topics that you specify.  

The results aren't quite as good as Google's just yet as far as my very brief tests show.  It does have some interesting and possibly quite useful features as well though.  The interface is nice and clean and easy to use.  

Take a look and see for yourself, you might like it.

October 03, 2008

TGIF quotes and humour

To start things off for the weekend, here are a couple of quotes:

 "All parts should go together without forcing.  You must remember that the parts you are reassembling were disassembled by you.  Therefore, if you can't get them together again, there must be a reason.  By all means, do not use a hammer."
– IBM Maintenance Manual, 1925 

"If people never did silly things, nothing intelligent would ever get done."
– Ludwig Wittgenstein

"The Internet?  We are not interested in it."
– Bill Gates, 1993

...oh Bill....

New algo from Yissum


 Yissum, partnered with the Hebrew University of Jerusalem published (IEEE)  information on a new algorithm for ultra-rapid image retrieval which they say is robust and reliable.  It is obviously going to be very useful in large image databases.  They say that basically you could take a photo of a restaurant and then immediately get information on it from the photo analysis.   In fact it has many applications, they give the example of surveillance cameras.  

The algorithm decides which pattern and areas (windows) are non-similar rather than establishing a distance measure.  This way unnecessary information is not computed so it's faster.  They say it can detect "low quality patterns, rotated patterns or patterns that are partly occluded."

From their website:

Our Vision
Through our support and encouragement of research, development, and education, we are dedicated to turning science into commercial products for society’s use and benefit.

Goals & Objectives
  • To protect, promote and market commercially promising inventions and know-how developed at the Hebrew University of Jerusalem
  • To find the “right fit” for each intellectual property asset in our portfolio
  • To deliver, manage, and optimize knowledge transfer to the global market through a variety of business development activities and services
This will be very interesting to any image search engine in particular, because their databases are large and they often rely on tags.  There is work in image recognition for sure, but this really cool because I'd quite like to pick a photo and immediately get information on it.  This would be particularly powerful in mobile search.

Goodsearch


 Goodsearch is a search engine powered by Yahoo that donates around a penny to a charity of your choice every time you do a search.  The donations are raised from 50% of the advertising revenue.   Apparently the Dance marathon chapter raised $900 for a children's hospital.  

Goodshop was recently launched and it donates 37% of the sale of any goods purchased.

The results are decent too.  Give it a go, especially if you run a charity.
Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.