My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

November 28, 2008

TGIF - let's celebrate :)

Welcome to another weekly installment of TGIF here at SFS.  I hope the week has treated you well and that you have had happy surprises along the way, and an excellent Thanksgiving for the American readers amongst you.  Everyone else will have to wait for xmas for fun and games and good food.  Unless like me you're invited to a posh dinner in your favourite restaurant like I am tonight :)

Without further ado...

Geeky xmas presents you might want to add to your list:

I love these  awesome T-shirts from Nerdy T-shirts.  I'd quite like the "Reading is for awesome people" and the PacMan one.  

ThinkGeek (which is the ultimate geek gift site) do super strong caffeine comestibles, and I love the LED umbrella.  They have some very nice geeky baby and kids clothes as well.  

Firebox do these very cute lego brick USB keys.  Definitely want one.  Also feel free to send through this delicious looking box of munchies my way.

Scientifics do the amphibian solar powered car, way cool.  And seriously, who wouldn't want a pair of night vision googles like these? 

From Amazon (who I buy too many things from) I like the "Geek Logik: 50 Foolproof Equations for Everyday Life" book, and "The Unofficial LEGO MINDSTORMS NXT Inventor's Guide" - build away.  My wish list is huge, so I'll stop there.

Here are or Geeky facts of the week:

The first ever Internet marketing campaign was in 1978, when an advertising note was sent via ARPANET (precursor to the Web) to about 600 people on the thing.  It advertised a new computer, and as a result Digital Equipment Corp. sold over 20 of them at about $1 million each. 

If you opened up the case of the original Macintosh, you will find 47 signatures, one for each member of Apple's Macintosh division as of 1982. 

Your brain has about 100,000,000,000 (100 billion) neurons.

Some divers in 1901 discovered the oldest mechanical computer ever. It is a 2,100 year old machine and was used to calculate astronomical positions.   It was discovered in the Antikythera wreck off the Greek island of Antikythera.

Pneumonoultramicroscopicsilicovolcanoconiosis is the longest word in the English dictionary. It refers to a lung disease caused by inhaling particles from volcanoes.

Wonder Woman co-creator William Moulton Marston also created the systolic blood-pressure test, which led to the creation of the polygraph (lie detector).

I also took the geek test, and guess what...

77% Geek

Why I used Blogger

I'm often asked why I use blogger to host my blog, being a knowledgeable SEO person, and also a computer scientist.  "I should know better" is the main idea.  I advocate using Wordpress instead, and tell everyone else to do so when they ask me what they should use for their blog...for their business.  

Wordpress is better:  
  • You can host it yourself - unless you mess up it's highly unlikely to go missing
  • There are tons of tools for SEO and templates, etc...
  • Highly customisable
  • Search engine friendly
  • Publishing is fast
  • You can extend it as far as you like
But:
  • Can be a pain to install
  • If you host yourself it requires dosh
  • Bit harder to install Adsense
Blogger is better for noobs:
  • Easy to setup
  • Login is using your Google account
  • Lots of templates available
  • Silly easy to use
  • Free
  • Easy to add little apps
  • Easy to sort out your template how you want it with drag and drop and customise HTML
  • You can easy get Adsense on there
But:
  • not so customisable
  • .blogspot doesn't look too professional for businesses
  • Doesn't support all the Wordpress plugins
Feel free to add to the list, it's by no means exhaustive.  

Why did I do it then?  Well it was a bit of an experiment and a way of showing my colleagues that if you engaged yourself in your blog properly and were genuine in your intentions, then any blog platform would work.  I hosted my travel blogs with blogger prior to SFS and they got a lot of traffic without me doing a great deal.  

Hang on...I did do quite a bit.  But it honestly hasn't been hard or something I'd consider as work.  For both my travel blogs and SFS, I have been involved in relevant communities.  On Twitter my followers and the people I follow are all into the same stuff, I use Sphinn for SFS and I am involved there.  I do submit my posts, but that's what it's there for and it's all about sharing with the community.  I am involved in forums, and Linkedin groups.  That's about it actually.  

In short - I'm involved socially in my area of interest.  I interact, I share, I help out too when I can, I take the time with the people I connect with.  I like them, i hope they like me too, and I love what we talk about.  I write (I like to think) good content, and I take the time with my posts. 

How much time a week do I spend on my blog? humm..maybe an hour a day.  I take Saturdays off and the odd busy day.  It's a hobby but it has served me well, I've had quite a few job offers (thank you I'm flattered), and I've most importantly met some excellent people.

If you're running a business, use Wordpress, but honestly, it won't work if you're not involved in your community.  A blog isn't an extension of your website that you call "blog" - it's a gateway to information share and a whole interesting community.  You can reach clients, but you can also learn too.

If you're running something personal, honestly this blog is proof that any platform can work for you as long as you're dedicated and passionate.

that's my 2 beans about it anyway :)

November 26, 2008

SEO ladies, fancy being a Syster?

I'm always trying to build a little bridge between the SEO and the Science communities, I think they both have a lot to learn from each other.  Once in a while I like to try and bring them both together because both communities share some commonalities.

In the SEO world the girls over at Seo Chicks keep the Girly flag nice and high, and over in computing the Anita Borg institute does much of the same but in a different way (they trademarked the word Syster).  I'm one of 3 girls doing a PhD in a computer science discipline at my University.  At conferences, there are always more men than girls.  The SEO conferences on the other hand are far more girl friendly, and the women in SEO such as Jill Whalen and Donna Fontenot for example are strong characters.  

In computing the girl zone is in trouble, as women represent under 20% of professionals.  It was nearly 40% in the 80's (ACM report).  

Let us not forget that Ada Lovelace was the 1st ever computer programmer, 6 women were the original programmers of ENIAC, and Susan Kare designed most of your Mac interface and icons.  Karen Sparck-Jones was one the the pioneers in information retrieval...and Karen said "Computing is far too important to be left to the men" when she won the BCS Lovelace medal.  

Here is my list of top 10 computing chicks (alive today) in no particular order:
Go through this wikipedia list of computer scientists and have a vodka for every woman you come across.  Don't worry you won't be getting very drunk.

Is SEO too important to be left to the men?  There are a lot more women in SEO than in computing, there's quite a comprehensive list on the blogroll of "Women of SEO".  Don't try the vodka game here, you will be in a bad way :s

Is there a way to attract more girls over to computing?  How many of the SEO ladies out there would make excellent computer scientists?  Can computing poach a few? You know, seeing we're in trouble and all...

Google, my backend system

Right now, we're the web equivalent of the horse and cart.  We have invented the wheel, and domesticated animals and this has revolutionised our existence, especially the way in which we do business, but...I don't see Ferrari's, E-type Jags or anything like that right now in web world.

Are we heading that way?  Yes, for sure.

Search is not supposed to be something independent of the rest of your web experience, or actually you digital experience.  You are going to be able to access search from any device or environment without actually having to go to a search engine.

Imagine you're typing away an article for your photography blog.  The intelligent environment you're in is already aware that you are writing something for you blog, because it has seen patterns and features develop over time.  You highlight something and summon Google.  It does something pretty cool.  It goes out with the highlighted words as a query, but already has the context of the query, because of it knowing all about your writing for your photography blog.  

It rushes out, and visits all the top results.  These results are dependant not just on the keyword phrase but also on the other variables gathered from your intelligent environment.  Then it pulls out all the key concepts and information and writes you a summary to answer your question.  You can "repair" the results by typing something like "No, I meant..." or "Perfect! Tell me more about the canon".   And off it goes again.  Or "Let me see the top 5 documents", or "Show me related information",...

From a mobile device, you could summon Google during a conversation with someone for example.  Imagine you're trying to figure out where the closest restaurant is to you both.  You summon Google and ask it, where's the best place for us to meet, she's vegetarian".  You can request an answer to a question like "Did Angelina Jolie really bungee jump yesterday?" and get a response such as "Yes she did.  She jumped off a bridge in New Zealand".

I look forward to summoning Google and saying "Remember when I was writing that paper for that conference?  There was a quote by x about y in it, what was it?"..."That's right, who else said something about that?"...

Google becomes a backend system.  Gasp!  No but there is nothing more natural than for the engine to be in the background.  I think that conversational systems removed from search are fun toys, but their real use is in information retrieval.  Once you start getting used to having all your information at your fingertips as and when you ask, you are also going to get used to conversing with the system pretty quickly.

For this kind of thing of thing to work you will need strong summarization systems, natural language generation and understanding, and machine translation, personalization, machine learning also, not to mention all the other supporting technologies without which it could never happen.   Luckily these are all under development right now.

One of the most interesting questions I believe is "Does your behaviour change now that the search engine is conversational in nature?" - Does it become your friend, do you get attached to it because it shows human qualities, or do you treat it like a tool?  Does the way that you search change now that you no longer actually go to a search engine web page?  Are you more focused, more specific, more vague?  

Hummmm....so how are businesses going to take advantage of the search market then?  Clearly ads are still going to be served up, but how do you make sure the clever agent like your content most of all?

Google discuss the more immediate future of search here.

November 25, 2008

SearchWiki according to me

I don't usually post about already well covered news, but in the case of the Google SearchWiki I will make a small exception.  Search Wiki allows you to manipulate the search engine results and leave comments for others about a result.

There is an awful lot more information on the actual Google blog, Danny Sullivan wrote a nice guide as well, and there's a Q&A with Google about it as well.

I've asked around and most general users don't seem to have even noticed it was there.  My mum definitely has no idea what the whole thing is about, because she doesn't want to break anything, she isn't going to press on any of the buttons.  The more savvy users right down to the programmers said they weren't bothered with it either.

I keep forgetting its there and so I haven't used it very much.  I think that we would all begin to use it if we started to see the benefit.  Sadly in order to see the benefit, you have to start using it!

Google said “It’s a new way to empower users. You can remember answers to repeat queries. It lets you add your personal touch to our algorithms” (See the Q&A doc).

I genuinely think it is indeed a tool to help you alter the results to suit your particular slant on a particular query.  I also think it's a pretty cool way to collect a huge amount of user data and also human edited results provide more information on the authority of the resource.

Remember how we look at Social Media sites like Digg and said that the voting was warped because of it being so easy to manipulate?  Well seeing this is in a "closed" environment, meaning that nobody else but you gets to see it, there is no reason to manipulate the results.  Also the issue with the weighting of each vote is also no issue because it's proper to a single user.

1-800-GOOG-411 was all about collecting phonemes to feed into a machine to make voice search possible today.  I think SearchWiki is along the same lines.

November 24, 2008

Evri: content recommendation

For the past week you may have noticed a little Evri button at the bottom of my posts.  If you click on it, you'll get a pop up window displaying related news, videos, topics and a visual of how this topic ties into others.  You can drill down from there and explore more content.

This cool little tool is created by a team consisting of NLPers, AI aces, and the CEO is Neil Roseman (former VP of technology at Amazon).  The idea is to allow content to network and help you discover new connections in your semantic ride.

"Evri's technology automates connections between Web content by applying a more human-like understanding of the words on the page. We think that there is a big opportunity to help website publishers better engage their readers and help readers discover compelling content in a new and addictive way."

It doesn't use keywords, popularity or anything other than each element available (place, person, article...) and how to connects to others.  Evri grabs content from highly rated resources and is constantly working to increase it's knowledge base and building a nice data graph of the web at the same time.  

I like it :)

The fact that you never actually type a query to get information is a trend I see continuing waaaay into the future.  There will be a time when there is no need to visit a search engine as all of your information will be available to you in whichever environment you happen to be in (word doc, excel, programming IDE, game...).  Personalisation supports these applications, imagine how much a system can learn from you in 5 years of working with you.    

So we'll be needing a fair few more of you to join the ranks in A.I please, and bring your HCI friends too.  
 

November 23, 2008

10 free NLP tools for the SEO

Here is a list of 10 seriously sound NLP tools that I've used or still use both for SEO and other things.  I won't tell you what to do with them, I'm sure you'll find a use for the tool if you like it :)

  • FreeLing -  it's a package for language analysis containing amongst other things sentence splitters, pos-taggers, morphological analysis, flexible multiword recognition, named entity detection...
  • Assert - it's for semantic role tagging.  It annotates naturally occurring text with semantic arguments.  
  • LingPipe -  Java libraries for linguistic analysis.  It uncovers relationships within your text, and classifies text passages by language, character encoding, genre, topic, or sentiment and it can also cluster documents into sets.
  • WordSmith tools - lots of language tools in one environment, the text appears all highlighted and analysed for you to use.
  • WordNet - cool little machine readable dictionary, but not good for domain specific tasks.  There is also a Java library.
  • GATE - for general text engineering, an awesome toolkit for text-mining.
  • SenseClusters - clusters similar contexts together (using unsupervised methods)
  •  Amalgram - it's a pos tagger (it uses Brill too)

November 21, 2008

TGIF - hooray

I hope that you all had a brilliant week full of fulfilling and exciting projects and that the less interesting things eased by.  I hope you're looking forward to a nice weekend hiding from the cold or boldly facing it with vigour! 

Without further ado...some cool comics this week:

Be sure to check out "Kevin and the Googlebots", it's a brilliant comic strip by, well...Kevin.  It's one of my new favourites!

Also check out the Geeky comic, a nice site if you're looking for a smile.

And then there's Monty the inventor of course.  

I love Abstruse Goose too, simple and super funny.

I don't need to mention Dilbert, but if you haven't come across him yet...shame on you.

The usual geek facts for your pleasure:

  • TYPEWRITER is the longest word that can be made using the letters only on one row of the keyboard
  • 111,111,111 x 111,111,111 = 12,345,678,987,654,321
  • By the year 2012 there will be approximately 17 billion devices connected to the Internet.
  • E-mail has been around longer than the World Wide Web.
  • The first computer mouse was invented by Doug Engelbart in around 1964 and was made of wood.
Microsoft has no idea why the following happens:

Open Microsoft Word and type
=rand (200, 99)
And then press ENTER.

To finish...one of my favourite bits of footage ever...enjoy.

BTW I love the fact that some of you only visit on Friday for the TGIF post - clearly it's my best work :) 

User Experience at Google

At CHI 2008, Google presented a paper called "User Experience at Google – Focus on the user and all else will follow".  It's an overview of how the UX team at Google operate and how Google gets that super important job done.

Here they discuss their bottom-up 'ideas' culture, their data-driven engineering approach, their fast, highly iterative web development cycle, and their global product perspective of designing for multiple countries.  Google's core products are search, applications and commerce.  The UX team is located all over the world.

Here are some highlights:

Lots of cool stuff comes out of the 20% scheme but the UX team have to make sure they're not just technically feasible and fascinating but also useful to the user.  There are also so many projects that the UX team has to be super organised to include all the projects.

The UX teams educate and inform all of the teams on good user experience practice and work hard to make sure it is ingrained in their minds.  In fact they word it very nicely they say "UX aims to get user empathy, and design principles into every Google engineer's head".  This is what they call "entering the corporate DNA".  

All nooglers (new Googlers) are sent on a "Life of the user" training.  The UX team also hosts "Field Fridays", "any Googler can attend field studies to connect them with the everyday problems and “delighters” of our users."   There are "Office hours" sessions for each product area where Googlers can get involved hands-on.  20% projects get some help in these sessions.  

They don't do usability tests for each feature, instead they bundle up testing into "Regular testing programs" for any product area.  They streamline the recruitment process and spare 5-10 minute "Piggy-back" slots are made available for smaller projects.  They have a "User research knowledgebase", to make information accessible to teams by product area.  

As for the whole of Google, they use a data-driven approach.  Absolutely everything is tracked at Google, which is really sensible.  Also computing people do have an unnatural passion for data I might add.  Some UX experts work on usage data where they gather things like page-views but also product growth, number of "active" users (they mention that defining these isn't straightforward) but for Blogger for example they use a variable-length time window, based on what is typical for each blogger because this product isn't the same as the others.  They also use A/B testing but of course it doesn't stop there, there's a load of qualitative and quantitative data as well.  

Updates and changes to products including new things coming out means that they have to use: "a number of agile techniques such as guerilla usability testing (e.g. limited numbers of users hijacked from the Google cafeteria at short notice), prototyping on the fly, and online experimentation." They'll use live instant messaging also.

On a global scale, they have to make sure that the cultural, regulatory and structural differences between locations are addressed correctly.  They use Global Payment as an example, which impacts Google Ads and checkout, as well as having financial regulations  tax issues.  Geotargeting also comes under this.  How can they predict the location of a user, or their language?  This is why the team is global, and they carry out global projects.  

I think it all sounds really exciting and well structured.  I would love to see that data :) 

Issues with collaborative voting

Collaborative voting is used a lot these days in news systems, where people submit articles and others vote on whether they are interesting or not.  The articles with the most votes are ranked highest if you like, making it to the much coveted front page.

There are some issues with these systems which cause degradation in user feedback, here are a few:

  • Not all votes carry the same weight - if an expert votes on an article and a layman does, the expert vote is the most noteworthy.  There are problems with establishing who are the authorities on which topics.
  • Social voting: consistently voting for people you know and your friends.  These are not always votes based on the quality of the article.  
  • Sometimes though, users find that they appreciate articles from a certain author and track them, voting often on their submissions, but in this case, it is a genuine vote.  It's hard to tell these apart.
  • Some votes are generated without much thought.
  • Some votes are given for fun or for profit.
There is a lot of research going on to resolve these issues, take a look at these to start with:


"A Few Bad Votes Too Many?  Towards Robust Ranking in Social Media" (Jiang BianYandong Liu, Eugene AgichteinHongyuan Zha)

"Dynamics of Collaborative Document Rating Systems" (Kristina Lerman)


November 20, 2008

Google voice search - say wha?

Google released voice search on the pretty looking iphone this week.   The google mobile blog post on it is cool with a video and everything, and you'll also find a low-down over at  C-Net.

Many of you will know 1-800-GOOG-411.  The whole point of that operation was to gather enough phonemes (units of speech) to make this possible :) 

I tested it briefly on a few queries with a couple of friends and overall I think it did really well.  I would definitely switch to voice over type from now on.  

The queries "Bushwalking", "Dog", and other straight forward ones worked well.  The term "bollocks" wasn't well received because Google returned results on politics...kinda related but not what we were after.  The Geolocation didn't work when we asked for cinema times.  G thought we were in Milwalkee.  We're in Norwich in the UK.  

How does it work? Here is a high-level summary of what's being said over at Waxy:

The sound of your voice triggers a connection to the search engine, then the chuncks of audio are sent through.  It is believed that the voice is broken down into phonemes or a fingerprint of the file, so that just enough gets sent through.    Feature extraction does have to occur, we're not sure how it's done right now.  

Google uses the Speex OpenSource codec because it works really well with Internet applications amongst other things.  The codec (compressor-decompressor) encodes the signal so that Google can understand it.  The teeny file gets sent as a POST request and then Google sends an even smaller file.    Once Google has the voice signal the page of results is triggered as well and a GET request for the voice-to-text string.  The Voice-to-text operation doesn't take place inside the iphone, because this would mean that substantial data would have to be sent, so it's more likely this is taking place on their own servers.  An array of search terms is then presented, and ready for use.

Check out the Waxy site for updates on this regularly, for example secret settings have been found.

You might also want to keep and eye on Nuance (who provided Google with the OCR software in 06) because Google enticed a fair few impressive engineers to join their ranks.  Many of the others went to... Yahoo! (OneSearch technology available). 

Google also encouraged the formation of the Open Handset Alliance.  

If you're interested register for the Voice search Conference in San Diego, it's on the 2-4 March 2009.  BUT if you're over here in the UK, you can pop along to Interspeech in Brighton on 6-10 September, you might see me there (they're hosting Loebner there this year).  

November 19, 2008

Deep-space internet a reality


NASA is reported the first successful tests of the Deep Space Network.  The earth's network was used as a model. They used (instead of TCP/IP) the interplanetary communication network which uses DTN (Disruption-Tolerant Networking).

Naturaly Vinton cerf from Google was involved too, and it's all pretty cool don't you think?

Read more over at engadget.


Optimal Marketing Strategies over Social Networks

"Optimal Marketing Strategies over Social Networks" (www 2008) is a paper written by Jason Hartline (North-Western Uni), Vahab S. Mirrokni (MIT), Mukund Sundararajan (Stanford).  It's interesting because it gives an idea of how businesses can use social networks in an effective way to sell their products.  

I think the isse at the moment isn't selling products via SN but rather going in unintelligently with the hard sell, and the spamming.  The way to sell your products and services is to find interested individuals and to approach them in a friendly, social networking way, about your stuff.  Use social networking etiquette.

They looked at influence and revenue maximization.  The buyers descision to buy the product is influenced by other buyers in the social network and also by the price of your product.  When the buyers were completely symmetric, they could find the optimal marketing strategy in polynomial time.  

They looked at approximation algorithms and used the influence-and-exploit strategy.  Basically , you give the product for free to a select number of buyers, then you use a "greedy" pricing strategy for the buyers attracted by the influential individuals in their community.  They developed set-function maximization techniques to locate the target buyers to influence.  When other buyers are influenced from others, it's called "the externality of the transaction".  When there is a positive sale, it's called a "positive externality".

We know that users with the most connections have the most influence.  However the probability of people buying the product decreases as the marketing strategy progresses.    This is why the ultimate method is to give the product away to start with, much like Tivo did.  

To start with, you approach individuals and give the itme away, the you go on to the "exploit" stage. You visit buyers in a random sequence and offer them a "myopic price" (optimal pricing for revenue based on the influence of the initial buyers and the buyers who have already bought the item).    

They used a simple dynamic programming approach to identify an optimal marketing strategy.  Because it's symetric, the order in which you appraoch buyers is irrelevant.  "the offered prices are a function only of the number of buyers that have accepted and the number of buyers who have not, as yet, been considered."  

They found that the problem of computing the optimal strategy was NP-Hard, even when there was no uncertainty in the input parameters.    Using automated things to comoute strategies also involve computing issues such as this basically, you don't have to worry too much about that right now unless this stuff tickles your fancy.

The conclusion:

If a set S of buyers have previously bought, offer the next buyer i price vi(S). Buyer is i, Vi is the value of the buyer and S is a set. Vi(S) is a non-negative number.

This price simultaneously extracts the maximum revenue possible and ensures that the buyer buys and hence exerts influence on future buyers."

There is a lot more detail in the paper and the equations are worth a thousand words as always.  Take a look if you're interested in picking at it.

November 18, 2008

What is the Ubiquitous web?

We hear an awful lot about the semantic web, and extension of web 3.0 - but not a great deal about the ubiquitous web.  Web 3.0 lays the foundations for the ubiquitous web.   

"Ubiquitous" = everywhere at once

"Ubiquitous web" = a web that learns and reasons, based on the anytime/anywhere/anymedia paradigm.  You can access it from wherever, and all of your data and applications can be stored in it.  It's "pervasive" because all physical objects are resources accessible by URIs.  Every object is web-accessible and interlinked.  It strives to provide you with the right thing at the right time in the right way.

Semantic web stuff includes: thesaurus and taxonomy, ontologies, personal assistants, semantic search and websites...

Ubiquitous web stuff includes: semantic social networks, semantic email, context aware games,  semantic agents ecosystems, natural language...  

Ubiquitous web agents are going to be more location aware than the standard browser. They are user-aware as well as context aware and users can also annotate their resources.  It also allows for easier browsing because of everything being arranged in folksonomies, so you can jump from one concept to a related one.  The social tags provide information on the proximity of the relationships.  

The ubiquitous web is also "context aware" (location and user aware).  A knowledge and communication model improves the user experience in the physical context, very much like the web does in cyberspace.  "Context" allows for the rectification of the properties in the environment of the application.  This means that it can determine what to customise and when.

The "adaptation rules" are specified by UML annotations in the "context" dimension.    involves a set of generic and pre-defined operations for each model element which is part of the information, presentation, operations or navigation design.  

So...the ubiquitous web can adapt the environment to user preferences.  This creates a reactive environment.

There's a lot of work going on in mobile (see MORFEO) at the moment and also in cameras, GPS, search technology...


Predictions for 2008 - were they right?

The eLearn magazine from the ACM published an article where they had asked a number of scientists what they predicted in technology for 2008.  This is focused on e-education but it's good to see some of these predictions anyway as they concern us all really.

This was in January, and seeing that we are approaching the end of the year, i thought I'd take a look and see who was right and who was wrong and who needs a little more time to be right.  I'll pull a few out, read the article for more.

The author, Lisa Neil predicted: meaningful connections in social networking, less democratic processes to allow for us to find quality info, and it being easier to locate and create quality content.

I think that social networking hasn't yet matured to the point where people are forming meaningful connections.  A lot of users are collecting followers or friends for the sake of it or for marketing in a blind way.  It is easy to create content now, but maybe it gave way to more rubbish being published.

Stephen Downes predicted that users would shun Facebook, itunes and so on for commercial software, the filtering of social network feeds, syncing digital devices and lots more academic material available online.

Yes for the digital syncing (iphone etc...), not a chance for commercial applications, OpenSource and free things are almost the norm now.  I think we are better at filtering social networks, but we don't yet have very good tools to help us do this.

Jay Cross: 2.0 will be appended to everything - Yes, he was quite right!

Michael Feldstein: institutional support for FaceBook and other web 2.0 tools - Universities are indeed embracing those and including them in modules and so on.

James Hendler: Semantic web will spread: Yep! (mind you he would say that)

Mine were: more semantic web, loads more going on in social networks, more ubiquitous web stuff, and more research in natural language systems and Q&A.

Verdict: right about semantic web and social networks, NL systems and Q&A are indeed in research, not so much ubiquitous web around yet but getting there.

November 17, 2008

Things I'd like for xmas from Twitter

I use Twitter daily and think it's great for businesses as well.  Customer service is vastly improved and product information can be disseminated to those interested.  I use it as a personal account and like to follow people with the same interests, because that way I can get really interesting information from them, and hopefully they can get some from me too.  Twitter is pretty basic though, which is good, but I wish there were a few more features that would make my experience and account management a bit better.

  • Putting followers into categories (SEO, Web 2.0, yogis...) - there is "GroupTweet" but you have to ask everyone you want in each to join it. Laborious.
  • Having the "search for people" tool actually work and not be out of service every time I try to use it
  • Threads for replies (I can't always remember what the reply was about)
  • Categorise users by "not following", "new", "long-timers"...
  • Recommendations for people who I might be interested in following
  • A filter for spam and fakes
There are external tools who allow you to do some of these things.  I think it would be nice to have it all in Twitter though.

For a good list of Twitter tools, see the Seoptimise list.

Free Tech Books

FreeTechBooks is a free online repository of lecture notes, textbooks, and books on computer science, engineering and programming.  It's all legal and a really good resource for everyone interested in those topics.

Some resources on there that I think might be interesting for SEO people:

PhD offer - for the academics

Just a shout out to the academic readers, the University of Geneva has a PhD position in Machine Learning for Multimedia Information Retrieval going if you want to apply.  

You'd be working in:

* multimodal information access, including:
o multimodal data analysis
o interactive learning
* Large-sale multimodal information handling, including
o construction of large-scale interactive frameworks
o relevance feedback handling

You'll need C++ and Java, and a strong background in Machine Learning, Computer Science.  because this position is in Geneva, you need to speak excellent English and also some French.  If you're not fluent in French, you must be willing to learn.

As with all PhD's you'll have teaching duties, and be expected to write papers.

Check out the announcement here.

And good luck if you're applying!

November 16, 2008

Super fast live blogging and more

At PubCon a lot of people live-blogged from the event.  This means that as the talks and presentations took place, they shared what was being said and what was going on via their blogs.  This is really important for any community because it allows for the sharing of information (which is what the web has always fundamentally been about), and opens up discussions.  Not everyone can go and attend a conference at the other end of the world for financial, practical or other reasons.  There are also loads and loads of conferences each year - it's not possible to attend all of them unless you can make a living out of it.

I like the SEO conferences, the web 2.0 conferences, and all the HCI, AI, natural language processing and information retrieval conferences.  Not to mention all those on smart agents but there aren't so many of those.  I can't go to all of them!

Having live blogged myself, I know it's knackering, and that you're typing away, it's frantic and hard work!  I have come across a really cool gadget that can speed things up no end though, and I am really excited about it.

It's called "DigiScribble".  It's a pen basically.  You can write with it on paper, or even draw, and it will store all the information digitally in the little device you clip to the paper.  Then you take it to your hotel or home or something and upload it all onto your computer.  You can leave it all freehand, or have it convert it to typed text and nice smart diagrams or whatever you have. 

It can also take over from the mouse, so for presenting it's great, you point it at the screen and it moves stuff about, opens documents, all the stuff you'd do with a mouse, without having a clunky mouse about (although I love my mine with a passion).

I have a ton of diagrams to do for my system design, which have to be in my PhD thesis.  Powerpoint takes ages and ages, dragging those little shapes about, and all that.  Now I just freehand draw it on paper, and convert it to "shape".  It has made my life a whole lot easier.

I'll be using it in meetings, conferences, and even on the plane so I don't have to have my laptop open and be typing away.  

Definitely give it a go, it's priced at £49.99 and I'm pretty sure you can get it in the US and elsewhere too.

There's a live Demo here.


The power of twitter - a personal experience

I began using SlideShare a few weeks ago, it's a place to share presentations with others in the community, they can download them, favourite them, view them, embed them in their own sites and blogs.  It's a great source of information on a wide variety of topics, and it supports open office, which Google presentation doesn't.  It worked great and I began using it to share info with the Uni students and colleagues and so on.

For about a week, it stopped working and I could no longer add my presentations.  I'd begun to rely on it, and suddenly it was letting me down (it is in beta, so lets not be too harsh, and it is a free service!).    

I vented my frustration on Twitter saying something like "I will never use SlideShare again, it won't let me upload anything and I can't see where to contact anyone".  Within a few hours, SlideShare were following me and also asked what was wrong, how they could help and pointed me to their "contact us" link that I couldn't find.  I explained my issue via Twitter and they explained what the problem was.

What they achieved?

Me actually bothering to go back to try the service again, and when it still didn't work, I decided to leave it for a week and see how it goes then.  Without their Twitter intervention, I was a lost cause to them and their product.  

Obviously if it does never work again, I'll use a different service altogether, but the important thing is that they made me feel valued and understood.  I think a lot of companies should learn from just this one instance, and I know many like Dell already look after their customers and users this way, not to mention potential customers.

Time to get on board those of you who haven't yet!

November 14, 2008

TGIF - cool

Welcome to another round of TGIF.  I hope the week has treated you well and that you're looking to a fun weekend.  Make sure you make the most of it, and yake time to chill out.

Without further ado:

Check out this site with a colletion of geek jewellery, I really like the Lego ring, but the USB bracelets are cool too.

Check out the awesome footage from some robot programmers, I'm not sure this is part of their research, but I'd like to be involved!

And do not miss out on the dancing Daleks.

And now for some geeky facts:

  • 32% of all Geek keyboard faults are caused by a build up of doughnut particles, nose hairs and Frito crumbs stuck between the keys.
  • There are more than 1,000 chemicals in a cup of coffee. Of these, only 26 have been tested, and half caused cancer in rats.
  • A Geek once wrote an entire piece of code using only zeros. 
  • The word "modem" is a contraction of the words "modulate, demodulate."
  • Wearing headphones for just an hour will increase the bacteria in your ear by 700 times.

Google Tech talk

On the Google channel on YouTube you'll find a tech talk called "Knowledge-based Information Retrieval with Wikipedia" from October 31st 2008.

It covers the limitations of search engines today.  Documents and queries aren't really understood at all, because they're still viewed as tokens.

They tested a method where they consult Wikipedia for knowledge.  It hasn't worked so far but there has been a lot of research on it.  Wikipedia is useful for semantic relatedness (see Wikirelate).  

The idea is treated like an ontology here.  Wikipedia how ever is not a formal structure so it's not easy.  It's believe that it can be used in this way, using HCI rather than AI or NLP.

Koru is introduced for exploratory search.  It works well, although improvements are necessary.  "Wikiminer" is also demoed.  

For an awful lot more detail and interesting information, take an hour, sit back and enjoy the talk.

November 13, 2008

Sphinn spam - some solutions

*Before you read, a clarification - I'm aware that Sphinn is working on a newer version, and also that they use editors, mods and user interaction for spam fighting - this post is about those techniques and their limitations, and also introduces some new ideas*

Sphinn has been swamped with spam recently, I've seen a lot of it myself and it's been reported by other users, including Zigojacko.  What's up?

Although Sphinn is small in comparison to Digg, it uses the same kind of system.  People submit stories and they get voted by other members.  The posts with the most votes go to the "Hot topics" page, which is also the content that you'll get in your feed.  Basically, it's the same thing.  People post advertising for their products and services instead of information rich resources that can be shared with the community.  It's also a drain on resources.  All in all, a nasty thing that needs to be dealt with.

Ways that this problem can be solved include:
  • Having a human spam editor 
  • Getting users to flag spam 
  • Moderation 
  • Captcha (not effective for human submissions)
  • Relevance rank
  • and finally...personalisation.
Having a human spam editor isn't ideal in a very dynamic environment like Sphinn.  It works for Wikipedia, but it moves at a much slower pace.  Captcha is only useful for deterring bots (although some can break captcha now).  Moderation uses human resources and is time consuming, Tamar and Danny at Sphinn make it clear in the Zigojacko thread.  Moderators should not have to clear out the spam anyway.  That leaves...

Personalization:
Digg already announced at Web 2.0 expo that they were working on a personalised front page.  This means that, yes, you might still get spam on your front page, but it's not really going to be worthwhile for the spammers, seeing as their audience becomes very small all of a sudden.  You get to moderate your own "front page", and in this sense, I guess something like Twine is worth a look (I really like Twine btw).

There is a way for spammers to use this to their advantage though, and this would be through social network monitoring, to detect where their interest group is, and then target them in some way, like with paid for ads.  It is still tricky though.

Relevance rank:
Most people will be aware of this, it's basically ranking results by relevance, but first you have to decide what's relevant.  

On Sphinn, new submissions come in on the "what's new" section as they are submitted, which I like.  Sometimes stuff I'm interested in doesn't get many votes and they'd be buried before I could come across them (not "find" - I'm never looking on Sphinn, I'm browsing).  This section is easily spammed and to be honest it's not as bad as I've seen it elsewhere.  

There has to be a filter as stories come in to minimise spam at this level.  One way to do it would be to use a topic detection algorithm and train it on a clean already existing Sphinn corpus.  The system can draw patterns from the training data which help it label a submission with "Sphinn" or "Foe".   The patterns will be numerous!  A cool by-product is a way to visualize the community.  

This type of method needs to be flexible as well though, otherwise if you used an unconventional title for example, or weird words, your submission would be chucked out.  The more you train it the better it gets and I would define Sphinn as a closed environment, which makes the problem easier to deal with.  There are only so many categories.  It's not as difficult as tracking spam in a global engine.  On top of that you could take into consideration user interaction to solidify your method.

Or failing all of that, we could beg: "Spammers, please please stop  peeing in the beer".

New Google patent - more personalization

Google publish (another) patent on the 11th November.  This one is interesting because it deals with serving up queries in a preferred language.  It means that your search is no longer limited to English sources exclusively if you're in the UK or US or somewhere, because you can specify a language you'd prefer results to be in.  This doesn't mean all your results are in your chosen language but Google will also serve those as well as your English results.  You could also use dialects if you wanted and also dead languages (Latin, Greek...) AND...Klingon.

You might want Italian results although you are French, because you can read both, so why not? It makes things much easier for people who study or who speak several different languages.  If you speak 3 languages, you could be missing out on great information available in German or Lao for example.  I'd be interested to know how translators feel about this.

The invention  dynamically determines the preferred languages and ranks the search results.  The system can determine what you preferred and least preferred languages are by evaluating queries, user interface and search result characteristics.  

Query terms are not a good way of determining the language preference because for example, proper nouns are for the most part language independent, so "Marlena Shaw" is always going to be the same.  It gives no clue as to what language you want your results in.  

Also, keyword searches are not complete enough to determine a language preference, because there's no context.   Also individual words can be language-independent or language-misleading.  The example used in the patent is the "Waldorf Astoria".

Rankings....

These results are going to have to be ranked to favour the results in the preferred language whilst still allowing for the other results to appear.  It's done by using a predetermined shifting factor or by adjusting a numerical score assigned to each search result by a weighting factor and resorting the search results. 

I hope this happens soon, it would be really interesting to get multiple language results.  This is once again, another example of how personalisation is charging towards us at full throttle.  Cool.

November 12, 2008

Social network analysis tool

Agna (Applied Graph & Network Analysis) is a tool built for social network analysis, sociometry and sequential analysis.  With Agna you can study relationships between groups, relationships between people, and the structure of social networks.  The results are graphed nicely so that you can simply see the results of your analysis.

It uses network analysis which assumes that the way that people communicate affects important properties of that group.  Here nodes are people and edges are communication acts.  It also uses sequential analysis which identifies the constants and rules that govern the inner structure.  

It was developed by Marius I. Benta, a PhD student at University College Cork (Ireland). 

It's a free download (windows, Mac, Linux), and you can see more  screenshots here.


Blogosphere vs Web - ranking issues

I came across a very cool paper from SIGKDD 2008 called "Blogosphere: Research Issues, Tools, and Applications" by Nitin Agarwal and Huan Liu from the University of Arizona.  It's an easy but long read, for the geek, but can also be quite happily understood by the layman.  I've pulled out some things that I thought were interesting and given you a short taster here, but I urge you to read the paper, it's brilliant.

There is a model of the web, called the webgraph, where each webpage is a node and each hyperlink an edge.  It provides a visual model of the web, which can be used for many things, such as for example search engines that use this graph for ranking documents. 

We can't map the blogosphere in the same way because the number of links is sparse, and blog posts are dynamic and short-lived quite often.  Also the comment structure which provides for interaction does not exist in the webgraph model.  The webgraph assumes that sites build links over time, this isn't so in the blogosphere.  We cannot use a static graph like the webgraph.

One way to model the blogosphere is to gather data concerning link density,  how often people create blog posts, burstiness and popularity, and how these blog posts are linked.  also it's possible to use the blogrolls to find similar blogs.  This is what Lescovek et al. did, they used a cascade model usually used in epidemiology:

"This way any randomly picked blog can infect its uninfected immediate neighbors probabilistically, which repeats the same process until no node remains uninfected. In the end, this gives a blog network."

Brooks and Montanez used tf-idf to find the top 3 words in every post and then computed blog similarity based on that, which means that they could cluster them.

The problem is that these methods are keyword based clustering and therefore have high-dimensionality and sparsity issues.  You could reduce this by using LSI but the results still aren't so good.

Many companies have already seen the usefulness of blogs for sentiment analysis, trend tracking and reputation management.   Some systems use manually tagged sentences with  negative/positive references, then using a naive-bayed classifier until everything has been classified.  

Another way of finding the edges on the graph is by taking the topic similarity between 2 blogs.  This is a good idea, but using this method is still under research and very difficult.  

iRank is a "blog epidemic analyzer", and  predicts if 2 blogs should be linked (BlogPulse uses this).  They look for "infection" (how the information is propagated), so their aim is to find the blog responsible for the epidemic.  These are the authority blogger, the influential ones in the blogosphere.  It's good news when you find these bloggers because you can use them for word-of-mouth marketing as it were.  They provide valuable information that companies may be interested in, they may employ the blogger for example because s/he gives brilliant information to people about their products.

Another method to infer this has been to predict the odds of a page being copied or read, and also look at topic stickiness.  The most influential node is chosen with each iteration.  It apparently outperforms both PageRank and hits for this task.

Splogs (spam blogs) are the equivalent of link spam in search engines.  On the web algorithms include variables such as keyword frequency, tokenized url, length of words, anchor text and more.  PageRank computed a score which it uses to identify splogs.  This doesn't work on blogs unsurprisingly because they are too dynamic for spam filters to be effective.  This issue hasn't been resolved as yet, although there is research in this area, and things are improving.

Link analysis is also used to find patterns.  The text around the links is used, and based on those links hubs and authorities are found.  You could use comments as links between the blogs.  An influence score could be determined by taking into consideration inbound links, comments, length of posts, and links out.  

This is a fun and really interesting are of research, keep an eye on new things emerging from this research community.

November 11, 2008

About machine translation

Machine translation (MT) is all about translating text (or speech even) from one language to another.  It's part of computational linguistics, and uses a lot of NLP methods as well as statistical methods, rule-based methods, corpus techniques, some AI too, amongst other things.  Apparently it was started in the 17th century and in the 1950's the Georgetown experiment wen on, but it didn't really work so funding was really reduced, meaning that a lot of research in this area was terminated.  In the 1980's it made a comeback.

It's important today, in the age of the Internet, because a lot of data is in different languages, and when we can't understand another language, we are deprived from what may be the most relevant content to our query.

First you have to pull apart the source text to make sense of it, and then you have to re-engineer it into the target language so it makes perfect sense to a target language reader.  Not only do you have to understand all the grammatical elements, the syntax, the idioms, the semantics, and so on, you also have to have a good grasp of the culture associated with the target language.  

Different systems use different approaches, here is a brief description:

Rule-based systems:
It's basically made up of a load of rules relating to translation between the two languages.  It can use a dictionary and map to that.  You can use a parallel corpus to find those rules, which means that you map between ready made translations and pick out the common patterns, then feed these into a machine.  I did this and it wasn't very precise.  Google used SYSTRAN for many years, and this is a rule-based system.

Statistical methods:
Google translate now works with these.  It involves generating a load of statistics derived from a large corpus.  The problem is finding a very large corpus.  this isn't too much of a problem for Google, but not very many corpora exist, but even Google used the united nations corpus to add 200 billion words to it's system.  These are used to train the system.  

The main issues:
Word disambiguation is very difficult.  This is when words can have more than one meaning.  Google doesn't do so well in this area.  There are 2 methods that are known to deal with this, the shallow approach (looking at surrounding words and drawing statistical information from this), and the deep approach (providing a comprehensive definition to the system).  The deep approach takes a lot of time, and isn't so precise, so statistical methods tend to do better.

Consider this for example: "Cleaning fluids can be dangerous" - does it mean that cleaning fluids IS dangerous or that they ARE dangerous?

There are so many difficult issues in handling language anyway, seeing as it requires natural language understanding, which is far from performing right now.  There is a lot of research going on though, and eventually machine translation will work, but I'm not so sure how soon that will be.

What does it mean for SEO?
Well your keywords and your content is going to look a lot different in other languages, and the text may also be modified and re-written in places.  This means that you have a lot less control over how these pages rank in other languages.  The solution?  Maybe it would be worth having multi-lingual staff :) 


Read more here, from the University of Essex.  
There's also good information at Microsoft research (MT labs).
John Hutchins is a great source of information.
And check out Carnegie Mellon University MT labs too.

Google translate tried and tested

Google has added it's translating tool to Google Reader.  This means that now you can read feeds in different languages, directly translated within the Google reader framework.

The performance is ok from French to English, I'm a native speaker of both, and my first degree was in translating-interpreting French and German so you'd hope I'd be able to compare the 2.  I also worked in machine translation for a bit, and it was my thesis topic for my Masters.  I can provide a bit of insight hopefully!

I think that it's ok to get the gist of it, but the grammar isn't great and there are words missing here and there, also French words appear instead of English words in the translation.  Here are the results from Le Monde's feed, in green my alterations:

1- Commemorations: quelles sont les dates auqelles vous  êtes attaché? 

G: Commemorations: what are the dates you are committed?

Me: Commemorations: what are the dates that you are attached to?

2 -Mama Africa chante "Khawuleza"
- La chanteuse sud-africaine Miriam Makeba, affectueusement surnomée Mama Africa est morte lundi 10 

G: Mama Africa sings "Khawuleza" - The South African singer Miriam Makeba, affectionately surnomée Mama Africa died Monday Nov. 10

Me: Mama Africa sings "Khawuleza" - The South African singer Miriam Makeba, affectionately nicknamed Mama Africa died Monday Nov. 10

3 - "Une tragédie froide, comme la vie"
 - Le prix Goncourt attribué à Atiq Rahimi pour "Syngué sabour. Pierre de patience".

G: "A cold tragedy, as life" - The Prix Goncourt awarded to Atiq Rahimi for "Syngué Sabour. Pierre patience."

Me: "A tragedy as cold as life itself" - The Prix Goncourt was awarded to Atiq Rahimi for "Syngué Sabour. Pierre patience."

4- Sarah Palin n'exclut pas de se présenter en 2012 - La candidate malheureuse à la vice-présidence a déclaré que si Dieu voulait qu'elle conquière la Maison Blanche, elle espérait qu'il lui montrerait la voie.

G: Sarah Palin does not arise in 2012 - The unfortunate candidate for the vice-presidency said that if God wanted it fights the White House, she hoped it would show the way.

Me: Sarah Palin does not rule out presenting herself in 2012 - The unfortunate candidate for the vice-presidency said that if God wanted her to conquer the White House, she hoped that he would show her the way.

In the first one, "Attacher" means to tie, to bound, to attach - "committed" wouldn't really come to mind.

In the second, it doesn't know the French for "nickname".

In the third, Google translates literally, although it gets the word order right for "cold tragedy".

In the fourth, Google uses "to arise" instead of the very obvious "to present oneself", it got the wrong definition for "conquière ", and for the last mistakes, it completely didn't get the grammar, because "he" relates to God, so it could never be "it" (unless Google has some special religious beliefs).  Google also omitted "her", which is important in this context.

The errors are lexical errors, ambiguity issues and grammatical errors.  Cohesion and coherence are also not well implemented.  transfer and re-wording fails.   

Translation as a discipline is both an art and a science.  It isn't always a literal translation that you use, but rather a re-write to convey the meaning accurately, and sometimes as the local culture.  For example a "suburb" in Britain is usually synonymous with leafy streets and nice houses.  In France however, it's synonymous with the rougher areas of a city.  A literal translation would completely change the meaning of the text.

But: Machine translation is very very hard.  Google is doing well to get the gist of the thing across and it does do much better than other automated translation services.  It uses a statistical approach rather than a rule-based model.  

More on machine translation in my next post for those interested in the subject.

Google ninja challenge-some results

The Google Ninja challenge was launched on the 23rd of October.  Volunteers were asked to fill in a preliminary questionnaire, and then were given 8 things to query in Google.  Then they did a de-brief survey.

In the preliminary survey they were asked how confident they were that they would find all the information.  Most were 80% confident.  They did however struggle, or not find it quite as easy as they had thought.  Internet professionals were no exception to the rule.

What is hard about those queries?  That's a question I'll be asking you.

It is hard, although some people have managed quite easily to find the answers.  Some gave off-topic answers that were as close as they could get, some just couldn't find the information.

I'm still collecting data, if you think you can handle the challenge, give it a shot and see how you go.

November 09, 2008

The profile picture

We belong to at least one social network, and it requires you to have a profile picture.  How do you choose?  If you belong to lots of social networks do you use a different one in each?

PhD comic suggests:

November 07, 2008

TGIF - weekend ahoy

Welcome to another Friday, time for an easy and chilled out post with geeky humour and some inspiration to celebrate the end of a long week.  I'll be working on my PhD this weekend, so will be in a darkened room in front of my laptop and a series if IDE's - fret not, I like it :)

Without further ado:
  • Programs for sale: Fast, Reliable, Cheap: choose two.
  • I, myself, have had many failures and I've learned that if you are not failing a lot, you are probably not being as creative as you could be -you aren't stretching your imagination. (J.Backus)
  • It is important that students bring a certain ragamuffin, barefoot irreverence to their studies; they are not here to worship what is known, but to question it. (J.Bronowski) 
  • The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time. (T.Cargill)
  • Computer Science is no more about computers than astronomy is about telescopes. (E. Dijkastra)
  • On two occasions, I have been asked [by members of Parliament], "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question. (Babbage)
About the last two:
  • So stop asking me to fix your computers - I have no idea how to I'm a Comp. Scientist
  • And this still happens all the time, still!
A few examples and funny ones at that from teaching days:

Student 1 catastrophe (after 2 intensive weeks of Java classes): "So, is the compiler a person or some software?".

Student 2 (admittedly after only a few classes of Java): 

- "My code doesn't work" 
- "Humm...let's take a look"
...after 15 minutes... "It's very strange, everything looks perfectly ok, lets type it all out again"...
...typing... System.out.println("Hello World!");
- Oh! Is it "println" with an "L" and not a "1"?
- Yes. 

Using link structure to fight webspam

We are all familiar with webspam but not a lot of people know that it is a classification problem in computing.  It's hard to get all of the features right, and difficult to find an efficient classifier.  

There's an interesting paper called "Improving Web Spam Classifiers Using Link Structure" by Qingqing Gan and Torsten Suel from the Polytechnic University in Brooklyn,NY.

The usual features used in spam detection are (they provided comprehensive lists):
  • fraction of words drawn from globally popular words.
  • fraction of globally popular words used in page, measured as the number of unique popular words in a page divided by the number of words in the most popular word list.
  • fraction of visible content, calculated as the aggregate length (in bytes) of all non-markup words on a page divided by the total size (in bytes) of the page.
  •  number of words in the page title.
  •  amount of anchor text in a page. 
  • compression rate of the page, using gzip.
They also calculated the following link features for each site:
  •  percentage of pages in most populated level
  • top level page expansion ratio
  • in-links per page
  • out-links per page
  • out-links per in-link
  • top-level in-link portion
  • out-links per leaf page
  • average level of in-links
  • average level of out-links
  • percentage of in-links to most popular level
  • percentage of out-links from most emitting level
  • cross-links per page
  • top-level internal in-links per page on this site
  • average level of page in this site
In addition, they added the following:
  • number of hosts in the domain. We observed that domains with many hosts have a higher probability of spam.
  • ratio of pages in this host to pages in this domain.
  • number of hosts on the same IP address. Often spammers register many domain names to hold spam pages.
They used the C4.5 classifier.  (This produces a decision tree or a rule set (these are easier to understand), and is a statistical classifier.  The trees are built from training sets (which are already classified).  I'll add that using C5.0 is considerably faster, has lower error rates and more features.)   - Then they used a second classifier and found that the results were far better, it "uses the baseline classification results for neighboring sites in order to flip the labels of certain sites."

Until a really robust and fast method is found, then there will always be the problem of webspam.  It pollutes search engines, and annoy users no end.  I hope to see a lot more work in this area in the future.  It's not my area of expertise, although classification methods are similar to those I use but I find it really interesting and worthwhile research. 

Patent for SEO software (2008)

I came across a patent for seo software.  The inventors are Ray Grieselhuber, Brian Bartell, Dema Zlotin, and Russ Man.  It's called "Centralized web-based software solution for search engine optimization" and it was published on the 12th June 2008.

They have patented a piece of software for SEO:

"In one aspect, the invention provides a system and method for modifying one or more features of a website in order to optimize the website in accordance with an organic listing of the website at one or more search engines. The inventive systems and methods include using scored representations to represent different portions of data associated with a website. Such data may include, for example, data related to the construction of the website and/or data related to the traffic of one or more visitors to the website. The scored representations may be combined with each other (e.g., by way of mathematical operations, such as addition, subtraction, multiplication, division, weighting and averaging) to achieve a result that indicates a feature of the website that may be modified to optimize a ranking of the website with respect to the organic listing of the website at one or more search engines."

"... The solution 290 may make recommendations regarding improvements with respect to the site's construction. For example, the solution 290 may make recommendations based on the size of one or more webpages ("pages") belonging to a site. Alternative recommendations may pertain to whether keywords are embedded in a page's title, meta content and/or headers. The solution 290 may also make recommendations based on traffic referrals from search engines or traffic-related data from directories and media outlets with respect to the organic ranking of a site. Media outlets may include data feeds, results from an API call and imports of files received as reports offline (i.e., not over the Internet) that pertain to Internet traffic patterns and the like. One of skill in the art will appreciate alternative recommendations ."

One of the claims is:

"...acquiring data associated with the website; generating a plurality of scored representations based upon the data; and combining the plurality of scored representations to achieve a result; recommending, based on the result, a modification to a parameter of the website in order to improve an organic ranking of the website with respect to one or more search engines."

How many of us use statistical methods for SEO optimisation?  I know I collect a lot of data, but not in the same format as this.  Can this be reliable?  Every site is very different and has different needs.  A human is able to discuss this with the client and adapt the strategy depending on that.  Can this system take those parameters into account also?  It is a recommendation system, so I would think that you could adjust the weightings depending on the site you're analysing.  I would be interested to try this out in a free beta, but don't see myself handing over a handful of cash just yet.

I'm all for applying data mining techniques to SEO, I've looked at this before and it is useful.


November 06, 2008

The Future of Online Social Interactions: What to Expect in 2020


This discussion between Yahoo social media gurus, industry and academics took place at www2008 in Beijing.

We all use social media, well most of use, certainly internet professionals.  The younger generation are communicating very freely via this medium.  It is not a fad, it is trend that will continue to grow strong in the years to come as networks continue to grow and evolve and their users become more expert in their use.  

There is set to be better understanding of user behaviour, data mining, new applications and domains.  There will also be better search capabilities within these networks.  The authors foresee a complete change in the workplace, which we know has already started to take place.

Franck Nack (Uni Amsterdam):

"Current systems utilise similitude as selector of new experience. ‘If you liked that then you’ll like this’. However the more profound and hence lasting experiences are the unexpected ones that are at once accessible and confrontational. It is easy to be either, but being both is a demanding challenge. So far we have little capability in marshalling such experience for users but in 2020 this will be different."

He says we need to root technological developments in the understanding that information interest is a sensory experience, filtered by emotional and cultural memories.  He says that we can gage this by navigation, speed, focus and other factors. He concludes:

"Social online interaction will be mobile and immersive interaction"

David Ayman Shamma (Yahoo inc):

He says that the techo-centris view of the web is not in line with the social world.  

"The future of online social interactions requires a conversational redux. Content semantics alone is not sufficient. How we consume media (photos and videos) will become conversation centric. Conversational semantics, found in the conversations that ensue around media, is as important as traditional content-based semantics".

He says that we have to look at how people are sharing content within their communities and understand the supporting online social context.  He believes that conversational semantics will be a central part of the experience and a primary area of research.  

Dorée Duncan Seligmann (director of Collaborative Applications Research at Avaya Labs.)

She says that communications will be fully integrated and unified with social software and contextual communication data across media will be shared and analysed, thus driving a new experience.  

"Imagine a search on a keyword that returns a list of items ranked by communicative or contextual relevance as opposed to content and large scale popularity. The ranking could consider if the information or the interaction sought is best from a certain source (a person’s whose opinion is respected by the searcher, or a person with whom there is a history of successful interactions) – through a particular medium (that is more accessible, comprehensible to that searcher), in a particular context (from a particular forum or news site). 

Such a search could return: people available to chat on that subject now, a list of blogs written by people whom you have valued before on that subject, or product ratings from people with like interests and backgrounds, it could set up a forum from a group of people on-line. 

Such a search would not return a list of content, but rather content vehicles, the people, devices, media, modalities that are most valuable to you and at the same time could establish communications directly. These rankings could be accessed directly by users, but more importantly would drive the processes that automate and manage communications."

The full discussion is available in the ACM digital library, well worth the subscription.

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.