My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

December 19, 2008

Off on holiday!

It has been a very cool year hasn't it?

I am setting off for France now, and will be offline (gasp!) for the first time in 3 years for the entire duration of my 10 day holiday.  That's right, no email, blogging, reading news, Twitter, FaceBook, Linkedin, Twine, Last.fm...oh dear.  I am quite sure the web will suffer without me and feel a bit guilty leaving you all to your own devices but please try to have a good Christmas and New Year anyway :)

A little low-down on me and my year:

Stuff I've really liked in the research arena:

- Collaborative systems
- Semantic web data and ontologies
- Human computer interaction
- Data-mining and social networks
- Information seeking behaviours and personalisation
- Novel search paradigms
- Social network trust management
- Obviously question-Answering systems and conversational ones

Stuff I've really  liked in the SEO/M arena:

- Linkscape
- Video and podcast SEO
- Different ways of measuring stuff
- Measuring success on web 2.0
- Searcher behaviour
- Brand and reputation management
- Multi-lingual SEO
- All of the Google system chat

Top 5 purchases this year:

- DigiScribble pen
- My Dell laptop
- My Lumix camera
- My Asics Kinsei running shoes
- My flight to Thailand and Australia (one way)

Top 5 things I did this year:

- Graduated from Wat Po Thai massage school in Bangkok 
- Self-practiced Ashtanga yoga for a year all alone
- Went to see strange art house films with my Mum a lot
- Went to Paris for a long weekend of fun
- Designed my natural language generation/understanding system at last

5 things you didn't know about me:

- I'm allergic to peppers
- I can stand on my head
- I don't like Apples
- I'd rather design than code
- I like Batman comics and The Punisher

I look forward to reading your top 5's when I return from the real world.  Thank you so much for supporting me and my blog and I've loved your comments and input.  

Happy Christmas people and enjoy all the new year celebrations :)

TGIF - holidays!

Welcome to another installment of TGIF.  I hope you are all well and that you are looking forward to the Christmas break.  Many of you finish work today for at least 2 weeks so make sure you celebrate accordingly and enjoy a beverage of your choice and a couple of sweet treats too :)

Without further ado...

The Pedantic web:

 “I see the the birth of artificial intelligence leading to increasingly snotty and patronizing personalities emerging via AI on the web to enhance our experience, just as they do in real life. At the moment it is difficult to imagine but over the next few years you are going to hear more and more about the Pedantic Web. It is a natural step in the evolution of the web towards Web 9.0 which we have named the Romantic Web. Eventually the web will contain and involve all of our relationships and connections, it will be our friend, our lover and our master.” (James Johnson - found on the no longer updated Science Faction blog)

Twitter -Taking it too far?

Facts:

A 'jiffy' is an actual unit of time for 1/100th of a second.

"The quick brown fox jumps over a lazy dog." uses every letter of the alphabet.

Thomas Edison (who invented the light bulb) was afraid of the dark.

Bill Gates house was partially designed using a Mac.

Atari sold 400000 VCS consoles in 1979.

The Atari 2600 only has 128 bytes of RAM.

Following on from our Atari facts, an Atari commercial from 1978 (the year I was born!):



"A few Chirps about twitter"

"A few Chirps about twitter" gives valuable insight into how we use the application and why we use it. 

It is a paper written by Balachander krishnamurthy (AT&T Labs), Phillipa Gill (Uni Calgary) and Martin Arlitt (HP Labs - Uni Calgary)

"Our goal is to characterize a novel communication network in depth, its user base and geographical spread, and compare results of different crawling techniques in the presence of constraints from a generic measurement point of view".

They gathered 3 datasets covering nearly 100,000 users.  Detailed information was gathered on each user and the list of users they were following.  They say that relationships in Twitter are directed but that there is no way of gathering the set of reverse links: information on the set of users following a user.  

1st Crawl:
They collected data at specific times of the day and extracted the users that posted at these times - they collected data from each user and a partial list of his/her followers.They gathered data for 3 weeks.

"During this process the median number of users followed by the previously crawled users, m, was tabulated. To further the crawl the first m users followed by the current user would be added to the set of users to crawl."

2nd Crawl:
It focused on current active users who continually post a series of 20 or more updates.  Details were collected on each user

3rd crawl:
They used a Random walk with backtracking to collect the data.  They only considered once child of each node.

Distinct classes of users were identified:

Broadcasters - They follow few but have a large number of followers

Acquaintances - they reciprocate the follow (so have an even spread of following-followers)

Miscreants - spammers or stalkers who contact everyone they can to get followers

How did users access Twitter:
61% use the web, 7,5% mobile, 7.2% IM, 1.2% Facebook, 22.4% custom applications

Highly popular users update their status very often, generally those with more than 250 followers updated a lot more often than others.

There is a lot more information and analysis present in the paper, give it a read, it's very accessible.

Submit a paper to SAW 2009!

The "3rd Workshop on Social Aspects of the Web (SAW 2009)" is calling for papers - you don't have to be a die-hard academic or scientist to submit a paper. In fact I encourage those from other backgrounds to do so, as it's really good for diversity at these conferences, and your insight is valuable to the community.

"The goal of the 3rd Workshop is to bring researchers and practitioners together to explore the issues and challenges related to social aspects of the Web."

The deadline for submission is the 1st February - conference is 27-29th April in Poland:

* Long papers: max. 12 pages
* Work-in-progress reports: max. 6 pages
* Demo papers: max. 4 pages

If you fancy giving it a go the areas of interest (broadly) are:

 - People on the social Web (communities, collaboration, interaction models...)

 - Data and content on the social Web (social content organisation, semantic social web, ontologies...)

 - Social software and services (specific types of social networks, architectures, technologies...)

 - Mining the social Web (the social graph, activity patterns, marketing...)

For more in-depth information and the submission procedure see their website.  Pay special attention to the format required because if it's not in that format they'll discard your paper.

High level IR book

"Information Retrieval: Algorithms and Heuristics" by David A. Grossman and Ophir Frieder is a very useful book for those wanting to understand more about how information retrieval really works: 

"This book is not yet another high level text. Instead, algorithms are thoroughly described, making this book ideally suited for both computer science students and practitioners who work on search-related applications. As stated in the foreword, this book provides a current, broad, and detailed overview of the field and is the only one that does so. Examples are used throughout to illustrate the algorithms."

It is high level enough for newcomers to the subject or non-experts to gain further understanding and this as I keep saying is really important if you work in any search related field.

Get yourself an early Christmas present!

December 18, 2008

The web in 2018

"Welcome to web 3.0" is a very cool and relaxed piece by Laurie Rowell for the ACM digital journal which you can freely access and subscribe to.  She talks about web 3.0 having mobile devices at its center and draws some very interesting comments from the most respected in the domain.  

Forget all your web 3.0 induced sighs and give it a read, you'll like it.  I say it's important for marketing people to know about basic IR and so forth but it's important for all of us to be aware of future web developments, whether you like the label 3.0 or not :)  

Some things I really liked from it (she starts in 2018):

"Your mobile sends your two-word message to your joking friend, discards the augmented-reality layer of info, and shows you what’s really going on in the building where you work: The second elevator is still down for repairs, the cafeteria offers your favorite almond croissants this morning (see calorie count!), the patent information you requested on a competitor’s product is waiting in your inbox, and company stock is down three points. You click on a link to The Wall Street Journal for an article on this last bit of information and listen as you enter the building."

This is exactly what I want!  Bring it on.

“Although there has long been a promise of a mobile Web, we are just now getting to the cusp,” says Michael Liebhold, senior researcher at Institute for the Future"

Google CEO Eric Schmidt didn't refer to the "mobile web" but rather said web 2.0 was about applications involving Ajax and web 3.0 brought together a whole host of things with data in a cloud.

I agree with the "mobile web is just a launchpad for the cooler stuff!" and "we won't be connecting to the web but walking around it" (Liebhold)

GeoRSS is interesting for handling the location extensions to RSS.  When this is integrated into digital map systems information can be gathered about physical locations like never before.  For example you can see the news for where you are currently located.  Geo-web is enabled by KML (keyhole markup language).

Context-aware systems will be able to explain what you want to do next, so it knows all of your preferences and intentions and so on.  This means that advertisers can target you more effectively because they know your location and for example the fact that you like sushi and its lunchtime.

This is just a short summary of what's in store for the future of the web/Internet.  I think it's very exciting, and I think it's going to move relatively fast.  It always depends on what users are ready for and also how quickly applications and things can be developed.

Searching The Social Web

I love this presentation about the challenges of searching the social web. It's a hard task, and a difficult problem to solve. Here all the issues and possible solutions are nicely outlined - enjoy!
Searching The Social Web
View SlideShare presentation or Upload your own. (tags: ir retrieval)

December 17, 2008

Off Topic!

I felt like writing about something mildly off topic today, the "Chicken and the egg" so called problem.

I read papers all the time where scientists have alluded to it, and it is of course all over marketing, advertising, management etc...

"Tim Berners-Lee's vision of a Semantic Web is hindered by a chicken-and-egg problem".

The thing is that I really can't see where the "problem" is and I never really have.  There. I've said it. I have no idea what you're all going on about.

In super sped up:
Once there was ice.  Then bacteria started to form.  A very strange creature appeared who gave birth to another one, and this one laid an egg containing a dinosaur with a beak, and this one laid another egg containing a smaller more rounded animal with a beak and scales.  This one laid an egg with...a chicken in it.  

Now in normal time:
This chicken laid an egg too, and it contained a chicken.  This chicken will also lay an egg and it should also contain a chicken (unless aliens are going to come and mess up my whole explanation) which will hatch to be another chicken.

They are evolving as are we.  Eventually the chicken will lay an egg that will hatch chicken 2.0

Until then, there is no issue.  No egg, no chicken.  No problem.

The experts concur I might add.  


Ontology in a nutshell

We're going on a lot about ontologies at the moment so it seems right to explain what they are and how they fit in with everything.

Fabien Gandon has put together a brilliant presentation for you to read through.  It will explain everything from the basics to lower level stuff.  very accessible and highly informational. 

View SlideShare presentation or Upload your own. (tags: introduction folksonomy)

A global review of the semantic web industry

On the Cusp: "A global review of the semantic web industry" by David Provost is a great 38 pager on the semantic web industry at the moment.  It's good because I don't see many buzz words in there or very technical terms, it's high level as it should be for managers, vendors, strategists and customers.

The report is all about the semantic web surprise surprise! He focuses on what companies involved in the semantic web do, and not so much "how they do it".  Twine, Primal Fusion  and the Calais initiative have of course been quite important in this area.  Bottom line, people are investing in semantic web technology.

I have a great love of Twine.  I can spend hours and hours reading content on there and creating my own information groups and discovering new things.  If you're short of something to write about, visit twine.  (Actually I just spent 20mins on there when I only went to copy their url!).

The Calais initiative I am also a big fan of but in a different way.  They provide tools to help connect information together - it's like lego.

Primal Fusion are in Alpha and I haven't had access as yet, but they look like they have something interesting to share.  A way to explore and organise your thoughts.

Back to David Provost:

"Intriguing possibilities are emerging, such as the role of “linked data”, Social Network Analysis and how the Semantic Web may aid this practice, and how the convergence of Natural Language Processing, Semantically enabled search, and the traditional publishing industry will play out. Time will tell, but the potential effects could be substantial".

It's interesting that he says that now deployers have to establish credibility and show that they are more than "2 guys in a garage" - times are changing.

He says that the SW is a global industry, and that vendors are thriving.  Some companies have already started using SW technology in risk management, knowledge management and other areas.

He mentions Franz'z Allegrograph which is a semantic database which I really like.  You can use a free version and they also do the RacerPro reasoner.  It can be used for social network analysis.

NLP is "emerging as a force for taming world wide content", basically quality content and wild content which is unknown.  So basically this field continues to grow and evolve as we are well aware.

Linked data is going to have valuable and has super important uses.  This is one of the primary aspects, we touched on this with FOAF earlier.

Marketing, technical and solution patterns will have a greater role in selling semantic web solutions.

The author says that we're going to move away from the very low level concepts and terminology used by researchers and move on to a much higher-level of discussion.
 
So if I have this right, the semantic web can now be marketed so we need to use higher level language and focus on what it can do for companies.  Fair nuff.

Tad over at the CogBlog has some interesting questions for you and has some interesting ideas.  Swing by and take a look.

December 15, 2008

Designing for conversation

This is a very cool and light-hearted conversation led by Heather Gold at Google.  It's really funny and really interesting too.  It's very interesting for social media people, and also for any other community situation.

She also says that she won't use "the words leverage or synergize unless its for a very important lifesaving purpose".

"Innovative comedian Heather Gold explains basic differences between presentation and conversation and the assumptions underneath each. More entertainingly (and usefully) she demonstrates these ideas by creating a great conversation in the room so that all can feel the difference."

Identifying the Influential Bloggers in a Community

This paper was presented at WSDM 08 by Nitin Agarwal, Huan Liu, Lei Tang (Arizona State University) & Philip S. Yu (University of Illinois at Chicago).  "Identifying the Influential Bloggers in a Community" can be read at the ACM.

They look at the very important area of research concerning how we deal with the huge amount of data generated by bloggers and how we rank these blog posts.  

I've presented you with a short summary of the main points:

Whether a blogger is active or not does not necessarily mean that s/he is not influential.  Very active bloggers can be influential and just as easily not.  The influential ones however are very important because they can help companies in developing new business ideas, identify key concerns and trends, competitive products,...Bloggers can become product advocates, and basically, they are market movers.  The blogging on the recent US electoral campaign shows how bloggers can have influence over social and political issues also.

The researchers say that 64% of companies have identified the importance of the blogospere for their business.  Instead of trawling through endless posts in the relevant community, the best entry point are the most influential posts.

Technorati reports a 100% increase in the size of the Blogosphere every month.  This is huge and means that methods need to be developed in order to deal with this enormous amount of data.

You can't (as we've seen before) use PageRank or HITS or whatever method applied to search engines for the Blogosphere, because the blogs are sparsely linked, and the Random Surfer model just doesn't work for this.  Web pages can gain authority over time, but this is not necessarily true of Blogs.  As they say, a blog post and a bloggers influence actually decreases over time.  This is because even more sparsely linked posts come into existence.

They say that there is research going on regarding ranking on topic similarity but this is still very much on the drawing board right now.  They say that you could use traffic information, number of comments and more of these kinds of statistics, however you'd be leaving out all of those inactive bloggers.

They identify 4 groups of bloggers:
 "active and influential, active and non-influential, inactive and influential, and inactive and non-influential".  They create an influence score based on whether the blogger has any influential posts.  

You're influential in the following circumstances (obviously you could probably add quite a few more):
  1. Recognition - An influential blog post is recognized by many.
  2. Activity Generation - A blog post’s capability of generating activity (comments, follow-up discussions...)
  3. Novelty - Novel ideas exert more influence (lots of outlinks means that the post is not novel)
  4. The blog post length is positively correlated with number of comments which means longer blog posts attract people’s attention.
For example:

Active & influential: 
"‘Erica Sadun’ submitted 152 posts in the last 30 days, among which 9 of them are influential, attracting a large number of readers evidenced by 75 comments and 80 citations".

Inactive but influential: 
"‘Dan Lurie’ published only 16 posts (much fewer than 152 posts comparing with ‘Erica Sadun’, an active influential blogger) in the last 30 days".

This is a very good example of a paper addressing the issues we're encountering in Blog post retrieval, categorisation and so on.  It is a very very important area of research and needs imho to receive a lot more attention and budget dare I say :)


December 12, 2008

TGIF - terrific!

Hello, and welcome to another edition of TGIF.  This week has been cold, icy and snowing for a lot of us, but lets not forget all those living on the other side of the world enjoying a hot and sunny summer season.  Amongst other things I spent my time filling in customs forms and so forth so I can send my things on to Sydney, where I'll be for a while as from late Feb.  I am not sad to escape winter!

Without further ado:

Facts:

Do not believe in miracles. Rely on them.
Inside every large program is a small program struggling to get out.
The solution to a problem changes the problem.
It works better if you plug it in.
Given any problem containing N equations, there will be N+1 unknown.

Quotes:

“If people never did silly things, nothing intelligent would ever get done.” (L.Wittgenstein)
“Getting information off the Internet is like taking a drink from a fire hydrant.” (M.Kapor)
“Yes, we have a dress code. You have to dress.” (S.McNealy, co-founder of Sun Microsystems)
“Computer viruses are an urban legend.” (P.Norton, 1988)

Error messages from Hex ("A Heath Robinson/Rube Goldberg-esque, magic-powered computer"):

Mr. Jelly! Mr. Jelly! Error at Address Number 6, Treacle Mine Road.

+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++

+++Whoops! Here comes the cheese! +++

*Blip* *Blip* *Blip* End of Cheese Error

The coolest coolest computer program ever:  Do you want one?  I really do.

December 11, 2008

Google research beyond LSI

Google picked up Amrit Gruber who is doing an internship with them.  He's pretty valuable because of his PhD research in statistical text analysis (which is what LSI is).  His method is uses Hidden Topic Markov Models (HTMM) and a working version was released in 2007.

In this post Google mention PLSI (Probabilistic latent semantic indexing) and also Latent Dirichlet Allocation as examples of varients to LSI.

It's different because instead of treating the document as a bag of words, it uses a Temporal Markov Structure.  

Read the Google post here, and OpenHTTM is available here.  Good old Google, thanks for sharing.

This supports my post about how LSI in its very basic form as summarized in various places as well as the excellent Wikipedia is not the variety used in Google, whatever Matt Cutts says.  Yes it is used, but he doesn't give away the important information, what he presents is a very very basic version.  It's like saying "Yes, we use glue in our computer chips" or "Yes, here at NASA we use Glue as an adhesive for our rockets".  It's unlikely to be the glue your child uses at playschool :)

LSA/LSI source code & tools

I'm often asked by students, researchers in other areas and sometimes SEO people where they can find LSI/LSA source code/tools.  My favourite beginners tutorial on LSI is by Genevieve Gorrell from Sheffield University. The term is LSA mostly used in computer science these days but it doesn't matter what you call it.

There are a number of packages which will allow you to use LSA/I and also offer many other useful things regarding semantic analysis, IE and IR for example.

(LSI/A is also applicable to source code too and also images).

For coding your own, you'll need to in short:

- Have a stopword file
- Process each file
- Compute the weights
- Normalize
- Print your data

There's a MATLAB (most unis will have licences allowing you to get a free copy) toolbox called TMG which will allow for clustering, retrieval, indexing, dimensionality reduction and classification - a powerful package indeed! Also MATLAB does a whole load of things because there are plenty of extensions freely available such as the SVM Toolbox.

JLSI is a Java implementation freely available.

The semantic-engine which also uses LSI/A in C++ (Google code).

The semantic vectors package is also available in Java + Lucene.

There's a working online tool at Uni Colorado LSA group.  It also does other types of classification.
 
There's gCLUTO with a nice interface for you - it gives you a graphical representation of clusters.

There's a demo here from Telecordia.

There's also a PLSI parser here.  If you want to try the other variant and compare.

I think that will do for now, I hope that you have fun with these :)
  

December 10, 2008

Tips for blog writing? Really?

After writing commercial blogs and such things, I really like that my blog doesn't sell anything, or try to be anything more than what it is: A place for information about IR related topics which relate to SEO work, although not all SEO peeps will see it that way.

I follow no guidelines, I don't promote although I submit to Sphinn but tiring a little of it now.  I also write on the wrong platform...I don't care.  It's my hobby.  If you guys enjoy reading and get something from it, excellent stuff.  My work here is done :)

I'm going to list a generic number of tips given out by numerous blogs online and I'll answer each one truthfully in a no nonsense way:

Make your opinion known - Yes obviously
Link like crazy - NO!  I'll link when I feel it's appropriate and useful to my readers.
Be yourself - Yes, definitely!
Write less - No.  I'll write as much a I like thank you v. much.
250 Words is enough - No.  (Where did that figure come from?)
Make Headlines snappy - Ok
Write with passion - Of course, what's the point of this blog otherwise?
Include Bullet point lists - Where appropriate.
Edit your post - Of course.
Make your posts easy to scan - Ok
Be consistent with your style - Obviously.
Litter the post with keywords - NO!
Write with the reader in mind - Of course but I have readers from a mix of backgrounds.
No Jargon - Well part of the blog is to introduce new things so yes Jargon, but always explained or linked to a clear explanation.
Make bold statements - When appropriate ok.
Be controversial - When I have reason to be.
Always respond to comments - Yes, always.
Write for a global audience - It's not always easy to do that because I guess this blog isn't aimed at my next door neighbour who is a nurse or the Noob SEO (no offence but I think this blog might come across a bit scary) but rather at advanced SEO experts and computing people and interested scientists, linguists etc...

Seth Godin says the best thing ever: write something that "causes the reader to look at the world differently all day long" - This is probably the entire idea of this blog, I'd love it if something went *ping* for you and you began a journey in a whole new exciting world :)

I could write less but I don't think you'd get a complete summary of what are sometimes difficult topics, you'd get a snap shot and you'd have to do a lot of digging yourself.  The point is that you can just read my summary and understand enough to "get it" but if you want more, you can breeze on to new and more complex resources.

Advances in IE for the Web

This article was published in the ACM communications and was written by Oren Etzioni, Michele Banko,Stephen Soderland, and Daniel S. Weld.  It's freely availble and you can read the whole issue here.

Google usually give you way too many documents when you're searching for a very simple query, and as the Authors say: does not allow you to make very advanced searches like listing all the people who published at a conference and list them by geographical location.  In fact the "Advances search" function allows for very basic operations. He says that the time has come for systems to sift through all the information for you and deliver an answer to your query.  I obviously agree since this is the area I work in :)

They discuss a range of Information extraction (IE) methods that are "Open" as in the identities of the relations to be extracted are unknown and the all of the mountains of web documents need highly scalable processing. (Open domains are exceptionally hard to test on, so usually you test on a "closed domain" which is far more structured and easier to obtain good results from - the 2nd step is extending this method to work in an "open" domain).

What's an IE system composed of?

The extractor finds entities and relationships between them, and you can use RDF (semantic web) or another formal language. You need an enormous amount of knowledge to do this, and this can be obtained from a ready made knowledgebase made through supervised or unsupervised machine learning methods.

IE methods:

- Knowldege-based methods:
This relies on pattern-matching, human-made rules constructed for each domain. Semantic classes are applied and relationships identified between concepts, however this is obviously not scalable (I can guarantee you that as my own system is KB-based).

- Supervised methods:
Learns an extractor from a training set which has been tagged by humans. The system uses a domain-independent architecture and sentence analyser.  Patterns are automatically learned this way and the machine can find facts about texts.  Getting training data is the problem.  Snowball and such systems addressed this issue by reducing the manual labour necessary to create relation-specific extraction.  Recent work with Markov models for example.

- Unsupervised methods:
Labels it's own training data using a small set of domain-independant extraction patterns.  KnowledgeItAll was the 1st system to do this, extracting from web pages, unsupervised large-scale and domain independent. It Bootstraps its learning process.  Very very basically (see document for loads more detail) the rules were applied to web pages found via SE queries and the extractions were assigned a probability. Later frequency stats were added. It uses labeled data and made classifiers.

Yep - next is Wikipedia which I think we all take a bit for granted. The Intelligence in Wikipedia Project (IWP)It also uses unsupervised training to train its extractors, then IWP bootstraps from the wikipedia corpus.  The cool thing about using Wikipedia as a corpus, as many have figured out, is that it's nicely structured.  It's used to complement Wikipedia with additional content. 

Open IE (web extraction):
The problem is that it is huge and very unstructured. I think it's the hardest corpus ever to be tackled.  These systems can for example learn from a model of how relations are expressed based on features like part-of-speech tags, domain-independent regular expressions and so on.  

The new method for IE:
The authors analysed 500 randomly selected sentences from a training corpus.
They found that most relationships could be characterized by a set of relation-independent patterns. 

TextRunner extracts high-quality information from sentences and learns the relations (you can actually test it), classes and entities from the corpus using its relation-independent extraction model.  You will find more references to Markov models here, and also find out how it trains a conditional random field.  The sentences are extracted linearly, and it extracts triples that it thinks are important.  The language on the web is very ambiguous though which makes it notoriously difficult to deal with.  I think it's important to say that TextRunner uses Lucene (very good open source search engine - many of us owe a lot to it).

They tested Open IE in collaboration with Google and found that it highly increased precision and recall.

It can be used for IR tasks of course, but also opinion mining, product feature extraction, Q&A, fact checking, and loads of others.

Further research is aready being carried on where the system is able to reason based on facts and generalizations.  They will use ontologies like WordNet (good 'ol WN) and cyc, Freebase and OpenMind.

See. web 3.0, the web of machine reasoning and information extraction is very very real.

December 09, 2008

LSI - No more!

With the help of some very cool Tweeters, I found some interesting facts about LSI and SEO.  They are @dpn and @Mendicott.

For a simple idea of what LSI/A is please read the wikipedia entry on it.  The original paper is here.

LSI was patented in 1988 by Scott Deerwester (doing humanitarian work now), Susan Dumais (HCI/IR @ Microsoft),George Furnas (HCI @Uni Michigan), Richard Harshman (Psychologist @ Uni Western Ontario), Thomas Landauer (Psychologist @ Uni Colorado/Pearson), Karen Lochbaum (where did she go?)  and Lynn Streeter (Knowledge technologist @ Pearson).

We will look at Susan Dumais here because she's actively submitting:

Unsurprisingly all her recent search is in HCI and personalisation, just like Google, and Microsoft and...well everyone:

"The Web changes everything: Understanding the dynamics of Web content". (WSDM 2009)

"The Influence of Caption Features on Clickthrough Patterns in Web Search" (SIGIR 08)

"To Personalize or Not to Personalize:Modeling Queries with Variation in User Intent" (SIGIR 08)

"Supporting searchers in searching". (ACL keynote 08)

"Large scale analysis of Web revisitation patterns" (CHI 08)

"Here or There: Preference judgments for relevance". (ECIR 08)

"The potential value of personalizing search". (SIGIR 07)

"Information Retrieval In Context" (IUI 07)

Humm...No LSI here.

LSI papers since its introduction:

"Adaptive Label-Driven Scaling for Latent Semantic Indexing" -Quan/Chen/Luo/Xiong (USTC/Reutgers) => exploiting category labels to extend LSI (SIGIR 08)

"Model-Averaged Latent Semantic Indexing"- Efron => Extended with Akaike information criterion (SIGIR 07)

"MultiLabel Informed Latent Semantic Indexing"- Yu/Tresp => using the multi-label informed latent semantic indexing (MLSI) algorithm (SIGIR 05)

"Polynomial Filtering in Latent Semantic Indexing for Information Retrieval"- Kokiopouplou/Saad => LSI based on polynomial filtering (SIGIR 04)

"Unitary Operators for Fast Latent Semantic Indexing (FLSI)" - Hoenkamp => introduces alternatives to SVD that use far fewer resources, yet preserve the advantages of LSI.(SIGIR o1)

"A Similarity-based Probability Model for Latent Semantic Indexing" - Ding => checks the statistical significance of the semantic dimensions (SIGIR 99)

"Probabilistic Latent Semantic Indexing" - Hofmann => "In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model" - (SIGIR 99)

"A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval" Kolda/O'Leary => Replacing low-Rank approximation with truncated SVD approximation (ACM 1998)

Well...

The initial theory of LSI and it's methodology has been extended a great deal throughout the years.  The basic LSI method is important as it's a great way to introduce topic detection and such things.  There is a lot more to build on from there though.

There are so many more, some other methods are the Generalized Hebbian Algorithm, Partial least square analysis, Latent Dirichlet Allocation...

@Mindicott reports that "SEO" first appeared in Google in 1998.  "Search engine optimisation"Search engine optimisation + Latent semantic indexing" appeared in 2005.

@dnp quite rightly says that "SVD on huge datasets is BS".

It appears to me that the LSI that the SEO community refers to is in fact the base model which has been extended and changed and improved quite a bit since 1988.  This is quite expected, and therefore when you say "Oh I'm using LSI", you would be asked which method or if you've extended it yourself etc...

Currently the focus on keywords, which is what LSI uses isn't quite right anymore.  I've seen a lot of recent research (and so have many of you) talking about semantics.  There is lot of work on using semantic units which are not always keywords anyway.

The question should be "What multitudes of methods is Google using?" and "I wonder which LSI method is being used, although I know it is just one factor in a very very large system".  Not "How should I optimise my site for LSI" - I'd ask you which type.  I believe that Matt Cutts said something very generic when he said Google used LSI :)

The importance of Datamining

Data mining is also called knowledge discovery and data mining (KDD). 
 
Data mining is the extraction of useful patterns and relationships from data sources, such as databases, texts, the web… It has nothing to do however with SQL, OLAP, data warehousing or any of that kind of thing.  It uses statistical and pattern matching techniques.  Data mining does borrow from statistics, machine learning, databases, information retrieval, data visualization and other fields.
 
Many areas of science, business, and other environments deal with a vast amount of data, which needs to be turned into something meaningful, knowledge.  Many website owners and SEO professionals use different statistical packages to make sense of their data, as do many other professionals.  Data mining is often overlooked when in fact it can provide very interesting information that statistical methods are unable to produce or produce properly.  These data mining methods give you a lot more control.
 
The data we have is often vast, and noisy, meaning that it’s imprecise and the data structure is complex.   This is where a purely statistical technique would not succeed, so data mining is a solution. 
 
The issues in data mining are noisy data, missing values, static data, sparse data, dynamic data, relevance, interestingness, heterogeneity, algorithm efficiency, size and complexity of data.  These types of problems often occur in large amounts of data.
 
The process for datamining is the following:
  1. Identify data sources and select target data
  2. Pre-process: cleaning, attribute selection
  3. Data mining to extract patterns or models
  4. Post-process: identifying interesting or useful patterns
 
Patterns must be: valid, novel, potentially useful, and understandable. 
 
A number of different rules are used:
  • Association rules: these identify a collections attributes that are statistically related in the data. For example X => Y where X and Y are disjoint conjunctions of attribute-value pairs.
  • Classification is where we classify future data into known classes.
  • Clustering is where we identify similarity groups in the data.
  • Sequential pattern mining is where we analyze collections of related records and detect frequently occurring patterns over a period of time.  A tool called SPAM is available for this.
Models are used for datamining, such as:
  • Decision trees are collections of rules mapped out in the form of tree branches leading to larger values or classes.  The algorithm used for building decision trees is C4.5.  These are simple but they’re limited to one attribute per output.
  • Rule induction is where rules about data are induced.  This method gives values in the dataset so it is possible to see where there is a concentration of association factors
  • Regression models are a number of mathematical equations which show the potential associations between things.
  • Neural Networks are statistical programs which classify data sets by grouping things together in a way similar the brain.
The hardest to understand are the neural networks, the easiest the decision trees.
 
Many interesting things you want to find cannot be found using database queries such as fiding out at what time of the day most of your stock is sold, or finding out what people thought about your new product.
 
Datamining is widely used in marketing, bioinformatics, fraud detection, text analysis, fault detection, market segmentation, interactive marketing, trend analysis…
 
A few resources:

There’s a Microsoft tutorial about datamining which you can use
KDNuggets has a wealth of information
 
Tools:
Himalaya DM tools (SourceForge project)
Gnome data mining package 
Weka dataming tool in java
DevShed dataming with perl
 
Commercial packages:
A full list of commercial tools check this out the KDnuggets site. 

December 08, 2008

Semantic method for keyword research

The paper "Keyword Generation for Search Engine Advertising using Semantic Similarity between Terms" by Vibhanshu Abhishek, Kartik Hosanagar (The Wharton School Philadelphia), was presented at ICEC’07.  That conference will be of particular interest to online marketing professionals.

"This paper mathematically formulates the problem of using many keywords in place of a few.A method is proposed that can be used by an advertiser to generate relevant keywords given his website. In order to find relevant terms for a query term semantic similarity between terms in this dictionary is established. A kernel based method developed by Shami and Heilman is used to calculate this relevance score. The similarity graph thus generated is traversed by a watershed algorithm that explores the neighborhood and generates suggestions for a seed keyword."

Their initial equations show a trade off between the number of terms and the total cost.  Relevant keywords are important because conversion rates will be higher.

They focus on a new technique for generating a large number of keywords that might be relatively cheaper compared to the seed keyword.  There's not been much work done in keyword generation, but a related area of interest is query expansion.

Different ways to generate keywords are: query log (used by search engines) and advertiser log mining, proximity searches and meta-tag crawlers (WordTracker).

Search engines work on finding the co-occurence relationship between terms and similar terms are then suggested.  The Adwords tool also uses past queries that also contain the search terms.  Advertisers logs are also taken into account.

Most 3rd party tools use proximity, and this does produce a lot of keywords, however relevant keywords containing the original terms don't appear.  

These tools and methods don't consider semantic relationships.  They address this issue with their new system "Wordy":

"We make an assumption that the cost of a keyword is a function of its frequency, i.e., commonly occurring terms are more expensive than in frequent ones. Keeping this assumption in mind a novel watershed algorithm is proposed. This helps in generating keywords that are less frequent than the query keyword and possibly cheaper."

You can easily add new terms to the system and it automatically, it establishes links between them and the others. 

It generates keywords starting from a website, established semantic similarity between them, suggests a large set that might be cheaper than the query word.  The dictionary they use is generated by the set of documents (the corpus). tfidf is computed for all the words in the corpus.  Top tfidf weighted keywords are chosen.  A search engine queries each word in the dictionary that was created and top documents (already pre-processed) are retrieved for each query and added to the corpus also.  A final dictionary is created eventually and this is the finished list of suggested keywords.

They use the Shami/Heilman technique for semantic distance computation where each snippet is used to retrieve the correct documents.  These are then used to form a context vector where terms occurring in the documents are listed.  They're compared using a dot product to find similarities between the snippets - They used the method to find semantic similarity (Shami/Heilman used it to suggest additional queries)

"Cheaper keywords can be found by finding terms that are semantically similar but have lower frequency. A watershed algorithm is run from the keyword k to and such keywords. The search starts from the node representing k and does a breadth first search on all its neighbors such that only nodes that have a lower frequency are visited. The search proceeds till t suggestions have been generated. It is also assumed that similarity has a transitive relationship."

You can obviously choose to ignore the cheaper keyword results and just see similar ones.

They found that a bigger corpus improves the quality of the suggestions, and relevance is improved by increasing the number of documents retrieved while creating the dictionary as well while computing the context vector which increases the relevance of suggested keywords. Basically it worked.

If you want to see a working system let me know and I'll see what I can do.

Test it against the Google keyword suggestion tool.  Wordy found:

Pedicure: 
manicure-leg-feet-nails-treatment-skincare-tool-smilesbaltimore-massage-facial

Skin:
skincare-facial-treatment-face-care-ocitane-product-exfoliator-dermal-body

What do you reckon?  Good or bad?


The impact of SEO on the online advertising market

This paper written by BO Xing and Zhangxi Lin from the Texas Tech University in 2006 discusses the impact of SEO online. The study is conducted in an analytical way, using a number of good resources but has at times a simplistic view of the SEO effort. SEO's are considered to be of "parasitic nature", hindering the good functioning of search engines and cheating the user. These are not new accusations, the community has sustained these on a regular basis. Nontheless the paper is interesting and opens up discussion on this little researched topic (academically). Such things concerning algorithm robustness are briefly discussed basically saying that the better the search engine, the harder the SEO becomes. 

It's a shame that there was no follow up to this for 2008 really, but their paper is still interesting.  I'd blogged about this on my last blog, so it's old work brought back to the forefront if you like.

Here are a few exerpts:

"This study aims to analyze the condition under which SEO exist and further, its impact on the advertising market. With an analytical model, several interesting insights are generated. The results of the study fill the gap of SEO in academic research and help managers in online advertising make informed advertising decisions".

"Recently, SEO is gaining momentum primarily for two reasons. First, CPC has increased tremendously over years. According to a Fathom Online report, keyword cost has risen 19% in one year since September 2004[8]. Second, it has been realized that organic results are more appealing to searchers because these results are considered more objective and unbiased than sponsored results. According to an online survey by Georgia Tech University[10], over 70% of the search engine users prefer clicking organic results to sponsored results. The SEMPO survey[17] concurs with this finding, showing that organic listings are chosen first by 70% of the people viewing search results, while sponsored listings receive about 24.6% of clicks".

"No SEO firm knows the ranking algorithm of the search engine, and therefore, SEO practice only improves the chance of ranking improvement, rather than guarantees top ranking. Given an advertiser and advertising requirement, algorithm robustness denotes the effectiveness of SEO with the search engine".

"The net payoff for higher type advertisers using paid placement decreases because the marginal cost from CPC does not keep up with the marginal benefit from advertising. On the contrary, in the case of SEO, the marginal benefit increases due to the constant SEO fee. The practical implication is that search engines could increase its profit by adopting period-based pricing policy, rather than CPC, for higher-type advertisers".

"The sustainability of SEO firms also depends on s, the proportion of sponsored results returned, and h, the algorithm robustness. Intuitively, decreasing the proportion of organic results could pose threat to SEO firms".

"More importantly, a search engine is potentially subject to “freeriding” effect from SEO firms, because of the parasitic nature of these firms. As the search engine invest in algorithm effectiveness improvement, SEO firms may also benefit from this investment. In order to reap a fuller benefit from investment, the search engine has the incentive to improve its algorithm robustness at the same time".

"First, a search engine could optimize its pricing policies for higher-type advertisers to reap higher profit. Second, investment in algorithm robustness has the effect of protecting the investment in algorithm effectiveness. Third, the second market position endows the follower additional benefits due to low sustainability of SEO firms".

There are a number of juicy equations in this study for you to ponder over if you can get hold of it. Overall I find it to be quite right in some respects but in others I think that the view of SEO is fairly limited, the authors don't seem to have a totally realistic grasp of the industry, although they are quite thorough in the way that they advance their views.

December 07, 2008

Hot topics in comp sci vs SEO

To see if there was a correlation between hot topics in SEO and hot topics in IR, I've listed the top 10 in each in no particular order.  I may have forgotten some in SEO because that space is not as ordered at the comp sci one.

SEO popular topics:

 - How to get more traffic to your blog
 - How to use LinkedIn/Twitter/etc to increase your traffic
 - NoIndex /No Follow
 - SearchWiki
 - Guides and ebooks / tutorials on all manner of marketing things
 - x number of easy seo strategies
 - Wordpress themes
 - Link building strategies
 - Social media evolution etc
 - Tools you might be missing etc
 
 Computer science (IR, NLP) popular topics:
 
 - Classifier systems
 - Recommender systems
 - Personalisation
 - Ranking both in SE and SN
 - Information retrieval & browsing
 - Q&A systems
 - System evaluation
 - Information extraction
 - Digital libraries
 - Interfaces and HCI

Not much correlation there sadly.  There's evidence of attention to links in both though and also to personalisation.  I guess that this means that both are not really picking things up from each other.  It's a gap that needs to be bridged.
 

10 free papers: semantic relatedness of words

There's a lot of buzz about keywords and their semantic relatedness recently so I thought I'd volunteer 10 good papers, freely available via citeseer to widen or extend the conversation.  The list is obviously by no means exhaustive.

Non-computer scientists, don't be afraid of the big complicated equations and stuff, if you get the gist of it that's perfectly cool - you can pick these up with practice, and remember, it's not rocket science, it's computer science :)

Here they are:



A WordNet based rule generalization engine for meaning extraction   by Joyce Yue Chai, Alan W. Biermann — 1997 — Tenth International Symposium On Methodologies For Intelligent Systems

Finding Semantically Related Words in Large corpora by Fi Mu, Pavel Smrž, Pavel Rychlý 

Syntactic contexts for finding semantically related words by Lonneke Van Der Plas, Gosse Bouma — In CLIN

 Making senses: Bootstrapping sense-tagged lists of semantically related words by Nancy Ide — 2006 — Computational Linguistics and Intelligent Text Processing. Lecture notes in Computer Science 3878

Mapping syntactic dependencies onto semantic relations  by Pablo Gamallo, Marco Gonzalez, Alexandre Agustini, Gabriel Lopes, Vera S. De Lima — 2002 — ECAI Workshop on Machine Learning and Natural Language Processing for Ontology Engineering

Contextual word similarity and estimation from sparse data (1993) by Ido Dagan In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics

Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing (2005) by L Shi, R Mihalcea in Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics

Exploring the Potential of Semantic Relatedness in Information Retrieval (2006) by Christ of Müller, Iryna Gurevych, In Proc. of LWA 2006 Lernen - Wissensentdeckung - Adaptivität: Information Retrieval

The Gooleplex: serious issues?

Piotr Cofta (BT Plc) wrote a very interesting paper for the 10th Int. Conf. on Electronic Commerce (ICEC) ’08 Innsbruck, Austria.

It questions the Googleplex as a whole rather than just "Google", and honestly raises some serious issues with it.  They have an awful lot of power and we have placed an awful lot of trust in it.  To an extent, they rely on this trust to be successful and function.

Here are a few main points:

Google try and monitor everything they can on the Internet to gain as much user data as possible, thus monitoring behaviour.  It's important for them to have a stake in every possible interaction mode, not just search for example.

Google focuses on innovation, so it frantically chases top researchers to develop trends and obviously gives out free tools that are used to test and develop thousands of new ideas at the same time.

PageRank and query logs enable Google to identify trends that are likely to stay.  It's cheap to run as well.  The author reckons that more of this data will be available than is already presented via AdSense.

The trends unsurprisingly are used to fuel the advertising market

People aren't identified during personalisation but computers are.  The fact as we all know that you can log in to a tool means that you're signed in to everything.  This way an enormous amount of user behaviour data can be captured.

There are endless opportunities for new applications, but people search is super important.  Google are tracking people who share similarities, habits and customs.  The author qualifies the task as "mathematically trivial", and says however that it is hugely importance for us all.

The author also says that "the Googleplex is not malicious in itself."  It's a business.  They have a huge amount of power and we need to see if they eventually abuse it of not, despite the "Do no evil" strap line.  Interestingly he asks if the Googleplex will be compromised by others.  He asks 3 questions: whether the Googleplex can be harmful to individuals, society and social values.  People like to develop trust and confidence for organisations like this, this is dangerous to a caertain extent.

He says the the "Crude PageRank value" can be used (as well all know) as the strength socre of a site and also it's reputation score.  In a way it must be said, I think, that assigning a numerical value to something is indeed giving something away to the users so that they can feel more satisfied and confident in the organisation.  Even if the numbers don't really add up on purpose, it still fulfills its function, as far as the score delivered to the users is concerned.  SEO peeps did use it as a measure of success once, and held it as very important.  Evidence of this can be seen in the mountains of blogs talking about it.

He says in a totally other way than I'll put it here, that we put trust in the results because we don't know how the whole thing works.  I've already said that we cannot exactly know whether the results we are being served up are the exact right ones for us, but rather the best ones they can come up with.  Maybe the perfect documents for your needs don't show.  people have such confidence that they usually defend Google passionately when this issue is raised.  From a scientific perspective though, it is very natural to consider that the results might not be the best.  One research project did show that Google didn't perform well at all in comparison to the actual human expert ranking for example.   

Do a very very simplistic test and choose your expert area (a very specific one) and rank the most important documents you would give someone on this topic.  then check Google and see what you find.  More complex tests of course will yield more exact results.

He makes a good point about the "Do no evil thing" by saying that it can't possibly do evil or good because it's fully automated.  When there have been mistakes and when people have sporadically written on blog or blog comments about how this can be questioned, the idea dies down quite quickly, because we love and trust Google. It's not their fault that nasty ads get served up or that an update goes painfully wrong, it's the system.

He's right in suggesting that maybe all the free stuff we get does come at quite a high price. 

I urge you to read the paper in its entirety, and to do that you'll have to get ACM Digital library access, which is well worth the purchase.  It is in my opinion for both marketeers and computer scientists alike a very necessary professional tool.

December 05, 2008

TGIF - Rock on!

Welcome to yet another installment of TGIF.  I hope you all got some fresh air this week and didn't stay cooped up with your laptops...like I mostly did.  I bet you're all looking forward to xmas, and have the dreaded/much anticipated office xmas party to look forward to.  Enjoy yourselves this weekend, and don't get too caught up in shopping :)

Without further ado...

I have enjoyed wasting valuable time playing Pac-Man (I hope my advisor doesn't read this).

Pac-Man was released as an arcade game in 1980 in Japan (Pakku-man).  "Perfect Pac-Man" occurs when you get the maximum points for each of the first 255 levels.  Billy Mitchell did this in 1999  with a score of 3,333,360 points and it took him 6hrs.  

"I have traveled the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won't last out the year." 
– The editor in charge of business books for Prentice Hall, 1957. 

Who on earth did he speak to?  Maybe he should have gone further afield too.

"Computers in the future may weigh no more than 1.5 tons." 
– Popular Mechanics, forecasting the relentless march of science, 1949.

I actually think this is right if you add "near" to "future".

Facts:

A chip of silicon a quarter-inch square has the capacity of the original 1949 ENIAC computer, which occupied a city block. 

The first VCR, made in 1956, was the size of a piano.

Limelight was how we lit the stage before electricity was invented. Basically, illumination was produced by heating blocks of lime until they glowed.

Enjoy this brilliant video and have an excellent weekend.



CredibleRank

I thought I'd share "Countering Web Spam with Credibility-Based Link Analysis" by James Caverlee (Texas A&M University) and Ling Liu (Georgia Institute of Technology) at PODC'07 today.

PageRank,TrustRank and HITS all couple link credibility and page quality, which isn't ideal because good links doesn't necessarily mean that you have a quality page here.  I think page authority and quality are very important areas of research right now.

So, these guys used a credibility-based link analysis and called it "CredibleRank".  The credibility of information is directly used in the quality assessment of each page.  It proves to be way more more spam-resilient than both PageRank and TrustRank.  These two algorithms rely on the assumption that the quality of a page and the quality of a page’s links correlate.  This unfortunately leaves them open to spam.  

CredibleRank incorporates credibility information directly into the quality assessment of each page on the Web.  

They found that a page’s link quality should depend on it's own outlinks and that it is related to the quality of the outlinks of its neighbours.  So they use the local characteristics of pages and place in the Web graph as opposed to the global properties of the entire Web that the other algorithms use.

Relying on a whitelist (set of known good pages) isn't very useful because Spammers can camoflage their low rubbish outlinks to spam pages by linking to known whitelist pages.  They advocate the use of a Blacklist (known spam pages) instead, where the proximity of page to spam pages.  They're penalised for low quality outlinks.

"First, the initial score distribution for the iterative PageRank calculation (which is typically taken to be a uniform distribution) can be seeded to favor high credibility pages. While this modification may impact the convergence rate of PageRank, it has no impact on ranking quality since the iterative calculation will converge to a single final PageRank vector regardless of the initial score distribution." 

They found that CredibleRank does not negatively impact good sites, because they compared the ranking of each whitelist site under PageRank against its ranking on CredibleRank, and the fluctuation was only of 26 spots, so it isn't unfairly treating clean sites.

It proves to be so far spam resilient and efficient, and outperforms TrustRank and PageRank.  Excellent stuff.

December 04, 2008

Doing a PhD - ouch and wow

This won't be relevant to everyone but I know some computing students read this blog so for their sake I'm embedding this very good presentation on "The art of doing a PhD".

My synonyms for PhD: hard, lonely, exciting, humbling, exalting, confusing, interesting.  And most of the time you have no idea what you're doing, but Einstein said it wasn't research if you did, so I guess that's ok.

View SlideShare presentation or Upload your own. (tags: computing ubiquitous)
I'm glad to be at the end of it now, and I have to say that any work experience you can get during your PhD is very valuable.  Coming out of University at 30 with with no work experience won't impress anyone.   Show you can walk the talk.

Internet 2 - in research since 1996

"Internet Evolution" (sponsored by IBM) have a very cool and information (and thorough) article on Internet 2.  We talk a lot about web 2.0/3.0 but rarely about Internet 1.0/2.0.  We need the Internet to work properly otherwise there is no www.  The Internet is as Roger Smith puts it "running out of gas", because of all the new devices that need to be supported.  

It's a long article so as usual in these cases, I'm going to give you the main points:

  • The Internet 2 project was founded in 1996 by 34 researchers in a hotel
  • It's run by the University Corporation for Advanced Internet Development
  • It uses protocols like IPv6 to access the huge number of new URL address space, middleware and security capabilities and also things like high-definition videoconferencing.
  • It doesn't just mean a faster web, but allowing networking that hasn't been possible so far such as digital libraries, virtual laboratories, distance and independent learning, and health applications.
  • CERN's  Large Hadron Collider for example will benefit by allowing more efficient testing. It will produce about 2 terabytes of data every 4 hours every 2 weeks - lots of Internet 2 researchers will participate in this research
  • Live video is another area that will benefit: "Live video is encoded into DVD-quality MPEG-2 and sent at an average rate of 6 Mbit/s to the University of California, Santa Cruz. It then travels over Internet2 networks to the University of Connecticut and to the Mystic Aquarium."
  • The internet 2 K20 Initiative extends Internet2 to all levels in education and has led to Muse, which is a socila networkfor librarians, students, researchers to connect.
  • The Internet2’s DCN initiative which provides loads of bandwidth on demand and could computing go hand in hand
  • "Although these initiatives would most likely affect commercial services such as television delivery, not academic research, Internet2 opposes tiered schemes because they would allow network operators to restrict Internet users or applications in order to give an advantage to their own services."
  • It could spur on economic growth like more than the current Internet ever has.
We should be keeping track of Internet 2 and taking notice, be it, Internet professionals, researchers, scientists and the general public.  Read more about it at Internet 2, and check out wikipedia for a low-down.

Systers Microsoft meet

If you fancy popping along to the Microsoft professional developers conferences, they have added a women's networking event called Woman's Build.  Hilary Pike will be hosting it.

You can also go to the Lego SeriousPlay WomenBuild workshop at the MSDN Developer Conferences!  It combines personal interaction and networking with the use of LEGO(R) Bricks as a conceptual modeling tool, part of the LEGO(R) Serious Play Program (LSP).

Register for WomenBuild using this code: MDCWIT

You can also join the WomensBuild Facebook group.

There's a video by Laura Foy that tells you all about WomensBuild.

Houston: 12/9
Orlando: 12/11
Atlanta: 12/16
Chicago: 1/13
Mineapolis: 1/13
Washington DC: 1/16
New York: 1/20
Boston: 1/22
Detroit: 1/22
Dallas: 1/26
San Fran: 2/19

Now now, hop along and register if you're fortunate enough to be in the region.

December 03, 2008

The semantic web is not research as usual

Frank van Harmelen (Vrije Universiteit Amsterdam) did a nice lecture called "Where Does It Break? Or: Why the Semantic Web is Not Just 'Research as Usual'" - I think that there is a lot of confusion about what the semantic web is and how complicated the entire thing is.  One proof of that is people still referring to it as Web 3.0, which it isn't.  Here this rather clued up researcher tells us a bit more about it.

Lots of people talk about how the semantic web affects search engines, personalisation, data structures and so on.  This suggests that all the components are readily available, but this is not so.  It actually forces us to put into question technologies we already use.  His work is in "knowledge representation" and the semantic web has forced researchers to re-evaluate things.  He says that other fields are also seeing this.  

Some high level main points in brief:

  • Semantic web 1 is the web of data
  • Semantic web 2 is the web of "Enrichment of the current web"
  • Both use different techniques and target different users
  • Semantic web means better search and browse, better personalisation and interlinking, links being created on the fly with the profile of the visitor,... these are the technological aspects.
  • Decidability,  undecidability, complexity measures for example are the scientific aspects.
  • And context as always is an issue.
  • We need to combine logic and statistics, which are 2 different fields of computing really.  We need to talk to physicists, AI buffs, loads of people who don't necessarily feel involved right now.
For loads more please view the 58min lecture, which we are fortunate to have access to via Videolectures.net.  They have a huge collection of lectures not only about every research area in computing but also in other fields.

Warning: big equations :)


Making Twitter bots

As some companies have figured out, there is a lot of information to be gleaned from Twitter that can benefit their business.  They can manage reputation and customer service for example.  There are however other reasons to collect information which do not involve marketing motivations but for example as I do, for research reasons.  I study the language used, behavioural analysis and that sort of thing.

You may also want to collect information for your own purposes and for this you can write your own specialised Twitter bot from scratch or use a library provided by kind souls.

Google provide some help by providing an infrastructure for you.  Using this you only need to code the logic rather than the whole thing yourself.  It interacts with the Twitter API respecting the terms of service, and manages the enrollment of followers.  The twitter account acts as the bot.  It's built on the .Net 2.0 framework. 

There are other bot code resources available in Google code, here are some nice ones:

Python-Twitter - python wrapper for the Twitter API: allows people to connect via the web, IM, and SMS.

TwitterBroadcastBot - in Ruby - broadcasts a message whenever a friend update contains a certain word.  Good for info flagging.

ContactsNearby - ASP.Net & InSTEDD- allows you to find out the geographical location of your friends on Twitter.  Plugs into Facebook too.

MadCow - python - has a whole host of features for you to use like tracking bookmarks, ip lookup, and silly things like ASCII art and loads more.  

Nathan at Flowing data has a nice tutorial on how to build your own Twitter bot, take a look - he also provides some code to get you started.  

You can automate a lot of things using CRON.  You can use UNWIN if you're on windows and not Unix.  

December 02, 2008

Friend-of-a-friend (FOAF)

The FOAF project is all about building a machine readable web which describes people, the links between them, their interests, things they create and do and many more.  It means that you can share and interconnect information from lots of different sources.  It is an experimental project.

It is a highly descriptive language that uses RDF/XML.  It is dubbed as "The first social semantic web application", and was started in 2000 by Libby Miller and Dan Brickley.  It's automatically generated in some social networks and blog platforms.  OpenID is quite an opportunity for FOAF.

"FOAF defines an open, decentralized technology for connecting social Web sites, and the people they describe."

By using FOAF you can let machines know about your website and through this they can learn about connections between data and people amongst other things.  FOAF means that we can find documents based on properties and interrelationships, we can find people based on different variables and features, we can share annotations, ratings and bookmarks for example.  The main idea is to treat the web like a database, keeping it "neutral, decentralized and content-neutral".

You could for example ask for information from anyone working at Adobe about the recent software update, or ask for a list of documents related to the ones you've used, and so on...

Have a play with the FOAF Explorer where you can explore neighbourhoods.  

You can get involved and use FOAF-a-matic to create a FOAF file about yourself.  Once you've generated the code you can chuck it in a publicly accessible file on your website (foaf.rdf).  Then Google can find your foaf file.

For those of you who want more in depth involvement, check out the full FOAF specs here, and you can download the full dataset here.  

Get involved in the semantic web! Don't just talk about it or ignore it.

December 01, 2008

Google tech talk on the semantic web

This is by Professor Abraham Bernstein.  It's very interesting, it's all about what the semantic web is in brief, but mostly about the various techniques used such as SPARQL, Querix, Ginseng, OWL DL...these are however rubbish for humans mostly.  He explores how to make the semantic web accessible to the general public.

One of the solutions involve natural language queries (yey!) - but it's a "complete mess" at the moment, being ambiguous, domain specific, etc...BUT he did find, as I did, and Jimmy Lin did, that users preferred natural language querying.  

During the Google Ninja challenge I set you all, I have found that the majority of people are indeed using natural language in Google, and I think this is because of the complexity of the search context.  Watch this space.





Why writing a search engine is hard

Anna Patterson, research Associate to the formal reasoning group at Stanford and ex-Googler, also head lady at the Cuil search engine explains why writing a search engine is hard at the ACM queue.

Some main points:
  • Building good search engines has never been done in a big group but in teams of 1 to 4.
  • You need a lot of disks.  The indices are so big that you have to merge them and they will never fit on a single machine.
  • You need to design a ranking algorithm
  • CPU doesn't matter - you need as much bandwidth as you can afford
  • The bugs you write will slow you down more than the cheap CPUs
  • SCSI is faster, but IDE is bigger and cheaper
  • For indexing use a big huge file to minimize disk seeks, which will slow you down no end - You cannot afford the time to seek to a file to process a Web page
  • Use real distributed systems, not a Network file system (NFS)
  • Write a very simple crawler.  "For instance, (dolist (y list of URLs) GET y) is essentially all you need."  Use Sort | uniq on Linux to find duplicates.  This of course a very simplistic way of designing the crawler and duplicate issue but it will mean that you can get up and running quickly.  The other option is to use and opensource crawler.
  • One false step in the indexing and processing will take too long.  To make it simple, just index on words.  Indexing is a really complex area of information retrieval research.
  • Keep a disk-based index architecture - you're not getting lots of traffic right now
  • Don't use PageRank - "Use the source, Luke—the HTML source, that is."
  • "At serve time, you have to get the results out of the index, sort them as per their relevancy to the query, and stick them in a pretty Web page and return them. If it sounds easy, then you haven't written a search engine".
  • "The fastest thing to do at runtime is pre-rank and then sort according to the pre-rank part of your indexing structure."
  • Leave the little indices where they were deposited initially.  This means makes the whole thing faster - then gather these little lists into a big list and sort this list for relevancy.  Or get all results for a particular word together in a big list beforehand. 
  • Loads and loads of things can go wrong, and you have no room for error or you will be sunk.  

Have fun!

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.