The paper "Keyword Generation for Search Engine Advertising using Semantic Similarity between Terms" by Vibhanshu Abhishek, Kartik Hosanagar (The Wharton School Philadelphia), was presented at ICEC’07. That conference will be of particular interest to online marketing professionals.
"This paper mathematically formulates the problem of using many keywords in place of a few.A method is proposed that can be used by an advertiser to generate relevant keywords given his website. In order to find relevant terms for a query term semantic similarity between terms in this dictionary is established. A kernel based method developed by Shami and Heilman is used to calculate this relevance score. The similarity graph thus generated is traversed by a watershed algorithm that explores the neighborhood and generates suggestions for a seed keyword."
Their initial equations show a trade off between the number of terms and the total cost. Relevant keywords are important because conversion rates will be higher.
They focus on a new technique for generating a large number of keywords that might be relatively cheaper compared to the seed keyword. There's not been much work done in keyword generation, but a related area of interest is query expansion.
Different ways to generate keywords are: query log (used by search engines) and advertiser log mining, proximity searches and meta-tag crawlers (WordTracker).
Search engines work on finding the co-occurence relationship between terms and similar terms are then suggested. The Adwords tool also uses past queries that also contain the search terms. Advertisers logs are also taken into account.
Most 3rd party tools use proximity, and this does produce a lot of keywords, however relevant keywords containing the original terms don't appear.
These tools and methods don't consider semantic relationships. They address this issue with their new system "Wordy":
"We make an assumption that the cost of a keyword is a function of its frequency, i.e., commonly occurring terms are more expensive than in frequent ones. Keeping this assumption in mind a novel watershed algorithm is proposed. This helps in generating keywords that are less frequent than the query keyword and possibly cheaper."
You can easily add new terms to the system and it automatically, it establishes links between them and the others.
It generates keywords starting from a website, established semantic similarity between them, suggests a large set that might be cheaper than the query word. The dictionary they use is generated by the set of documents (the corpus). tfidf is computed for all the words in the corpus. Top tfidf weighted keywords are chosen. A search engine queries each word in the dictionary that was created and top documents (already pre-processed) are retrieved for each query and added to the corpus also. A final dictionary is created eventually and this is the finished list of suggested keywords.
They use the Shami/Heilman technique for semantic distance computation where each snippet is used to retrieve the correct documents. These are then used to form a context vector where terms occurring in the documents are listed. They're compared using a dot product to find similarities between the snippets - They used the method to find semantic similarity (Shami/Heilman used it to suggest additional queries)
"Cheaper keywords can be found by finding terms that are semantically similar but have lower frequency. A watershed algorithm is run from the keyword k to and such keywords. The search starts from the node representing k and does a breadth first search on all its neighbors such that only nodes that have a lower frequency are visited. The search proceeds till t suggestions have been generated. It is also assumed that similarity has a transitive relationship."
You can obviously choose to ignore the cheaper keyword results and just see similar ones.
They found that a bigger corpus improves the quality of the suggestions, and relevance is improved by increasing the number of documents retrieved while creating the dictionary as well while computing the context vector which increases the relevance of suggested keywords. Basically it worked.
If you want to see a working system let me know and I'll see what I can do.
Test it against the Google keyword suggestion tool. Wordy found:
What do you reckon? Good or bad?