Science for SEO: The importance of Datamining

December 09, 2008

The importance of Datamining

Data mining is also called knowledge discovery and data mining (KDD).

Data mining is the extraction of useful patterns and relationships from data sources, such as databases, texts, the web… It has nothing to do however with SQL, OLAP, data warehousing or any of that kind of thing. It uses statistical and pattern matching techniques. Data mining does borrow from statistics, machine learning, databases, information retrieval, data visualization and other fields.

Many areas of science, business, and other environments deal with a vast amount of data, which needs to be turned into something meaningful, knowledge. Many website owners and SEO professionals use different statistical packages to make sense of their data, as do many other professionals. Data mining is often overlooked when in fact it can provide very interesting information that statistical methods are unable to produce or produce properly. These data mining methods give you a lot more control.

The data we have is often vast, and noisy, meaning that it’s imprecise and the data structure is complex. This is where a purely statistical technique would not succeed, so data mining is a solution.

The issues in data mining are noisy data, missing values, static data, sparse data, dynamic data, relevance, interestingness, heterogeneity, algorithm efficiency, size and complexity of data. These types of problems often occur in large amounts of data.

The process for datamining is the following:

Identify data sources and select target data
Pre-process: cleaning, attribute selection
Data mining to extract patterns or models
Post-process: identifying interesting or useful patterns

Patterns must be: valid, novel, potentially useful, and understandable.

A number of different rules are used:

Association rules: these identify a collections attributes that are statistically related in the data. For example X => Y where X and Y are disjoint conjunctions of attribute-value pairs.
Classification is where we classify future data into known classes.
Clustering is where we identify similarity groups in the data.
Sequential pattern mining is where we analyze collections of related records and detect frequently occurring patterns over a period of time. A tool called SPAM is available for this.

Models are used for datamining, such as:

Decision trees are collections of rules mapped out in the form of tree branches leading to larger values or classes. The algorithm used for building decision trees is C4.5. These are simple but they’re limited to one attribute per output.
Rule induction is where rules about data are induced. This method gives values in the dataset so it is possible to see where there is a concentration of association factors
Regression models are a number of mathematical equations which show the potential associations between things.
Neural Networks are statistical programs which classify data sets by grouping things together in a way similar the brain.

The hardest to understand are the neural networks, the easiest the decision trees.

Many interesting things you want to find cannot be found using database queries such as fiding out at what time of the day most of your stock is sold, or finding out what people thought about your new product.

Datamining is widely used in marketing, bioinformatics, fraud detection, text analysis, fault detection, market segmentation, interactive marketing, trend analysis…

A few resources:

There’s a Microsoft tutorial about datamining which you can use

KDNuggets has a wealth of information

Tools:

Himalaya DM tools (SourceForge project)

Gnome data mining package

Weka dataming tool in java

DevShed dataming with perl

Commercial packages:

Spss Clementine

A full list of commercial tools check this out the KDnuggets site.

4 comments:

Arturo Servin said...: Where do Classifiers such as naive Bayes or Fisher fall (rule induction)?; 9 December 2008 at 20:25
CJ said...: Arturo,

I think both are indeed rule induction methods, as they extract rules from patterns in test data. But there is also predictive induction (induces rule sets - supervised learning) and descriptive induction (discovers individual rules - unsupervised learning). It depends how you use them.; 9 December 2008 at 22:49
Sandro Saitta said...: Hello,

KDD = Knowledge discovery in databases

Regards.
Sandro.; 12 December 2008 at 17:03
CJ said...: Yep! My mistake, thank you Sandro.; 13 December 2008 at 12:14

Science for SEO

December 09, 2008

The importance of Datamining

4 comments:

About Me

Follow me on Twitter

Subcribe

CJ's shared items

My Blog List

Blog Archive

ShareThis

Content Recommendations powered by Evri