My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
and update your bookmarks.

December 09, 2008

The importance of Datamining

Data mining is also called knowledge discovery and data mining (KDD). 
Data mining is the extraction of useful patterns and relationships from data sources, such as databases, texts, the web… It has nothing to do however with SQL, OLAP, data warehousing or any of that kind of thing.  It uses statistical and pattern matching techniques.  Data mining does borrow from statistics, machine learning, databases, information retrieval, data visualization and other fields.
Many areas of science, business, and other environments deal with a vast amount of data, which needs to be turned into something meaningful, knowledge.  Many website owners and SEO professionals use different statistical packages to make sense of their data, as do many other professionals.  Data mining is often overlooked when in fact it can provide very interesting information that statistical methods are unable to produce or produce properly.  These data mining methods give you a lot more control.
The data we have is often vast, and noisy, meaning that it’s imprecise and the data structure is complex.   This is where a purely statistical technique would not succeed, so data mining is a solution. 
The issues in data mining are noisy data, missing values, static data, sparse data, dynamic data, relevance, interestingness, heterogeneity, algorithm efficiency, size and complexity of data.  These types of problems often occur in large amounts of data.
The process for datamining is the following:
  1. Identify data sources and select target data
  2. Pre-process: cleaning, attribute selection
  3. Data mining to extract patterns or models
  4. Post-process: identifying interesting or useful patterns
Patterns must be: valid, novel, potentially useful, and understandable. 
A number of different rules are used:
  • Association rules: these identify a collections attributes that are statistically related in the data. For example X => Y where X and Y are disjoint conjunctions of attribute-value pairs.
  • Classification is where we classify future data into known classes.
  • Clustering is where we identify similarity groups in the data.
  • Sequential pattern mining is where we analyze collections of related records and detect frequently occurring patterns over a period of time.  A tool called SPAM is available for this.
Models are used for datamining, such as:
  • Decision trees are collections of rules mapped out in the form of tree branches leading to larger values or classes.  The algorithm used for building decision trees is C4.5.  These are simple but they’re limited to one attribute per output.
  • Rule induction is where rules about data are induced.  This method gives values in the dataset so it is possible to see where there is a concentration of association factors
  • Regression models are a number of mathematical equations which show the potential associations between things.
  • Neural Networks are statistical programs which classify data sets by grouping things together in a way similar the brain.
The hardest to understand are the neural networks, the easiest the decision trees.
Many interesting things you want to find cannot be found using database queries such as fiding out at what time of the day most of your stock is sold, or finding out what people thought about your new product.
Datamining is widely used in marketing, bioinformatics, fraud detection, text analysis, fault detection, market segmentation, interactive marketing, trend analysis…
A few resources:

There’s a Microsoft tutorial about datamining which you can use
KDNuggets has a wealth of information
Himalaya DM tools (SourceForge project)
Gnome data mining package 
Weka dataming tool in java
DevShed dataming with perl
Commercial packages:
A full list of commercial tools check this out the KDnuggets site. 


Arturo Servin said...

Where do Classifiers such as naive Bayes or Fisher fall (rule induction)?

CJ said...


I think both are indeed rule induction methods, as they extract rules from patterns in test data. But there is also predictive induction (induces rule sets - supervised learning) and descriptive induction (discovers individual rules - unsupervised learning). It depends how you use them.

Sandro Saitta said...


KDD = Knowledge discovery in databases


CJ said...

Yep! My mistake, thank you Sandro.

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at