Data Mining

Data mining refers loosely to the process of semi automatically analyzing large databases to find useful patterns. It attempts to discover rules and patterns from data.

  • deals with ‘knowledge discovery in databases’
  • There are number of applications of data mining , such as prediction of values based on past examples, finding of associations between purchases , and automatic clustering of people and movies.

Different views on data mining:
Data mining, or Knowledge Discovery in Database (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches such as clustering, data summarization, learning classification rules, finding dependency networks analyzing changes and detecting anomalies – William  Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
Data mining is the search for relationships and global patterns that exist in large databases but are ‘hidden’ among the vast amount of data such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and if the database is a faithful mirror, of the real world registered by the database – Marcel Holshemier and Arno Siebes (1994)
The analogy with the mining process is described as :
Data mining refers to “using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data and extracting these in  such a way that they can be put to use in the areas such as decision support, prediction, forecasting, and estimation. The data is often voluminous, but as it stands of law value as no direct use can be made of it,it is the hidden information in the data that is useful”
Stages of Data Mining
Selection: selecting or segmenting the data according to some criteria e.g. all those people who won a car, in this way subsets of the data can be determined.
Preprocessing: this is the data cleansing stage where certain information is removed which is deemed unnecessary and may slow down queries for example unnecessary to note the sex of patient when studying with pregnancy. Also the data is reconfigured to ensure a consistent format as there is a possibility of inconsitent formats because the data is drawn from several sources e.g. sex may recorded as f or m and also as 1 or 0,
Transformation: the data is not merely transferred across but transformed in  that overlays may added such as demographic overlays commonly used in market research. The data is made usable and navigable.
Data Mining: this stage is concerned with the extraction of patterns from the data. A pattern can be defined as given a set of facts (data) F, a language L, and some measure of certainty C a pattern is a statement S in L that describes relationships among a subset Fs of F with a certainty c such that S is simpler in some sense than the enumeration of  all the facts in Fs
Interpretation and evaluation: the patterns identified by the system are interpreted into knowledge which can then be used to support human decision-making e.g. prediction and classification tasks, summarizing the contents of a database or explaining observed phenomena.
There are number of data mining methods that are available to businesses today. Among the most common data mining methods are data reduction methods, classification systems, and other predictive modelling tools.
Methods

  • Association rule learning
  • cluster analysis
  • structured data analysis
  • data analysis
  • Predictive analysis
  • knowledge discovery

Data reduction tools provide a systematic way to cut down the number of variables that you consider in your decision making with classification methods, you can build rules that allow you to classify your new customers. After using these data mining methods, you can employ statistical modelling to predict future outcomes such as cross buying, detection and sales.
Predictive Analysis

  • is an area of statistical analysis that deals with extracting information from data and using it to predict future trends and behavior patterns. The core of predictive analytic relies on capturing relationships between explanatory variables and the predicted variables from past to occurrences, and exploiting it to predict future outcomes.
  • used i actuarial science, financial services, insurance, telecommunications, retail, travel and other fields.

Data Analysis

  • analysis of data is a process of inspecting, cleaning, transforming and modeling data with the goal of highlighting useful information, suggesting conclusions and supporting decision making.

Cluster  Analysis or clustering

  • is the assignment of a set of observations into subsets (called clusters) so that observations in the same clusters are similar to some sense
  • clustering is a method of unsupervised learning and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition.

Association rule learning

  • In data mining, association rule learning is a popular and well researched methods for discovering interesting relations between variables in large database.

Data Mining Problem/Issues
Data mining systems rely on databases to supply the raw data for input and this raises problems in databases tend to be dynamic, incomplete, noisy and large. Other problems arise as a res of the adequacy and relevance of the information stored.

  • Limited information: A database is often designed for purposes different from data mining and sometimes the properties or attributes that would simplify the learning task are not present nor can they request from the real world. Inconclusive data causes problems because if some attributes essential to knowledge about the application domain are not present in the data it may be impossible to discover significant knowledge about a given domain. For example- cannot diagnose malaria from a patient database if that database does not contain the patients red blood cell  count.
  • Noise and missing values: Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct. Attributes which rely on subjective or measurement judgements can give risk to errors such that some examples may even be mis-classified. Error in either the values or attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects  the overall accuracy of the generated rules.

Missing data can be treated by discovery systems in a number of ways such as :

– simply disregard missing values

– omit the corresponding records

– infer missing values from known values

– treat missing data as a special value to be included additionally in the attribute doamin

– or average over the missing values using Bayesian techniques.

  • Uncertainty: Uncertainty refers to the severity of the error and the degree of noise in the data. Data precision is an important consideration in a discovery system.
  • Size, updates, and irrelevant fields: Databases tend to be large and dynamic in that their contents are ever-changing as information is added, modified or removed. The problem with this from data mining perspective is how to ensure that the rules are up-to-date and consistent with the most current information. Also the learning system has to be time-sensitive as some data values vary over time and the discovery system is affected by the ‘timeliness’ of the data.

Leave a Reply

Your email address will not be published. Required fields are marked *