13.4 Data Mining Tasks (2024)

Next: 13.5 Data Mining ComputationalUp: 13. Data and Knowledge Previous: 13.3 Supervised and Unsupervised

13.4.1 Description and Summarization
13.4.2 Descriptive Modeling
13.4.3 Predictive Modeling
13.4.4 Discovering Patterns and Rules
13.4.5 Retrieving Similar Objects

The cycle of data and knowledge mining comprises various analysissteps, each step focusing on adifferent aspect or task.[13] propose the following categorization of data miningtasks.

13.4.1 Description and Summarization

At the beginning of each data analysis is the wish and the need to getan overview on the data, to see general trends as well as extremevalues rather quickly. It is important to familiarize with the data,to get an idea what the data might be able to tell you, wherelimitations will be, and which further analyses steps might besuitable. Typically, getting the overview will at the same time pointthe analyst towards particular features, data quality problems, andadditional required background information. Summary tables, simpleunivariate descriptive statistics, and simple graphics are extremelyvaluable tools to achieve this task.

13.4.2 Descriptive Modeling

General descriptions and summaries are an important starting point butmore exploration of the data is usually desired. While the tasks inthe previous section have been guided by the goal of summary and datareduction, descriptive modeling tries to find models for the data. Incontrast to the subsequent section, the aim of these models is todescribe, not to predict models. As aconsequence, descriptive modelsare used in the setting of unsupervised learning. Typical methods ofdescriptive modeling are density estimation, smoothing, datasegmentation, and clustering. There are by now some classics in theliterature on density estimation ([27]) and smoothing([14]). Clustering is awell-studied and well-knowntechnique in statistics. Many different approaches and algorithms,distance measures and clustering schemeshave been proposed. With large data sets all hierarchical methods haveextreme difficulties with performance. The most widely used method ofchoice is -means clustering. Although -means is not particularlytailored for alarge number of observations, it is currently the onlyclustering scheme that has gained positive reputation in both thecomputer science and the statistics community. The reasoning behindcluster analysis is the assumption that the data set contains naturalclusters which, when discovered, can be characterized andlabeled. While for some cases it might be difficult to decide to whichgroup they belong, we assume that the resulting groups are clear-cutand carry an intrinsic meaning. In segmentation analysis, in contrast,the user typically sets the number of groups in advance and tries topartition all cases in hom*ogeneous subgroups.

13.4.3 Predictive Modeling

Predictive modelingfalls into the category of supervised learning, hence, one variable isclearly labeled as target variable and will be explained asafunction of the other variables. The nature of the targetvariable determines the type of model: classification model, if isadiscrete variable, or regression model, if it is acontinuousone. Many models are typically built to predict the behavior of newcases and to extend the knowledge to objects that are new or not yetas widely understood. Predicting the value of the stock market, theoutcome of the next governmental election, or the health status ofaperson Banks use classification schemes to group their costumersinto different categories of risk.

Classification models follow one of three different approaches: thediscriminative approach, the regression approach, or theclass-conditional approach. The discriminative approach aims indirectly mapping the explanatory variables to one of the possible target categories . The input space ishence partitioned into different regions which have aunique classlabel assigned. Neural networks and support vector machines areexamples for this. The regression approach (e.g. logistic regression)calculates the posterior class distribution for each caseand chooses the class for which the maximum probability isreached. Decision trees(CART, C5.0, CHAID) classify for both the discriminative approach andthe regression approach, because typically the posterior classprobabilities at each leaf are calculated as well as the predictedclass. The class-conditional approach starts with specifying theclass-conditional distributions explicitly. After estimating the marginal distribution, Bayesrule is used to derive the conditional distribution . Thename Bayesian classifiersis widely used for this approach, erroneously pointing to aBayesianapproach versus afrequentist approach. Mostly, plug-in estimates are derived via maximum likelihood. Theclass-conditional approach is particularly attractive, because theyallow for general forms of the class-conditionaldistributions. Parametric, semi-parametric, and non-parametric methodscan be used to estimate the class-conditional distribution. Theclass-conditional approach is the most complex modeling technique forclassification. The regression approach requires fewer parameters tofit, but still more than adiscriminative model. There is no generalrule which approach works best, it is mainly aquestion of the goal ofthe researcher whether posterior probabilities are useful, e.g. to seehow likely the ''second best'' class would be.

13.4.4 Discovering Patterns and Rules

The realm of the previous tasks has been much within the statisticaltradition in describing functional relationships between explanatoryvariables and target variables. There are situations where suchafunctional relationship is either not appropriate or too hard toachieve in ameaningful way. Nevertheless, there might be apattern inthe sense that certain items, values or measurements occur frequentlytogether. Association Rulesare amethod originating from market basket analysis to elicitpatterns of common behavior. Let us consider an example originatingfrom data that is available as one of the example data files for theSAS Enterprise Miner. For this data (in the following refered to asthe SAS Assocs Data) the output for an association query with and , limited byamaximum of items per rule generated by the SAS Enterprise Minerconsists of lines of the form shown inTable13.1.

**Table 13.1:**Examples of association rules as found in the SAS Assocs Data by the SASEnterprise Miner Software. rules have been generated: including items, with items and with items
# items	conf	supp	count	Rule
	82.62	25.17		artichok	heineken
	78.93	25.07		soda	cracker
	78.09	22.08		turkey	olives

	95.16	5.89		soda & artichok	heineken
	94.31	19.88		avocado & artichok	heineken
	93.23	23.38		soda & cracker	heineken

	100.00	3.1		ham & corned beef & apples	olives
	100.00	3.1		ham & corned beef & apples	hering
	100.00	3.8		steak & soda & heineken	cracker

The practical use of association rules is not restricted to findingthe general trend and the norm behavior, association rules have alsobeen used successfully for detecting unusual behavior in frauddetection.

13.4.5 Retrieving Similar Objects

The world wide web contains an enormous amount of information inelectronic journal articles, electronic catalogs, and private andcommercial homepages. Having found an interesting article or picture,it is acommon desire to find similar objects quickly. Based on keywords and indexed meta-information search engines are providing uswith this desired information. They can not only work on textdocuments, but to acertain extent also on images. Semi-automatedpicture retrieval combines the ability of the human vision system withthe search capacities of the computer to find similar images in adatabase.

Next: 13.5 Data Mining ComputationalUp: 13. Data and Knowledge Previous: 13.3 Supervised and Unsupervised