13.4 Data Mining Tasks (2024)


Next: 13.5 Data Mining ComputationalUp: 13. Data and Knowledge Previous: 13.3 Supervised and Unsupervised

Subsections

  • 13.4.1 Description and Summarization
  • 13.4.2 Descriptive Modeling
  • 13.4.3 Predictive Modeling
  • 13.4.4 Discovering Patterns and Rules
  • 13.4.5 Retrieving Similar Objects

The cycle of data and knowledge mining comprises various analysissteps, each step focusing on adifferent aspect or task.[13] propose the following categorization of data miningtasks.


13.4.1 Description and Summarization

At the beginning of each data analysis is the wish and the need to getan overview on the data, to see general trends as well as extremevalues rather quickly. It is important to familiarize with the data,to get an idea what the data might be able to tell you, wherelimitations will be, and which further analyses steps might besuitable. Typically, getting the overview will at the same time pointthe analyst towards particular features, data quality problems, andadditional required background information. Summary tables, simpleunivariate descriptive statistics, and simple graphics are extremelyvaluable tools to achieve this task.

[33] report from astudy of 13.4 Data Mining Tasks (6) car insurancepolicies during which the following difficulties emerged amongstothers (see Fig.13.1).

(a)
Barcharts of the categorical variables revealed that several hadtoo many categories. Sex had seven, of which four were so rare as topresumably be unknowns or errors of some kind. The third large categoryturned out to be very reasonable: if acar was insured by afirm, the variablesex was coded as ''firm''. This had not been explained in advance and wasobviously useful for abetter grasp of the data.
(b)
Ahistogram of date of birth showed missing values, afairly largenumber (though small percentage) of underage insured persons, and alargishnumber born in 1900, who had perhaps been originally coded as ''0'' or''00'' for unknown. Any analytic method using such avariable could havegiven misleading results.
(c)
Linking the barchart of gender from(a) and the histogram of agefrom(b) showed quite plausibly that many firms had date of birth coded asmissing, but not all. This led to further informative discussions with thedata set owners.
Figure 13.1:Linked highlighting reveals structure in the data and explains unusual resultsof one variable quite reasonably. Barchart of Sex of car insurance policyholders on the left, Histogram of year of birth of policy holders on theright. Highlighted are cases with 13.4 Data Mining Tasks (7) (firm). Thelines under some of the bins in the histogram indicate smallcounts of highlighted cases that can't be displayed proportionally
13.4 Data Mining Tasks (8)

Checking data quality is by no means anegative part of the process.It leads to deeper understanding of the data and to more discussionswith the data set owners. Discussions lead to more information aboutthe data and the goals of the study.

Speed of the data processing is an important issue at this step. Forsimple tasks-and data summary and description are typicallyconsidered to be simple tasks, although it is generally nottrue-users are not willing to spend much time. Afrequency table orascatterplot must be visible in the fraction of asecond, even whenit comprises amillion observations. Only some computer programs areable to achieve this. Another point is afast scan through all thevariables: if aprogram requires an explicit and lengthy specificationof the graph or table to be created, auser typically will end thistedious endeavor after afew instances. Generic functions withcontext-sensitive and variable-type-dependent responses provideaviable solution to this task. On the level of standard statisticaldata sets this is provided by software like XploRe, S-Plus and R withtheir generic functions summary and plot. Genericfunctions of this kind can be enhanced by aflexible and interactiveuser environment which allows to navigate through the mass of data, toextract the variables that show interesting information on the firstglance and that call for further investigation. Currently, no systemcomes close to meet these demands, future systems hopefully will do.


13.4.2 Descriptive Modeling

General descriptions and summaries are an important starting point butmore exploration of the data is usually desired. While the tasks inthe previous section have been guided by the goal of summary and datareduction, descriptive modeling tries to find models for the data. Incontrast to the subsequent section, the aim of these models is todescribe, not to predict models. As aconsequence, descriptive modelsare used in the setting of unsupervised learning. Typical methods ofdescriptive modeling are density estimation, smoothing, datasegmentation, and clustering. There are by now some classics in theliterature on density estimation ([27]) and smoothing([14]). Clustering is awell-studied and well-knowntechnique in statistics. Many different approaches and algorithms,distance measures and clustering schemeshave been proposed. With large data sets all hierarchical methods haveextreme difficulties with performance. The most widely used method ofchoice is 13.4 Data Mining Tasks (9)-means clustering. Although 13.4 Data Mining Tasks (10)-means is not particularlytailored for alarge number of observations, it is currently the onlyclustering scheme that has gained positive reputation in both thecomputer science and the statistics community. The reasoning behindcluster analysis is the assumption that the data set contains naturalclusters which, when discovered, can be characterized andlabeled. While for some cases it might be difficult to decide to whichgroup they belong, we assume that the resulting groups are clear-cutand carry an intrinsic meaning. In segmentation analysis, in contrast,the user typically sets the number of groups in advance and tries topartition all cases in hom*ogeneous subgroups.


13.4.3 Predictive Modeling

Predictive modelingfalls into the category of supervised learning, hence, one variable isclearly labeled as target variable13.4 Data Mining Tasks (11) and will be explained asafunction of the other variables13.4 Data Mining Tasks (12). The nature of the targetvariable determines the type of model: classification model, if13.4 Data Mining Tasks (13) isadiscrete variable, or regression model, if it is acontinuousone. Many models are typically built to predict the behavior of newcases and to extend the knowledge to objects that are new or not yetas widely understood. Predicting the value of the stock market, theoutcome of the next governmental election, or the health status ofaperson Banks use classification schemes to group their costumersinto different categories of risk.

Classification models follow one of three different approaches: thediscriminative approach, the regression approach, or theclass-conditional approach. The discriminative approach aims indirectly mapping the explanatory variables 13.4 Data Mining Tasks (14) to one of the 13.4 Data Mining Tasks (15)possible target categories 13.4 Data Mining Tasks (16). The input space13.4 Data Mining Tasks (17) ishence partitioned into different regions which have aunique classlabel assigned. Neural networks and support vector machines areexamples for this. The regression approach (e.g. logistic regression)calculates the posterior class distribution 13.4 Data Mining Tasks (18) for each caseand chooses the class for which the maximum probability isreached. Decision trees(CART, C5.0, CHAID) classify for both the discriminative approach andthe regression approach, because typically the posterior classprobabilities at each leaf are calculated as well as the predictedclass. The class-conditional approach starts with specifying theclass-conditional distributions 13.4 Data Mining Tasks (19)explicitly. After estimating the marginal distribution13.4 Data Mining Tasks (20), Bayesrule is used to derive the conditional distribution 13.4 Data Mining Tasks (21). Thename Bayesian classifiersis widely used for this approach, erroneously pointing to aBayesianapproach versus afrequentist approach. Mostly, plug-in estimates13.4 Data Mining Tasks (22) are derived via maximum likelihood. Theclass-conditional approach is particularly attractive, because theyallow for general forms of the class-conditionaldistributions. Parametric, semi-parametric, and non-parametric methodscan be used to estimate the class-conditional distribution. Theclass-conditional approach is the most complex modeling technique forclassification. The regression approach requires fewer parameters tofit, but still more than adiscriminative model. There is no generalrule which approach works best, it is mainly aquestion of the goal ofthe researcher whether posterior probabilities are useful, e.g. to seehow likely the ''second best'' class would be.


13.4.4 Discovering Patterns and Rules

The realm of the previous tasks has been much within the statisticaltradition in describing functional relationships between explanatoryvariables and target variables. There are situations where suchafunctional relationship is either not appropriate or too hard toachieve in ameaningful way. Nevertheless, there might be apattern inthe sense that certain items, values or measurements occur frequentlytogether. Association Rulesare amethod originating from market basket analysis to elicitpatterns of common behavior. Let us consider an example originatingfrom data that is available as one of the example data files for theSAS Enterprise Miner. For this data (in the following refered to asthe SAS Assocs Data) the output for an association query with13.4 Data Mining Tasks (23) and 13.4 Data Mining Tasks (24), limited byamaximum of 13.4 Data Mining Tasks (25)items per rule generated by the SAS Enterprise Minerconsists of 13.4 Data Mining Tasks (26)lines of the form shown inTable13.1.

Table 13.1:Examples of association rules as found in the SAS Assocs Data by the SASEnterprise Miner Software. 13.4 Data Mining Tasks (27) rules have been generated:13.4 Data Mining Tasks (28) including 13.4 Data Mining Tasks (29) items, 13.4 Data Mining Tasks (30) with 13.4 Data Mining Tasks (31) items and 13.4 Data Mining Tasks (32) with13.4 Data Mining Tasks (33) items
# itemsconfsuppcountRule
13.4 Data Mining Tasks (34)82.6225.1713.4 Data Mining Tasks (35)artichok13.4 Data Mining Tasks (36) heineken
13.4 Data Mining Tasks (37)78.9325.0713.4 Data Mining Tasks (38)soda13.4 Data Mining Tasks (39) cracker
13.4 Data Mining Tasks (40)78.0922.0813.4 Data Mining Tasks (41)turkey13.4 Data Mining Tasks (42) olives
13.4 Data Mining Tasks (43)
13.4 Data Mining Tasks (44)95.165.8913.4 Data Mining Tasks (45)soda & artichok13.4 Data Mining Tasks (46) heineken
13.4 Data Mining Tasks (47)94.3119.8813.4 Data Mining Tasks (48)avocado & artichok13.4 Data Mining Tasks (49) heineken
13.4 Data Mining Tasks (50)93.2323.3813.4 Data Mining Tasks (51)soda & cracker13.4 Data Mining Tasks (52) heineken
13.4 Data Mining Tasks (53)
13.4 Data Mining Tasks (54)100.003.113.4 Data Mining Tasks (55)ham & corned beef & apples13.4 Data Mining Tasks (56) olives
13.4 Data Mining Tasks (57)100.003.113.4 Data Mining Tasks (58)ham & corned beef & apples13.4 Data Mining Tasks (59) hering
13.4 Data Mining Tasks (60)100.003.813.4 Data Mining Tasks (61)steak & soda & heineken13.4 Data Mining Tasks (62) cracker
13.4 Data Mining Tasks (63)

The practical use of association rules is not restricted to findingthe general trend and the norm behavior, association rules have alsobeen used successfully for detecting unusual behavior in frauddetection.


13.4.5 Retrieving Similar Objects

The world wide web contains an enormous amount of information inelectronic journal articles, electronic catalogs, and private andcommercial homepages. Having found an interesting article or picture,it is acommon desire to find similar objects quickly. Based on keywords and indexed meta-information search engines are providing uswith this desired information. They can not only work on textdocuments, but to acertain extent also on images. Semi-automatedpicture retrieval combines the ability of the human vision system withthe search capacities of the computer to find similar images in adatabase.


Next: 13.5 Data Mining ComputationalUp: 13. Data and Knowledge Previous: 13.3 Supervised and Unsupervised
13.4 Data Mining Tasks (2024)
Top Articles
Latest Posts
Article information

Author: Duane Harber

Last Updated:

Views: 5707

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Duane Harber

Birthday: 1999-10-17

Address: Apt. 404 9899 Magnolia Roads, Port Royceville, ID 78186

Phone: +186911129794335

Job: Human Hospitality Planner

Hobby: Listening to music, Orienteering, Knapping, Dance, Mountain biking, Fishing, Pottery

Introduction: My name is Duane Harber, I am a modern, clever, handsome, fair, agreeable, inexpensive, beautiful person who loves writing and wants to share my knowledge and understanding with you.