Coursera: Pattern Discovery in Data Mining (2024)

Atsushi Takayama

·

Follow

11 min read

·

Jan 20, 2019

--

## 1.1: What Is Pattern Discovery? Why Is It Important?

Patterns:

  • a set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set
  • represent intrinsic and important properties of datasets

Pattern discovery: Uncovering patterns from massive data sets

## 1.2: Frequent Patterns and Association Rules

(absolute) support (count): how many times the item appears in the itemsets

minsup: minimum support threshold, if (relative) support exceeds, the item is frequent

Association rules: X → Y (s, c)

Support s: the probability that a transaction contains X ∪ Y

Confidence c: the conditional probability that a transaction containing X also contains Y

c = sup(X ∪ Y) / sup(X)

Association rule mining: find all of the rules, X → Y

## 1.3: Compressed Representation: Closed Patterns and Max-Patterns

Coursera: Pattern Discovery in Data Mining (3)

There are far too many patterns for any computer to compute or store all the patterns!

Closed pattern (close pattern, closed frequent pattern): A pattern (itemset) X is closed if X is frequent and there exists no super-pattern Y ⊃ X, with the same support as X

Max-pattern (max frequent pattern): A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ⊃ X

Closed pattern is a lossless compression, while max-pattern is a lossy compression. We only know if a pattern is frequent or not.

## 2.1: The Downward Closure Property of Frequent Patterns

The downward closure (also called “Apriori”) property of frequent patterns

Any subset of a frequent itemset must be frequent. (if {beer, diaper, nuts} is frequent, so is {beer, diaper})

Apriori pruning principle: if any subset of an itemset S is infrequent, then there is no chance for S to be frequent.

## 2.2: The Apriori Algorithm

level-wise, candidate generation and test

Initially, scan DB once to get frequent 1-itemset

Repeat

  • Generate length-(k+1) candidate itemsets from length-k frequent itemsets
  • Test the candidates against DB to find frequent (k+1)-itemsets
  • Set k := k+1

Until no frequent or candidate set can be generated

Return all the frequent itemsets derived

Coursera: Pattern Discovery in Data Mining (4)
Coursera: Pattern Discovery in Data Mining (5)
Coursera: Pattern Discovery in Data Mining (6)

## 2.3: Extensions or Improvements of Apriori

Scanning the entire database is costly. Partition it into k sub-databases. The global frequent patterns should be frequent in at least one of the sub-databases.

Coursera: Pattern Discovery in Data Mining (7)

How to partition the database? Such that a sub-database fits into the main memory.

Coursera: Pattern Discovery in Data Mining (8)

## 2.4: Mining Frequent Patterns by Exploring Vertical Data Format

ECLAT: Equivalence Class Transformation

  • 10: a, b, d, e
  • 20: a, b, e
  • 30: b, c, e

Transform this into

  • a: 10, 20
  • b: 20, 30
  • c: 10, 30
  • d: 10
  • e: 10, 20, 30

Then, “ae” appears only in the intersection of “a” and “e”.

Using diffset accelerate mining.

## 2.5: FPGrowth: A Pattern Growth Approach

FPGrowth: frequent pattern growth

  • Find frequent single items and partition the database based on each such item
  • Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database)

Use data structure FP-tree

Coursera: Pattern Discovery in Data Mining (9)
Coursera: Pattern Discovery in Data Mining (10)
Coursera: Pattern Discovery in Data Mining (11)

## 2.6: Mining Closed Patterns

CLOSET+: Mining Closed Itemsets by Pattern-Growth

## Programming Assignment 1

## 3.1: Limitation of the Support-Confidence Framework

Not all patterns mined are interesting. Interestingness measure:

  • Objective: support, confidence, correlation
  • Subjective (may vary from person to person): query-based, unexpected, freshness, timeliness, multi-dimensional visualization tools

## 3.2: Interestingness Measures: Lift and χ2

Support & confidence are not good enough as interesting measures, what else?: Lift and χ2

Coursera: Pattern Discovery in Data Mining (12)
Coursera: Pattern Discovery in Data Mining (13)

Null transactions: Transactions that contain neither B nor C

Too many null transactions causes a problem with Lift and χ2.

## 3.3: Null Invariance Measures

Coursera: Pattern Discovery in Data Mining (14)

## 3.4: Comparison of Null-Invariant Measures

Coursera: Pattern Discovery in Data Mining (15)
Coursera: Pattern Discovery in Data Mining (16)

## 4.1: Mining Multi-Level Associations

Items often form hierarchies

eg. Level 1: Milk, Level 2: 2% milk & skim milk

Shared multi-level mining: use the lowest min-support to pass down the set of candidates (mine the database in one scan)

Redundancy filtering: remove redundant rules that can be derived from higher level rules

Customized min-supports: set different min-support for different kind of items

## 4.2: Mining Multi-Dimensional Associations

Coursera: Pattern Discovery in Data Mining (17)

## 4.3: Mining Quantitative Associations

Coursera: Pattern Discovery in Data Mining (18)

## 4.4: Mining Negative Correlations

Rare patterns: very low support but interesting → setting customized min-support

Negative patterns (negative correlated patterns): unlikely to happen together

sup(A ∪ B) << sup(A) × sup(B)

Problem: the support-based definition is not null-invariant

A Kulczynski measure-based definition

  • If itemsets A and B are frequent but (P(A|B)+P(B|A))/2<ε, where ε is a negative pattern threshold, then A and B are negatively correlated

## 4.5: Mining Compressed Patterns

Coursera: Pattern Discovery in Data Mining (19)

Want to find typical patterns.

  • P1 and P2 share similar items and have similar support, so only list P2 which have more items
  • same for P4 and P5

## 5.1: Sequential Pattern and Sequential Pattern Mining

Sequential patterns: shopping sequences, medical treatments, natural disasters, weblog click streams, programming execution sequences, DNA, protein, etc.

Transaction DB, sequence DB vs time-series DB

Gapped vs non-gapped

Coursera: Pattern Discovery in Data Mining (20)

Within a basket, the sequence doesn’t matter.

Given support threshold min_sup = 2, <(ab)c> is a sequential pattern.

(a and b getting together, then c)

Algorithm requirement: efficient, scalable, finding complete set, incorporating various kinds of user-specific constraints

The Apriori property still holds: if a subsequence s_1 is infrequent, none of s_1’s super-sequences can be frequent

## 5.2: GSP: Apriori-Based Sequential Pattern Mining

Coursera: Pattern Discovery in Data Mining (21)
  • Scan for length-1, then prune sequences of lengths less than the min-support.
  • Combine the sequences of length-1 to form candidate sequences of length-2.
  • Scan for length-2, then prune sequences of lengths less than the min-support.
  • Combine the sequences of length-2 to form candidate sequences of length-3.
  • and so on

## 5.3: SPADE — Sequential Pattern Mining in Vertical Data Format

SPADE: Sequential PAttern Discovery using Equivalent Class

Coursera: Pattern Discovery in Data Mining (22)

## 5.4: PrefixSpan — Sequential Pattern Mining by Pattern-Growth

Coursera: Pattern Discovery in Data Mining (23)
Coursera: Pattern Discovery in Data Mining (24)
Coursera: Pattern Discovery in Data Mining (25)

## 5.5: CloSpan — Mining Closed Sequential Patterns

A closed sequential pattern s: There exists no superpattern s’ such that s’ ⊃ s, and s’ and s have the same support (in other words s has the long sequence with the same support).

Property P_1: If s ⊃ s_1, s is closed iff two project DBs have the same size.

Coursera: Pattern Discovery in Data Mining (26)

## 6.1: Mining Spatial Associations

Spatial frequent patterns and association rule: A → B [s%, c%]

  • s%: support
  • c%: confidence

A and B:

  • Topological relations: intersects, overlaps, disjoint, etc.
  • Spatial orientations: left_of, west_of, under, etc.
  • Distance information: close_to, within_distance, etc.

Example: If x is a large town and x intersects with a highway, then x is likely to be adjacent to water (lakes, rivers, etc.) with certain support and confidence.

is_a(x, large_town) ^ intersect(x, highway) → adjacent_to(x, water) [7%, 85%]

Hierarchy of spatial relationship: eg. close_to is a generalization of near_by, touch, intersect, contain

Progressive refinement: first search for rough relationship, then refine it

To save mining cost;

  • Step 1: MBR (Minimum Bounding Rectangle) or R-tree for rough estimation
  • Step 2: Apply more detailed algorithm for objects that have passed Step 1

## 6.2: Mining Spatial Colocation Patterns

Coursera: Pattern Discovery in Data Mining (27)

A, B, C, D: eg. regions where West Nile Virus are spreading, mosquito regions, cities, locations where birds are present, etc.

Edges: co-location, (eg. 6, 17 and 3 are co-located above)

Rowset: list of places where particular features are co-located

cp: conditional probability

Coursera: Pattern Discovery in Data Mining (28)

## 6.3: Mining and Aggregating Patterns over Multiple Trajectories

Coursera: Pattern Discovery in Data Mining (29)

Time-annotated symbolic sequence

Coursera: Pattern Discovery in Data Mining (30)

Flock pattern requires at least m entities moving at least k consecutive time stamps within a circular region of radius r. → A bit too rigid…

Convoy pattern is density-based, so it’s looser than the flock pattern.

Swarm pattern is even more relaxed.

Coursera: Pattern Discovery in Data Mining (31)

## 6.4: Mining Semantics-Rich Movement Patterns

Frequent Movement Pattern focuses on the semantics of places, as opposed to Frequent Sequential Pattern that focuses on spatial pattern.

eg. Home→Starbucks→Office (the exact location does not matter)

Coursera: Pattern Discovery in Data Mining (32)

## 6.5: Mining Periodic Movement Patterns

Periodic Movement Pattern in a Sparse Data: eg. bird’s nest

Find reference spots with density-based method, then detect periods using Fourier Transform and auto-correlation.

Coursera: Pattern Discovery in Data Mining (33)

## 7.1: From Frequent Pattern Mining to Phrase Mining

Phrase: A natural, meaningful, unambiguous semantic unit (unigrams are often ambiguous)

General principle: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience

Methodology: Exploring three ideas

  • Frequent pattern mining and colocation analysis
  • Phrasal segmentation
  • Quality phrase assessment

## 7.2: Previous Phrase Mining Methods

Reducing annotation cost: Web n-grams, SotA 95% accuracy, 88% phrase-level F-score

Topic modeling: each topic is represented by a word distribution

Three strategies on phrase mining with topic modeling

  • Strategy 1: Generate bag-of-words → generate sequence of tokens
    Bigram Topic Model, Topical N-Gram (TNG), Phrase-Discovering LDA (PDLDA)
    High model complexity (tends to overfitting) & high inference cost
  • Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
    TurboTopics, KERT
  • Strategy 3: Prior bag-of-words model inference, mine phrase and impose on the bag-of-words model
Coursera: Pattern Discovery in Data Mining (34)
Coursera: Pattern Discovery in Data Mining (35)

## 7.3: ToPMine: Phrase Mining without Training Data

Problem of Strategy 2 (Topic Modeling → Phrase Mining) is that phrases may be split into different topic models

Solution: Switch order!

  • Phrase mining, document segmentation, and phrase ranking
  • Topic model inference with phrase constraint
Coursera: Pattern Discovery in Data Mining (36)

Collocation mining: any of these will work

Coursera: Pattern Discovery in Data Mining (37)

Phrase Candidate Generation:

Coursera: Pattern Discovery in Data Mining (38)

eg. set α = 5 as the threshold (5 standard deviations)

## 7.4: SegPhrase: Phrase Mining with Tiny Training Sets

ToPMine: no training data required

SegPhrase: a small set of training data may enhance the quality of phrase mining significantly

Coursera: Pattern Discovery in Data Mining (39)

Classification → Phrasal Segmentation (SegPhrase)

Classification → Phrasal Segmentation → Classification → Phrasal Segmentation (SegPhrase+)

## Programming Assignment 2

## 8.1: Frequent Pattern Mining in Data Streams

Stream data mining tasks

  • Pattern mining in data streams (OUR INTEREST HERE!)
  • Multi-dimensional on-line analysis of streams
  • Clustering data streams
  • Classification of stream data
  • Mining outliers and anomalies in stream data

Unrealistic to mine precise patterns. Approximate answers are often sufficient. How to mine frequent patterns with good approximation?

Lossy Counting Algorithm: Not to keep the items with very low support count

  • Advantage: Guaranteed error bound
  • Disadvantage: Keeping a large set of traces
Coursera: Pattern Discovery in Data Mining (40)
Coursera: Pattern Discovery in Data Mining (41)

## 8.2: Pattern Discovery for Software Bug Mining

Coursera: Pattern Discovery in Data Mining (42)

CP-Miner:

  • Tokenize code
  • Compute hash of the token sequence of each line to construct a sequence DB
  • Find similar (allowing some gaps due to insertions) pattern from the sequence DB
  • Find “forgot-to-change” tokens

##8.3: Pattern Discovery for Image Analysis

Coursera: Pattern Discovery in Data Mining (43)

## 8.4: Advanced Topics on Pattern Discovery: Pattern Mining and Society — Privacy Issue

Tree categories on privacy issues arising out of data mining

  • Input privacy (or data hiding): hide data to prevent miners from reliably extracting confidential or private information
  • Output privacy (or knowledge hiding): no disclosure of sensitive patterns or knowledge from datasets
  • Owner privacy: does not allow any party to reliably learn the data or sensitive information that other owners hold

Ensuring Input Privacy

K-anonymity privacy requirement and subsequent studies

  • k-anonymity: each equivalent class contains at least k records
  • It is still not sufficient, thus leads to further studies, such as l-diversity, t-closeness, and differential privacy

Statistical distortion: Using randomization algorithms

  • Independent attribute perturbation
  • Dependent attribute perturbation: take care of correlations across attributes

## 8.5: Advanced Topics on Pattern Discovery: Looking Forward

Coursera: Pattern Discovery in Data Mining (44)
Coursera: Pattern Discovery in Data Mining (2024)
Top Articles
Latest Posts
Article information

Author: Frankie Dare

Last Updated:

Views: 5711

Rating: 4.2 / 5 (73 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.