Atsushi Takayama · Follow
11 min read · Jan 20, 2019
--
## 1.1: What Is Pattern Discovery? Why Is It Important?
Patterns:
- a set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set
- represent intrinsic and important properties of datasets
Pattern discovery: Uncovering patterns from massive data sets
## 1.2: Frequent Patterns and Association Rules
(absolute) support (count): how many times the item appears in the itemsets
minsup: minimum support threshold, if (relative) support exceeds, the item is frequent
Association rules: X → Y (s, c)
Support s: the probability that a transaction contains X ∪ Y
Confidence c: the conditional probability that a transaction containing X also contains Y
c = sup(X ∪ Y) / sup(X)
Association rule mining: find all of the rules, X → Y
## 1.3: Compressed Representation: Closed Patterns and Max-Patterns
There are far too many patterns for any computer to compute or store all the patterns!
Closed pattern (close pattern, closed frequent pattern): A pattern (itemset) X is closed if X is frequent and there exists no super-pattern Y ⊃ X, with the same support as X
Max-pattern (max frequent pattern): A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ⊃ X
Closed pattern is a lossless compression, while max-pattern is a lossy compression. We only know if a pattern is frequent or not.
## 2.1: The Downward Closure Property of Frequent Patterns
The downward closure (also called “Apriori”) property of frequent patterns
Any subset of a frequent itemset must be frequent. (if {beer, diaper, nuts} is frequent, so is {beer, diaper})
Apriori pruning principle: if any subset of an itemset S is infrequent, then there is no chance for S to be frequent.
## 2.2: The Apriori Algorithm
level-wise, candidate generation and test
Initially, scan DB once to get frequent 1-itemset
Repeat
- Generate length-(k+1) candidate itemsets from length-k frequent itemsets
- Test the candidates against DB to find frequent (k+1)-itemsets
- Set k := k+1
Until no frequent or candidate set can be generated
Return all the frequent itemsets derived
## 2.3: Extensions or Improvements of Apriori
Scanning the entire database is costly. Partition it into k sub-databases. The global frequent patterns should be frequent in at least one of the sub-databases.
How to partition the database? Such that a sub-database fits into the main memory.
## 2.4: Mining Frequent Patterns by Exploring Vertical Data Format
ECLAT: Equivalence Class Transformation
- 10: a, b, d, e
- 20: a, b, e
- 30: b, c, e
Transform this into
- a: 10, 20
- b: 20, 30
- c: 10, 30
- d: 10
- e: 10, 20, 30
Then, “ae” appears only in the intersection of “a” and “e”.
Using diffset accelerate mining.
## 2.5: FPGrowth: A Pattern Growth Approach
FPGrowth: frequent pattern growth
- Find frequent single items and partition the database based on each such item
- Recursively grow frequent patterns by doing the above for each partitioned database (also called conditional database)
Use data structure FP-tree
## 2.6: Mining Closed Patterns
CLOSET+: Mining Closed Itemsets by Pattern-Growth
## Programming Assignment 1
## 3.1: Limitation of the Support-Confidence Framework
Not all patterns mined are interesting. Interestingness measure:
- Objective: support, confidence, correlation
- Subjective (may vary from person to person): query-based, unexpected, freshness, timeliness, multi-dimensional visualization tools
## 3.2: Interestingness Measures: Lift and χ2
Support & confidence are not good enough as interesting measures, what else?: Lift and χ2
Null transactions: Transactions that contain neither B nor C
Too many null transactions causes a problem with Lift and χ2.
## 3.3: Null Invariance Measures
## 3.4: Comparison of Null-Invariant Measures
## 4.1: Mining Multi-Level Associations
Items often form hierarchies
eg. Level 1: Milk, Level 2: 2% milk & skim milk
Shared multi-level mining: use the lowest min-support to pass down the set of candidates (mine the database in one scan)
Redundancy filtering: remove redundant rules that can be derived from higher level rules
Customized min-supports: set different min-support for different kind of items
## 4.2: Mining Multi-Dimensional Associations
## 4.3: Mining Quantitative Associations
## 4.4: Mining Negative Correlations
Rare patterns: very low support but interesting → setting customized min-support
Negative patterns (negative correlated patterns): unlikely to happen together
sup(A ∪ B) << sup(A) × sup(B)
Problem: the support-based definition is not null-invariant
A Kulczynski measure-based definition
- If itemsets A and B are frequent but (P(A|B)+P(B|A))/2<ε, where ε is a negative pattern threshold, then A and B are negatively correlated
## 4.5: Mining Compressed Patterns
Want to find typical patterns.
- P1 and P2 share similar items and have similar support, so only list P2 which have more items
- same for P4 and P5
## 5.1: Sequential Pattern and Sequential Pattern Mining
Sequential patterns: shopping sequences, medical treatments, natural disasters, weblog click streams, programming execution sequences, DNA, protein, etc.
Transaction DB, sequence DB vs time-series DB
Gapped vs non-gapped
Within a basket, the sequence doesn’t matter.
Given support threshold min_sup = 2, <(ab)c> is a sequential pattern.
(a and b getting together, then c)
Algorithm requirement: efficient, scalable, finding complete set, incorporating various kinds of user-specific constraints
The Apriori property still holds: if a subsequence s_1 is infrequent, none of s_1’s super-sequences can be frequent
## 5.2: GSP: Apriori-Based Sequential Pattern Mining
- Scan for length-1, then prune sequences of lengths less than the min-support.
- Combine the sequences of length-1 to form candidate sequences of length-2.
- Scan for length-2, then prune sequences of lengths less than the min-support.
- Combine the sequences of length-2 to form candidate sequences of length-3.
- and so on
## 5.3: SPADE — Sequential Pattern Mining in Vertical Data Format
SPADE: Sequential PAttern Discovery using Equivalent Class
## 5.4: PrefixSpan — Sequential Pattern Mining by Pattern-Growth
## 5.5: CloSpan — Mining Closed Sequential Patterns
A closed sequential pattern s: There exists no superpattern s’ such that s’ ⊃ s, and s’ and s have the same support (in other words s has the long sequence with the same support).
Property P_1: If s ⊃ s_1, s is closed iff two project DBs have the same size.
## 6.1: Mining Spatial Associations
Spatial frequent patterns and association rule: A → B [s%, c%]
- s%: support
- c%: confidence
A and B:
- Topological relations: intersects, overlaps, disjoint, etc.
- Spatial orientations: left_of, west_of, under, etc.
- Distance information: close_to, within_distance, etc.
Example: If x is a large town and x intersects with a highway, then x is likely to be adjacent to water (lakes, rivers, etc.) with certain support and confidence.
is_a(x, large_town) ^ intersect(x, highway) → adjacent_to(x, water) [7%, 85%]
Hierarchy of spatial relationship: eg. close_to is a generalization of near_by, touch, intersect, contain
Progressive refinement: first search for rough relationship, then refine it
To save mining cost;
- Step 1: MBR (Minimum Bounding Rectangle) or R-tree for rough estimation
- Step 2: Apply more detailed algorithm for objects that have passed Step 1
## 6.2: Mining Spatial Colocation Patterns
A, B, C, D: eg. regions where West Nile Virus are spreading, mosquito regions, cities, locations where birds are present, etc.
Edges: co-location, (eg. 6, 17 and 3 are co-located above)
Rowset: list of places where particular features are co-located
cp: conditional probability
## 6.3: Mining and Aggregating Patterns over Multiple Trajectories
Time-annotated symbolic sequence
Flock pattern requires at least m entities moving at least k consecutive time stamps within a circular region of radius r. → A bit too rigid…
Convoy pattern is density-based, so it’s looser than the flock pattern.
Swarm pattern is even more relaxed.
## 6.4: Mining Semantics-Rich Movement Patterns
Frequent Movement Pattern focuses on the semantics of places, as opposed to Frequent Sequential Pattern that focuses on spatial pattern.
eg. Home→Starbucks→Office (the exact location does not matter)
## 6.5: Mining Periodic Movement Patterns
Periodic Movement Pattern in a Sparse Data: eg. bird’s nest
Find reference spots with density-based method, then detect periods using Fourier Transform and auto-correlation.
## 7.1: From Frequent Pattern Mining to Phrase Mining
Phrase: A natural, meaningful, unambiguous semantic unit (unigrams are often ambiguous)
General principle: Exploit information redundancy and data-driven criteria to determine phrase boundaries and salience
Methodology: Exploring three ideas
- Frequent pattern mining and colocation analysis
- Phrasal segmentation
- Quality phrase assessment
## 7.2: Previous Phrase Mining Methods
Reducing annotation cost: Web n-grams, SotA 95% accuracy, 88% phrase-level F-score
Topic modeling: each topic is represented by a word distribution
Three strategies on phrase mining with topic modeling
- Strategy 1: Generate bag-of-words → generate sequence of tokens
Bigram Topic Model, Topical N-Gram (TNG), Phrase-Discovering LDA (PDLDA)
High model complexity (tends to overfitting) & high inference cost - Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
TurboTopics, KERT - Strategy 3: Prior bag-of-words model inference, mine phrase and impose on the bag-of-words model
## 7.3: ToPMine: Phrase Mining without Training Data
Problem of Strategy 2 (Topic Modeling → Phrase Mining) is that phrases may be split into different topic models
Solution: Switch order!
- Phrase mining, document segmentation, and phrase ranking
- Topic model inference with phrase constraint
Collocation mining: any of these will work
Phrase Candidate Generation:
eg. set α = 5 as the threshold (5 standard deviations)
## 7.4: SegPhrase: Phrase Mining with Tiny Training Sets
ToPMine: no training data required
SegPhrase: a small set of training data may enhance the quality of phrase mining significantly
Classification → Phrasal Segmentation (SegPhrase)
Classification → Phrasal Segmentation → Classification → Phrasal Segmentation (SegPhrase+)
## Programming Assignment 2
## 8.1: Frequent Pattern Mining in Data Streams
Stream data mining tasks
- Pattern mining in data streams (OUR INTEREST HERE!)
- Multi-dimensional on-line analysis of streams
- Clustering data streams
- Classification of stream data
- Mining outliers and anomalies in stream data
Unrealistic to mine precise patterns. Approximate answers are often sufficient. How to mine frequent patterns with good approximation?
Lossy Counting Algorithm: Not to keep the items with very low support count
- Advantage: Guaranteed error bound
- Disadvantage: Keeping a large set of traces
## 8.2: Pattern Discovery for Software Bug Mining
CP-Miner:
- Tokenize code
- Compute hash of the token sequence of each line to construct a sequence DB
- Find similar (allowing some gaps due to insertions) pattern from the sequence DB
- Find “forgot-to-change” tokens
##8.3: Pattern Discovery for Image Analysis
## 8.4: Advanced Topics on Pattern Discovery: Pattern Mining and Society — Privacy Issue
Tree categories on privacy issues arising out of data mining
- Input privacy (or data hiding): hide data to prevent miners from reliably extracting confidential or private information
- Output privacy (or knowledge hiding): no disclosure of sensitive patterns or knowledge from datasets
- Owner privacy: does not allow any party to reliably learn the data or sensitive information that other owners hold
Ensuring Input Privacy
K-anonymity privacy requirement and subsequent studies
- k-anonymity: each equivalent class contains at least k records
- It is still not sufficient, thus leads to further studies, such as l-diversity, t-closeness, and differential privacy
Statistical distortion: Using randomization algorithms
- Independent attribute perturbation
- Dependent attribute perturbation: take care of correlations across attributes