A Granular Classifier for PAKDD 2015 Data Mining Competition Wojtek Świeboda the University of Warsaw May 20, 2015
Overview Exploratory Data Analysis Observation-level classifier Granular classifier Summary
Exploratory Data Analysis Doing detective work, exploring the dataset. There seem to be patterns/artifacts leftover from data processing? 0 5000 10000 15000 20000 25000 30000 Nov 16 Nov 26 Dec 06 Dec 16 objects in training and test file session begin timestamp 0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 Lag Autocorrelation of decisions
Exploratory Data Analysis Working hypothesis: Consecutive observations in training and test files correspond to the same entity (a user? an IP address?). Upon seeing leaderboard entries with outstanding scores I decided to investigate it further: can such entities in the training and test files be matched? Define a block, group or a granule as the set of consecutive observations with non-decreasing timestamps and with a consistent decision (for the training file).
Exploratory Data Analysis: Granule identification 0 50 100 150 200 0 50 100 150 200 1 173 394 615 836 1081 1351 1621 1891 2161 2431 2701 2820 3021 3222 3423 3624 3825 4026 4227 4428 4629 4830 Figure: Block lengths in training file (left) and test file (right). At the end of the training and test files, detected blocks are smaller and are more likely to be merged by mere chance. Blocks in the training file are slightly more robust to this thanks to information about decisions.
Problem Statement Not a typical Data Mining competition! Regularities in the dataset pose a unique problem: Try to recover the original data structure (match corresponding blocks in training and test files) If it s not possible, then apply Data Mining methods... or apply a combination of both.
Observation-level classifier Input features: 1. indicators of A,B,C,D-level identifiers of observed items, 2. the total number of observed items, 3. the length of the session, 4. predicted fraction of male observations based on hour alone, 5. predicted fraction of male observations based on date alone. Classifiers used: bagging with decision trees and a random forest. Since the results were very robust with respect to parameters, I did very little parameter tuning.
Exploratory Data Analysis 0.5 0.275 0.4 0.250 fraction of male visitors 0.3 fraction of male visitors 0.225 0.2 0.200 0 5 10 15 20 hour Nov 17 Nov 24 Dec 01 Dec 08 Dec 15 Dec 22 date Figure: These figures illustrate some of the features included in the object-level classifier. Figure on the left highlights e.g. that visitors late at night are more likely to be male. Such visits are nevertheless very rare, as indicated by point sizes. The figure is imperfect as the smoothing used to produce this plot does not account for the clock being cyclic, i.e. leftmost and rightmost endpoints do not meet. Figure on the right highlights likely local trends in male/female ratio.
Back to identified granules... 0 50 100 150 200 0 50 100 150 200 1 173 394 615 836 1081 1351 1621 1891 2161 2431 2701 2820 3021 3222 3423 3624 3825 4026 4227 4428 4629 4830 Figure: Blocks (granules) identified in training and test files.
Partial matching of blocks. Figure: A diagram illustrating the matching of blocks identified in test data (bottom) to blocks in the training data (top). Colors correspond to decisions. Heights of recangles correspond to number of observations in each block. Not all blocks are identified correctly and not all of them were matched. Matching is based on block lengths and comparison of the decision (training file) to average individual-level classifier raw output averaged over a granule (test file).
Classification Figure: Whenever possible, decision classes are assigned based on matching between blocks. The remaining object are classified either as whole granules (white rectangles) or as separate objects (squares marked in grey).
Classification U tr, U te : training and test files. I tr and I te are relations on U tr and U te : are two objects from the same block? m U te /I te U tr /I tr { } is the partial matching (where corresponds to unknown). For an observation x, f(x) is the raw score/raw output from individual-level classifier. Θ θ (x) = { female male if x > θ otherwise The classifier dec Ute {male, female} used in the submission assigns decisions as follows: dec(x) = dec(y) if y m([x] Ite ) Θ θ ( y [x]ite Θ θ (f(x)) f(y) [x] Ite ) if r([x] I te ) = 1 m([x] te ) = otherwise
Plans for the article 0.5 0.4 fraction of male visitors 0.3 0.2 0 5 10 15 20 hour Figure: Derive a smoothing method with constraints (for a cyclic domain).
Thank you