A Granular Classifier for PAKDD 2015 Data Mining Competition



Σχετικά έγγραφα
ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Other Test Constructions: Likelihood Ratio & Bayes Tests

C.S. 430 Assignment 6, Sample Solutions

Math 6 SL Probability Distributions Practice Test Mark Scheme

2 Composition. Invertible Mappings

The Simply Typed Lambda Calculus

4.6 Autoregressive Moving Average Model ARMA(1,1)

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Block Ciphers Modes. Ramki Thurimella

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Section 9.2 Polar Equations and Graphs

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

Fractional Colorings and Zykov Products of graphs

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

(C) 2010 Pearson Education, Inc. All rights reserved.

Example Sheet 3 Solutions

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Instruction Execution Times

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

derivation of the Laplacian from rectangular to spherical coordinates

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Statistical Inference I Locally most powerful tests

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

Areas and Lengths in Polar Coordinates

Lecture 2. Soundness and completeness of propositional logic

ST5224: Advanced Statistical Theory II

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Approximation of distance between locations on earth given by latitude and longitude

The challenges of non-stable predicates

Advanced Subsidiary Unit 1: Understanding and Written Response

Inverse trigonometric functions & General Solution of Trigonometric Equations

TMA4115 Matematikk 3

ECE 308 SIGNALS AND SYSTEMS FALL 2017 Answers to selected problems on prior years examinations

Solution Series 9. i=1 x i and i=1 x i.

Areas and Lengths in Polar Coordinates

Numerical Analysis FMN011

Strain gauge and rosettes

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Math221: HW# 1 solutions

Elements of Information Theory

Figure 3 Three observations (Vp, Vs and density isosurfaces) intersecting in the PLF space. Solutions exist at the two indicated points.

Concrete Mathematics Exercises from 30 September 2016

Περιοχή διαγωνισμού Rethink Athens

Problem Set 3: Solutions

Homework 3 Solutions

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

EE512: Error Control Coding

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

9.09. # 1. Area inside the oval limaçon r = cos θ. To graph, start with θ = 0 so r = 6. Compute dr

ΠΕΡΙΕΧΟΜΕΝΑ. Κεφάλαιο 1: Κεφάλαιο 2: Κεφάλαιο 3:

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

ΜΕΤΑΠΤΥΧΙΑΚΗ ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ «ΘΕΜΑ»

Démographie spatiale/spatial Demography

ΤΕΧΝΟΛΟΓΙΚΟ ΕΚΠΑΙΔΕΥΤΙΚΟ ΙΔΡΥΜΑ ΚΡΗΤΗΣ ΣΧΟΛΗ ΔΙΟΙΚΗΣΗΣ ΚΑΙ ΟΙΚΟΝΟΜΙΑΣ (ΣΔΟ) ΤΜΗΜΑ ΛΟΓΙΣΤΙΚΗΣ ΚΑΙ ΧΡΗΜΑΤΟΟΙΚΟΝΟΜΙΚΗΣ

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Congruence Classes of Invertible Matrices of Order 3 over F 2

; +302 ; +313; +320,.

Μηχανική Μάθηση Hypothesis Testing

Right Rear Door. Let's now finish the door hinge saga with the right rear door

Solutions to Exercise Sheet 5

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS International General Certificate of Secondary Education

Characterization Report

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Partial Trace and Partial Transpose

On a four-dimensional hyperbolic manifold with finite volume

Exercises to Statistics of Material Fatigue No. 5

ΔΙΑΣΤΑΣΕΙΣ ΕΣΩΤΕΡΙΚΗΣ ΓΩΝΙΑΣ INTERNAL CORNER SIZES

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)


ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ. ΕΠΛ342: Βάσεις Δεδομένων. Χειμερινό Εξάμηνο Φροντιστήριο 10 ΛΥΣΕΙΣ. Επερωτήσεις SQL

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΕΠΙΣΤΗΜΩΝ ΥΓΕΙΑΣ. Πτυχιακή εργασία ΑΓΧΟΣ ΚΑΙ ΚΑΤΑΘΛΙΨΗ ΣΕ ΓΥΝΑΙΚΕΣ ΜΕ ΚΑΡΚΙΝΟΥ ΤΟΥ ΜΑΣΤΟΥ ΜΕΤΑ ΑΠΟ ΜΑΣΤΕΚΤΟΜΗ

Capacitors - Capacitance, Charge and Potential Difference

Distances in Sierpiński Triangle Graphs

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Homework 8 Model Solution Section

* * GREEK 0543/02 Paper 2 Reading and Directed Writing May/June 2009

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

Συντακτικές λειτουργίες

ORDINAL ARITHMETIC JULIAN J. SCHLÖDER

Οδηγίες Αγοράς Ηλεκτρονικού Βιβλίου Instructions for Buying an ebook

VBA ΣΤΟ WORD. 1. Συχνά, όταν ήθελα να δώσω ένα φυλλάδιο εργασίας με ασκήσεις στους μαθητές έκανα το εξής: Version ΗΜΙΤΕΛΗΣ!!!!

Modern Greek Extension

þÿ ½ Á Å, ˆ»µ½± Neapolis University þÿ Á̳Á±¼¼± ¼Ìù±Â ¹ º à Â, Ç» Ÿ¹º ½ ¼¹ºÎ½ À¹ÃÄ ¼Î½ º±¹ ¹ º à  þÿ ±½µÀ¹ÃÄ ¼¹ µ À»¹Â Æ Å

1 String with massive end-points

Reminders: linear functions

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΒΙΟΤΕΧΝΟΛΟΓΙΑΣ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ. Πτυχιακή εργασία

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Section 7.6 Double and Half Angle Formulas

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Proforma C. Flood-CBA#2 Training Seminars. Περίπτωση Μελέτης Ποταμός Έ βρος, Κοινότητα Λαβάρων

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Transcript:

A Granular Classifier for PAKDD 2015 Data Mining Competition Wojtek Świeboda the University of Warsaw May 20, 2015

Overview Exploratory Data Analysis Observation-level classifier Granular classifier Summary

Exploratory Data Analysis Doing detective work, exploring the dataset. There seem to be patterns/artifacts leftover from data processing? 0 5000 10000 15000 20000 25000 30000 Nov 16 Nov 26 Dec 06 Dec 16 objects in training and test file session begin timestamp 0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 Lag Autocorrelation of decisions

Exploratory Data Analysis Working hypothesis: Consecutive observations in training and test files correspond to the same entity (a user? an IP address?). Upon seeing leaderboard entries with outstanding scores I decided to investigate it further: can such entities in the training and test files be matched? Define a block, group or a granule as the set of consecutive observations with non-decreasing timestamps and with a consistent decision (for the training file).

Exploratory Data Analysis: Granule identification 0 50 100 150 200 0 50 100 150 200 1 173 394 615 836 1081 1351 1621 1891 2161 2431 2701 2820 3021 3222 3423 3624 3825 4026 4227 4428 4629 4830 Figure: Block lengths in training file (left) and test file (right). At the end of the training and test files, detected blocks are smaller and are more likely to be merged by mere chance. Blocks in the training file are slightly more robust to this thanks to information about decisions.

Problem Statement Not a typical Data Mining competition! Regularities in the dataset pose a unique problem: Try to recover the original data structure (match corresponding blocks in training and test files) If it s not possible, then apply Data Mining methods... or apply a combination of both.

Observation-level classifier Input features: 1. indicators of A,B,C,D-level identifiers of observed items, 2. the total number of observed items, 3. the length of the session, 4. predicted fraction of male observations based on hour alone, 5. predicted fraction of male observations based on date alone. Classifiers used: bagging with decision trees and a random forest. Since the results were very robust with respect to parameters, I did very little parameter tuning.

Exploratory Data Analysis 0.5 0.275 0.4 0.250 fraction of male visitors 0.3 fraction of male visitors 0.225 0.2 0.200 0 5 10 15 20 hour Nov 17 Nov 24 Dec 01 Dec 08 Dec 15 Dec 22 date Figure: These figures illustrate some of the features included in the object-level classifier. Figure on the left highlights e.g. that visitors late at night are more likely to be male. Such visits are nevertheless very rare, as indicated by point sizes. The figure is imperfect as the smoothing used to produce this plot does not account for the clock being cyclic, i.e. leftmost and rightmost endpoints do not meet. Figure on the right highlights likely local trends in male/female ratio.

Back to identified granules... 0 50 100 150 200 0 50 100 150 200 1 173 394 615 836 1081 1351 1621 1891 2161 2431 2701 2820 3021 3222 3423 3624 3825 4026 4227 4428 4629 4830 Figure: Blocks (granules) identified in training and test files.

Partial matching of blocks. Figure: A diagram illustrating the matching of blocks identified in test data (bottom) to blocks in the training data (top). Colors correspond to decisions. Heights of recangles correspond to number of observations in each block. Not all blocks are identified correctly and not all of them were matched. Matching is based on block lengths and comparison of the decision (training file) to average individual-level classifier raw output averaged over a granule (test file).

Classification Figure: Whenever possible, decision classes are assigned based on matching between blocks. The remaining object are classified either as whole granules (white rectangles) or as separate objects (squares marked in grey).

Classification U tr, U te : training and test files. I tr and I te are relations on U tr and U te : are two objects from the same block? m U te /I te U tr /I tr { } is the partial matching (where corresponds to unknown). For an observation x, f(x) is the raw score/raw output from individual-level classifier. Θ θ (x) = { female male if x > θ otherwise The classifier dec Ute {male, female} used in the submission assigns decisions as follows: dec(x) = dec(y) if y m([x] Ite ) Θ θ ( y [x]ite Θ θ (f(x)) f(y) [x] Ite ) if r([x] I te ) = 1 m([x] te ) = otherwise

Plans for the article 0.5 0.4 fraction of male visitors 0.3 0.2 0 5 10 15 20 hour Figure: Derive a smoothing method with constraints (for a cyclic domain).

Thank you