Supervised Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Regression Example: Boston Housing. Regression Example: Boston Housing



Σχετικά έγγραφα
Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Solution Series 9. i=1 x i and i=1 x i.

Statistical Inference I Locally most powerful tests

Other Test Constructions: Likelihood Ratio & Bayes Tests

ST5224: Advanced Statistical Theory II

5.4 The Poisson Distribution.

Μηχανική Μάθηση Hypothesis Testing

An Introduction to Signal Detection and Estimation - Second Edition Chapter II: Selected Solutions

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

The Simply Typed Lambda Calculus

Solutions to Exercise Sheet 5

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Homework 3 Solutions

6.3 Forecasting ARMA processes

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM


Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Math 6 SL Probability Distributions Practice Test Mark Scheme

Introduction to the ML Estimation of ARMA processes

Exercises to Statistics of Material Fatigue No. 5

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Inverse trigonometric functions & General Solution of Trigonometric Equations

HOMEWORK#1. t E(x) = 1 λ = (b) Find the median lifetime of a randomly selected light bulb. Answer:

6. MAXIMUM LIKELIHOOD ESTIMATION

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

2 Composition. Invertible Mappings

4.6 Autoregressive Moving Average Model ARMA(1,1)

CE 530 Molecular Simulation

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

EE512: Error Control Coding

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

C.S. 430 Assignment 6, Sample Solutions

Fractional Colorings and Zykov Products of graphs

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Numerical Analysis FMN011

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Finite Field Problems: Solutions

Second Order Partial Differential Equations

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Areas and Lengths in Polar Coordinates

Approximation of distance between locations on earth given by latitude and longitude


Concrete Mathematics Exercises from 30 September 2016

Άσκηση 10, σελ Για τη μεταβλητή x (άτυπος όγκος) έχουμε: x censored_x 1 F 3 F 3 F 4 F 10 F 13 F 13 F 16 F 16 F 24 F 26 F 27 F 28 F

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Areas and Lengths in Polar Coordinates

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Section 8.3 Trigonometric Equations

Congruence Classes of Invertible Matrices of Order 3 over F 2

Lecture 34 Bootstrap confidence intervals

Probability and Random Processes (Part II)

ENGR 691/692 Section 66 (Fall 06): Machine Learning Assigned: August 30 Homework 1: Bayesian Decision Theory (solutions) Due: September 13

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013


Srednicki Chapter 55

Block Ciphers Modes. Ramki Thurimella

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Math221: HW# 1 solutions

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Example Sheet 3 Solutions

Section 9.2 Polar Equations and Graphs

The challenges of non-stable predicates

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Mean-Variance Analysis

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Every set of first-order formulas is equivalent to an independent set

Calculating the propagation delay of coaxial cable

D Alembert s Solution to the Wave Equation

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Solutions to the Schrodinger equation atomic orbitals. Ψ 1 s Ψ 2 s Ψ 2 px Ψ 2 py Ψ 2 pz

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

derivation of the Laplacian from rectangular to spherical coordinates

m4.3 Chris Parrish June 16, 2016

Statistics & Research methods. Athanasios Papaioannou University of Thessaly Dept. of PE & Sport Science

Problem Set 3: Solutions

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Υπολογιστική Φυσική Στοιχειωδών Σωματιδίων

Homework for 1/27 Due 2/5

Biostatistics for Health Sciences Review Sheet

w o = R 1 p. (1) R = p =. = 1

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Matrices and Determinants

Strain gauge and rosettes

Tridiagonal matrices. Gérard MEURANT. October, 2008

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

Lecture 2. Soundness and completeness of propositional logic

Part III - Pricing A Down-And-Out Call Option

Example of the Baum-Welch Algorithm

Parametrized Surfaces

An Inventory of Continuous Distributions

Lecture 7: Overdispersion in Poisson regression

Mock Exam 7. 1 Hong Kong Educational Publishing Company. Section A 1. Reference: HKDSE Math M Q2 (a) (1 + kx) n 1M + 1A = (1) =

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

1. A fully continuous 20-payment years, 30-year term life insurance of 2000 is issued to (35). You are given n A 1

Notes on the Open Economy

Transcript:

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oford http://www.stats.o.ac.uk/~sejdinov/sdmml.html Unsupervised learning: To etract structure and postulate hypotheses about data generating process from unlabelled observations 1,..., n. Visualize, summarize and compress data. Supervised learning: In addition to the observations of X, we have access to their response variables / labels Y Y: we observe {( i, y i } n i=1. Types of supervised learning: Classification: discrete responses, e.g. Y = {+1, 1} or {1,..., K}. Regression: a numerical value is observed and Y = R. The goal is to accurately predict the response Y on new observations of X, i.e., to learn a function f : R p Y, such that f (X will be close to the true response Y. Regression Eample: Boston Housing Regression Eample: Boston Housing The original data are 506 observations on 13 variables X; medv is the response variable Y. crim per capita crime rate by town zn proportion of residential land zoned for lots over 25,000 sq.ft indus proportion of non-retail business acres per town chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise no nitric oides concentration (parts per 10 million rm average number of rooms per dwelling age proportion of owner-occupied units built prior to 1940 dis weighted distances to five Boston employment centers rad inde of accessibility to radial highways ta full-value property-ta rate per USD 10,000 ptratio pupil-teacher ratio by town b 1000(B - 0.63^2 where B is the proportion of blacks by town lstat percentage of lower status of the population medv median value of owner-occupied homes in USD 1000 s > str(x data.frame : 506 obs. of 13 variables: $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905... $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5... $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87... $ chas : int 0 0 0 0 0 0 0 0 0 0... $ no : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524... $ rm : num 6.58 6.42 7.18 7.00 7.15... $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9... $ dis : num 4.09 4.97 4.97 6.06 6.06... $ rad : int 1 2 2 3 3 3 5 5 5 5... $ ta : num 296 242 242 222 222 222 311 311 311 311... $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2... $ black : num 397 397 393 395 397... $ lstat : num 4.98 9.14 4.03 2.94 5.33... > str(y num[1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9... Goal: predict median house price Y given 13 predictor variables X of a new district.

Classification Eample: Lymphoma Loss function We have gene epression measurements X of n = 62 patients for p = 4026 genes. For each patient, Y {0, 1} denotes one of two subtypes of cancer. Goal: predict cancer subtype given gene epressions of a new patient. > str(x data.frame : 62 obs. of 4026 variables: $ Gene 1 : num -0.344-1.188 0.520-0.748-0.868... $ Gene 2 : num -0.953-1.286 0.657-1.328-1.330... $ Gene 3 : num -0.776-0.588 0.409-0.991-1.517... $ Gene 4 : num -0.474-1.588 0.219 0.978-1.604... $ Gene 5 : num -1.896-1.960-1.695-0.348-0.595... $ Gene 6 : num -2.075-2.117 0.121-0.800 0.651... $ Gene 7 : num -1.875-1.818 0.317 0.387 0.041... $ Gene 8 : num -1.539-2.433-0.337-0.522-0.668... $ Gene 9 : num -0.604-0.710-1.269-0.832 0.458... $ Gene 10 : num -0.218-0.487-1.203-0.919-0.848... $ Gene 11 : num -0.340 1.164 1.023 1.133-0.541... $ Gene 12 : num -0.531 0.488-0.335 0.496-0.358... > str(y num [1:62] 0 0 0 1 0 0 1 0 0 0... Suppose we made a prediction Ŷ = f (X Y based on observation of X. How good is the prediction? We can use a loss function L : Y Y R + to formalize the quality of the prediction. Typical loss functions: Misclassification loss (or 0-1 loss for classification { 0 f (X = Y L(Y, f (X = 1 f (X Y. Squared loss for regression L(Y, f (X = (f (X Y 2. Many other choices are possible, e.g., weighted misclassification loss. In classification, if estimated probabilities ˆp(k for each class k Y are returned, log-likelihood loss (or log loss L(Y, ˆp = log ˆp(Y is often used. Risk The Bayes Classifier Risk paired observations {( i, y i } n i=1 viewed as i.i.d. realizations of a random variable (X, Y on X Y with joint distribution P XY For a given loss function L, the risk R of a learned function f is given by the epected loss R(f = E PXY [L(Y, f (X], where the epectation is with respect to the true (unknown joint distribution of (X, Y. The risk is unknown, but we can compute the empirical risk: R n (f = 1 n n L(y i, f ( i. i=1 What is the optimal classifier if the joint distribution (X, Y were known? The density g of X can be written as a miture of K components (corresponding to each of the classes: g( = π k g k (, where, for k = 1,..., K, P(Y = k = π k are the class probabilities, g k ( is the conditional density of X, given Y = k. The Bayes classifier f Bayes : {1,..., K} is the one with minimum risk: [ R(f =E [L(Y, f (X] = E X EY X [L(Y, f (X X] ] = E [L(Y, f (X X = ] g(d X The minimum risk attained by the Bayes classifier is called Bayes risk. Minimizing E[L(Y, f (X X = ] separately for each suffices.

The Bayes Classifier The Bayes Classifier: Eample Consider the 0-1 loss. The risk simplifies to: [ E L(Y, f (X ] X = = L(k, f (P(Y = k X = =1 P(Y = f ( X = The risk is minimized by choosing the class with the greatest posterior probability: f Bayes ( = arg ma P(Y = k X = = arg ma π k g k ( K j=1 π = arg ma jg j ( π k g k (. The functions π k g k ( are called discriminant functions. The discriminant function with maimum value determines the predicted class of. A simple two Gaussians eample: Suppose X N (µ Y, 1, where µ 1 = 1 and µ 2 = 1 and assume equal priors π 1 = π 2 = 1/2. g 1 ( = 1 ( + 12 ep ( and g 2 ( = 1 ( 12 ep (. 2π 2 2π 2 0.0 0.1 0.2 0.3 0.4 Optimal classification is f Bayes ( = arg ma marginal density 0.05 0.10 0.15 0.20 0.25 { 1 if < 0, π k g k ( = 2 if 0. The Bayes Classifier: Eample Plug-in Classification How do you classify a new observation if now the standard deviation is still 1 for class 1 but 1/3 for class 2? 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1e 32 1e 25 1e 18 1e 11 1e 04 Looking at density in a log-scale, optimal classification is to select class 2 if and only if [0.34, 2.16]. The Bayes Classifier chooses the class with the greatest posterior probability f Bayes ( = arg ma π k g k (. We know neither the g k nor the class probabilities π k! The plug-in classifier chooses the class f ( = arg ma ˆπ k ĝ k (, where we plugged in estimates ˆπ k of π k and k = 1,..., K and estimates ĝ k ( of, is an eample of plug-in classification.

LDA is the most well-known and simplest eample of plug-in classification. Assume multivariate normal conditional density g k ( for each class k: X Y = k N (µ k, Σ, g k ( =(2π p/2 Σ 1/2 ep ( 1 2 ( µ k Σ 1 ( µ k, each class can have a different mean µ k, all classes share the same covariance Σ. For an observation, the k-th log-discriminant function is log π k g k ( = c + log π k 1 2 ( µ k Σ 1 ( µ k The quantity ( µ k Σ 1 ( µ k is the squared Mahalanobis distance between and µ k. If Σ = I p and π k = 1 K, LDA simply chooses the class k with the nearest (in the Euclidean sense class mean. Epanding the term ( µ k Σ 1 ( µ k, log π k g k ( = c + log π k 1 2 ( µ k Σ 1 µ k 2µ k Σ 1 + Σ 1 = c + log π k 1 2 µ k Σ 1 µ k + µ k Σ 1 Setting a k = log(π k 1 2 µ k Σ 1 µ k and b k = Σ 1 µ k, we obtain log π k g k ( = c + a k + b k i.e. a linear discriminant function in. Consider choosing class k over k : a k + b k > a k + b k a + b > 0 where a = a k a k and b = b k b k. The Bayes classifier thus partitions X into regions with the same class predictions via separating hyperplanes. The Bayes classifier under these assumptions is more commonly known as the LDA classifier. Parameter Estimation How to estimate the parameters of the LDA model? We can achieve this by maimum likelihood (EM algorithm is not needed here since the class variables y i are observed!. Let n k = #{j : y j = k} be the number of observations in class k. l(π, (µ k K, Σ = log p ( ( i, y i n i=1 π, (µ k K, Σ = n i=1 log π yi g yi ( i =c + log π k 1 2 ( log Σ + ( j µ k Σ 1 ( j µ k ML estimates: ˆπ k = n k n ˆµ k = 1 n k j ˆΣ = 1 n ( j ˆµ k ( j ˆµ k Note: the ML estimate of Σ is biased. For an unbiased estimate we need to divide by n K. library(mass data(iris ##save class labels ct <- unclass(iris$species ##pairwise plot pairs(iris[,1:4],col=ct Sepal.Length 2.0 2.5 3.0 3.5 4.0 4.5 5.5 6.5 7.5 2.0 2.5 3.0 3.5 4.0 Sepal.Width 4.5 5.5 6.5 7.5

Just focus on two predictor variables. iris.data <- iris[,3:4] plot(iris.data,col=ct,pch=20,ce=1.5,ce.lab=1.4 Computing and plotting the LDA boundaries. ##fit LDA iris.lda <- lda(=iris.data,grouping=ct ##create a grid for our plotting surface <- seq(0,8,0.02 y <- seq(0,3,0.02 m <- length( n <- length(y z <- as.matri(epand.grid(,y,0 ##classes are 1,2 and 3, so set contours at 1.5 and 2.5 iris.ldp <- predict(iris.lda,z$class contour(,y,matri(iris.ldp,m,n, levels=c(1.5,2.5, add=true, d=false, lty=2