ABSTRACT. stratum variable C is always observed but the column variable B and row variable A might be

ABSTRACT Title of dissertation: stimating Common Odds Ratio with Missing Data Te-Ching Chen, Doctor of Philosophy, 005 Dissertation directed by: Professor Paul J. Smith Statistics Program, Department of Mathematics We derive estimates of expected cell counts for I J K contingency tables where the stratum variable C is always observed but the column variable B and row variable A might be missing. In particular, we investigate cases where only row variable A might be missing, either randomly or informatively. For K tables, we use Taylor expansion to study the biases and variances of the Mantel-Haenszel estimator and modified Mantel-Haenszel estimators of the common odds ratio using one pair of pseudotables for data without missing values and for data with missing values, based either on the completely observed subsample or on estimated cell means when both stratum and column variables are always observed. We examine both large table and sparse table asymptotics. Analytic studies and simulation results show that the Mantel-Haenszel estimators overestimate the common odds ratio but adding one pair of pseudotables reduces bias and variance. Mantel-Haenszel estimators with jacnifing also reduces the biases and variances. stimates using only the complete subsample seem to have larger bias than those based on full data, but when the total number of observations gets large, the bias is reduced. stimators based on estimated cell means seem to have larger biases and variances than those based only on complete subsample with randomly missing data. With informative missingness, estimators based on the estimated cell means do not converge to the correct common odds ratio under sparse asymptotics, and converge slowly for the large table asymptotics.

The Mantel-Haenszel estimators based on incorrectly estimated cell means when the variable A is informatively missing behave similarly to those based on the only complete subsamples. The asymptotic variance formula of the ratio estimators had smaller biases and variances than those based on jacnifing or bootstrapping. Bootstrapping may produce zero divisors and unstable estimates, but adding one pair of pseudotables eliminates these problems and reduces the variability.

stimating Common Odds Ratio With Missing Data by Te-Ching Chen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Par in partial fulfillment of the requirements for the degree of Doctor of Philosophy 005 Advisory Committee: Dr. Paul J. Smith, Chair/Advisor Dr. Benjamin Kedem Dr. Ryszard Sysi Dr. Grace L. Yang Dr. Chan Mitchell Dayton

c Copyright by Te-Ching Chen 005

To my families.

ACKNOWLDGMNTS With sincere gratitude, I wish to acnowledge the many individuals who have guided me in the process of completing my doctoral program, especially that of my advisor Dr. Paul J. Smith, whose guidance, encouragement, and extraordinary patience throughout my graduate study have contributed to my professional growth. I cannot than him enough for the support and help he provided in the process of writing this dissertation. I would also lie to express my appreciation and gratitude to professors Ryszard Sysi, Grace L. Yang, Chan Mitchell Dayton, and Benjamin Kedem, for their helpful suggestions for reviewing this study and serving on my committee. I would lie to express my sincere appreciation to Dr. Hwai-Chiuan Wang whose encouragement of studying in the United States opens both of my academic and life views, and to Dr. Anne Chao who introduced me to the joy of statistics and Ms. Mei-Gui Wang who inspired my interest in mathematics. I would lie to extend my thans to Ms. Juei-Hsien Luan, Ms. Shu-Yuan Chuang, Dr. Ronald Zeigler, Jeol smith, and Carolina Rojas-Bahr for their valuable suggestions and encouragement during difficult times. Special thans to Dr. Yan Xin, Yan Xin life science and technologyyxlst, and friends in the YXLST. Acnowledgment also goes to Dr. Cordell Blac and the Office of Multi-thnic Student ducation for their support of my dissertation studies. Lastly, than you all who have ever helped me. iii

TABL OF CONTNTS List of Tables vi List of Figures xi Introduction Literature Review 3. Missing Data....................................... 3. Closed Form stimates.................................. 5.3 Common Odds Ratio stimators............................ 7.3. Mantel-Haenszel stimator........................... 7.3. Mantel-Haenszel with Pseudotable Methods.................. 7.3.3 Jacnifing Mantel-Haenszel Method...................... 8.4 Variance stimation Formula.............................. 8 3 Closed-Form stimators for I J K Tables 9 3. Variable C always observed............................... 3. Variables B and C always observed........................... 7 3.3 Negative ˆα i boundary problems............................ 8 4 Common Odds Ratio stimation 0 4. Methods.......................................... 0 4.. Mantel-Haenszel stimator........................... 0 4.. Mantel-Haenszel stimator with pseudo-data................. 33 4. Missing Data....................................... 38 4.. Complete Data Only MAR Model....................... 39 4.. Complete Data Only Informative Missingness................. 5 4..3 Closed Form stimated Data for MAR Model................. 60 iv

4..4 Closed Form stimated Data for Informative Missingness Model...... 89 4.3 Independent binomial rows............................... 05 5 Simulation Results 06 5. Simulation......................................... 06 5. stimating the common odds ratio when A is MARB, C.............. 07 5.3 stimating the common odds ratio when A is missing informativea, C...... 5.4 Variance stimation When A is MARB, C...................... 0 5.5 stimating Variance When A is Missing InformativeA, C.............. 7 5.6 Multinomial Data..................................... 8 6 Summary and Future Research 9 6. Summary......................................... 9 6. Simulation Findings................................... 9 6.3 Conclusions........................................ 30 6.4 Future Research..................................... 3 A Simulation Result Tables MAR Model 3 B Simulation Result Tables Informative Model 34 C Simulation Result Tables Variance stimation for MARB, C Model 43 D Simulation Result Tables Variance stimation for Informative Model 59 Simulation Result Tables MAR Model Multinomial 7 F Simulation Result Tables Informative Model Multinomial 74 G Simulation Result Tables Variance stimation for MAR Model Multinomial 86 H Simulation Result Tables Variance stimation for Informative Model Multinomial 88 Bibliography v

LIST OF TABLS. Two-way contingency table with missing data..................... 5. One Pair of Pseudotables................................ 7 3. the th table....................................... 0 3. Closed Form stimates for I J K Tables..................... 4 3.3 Closed Form stimates for I J K Tables Continued.............. 5 3.4 Closed Form stimates for I J K Tables Continued.............. 6 3.5 Closed Form stimator B and C are Always Observed............... 8 3.6 Boundary Solution for InformativeA, C when both B and C are always observed 9 4. closed form estimated cell means for InformativeA, C................ 89 A. Mean of stimated Common Odds Ratio MAR Model for θ =.......... 33 A. Variance of stimated Common Odds Ratio MAR Model for θ =........ 35 A.3 T value for stimated Common Odds Ratio MAR Model for θ =........ 36 A.4 Mean of stimated Common Odds Ratio MAR Model for θ = 3.......... 37 A.5 Variance of stimated Common Odds Ratio MAR Model for θ = 3........ 38 A.6 T value for stimated Common Odds Ratio MAR Model for θ = 3........ 39 B. Mean of stimated Common Odds Ratio Informative A, C for θ =...... 40 B. Variance of stimated Common Odds Ratio Informative A, C for θ =.... 4 B.3 T value for stimated Common Odds Ratio Informative A, C for θ =.... 4 B.4 Simulation Results Informative A, C: stimated Common Odds Ratio for θ = 3 44 B.5 Variance of stimated Common Odds Ratio Informative A, C for θ = 3.... 45 B.6 T value for stimated Common Odds Ratio Informative A, C for θ = 3.... 46 C. Variance stimation MAR Model for θ = Full Data vs. Complete Only.... 47 C. Variance stimation MAR Model for θ = Full Data vs. stimated Data.... 48 vi

C.3 Variance stimation MAR Model for θ = Full Data vs. Complete Only With Pseudo-Tables....................................... 49 C.4 Variance stimation MAR Model for θ = Full Data vs. stimated Data With Pseudo-Tables....................................... 50 C.5 Variance of Variance stimation MAR Model for θ =.............. 5 C.6 Variance of Variance stimation MAR Model for θ = With Pseudo-Tables... 5 C.7 Variance stimation MAR Model for θ = 3 Full Data vs. Complete Only.... 53 C.8 Variance stimation MAR Model for θ = 3 Full Data vs. stimated Data.... 54 C.9 Variance stimation MAR Model for θ = 3 Full Data vs. Complete Data Only with Pseduo-Tables.................................... 55 C.0 Variance stimation MAR Model for θ = 3 Full Data vs. stimated Data with Pseduo-Tables....................................... 56 C. Variance of Variance stimation MAR Model for θ = 3.............. 57 C. Variance of Variance stimation MAR Model for θ = 3 With Pseudo-Tables... 58 D. Variance stimation Informative A, C for θ = Full Data vs. Complete Only. 60 D. Variance stimation Informative A, C for θ = Full Data vs. stimated Data. 6 D.3 Variance stimation Informative A, C for θ = Full Data vs. Complete Only With Pseudo-Tables................................... 6 D.4 Variance stimation Informative A, C for θ = Full Data vs. stimated Data With Pseudo-Tables................................... 63 D.5 Variance of Variance stimation Informative A, C for θ =........... 64 D.6 Variance of Variance stimation Informative A, C for θ = With Pseudo-Tables 65 D.7 Variance stimation Informative A, C for θ = 3 Full Data vs. Complete Only. 66 D.8 Variance stimation Informative A, C for θ = 3 Full Data vs. stimated Data. 67 D.9 Variance stimation Informative A, C for θ = 3 Full Data vs. Complete Only With Pseudo-Tables................................... 68 vii

D.0 Variance stimation Informative A, C for θ = 3 Full Data vs. stimated Data With Pseudo-Tables................................... 69 D. Variance of Variance stimation Informative A, C for θ = 3........... 70 D. Variance of Variance stimation Informative A, C for θ = 3 With Pseudo-Tables 7. Mean of Mean of stimated Common Odds Ratio Multinomial MAR Model : for θ =............................................ 73. Variance of stimated Common Odds Ratio Multinomial MAR Model for θ =. 75.3 T value for stimated Common Odds Ratio Multinomial MAR Model for θ =. 76.4 Mean of stimated Common Odds Ratio Multinomial MAR Model : for θ = 3. 77.5 Variance of stimated Common Odds Ratio Multinomial MAR Model for θ = 3. 78.6 T value for stimated Common Odds Ratio Multinomial MAR Model for θ = 3. 79 F. Mean of stimated Common Odds Ratio Multinomial Informative A, C for θ =.80 F. Variance of stimated Common Odds Ratio Multinomial Informative A, C Model for θ =.......................................... 8 F.3 T value for stimated Common Odds Ratio Multinomial Informative A, C Model for θ =.......................................... 8 F.4 Mean of stimated Common Odds Ratio Multinomial Informative A, C for θ = 3 83 F.5 Variance of stimated Common Odds Ratio Multinomial Informative A, C Model for θ = 3.......................................... 84 F.6 T value for stimated Common Odds Ratio Multinomial Informative A, C Model for θ = 3.......................................... 85 G. Variance stimation Multinomial MAR Model for θ = Full Data vs. Complete Only............................................ 87 G. Variance stimation Multinomial MAR Model for θ = Full Data vs. stimated Data............................................ 89 viii

G.3 Variance stimation Multinomial MAR Model for θ = Full Data vs. Complete Only With Pseudo-Tables................................ 90 G.4 Variance stimation Multinomial MAR Model for θ = Full Data vs. stimated Data With Pseudo-Tables................................ 9 G.5 Variance of Variance stimation Multinomial MAR Model for θ =....... 9 G.6 Variance of Variance stimation Multinomial MAR Model stimate for θ = With Pseudo-Tables................................... 93 G.7 Variance stimation Multinomial MAR Model for θ = 3 Full Data vs. Complete Only............................................ 94 G.8 Variance stimation Multinomial MAR Model for θ = 3 Full Data vs. stimated Data............................................ 95 G.9 Variance stimation Multinomial MAR Model for θ = 3 Full Data vs. Complete Only With Pseudo-Tables................................ 96 G.0 Variance stimation Multinomial MAR Model for θ = 3 Full Data vs. stimated Data With Pseudo-Tables................................ 97 G. Variance of Variance stimation Multinomial MAR Model for θ = 3....... 98 G. Variance of Variance stimation Multinomial MAR Model stimate for θ = 3 With Pseudo-Tables................................... 99 H. Variance stimation Multinomial Informative A, C Model: for θ = Full Data vs. Complete Only.................................... 00 H. Variance stimation Multinomial Informative A, C Model:for θ = Full Data vs. stimated Data.................................... 0 H.3 Variance stimation Multinomial Informative A, C Model:for θ = Full Data vs. Complete Only With Pseudo-Tables........................ 0 H.4 Variance stimation Multinomial Informative A, C Model:for θ = Full Data vs. stimated Data With Pseudo-Tables........................ 03 H.5 Variance of Variance stimation Multinomial Informative A, C Model for θ = 04 ix

H.6 Variance of Variance stimation Multinomial Informative A, C Model for θ = With Pseudo-Tables................................... 05 H.7 Variance stimation Multinomial Informative A, C Model:for θ = 3 Full Data vs. Complete Only.................................... 06 H.8 Variance stimation Multinomial Informative A, C Model:for θ = 3 Full Data vs. stimated Data.................................... 07 H.9 Variance stimation Multinomial Informative A, C Model:for θ = 3 Full Data vs. Complete Only With Pseudo-Tables........................ 08 H.0 Variance stimation Multinomial Informative A, C Model:for θ = 3 Full Data vs. stimated Data With Pseudo-Tables........................ 09 H. Variance of Variance stimation Multinomial Informative A, C Model for θ = 3 0 H. Variance stimation Informative A, C Model Multinomial: Variance of Variance stimate for θ = 3 With Pseudo-Tables........................ x

LIST OF FIGURS 5. Mean: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/................ 08 5. Variance: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/................ 09 5.3 t: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, and P missing = 0.4, P missing = 0.5 for > K/................ 0 5.4 Mean: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/................ 5.5 Variance: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/................ 5.6 t: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, and P missing = 0.4, P missing = 0.5 for > K/................ 5.7 Mean: θ =, Missing Informative, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/.......... 5 5.8 Variance: θ =, Missing Informative, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/........ 6 5.9 t: θ =, Missing Informative, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/............ 7 5.0 Mean: θ = 3, Missing Informative, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/.......... 8 5. Variance: θ = 3, Missing Informative, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/........ 8 5. t: θ = 3, Missing Informative, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/............ 9 5.3 stimating Variance: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/.......... xi

5.4 stimating Variance: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/, With Pseudo- Tables........................................... 5.5 Variance of stimating Variance: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/...... 3 5.6 Variance of stimating Variance: θ =, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/ With Pseudo-Table....................................... 3 5.7 stimating Variance: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/.......... 5 5.8 stimating Variance: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/, With Pseudo- Tables........................................... 6 5.9 Variance of stimating Variance: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/...... 6 5.0 Variance of stimating Variance: θ = 3, MAR, P missing = 0.5, P missing = 0.4 for K/, P missing = 0.4, P missing = 0.5 for > K/ With Pseudo-Table....................................... 7 xii

Chapter Introduction In real world studies where sequences of contingency table data are collected, frequently some tables are incomplete. Often investigators simply analyze the subsample of complete observations, ignoring possible effects of missing data. While this may be reasonable when only a small proportion of observations are incomplete, it could possibly cause incorrect estimates and variability when large amounts of data are missing. Another approach, imputation, is not always valid. When a large proportion of data is missing, incorrect estimation of variability is possible even if the imputation scheme does not produce bias. Statistical studies of the effects of missing data are summarized by Little and Rubin [30. They show that, in multivariate data with missing observations, estimates using only complete observations do lead to biased estimates of means and variances. They formulate mathematical models for missing data and suggest better techniques for dealing with missing data. In this dissertation, we study the effects of estimating the common odds ratio when the row variable is missing at random with missingness probability depending on the stratum and the column variables or when it is missing informatively with missingness probability depending on the variable itself and the stratum variable. We assume the column variable and the stratum variable are always observed, as in case control studies. We investigate estimators based on only the complete observations and in the closed-form imputation introduced by Baer, Rosenberger and Dersimonian [5. We use both mathematical proofs and simulations to study the effect of ignoring missing data, imputing missing data, and misspecifying the missing data mechanism.

The dissertation is organized as follows. In Chapter, we review the literature about missing data analysis, particular using the closed-form imputation of the cell counts for contingency tables and the common odds ratio estimation. In Chapter 3, we will show the three-way contingency tables closed-from. In Chapter 4, we study the biases for Mantel-Haenszel estimator of the common odds ratio and its adjustments, both in complete and incomplete contingency tables. for incomplete contingency tables. The results of the simulation are in Chapter 5. Chapter 6 states our conclusions and discusses some possible directions of further research.

Chapter Literature Review. Missing Data Let V be a matrix of random variables with n independent rows and p columns. ach row represents a different observation and each column represents a different categorical variable Y = Y, Y,..., Y p. One could rearrange V as a p-dimensional contingency table with W cells defined by the joint levels of the variables. The entries in the table are counts {z ij...t }, where z ij...t is the number of sampled cases in the cell with Y = i, Y = j,..., Y p = t. If the data matrix V has missing items, that is one or more column variables are not identifiable in some rows, we will convert the data matrix to a contingency table with p extra variables of R,..., R p = R which indicate whether Y,...Y p were observed. We write R r = when the rth variable is observed and R r = 0 otherwise. If the event that Y i is missing does not depend on Y i itself or any other Y, then we say Y i is Missing Completely at Random MCAR. That is, P R i Y, φ = P R i φ = π where φ denotes unnown parameters. If the event that Y i is missing does not depend on Y i itself but does depend on some other observed variables Y, i, then we say Y i is Missing at Random MAR. That is, P R i Y, φ = P R i {Y, R = }, φ Little and Rubin [30. In this situation, we will use the notation Y i is MARY, R =, or more simply MARY. Nonignorable missingness or informative missingness means that the missingness of Y i depends on either Y i itself or some unobserved Y. That is Y i is not MAR. Standard statistical methods were not designed to analyze missing data, so when some data 3

are missing, we should use other methods. Four groups of methods to analyze a data set with missing values were presented by Little and Rubin [30. a Procedures based on complete recorded units: When some variables are missing for some observations, a simple method is to analyze only the units with complete data. In general, this method can lead to serious biases and it may not be very efficient. However, under MCAR, the complete observations are a random sample of the full data set, so no bias occurs. b Imputation-based procedures: One fills in values for the missing data so that the data set become a completed data set, and then one simply uses standard methods to analyze the data. There are several imputation methods including hot dec imputation, mean imputation and regression imputation. The idea is somehow to find Ẑi,j, an estimate of the expected value of a missing Z i,j given the observed data, and substitute Ẑi,j in the sample. c Weighting procedures: In a complex survey, observations are given weights π i which are inversely proportional to the probability of selection. For instance, let x i be the ith value of a variable X. Then the population mean of X is often estimated by i π i x i / i π i,. where π i is the probability of that unit i is observed and π i is the design weight for observation i. Weighting procedures modify the weights in an attempt to adjust for nonresponse. The estimator. is replaced by π i ˆp i x i / π i ˆp i, where the sums are over data units where X is observed, and ˆp i is an estimate of the probability that X is observed for unit i. If the design weights are constant in subclasses of the sample, then the mean imputation and weighting lead to the same estimates of population means, although not the same estimates of sampling variances unless we mae adjustments to the data with mean imputation. d Model-based procedures: The analyst defines a model for the missing data mechanism 4

Table.: Two-way contingency table with missing data B = B = R b = R b = 0 R b = R b = 0 A = R a = R a = 0 z z 0 z 0 z 00 z z 0 z 0 z 00 A = R a = R a = 0 z z 0 z 0 z 00 z z 0 z 0 z 00 and bases inferences on the lielihood under that model. The advantages of this procedure are flexibility. However, there is a possibility of introducing bias if the model for missingness is misspecified. Methods b, v, and d all involve the use of a model, whether implicit or explicit. They are therefore potentially subject to biases.. Closed Form stimates Let A and B are two categorical variables with levels each and let R a and R b be the indicator variables we discussed in section.. The contingency table with missing data is as Table. Let µ ijl denote the expected cell counts of the i, j cells in a contingency table with R a = and R b = l. The log-linear model for two partially observed categorical variables with no three- or four-way interactions is log µ ijl = µ + α A i + α B j + α AB ij + β Ra + β R b l + β RaR b l + γ ARa i + γ AR b il + γ BRa j + γ BR b jl Baer, Rosenberger and Dersimonian[5 introduced the following parameterization of the model and led to closed-form ML estimates: 5

m ij = NP ra = i, B = j, R a =, R b = = exp{µ + α A i + α B j + α AB ij a ij = P rr a = 0, R b = A = i, B = j P rr a =, R b = A = i, B = j = exp{ [β Ra + β RaR b + γ ARa i + γ BRa j } b ij = P rr a =, R b = 0 A = i, B = j P rr a =, R b = A = i, B = j, = exp{ [β R b + β RaR b + γ AR b i + γ BR b j } + β Ra + β R b + β RaR b + γ ARa i + γ AR b i + γ BRa j + γ BR b J }, g = P rr a =, R b = A = i, B = jp rr a = 0, R b = 0 A = i, B = j P rr a =, R b = 0 A = i, B = jp rr a = 0, R b = A = i, B = j. = exp{4β RaR b }, so that µ ij = m ij, µ ij0 = m ij b ij, µ ij0 = m ij a ij, µ ij00 = m ij a ij b ij g, m ij 0, a ij 0, b ij 0, g 0, i j m ij + a ij + b ij + a ij b ij g = N. The cell probabilities are π ij.. = P ra = i, B = j = m ij + a ij + b ij + a ij b ij g, and the log-lielihood is L = {z ij logµ ij + z i+0 logµ i+0 0 i= j= =0 l=0 + z +j0 logµ +j0 0 + z ++00 logµ ++00 00 } µ ++++ where cd =, if = c and l = d, and 0, otherwise. To compute maximum lielihood estimator, one solves the system of equations L/ θ ijl = 0, where θ ijl includes all of µ, α s, β s and γ s. Closed-form solutions might be available, and boundary conditions might need to be used if any of the solutions to the lielihood equations falls outside the parameter space. 6

Table.: One Pair of Pseudotables 0 0 0 0.3 Common Odds Ratio stimators.3. Mantel-Haenszel stimator Suppose there are K tables, let a, b, c and d be the data of the th table and let n be the sum of the data. The Mantel-Haenszel estimator [7 for the common odds ratio is ˆθ MH = K a d /n K b c /n.. Agresti [ and Santner and Duffy [4 discussed Mantel-Haenszel estimator. The Mantel-Haenszel estimator tends to overestimate the common odds ratio Breslow [7 studied for the sparse data and Hauc, Anderson and Leahy [ studied for the large strata cases. They modified the Mantel-Haenszel estimator by adding pseudocounts to each cells to reduce the bias and the common odds ratio estimator for adding s observations to each table is ˆθ P MHO = K a + s/4d s + s/4 K b + s/4c + s/4. In their study, the best choice of s is 0.5..3. Mantel-Haenszel with Pseudotable Methods To reduce the bias of the Mantel-Haenszel estimator, Wypij and Santner [48 introduced the Pseudodata methods by adding one or more pairs of pseudotables as in Table.. They justified the pseudotable method using a Bayesian argument. The common odds ratio estimator for adding r pair of the pseudotables is K ˆθ P MH = a d /n + r/ K b c /n + r/ 7

.3.3 Jacnifing Mantel-Haenszel Method Pigeot and Strugholtz [35 used the idea of Quenouille [39 of using pseudo-values by calculating the same type of the estimators based on a reduced sample and investigated two jacnife techniques applied to the Mantel-Haenszel estimator by dropping one table every time or one observation at a time. When dropping table technique is in use, they calculated the ith pseudo-value J I i be determined as which can J I i = K ˆθ MH K ˆθ MH,i ˆθ MH = K K a d /n b c /n ˆθ MH,i = ˆθ JMH = K =, i a d /n K =, i b c /n K θ MH,i /K =.4 Variance stimation Formula The Mantel-Haenszel estimator is a ratio of sums of random terms which are not identically distributed. Using Taylor expansions on expressions of the form K ˆθ MH = y i K x i centered at θ = K [y i/ K [x i, one can write ˆθ MH θ. = K K [x i [ y i θ K x i. This suggests that approximate variance formula [ Varˆθ MH =. [ˆθ MH θ =. K K [x Vary i θx i. i This approach is detailed in Cochran [ for ratio estimator and was applied to the Mantel- Haenszel estimator by Breslow [7. 8

Chapter 3 Closed-Form stimators for I J K Tables Consider a three-way contingency table in which some cells are only partially observed. The th table is arranged as in Table 3., where z ijlmn denotes the data in the th table with the level of variable A = i and variable B = j. Here l = indicates that variable A is observed i is nown and l = 0 indicates that the level of variable A is unobserved i =. denotes unnown. The meaning of m and n are similar. The goal is to estimate the cell means µ ij from a fully observed I J K table. Let R a = indicate that variable A is observed and 0 if not. The meanings of R b and R c are similar. Assume there are no four-way or higher interactions for the categorical variables A, B, C, R a, R b and R c. The log-linear model is logµ ijlmn = µ + α A i + α B j + α C + α AB ij + β Ra l +γ ARa il +γ ABRa ijl +γ BCRc jn +γ CRaR b lm + β R b m + γ AR b im + γ ABR b ijm + γ ARaR b ilm + β Rc n + γarc in + γ CRaRc ln + γabrc ijn + α AC i + β RaR b lm + γ BRa jl + γ ARaRc iln + γ CR br c mn. + α BC j + βrar b ln + γ BR b jm + γ ACRa il + γ AR br c imn + α ABC ij + β R br c mn + γbrc jn + γ ACR b im + γ BRaR b jlm + β RaR br c lmn, + γ CRa l + γacrc in + γ CR b m + γ BRaRc jln + γ BCRa jl + +γcrc n + γ BCR b jm + γ BR br c jmn We extend Baer, Rosenberger and Dersimonian s [5 method to reparameterize as m ij = NP A = i, B = j, C =, R a =, R b =, R c = = expµ + α A i + α B j + α C + α AB ij + α AC i + α BC j + α ABC ij +β Ra + β R b + β Rc + β RaR b + β RaRc + β R br c + β RaR br c 9

Table 3.: the th table B = B =... B = J B =. A = z z... z J z +0 A = z z... z J z +0................................................ A = I z I z I... z IJ z I+0 A =. z +0 z +0... z +JK0 z ++00 a ij = P R a = 0, R b =, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = = expµ + α A i + α B j + α C + α AB ij + α AC i + α BC j + α ABC ij +β Ra 0 + β R b + β Rc + β RaR b 0 + β RaRc 0 + β R brc + β RaR br c 0 +γ ARa i0 +γ ABRa ij0 +γ BCRc j γ CRaR b 0 + γ AR b i + γ ABR b ji + γ ARaR b i0 + γ CRaRc 0 + γ ARc i + γ ABRc ji + γ BRa j0 + γ ARaRc i0 + γ CR br c / expµ + α A i + α B j + α c + α AB ij + γ BR b j + γ ACRa i0 + γ AR br c i + α AC i + γ BRc j + γ ACR b i + γ BRaR b j0 + α BC j + γ CRa 0 + γ ACRc i + γ CR b + γ BRaRc j0 + α ABC ij +β Ra + β R b + β Rc + β RaR b + β RaRc + β R brc + β RaR br c + γ BCRa j0 + γ CRc + γ BCR b j + γ BR br c j +γ ARa i +γ ABRa ij +γ BCRc j γ CRaR b + γ AR b i + γ ABR b ji + γ ARaR b i + γ CRaRc + γ ARc i + γ ABRc ji + γ BRa j + γ ARaRc i + γ CR br c + γ BR b j + γ ACRa i + γ AR br c i + γ BRc j + γ ACR b i + γ BRaR b j + γ CRa + γ ACRc i + γ CR b + γ BRaRc j + γ BCRa j + γ CRc + γ BCR b j + γ BR br c j 0

= expβ Ra 0 + β RaR b 0 + β RaRc 0 + β RaR br c 0 + γ ARa i0 + γ BRa j0 + γ CRa 0 + γ ABRa ij0 +γ ACRa i0 + γ BCRa j0 + γ ARaR b i0 + γ ARaRc i0 + γ BRaR b j0 + γ BRaRc j0 + γ CRaR b 0 + γ CRaRc 0 β Ra β RaR b β RaRc β RaR br c γ ARa i γ BRa j γ CRa γ ABRa ij γ ACRa i γ BCRa j γ ARaR b i γ ARaRc i γ BRaR b j γ BRaRc j γ CRaR b γ CRaRc = exp[ β Ra + β RaR b + β RaRc + β RaR br c + γ ARa i + γ BRa j + γ CRa + γ ABRa ij + γ ACRa i +γ BCRa j + γ ARaR b i + γ BRaR b j + γ CRaR b b ij = P R a =, R b = 0, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = = exp[ β R b + β RaR b + β R br c + β RaR br c + γ AR b i + γ BR b j + γ CR b + γ ABR b ij + γ ACR b i +γ BCR b j + γ ARaRc i + γ BRaRc j + γ CRaRc c ij = P R a =, R b =, R c = 0 A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = = exp[ β Rc + β RaRc + β R br c + β RaR br c + γ AR C i + γ BRc j + γ CRc + γ ABRc ij + γ ACRc i +γ BCRc j + γ AR br c i + γ BR BR c j + γ CR br c d ij = P R a = 0, R b = 0, R c = A = i, B = j, C = P R a = 0, R b =, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = P R a =, R b = 0, R c = A = i, B = j, C = = exp[ β R b + β RaR b 0 + β R br c + β RaR br c 0 + γ AR b i + γ BR b j + γ CR b + γ ABR b ij +γ ACR b i + γ BCR b j + γ ARaR b i0 + γ AR br c i + γ BRaR b j0 + γ BR br c j + γ CRaR b 0 + γ CR br c exp[β R b + β RaR b + β R br c + β RaR br c + γ AR b i + γ BR b j + γ CR b + γ ABR b ij +γ ACR b i + γ BCR b j + γ ARaR b i + γ AR br c i + γ BRaR b j + γ BR br c j = exp[4β RaR b + β RaR br c + γ ARaR b i + γ BRaR b j + γ CRaR b e ij = P R a = 0, R b =, R c = 0 A = i, B = j, C = P R a = 0, R b =, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = P R a =, R b =, R c = 0 A = i, B = j, C = = exp[4β RaRc + β RaR br c + γ ARaRc i + γ BRaRc j + γ CRaRc f ij = P R a =, R b = 0, R c = 0 A = i, B = j, C = P R a =, R b = 0, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = P R a =, R b =, R c = 0 A = i, B = j, C = + γ CRaR b + γ CR br c

= exp[4β R br c + β RaR br c + γ AR br c i + γ BR br c j + γ CR br c g ij = P R a = 0, R b = 0, R c = 0 A = i, B = j, C = P R a = 0, R b = 0, R c = A = i, B = j, C = P R a = 0, R b =, R c = A = i, B = j, C = P R a = 0, R b =, R c = 0 A = i, B = j, C = P R a =, R b = 0, R c = A = i, B = j, C = P R a =, R b = 0, R c = 0 A = i, B = j, C = P R a =, R b =, R c = 0 A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C = = exp[ β Rc + β RaRc 0 + β R br c 0 + β RaR br c 00 + γ ARc i0 + γ BRc j0 + γ CRc 0 + γ ABRc ij0 +γ ACRc i0 + γ BCRc j0 + γ ARaRc i0 + γ AR br c i0 + γ BRaRc j0 + γ BR br c j0 + γ CRaRc 0 + γ CR br c 0 exp[β Rc + β RaRc 0 + β R br c + β RaR br c 0 + γ ARc i + γ BRc j + γ CRc + γ ABRc ij + γ ACRc i + γ BCRc j + γ ARaRc i0 + γ AR br c i + γ BRaRc j0 + γ BR br c j + γ CRaRc 0 + γ CR brc exp[β Rc + β RaRc + β R br c 0 + β RaR br c 0 + γ ARc i + γ BRc j + γ CRc + γ ABRc ij + γ ACRc i + γ BCRc j + γ ARaRc i + γ AR br c i0 + γ BRaRc j + γ BR br c j0 + γ CRaRc + γ CR br c j0 exp[ β Rc + β RaRc + β R br c + β RaR br c + γ ARc i + γ BRc j + γ CRc + γ ABRc ij + γ ACRc i + γ BCRc j + γ ARaRc i = exp[ 8β RaR br c = g independent of i, j,. + γ AR br c i + γ BRaRc j + γ BR br c j + γ CRaRc + γ CR br c Therefore, µ ij = m ij, µ ij0 = m ij a ij, µ ij0 = m ij b ij, µ ij0 = m ij c ij, µ ij00 = m ij a ij b ij d ij, µ ij00 = m ij a ij c ij e ij, µ ij00 = m ij b ij c ij f ij, µ ij000 = m ij a ij b ij c ij d ij e ij f ij g, m ij 0, a ij 0, b ij 0, d ij 0, e ij 0, f ij 0, g 0 3. Variable C always observed We examine the simpler case where the stratum variable C is always observed, but the row variable A and column variable B may be missing. As we assumed before, there are no four-way or higher

interactions, so that, after simplifying the subscript notation logµ ijlm = µ + α A i + α B j + α C + α AB ij +β Ra l + β R b m + β RaR b lm + α AC i + α BC j + α ABC ij +γ ARa il +γ ACRa il + γ AR b im + γ ACR b im + γbra jl + γ BR b jm + γarar b ilm + γcra l + γ BCRa jl + γ CR b m + γ BCR b jm + γabra ijl + γbrar b jlm + γ ABR b ijm + γ CRaR b lm. Hence m ij = NP A = i, B = j, C =, R a =, R b =, a ij = P R a = 0, R b =, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C =, b ij = P R a =, R b = 0, R c = A = i, B = j, C = P R a =, R b =, R c = A = i, B = j, C =, d ij = P R a = 0, R b = 0, R c = A = i, B = j, C = P R a = 0, R b =, R c = A = i, B = j, C =, P R a =, R b =, R c = A = i, B = j, C = P R a =, R b = 0, R c = A = i, B = j, C =, µ ij = m ij, µ ij0 = m ij a ij, µ ij0 = m ij b ij, µ ij00 = m ij a ij b ij d ij, m ij 0, a ij 0, b ij 0, d ij 0. Since R c always is in this case, we drop the sixth subscript to simplify the notation. Also note that e ij = f ij = g ij = 0. In this case one computes separate estimates for the K subtables of order I J. Tables 3., 3.3 and 3.4 list the closed form solutions for various missingness models. Under the common odds ratio assumption, the log-linear model is logµ ijlm = µ + α A i + α B j + α C + α AB ij +β Ra l +γ ARa il +γ ABRa ijl +γ BCRc jn +γ CRaR b lm + β R b m + γ AR b im + β Rc n + γ ABR b ijm + γ ARaR b ilm + γarc in + γ CRaRc ln + β RaR b lm + γabrc ijn + α AC i + γ BRa jl + γ ARaRc iln + γ CR br c mn. + βrar b ln + α BC j + γ BR b jm + γ ACRa il + γ AR br c imn + β R br c mn + γbrc jn + γ ACR b im + γ BRaR b jlm + β RaR br c lmn + γ CRa l + γacrc in + γ CR b m + γ BRaRc jln + γ BCRa jl + +γcrc n + γ BCR b jm + γ BR br c jmn 3

Table 3.: Closed Form stimates for I J K Tables a Missingness of both A, B depends on C ˆm 0 ij = z ij ˆm t+ ij = [z ij + z +j0 ˆm t ij / ˆmt /z ++ + z ++0 + z ++0 +j + z i+0 ˆm t ij / ˆmt i+ â = z ++0 /z ++ ˆb = z ++0 /z ++ ˆd = z ++ z ++00 /z ++0 z ++0 b Missingness of A depends on C, Missingness of B depends on A, C ˆm ij = z ij z ++ z +j+ /z +++ z +j+ â = z ++0 /z ++ ˆbi = z i+0 / ˆm i+ ˆd = z ++ z ++00 /z ++0 z ++0 b Missingness of A depends on B, C Missingness of B depends on C ˆm ij = z ij z ++ z i++ /z +++ z i++ â j = z +j0 / ˆm +j ˆb = z ++0 /z ++ ˆd = z ++ z ++00 /z ++0 z ++0 4

Table 3.3: Closed Form stimates for I J K Tables Continued c Missingness of A depends on C Missingness of B depends on B, C ˆm ij = z ij z +j+ z ++ /z +++ z +j â = z ++0 /z ++ ˆbj such that j ˆm ijˆb j = z i+0 ˆd = z ++ z ++00 /z ++0 z ++0 c Missingness of A depends on A, C Missingness of B depends on C ˆm ij = z ij z i++ z ++ /z i+ z +++ â i such that i ˆm ijâ i = z +j0 ˆb = z ++0 /z ++ ˆd = z ++ z ++00 /z ++0 z ++0 d Missingness of A, B depends on A, C ˆm ij = z ij â i such that i z ijâ i = z +j0 ˆbi = z i+0 /z i+ ˆd = z ++00 / i z i+â iˆbi d Missingness of A, B depends on B, C ˆm ij = z ij â j = z +j0 /z +j ˆbj such that j z ijˆb j = z i+0 ˆd = z ++00 / j z +jâ jˆbj 5

Table 3.4: Closed Form stimates for I J K Tables Continued e Missingness of A depends on A, C Missingness of B depends on B, C ˆm ij = z ij â i such that i ˆm ijâ i = z +j0 ˆbj such that j z ijˆb j = z i+0 ˆd = z ++00 / i j z ijâ iˆbj f Missingness of A depends on B, C Missingness of B depends on A, C ˆm ij = z ij â j = z +j0 /z +j ˆbi = z i+0 /z i+ ˆd = z ++00 / i j z ijâ jˆbi 6

As αij ABC = 0 in the common odds ratio models, the closed form for ˆm ij will be the solutions of the system of equations m ij + a ij + b ij = z +j + i i m ij + a ij + b ij = z i+ + j j m ij + a ij + b ij = z ij+ + z +j0 m ij a ij m +j a ij + i z +j0 m ij a ij m +j a ij + j z +j0 m ij a ij m +j a ij + z i+0 z i+0 z i+0 m ij b ij m i+ b ij m i+0 b ij m ij b ij m i+ b ij m i+0 b ij m ij b ij m i+ b ij. m i+0 b ij But the â ij, ˆb ij, ˆd ij are still the same as in Tables 3., 3.3 and 3.4. 3. Variables B and C always observed In the special cases that both variables B and C are always observed, we consider the models in which missingness of variable A depends on C and either A or B but not both. In these models there are no four-way or higher interactions. After simplifying the subscript notation, we use µ ijl denote the expected cell counts and z ijl are the observations, and the log-linear model for the partially observed categorical variables is logµ ijl = µ + α A i + α B j + α C + α AB ij +β Ra l + γ ARa il + γ BRa jl + γ CRa l + α AC i + α BC j + γ ACRa il + α ABC ij 3. + γ BCRa jl. When missingness of A depends on A, C informative missingness, only γ ARa il, γ CRa l, γ ACRa il will be in the model, and when missingness of A depends on B, C Missing at Random or MARB, C, only γ BRa jl, γ CRa l, γ BCRa jl will be in the model. Moreover, as both B and C are always observed, b ij = 0 and g = 0 in both models, so only m ij and a ij need to be estimated. Table 3.5 shows the closed forms for these two models. 7

Table 3.5: Closed Form stimator B and C are Always Observed ModelM ARB, C Missing InformativeA, C ˆm ij z ij z ij â ij â.j = z +j0 z +j â i., such that i ˆm ijâ i. = z +j0 Closed-form solution requires I=J and non-negative estimates. 3.3 Negative ˆα i boundary problems In models c, c, d and d in Table 3.3, model e in Table 3.4 and the informative missingness model in Table 3.5, the closed-form solution might exist when I = J and the solutions of â i. or ˆb.j are nonnegative. If any solution of â i. is negative, the ML estimate lies on the boundary of the parameter space and boundary solutions need to be investigated. In the case that I = J =, we can obtain closed-form ML boundary estimates by setting one of the â i. = 0 in the lielihood equations. According to Baer, Rosenberger and Dersimonian [5, we must evaluate both boundaries â. and â. and calculate G in 3.3 to figure out which parameter is falling on the boundary of the parameter space. G = i + j j [ ˆmij z ij log + z ij i [ z +j0 log i ˆm ijâ ij z +j0 z i+0 log [ j ˆm ijˆb ij z i+0 [ i + z ++00 log j ˆm ijâ ijˆbij ˆd z ++00 3. If G = G, then both â. and â. are set to be zero. Also, if there are not enough independent equations to solve for â i., then the boundary solution will be used. In the informative missingness case, G = z ij log [ ˆm ij /z ij + i j j [ i z +j0 log ˆm ijâ ij z +j0 8

Table 3.6: Boundary Solution for InformativeA, C when both B and C are always observed â. = 0 â. = 0 ˆm j z j z + z j +z +j0 z + +z ++0 z mˆ + z j +z +j0 j z + +z ++0 â. 0 z j z ++0 z + â. z ++0 z + 0 The boundary estimates for K tables when both the stratum variable C and column variable B are always observed are presented in Table 3.6 9

Chapter 4 Common Odds Ratio stimation Consider K contingency tables where each table has n observations and K n = N. When the stratum variable C =, and no data are missing, we observe the table B = B = Total A = z z n A = z z n n for =,, K. We assume that the odds ratio π π /π π = θ independent of common odds ratio. We consider the cases where the stratum variable C and the column variable B are always observed but the row variable A might be missing for some of the n observations. Variable A is either missing at random depending on variables B and C MARB, C or informatively missing depending on variables A and C InformativeA, C. We study estimation of the common odds ratio based on the hypothetical fully observed data set, only the completely observed subset and closed-form estimated data set of Chapter 3. 4. Methods 4.. Mantel-Haenszel stimator The Mantel-Haenszel estimator for the common odds ratio is ˆθ MH = K = z z /n K = z z /n so that ˆθ MH θ = K = z z /n θz z /n K = z z /n 0

The quantities z, z, z have a multinomial distribution with parameters n, π, π, and π, where n z z z = z and π π π = π. Under the common odds ratio assumption, θ = π π π π, θπ π = π π, θπ π = π π π π, θπ π = π π π π π π, π + π θπ = π π π, π = π π π, π + π θ π = π π π Using properties of the multinomial distribution = π π π π π. π + π θ [ z z [z z /n = n = n [z z = n n n π π = n π π, and [z z /n = n π π. Therefore [ z z /n = = K n π π = K n π π /θ = = µ x.

When K and θ are fixed, µ x is ON and when n, and θ are fixed and the π ij are bounded away from 0, µ x is OK. When K and n are fixed, µ x is O/θ. Writing ˆθ MH = y/x and using a Taylor expansion, ˆθ MH θ can be written as y θx x = y θx µ x = y θx µ x [ x µ x + µ x y θx x µx µ x µ x + The expected value of the first term of 4. is [ K z z /n θz z /n /µ x = = µ x n [z z θz z y θx x µx 4. = µ x n n n π π θn n π π = µ x n π π θπ π = 0. µ x µ x We have The expected value of the second term of 4. is [ K z K z /n θz z /n z z /n µ x µ x [ [ z z θz z z z z z = µ x n n n + [ z z θz z z z z z µ x n n n = K µ A + B. x = [ [ z z θz z z z z z A = n n n [ z z θz z z z = n [ [ z z z z θz z n n [ [ = n z z z z θ n zz µ x

because [z z θz z = 0. As [ n z z z z = n [z z z n z z z = n [n z z z zz z z zz z z z = n [n z z z z z z z z z z z n [z z z z 3z z z = n n n n n π π π n n n n 3ππ π n n n n n 3π π π n n n n n 3π π π and Then n 3n n n π π π = n n n n π π π n n 3π + π + π 3 = n n n 3 π π π π n [ n zz = n [z z z z + z z z + n [z z z + z z = n n n n n 3π π + n n n π π + n n n n π π + n n π π = n n n 3 π π π π /θ n + n n π π π + π + n π π. n n [ [ n z z z z θ n zz 3

= n n n 3 π π π π n θ n n n 3 π π π π /θ n θ n n π π π + π + n π π n = θ n n π π [n π + π + = n n π π [n π π +. n This implies A = n π π n π π + n B = [ z z θz z z z z z µ x n n n = µ [z z θz z [z z [z z x n n = 0. Therefore, the leading term of the bias for the Mantel-Haenszel estimator is µ x K = A = µ x = µ x 0. K = n π π n π π + n n n π π [n π π + Because µ x = O/N and the sum is O n, µ x K A = O/N. The expected value of the third term of 4. is K z K z /n θz z /n z z /n [z z /n = µ 3 x µ x K [ z z θz z z z [z z. 4. n n n As the observations from different strata are independent, all of the expectations of cross-product terms are zeros. µ x 4

The th term of the summation in 4. is We have [ z z θz z n [ z z = z z n 3 z z [z z n n [ z 3 θ z 3 n 3 [ z z z z [z z n 3 + θ [ [ z z θz z z z +. [ z z z z n 3 n = n 3 [z z z z n [ z z [z z = n 3 [z z z z z z + z z z z z + n 3 [z z z z z + z z z z n 3 = n 3 n n n n 3n 4n 5π π π π + n 3 n n n n 3n 4π π π π π + π + n 3 n n n n 3π π π π = n n n n 3n 4n 5π π π π + n n n n 3n 4π π π π π + π + n n n n 3π π π π. Note that z 3 z 3 = z z z z z z + 3z 3 z + 3z z 3 z 3 z 9z z z z 3 + 6z z + 6z z 4z z z 3 z = z z z z z + z 3 z + 3z z 3z z z z + z z z z 3 = z z z z z + 3z z + z z 3 z z 3z z + z z 5

z 3 z = z z z z + 3z z z z z z = z z z z + z z + z z z z z z 3 = z z z z + 3z z z z z z = z z z + z z z z = z z z + z z. So zz 3 3 = z z z z z z + 3z z z z z + 3z z z z z + zz 3 + 9zz + z z 3 9zz 9z z + 8z z, = z z z z z z + 3z z z z z + 3z z z z z + z z z z + 9z z z z + z z z z + z z z + z z + z z z + z z 3z z, = z z z z z z + 3z z z z z + 3z z z z z + z z z z + 9z z z z + z z z z + z z z + z z z + z z. We have [z z z z z z n 6

= n n n n 3n 4n 5π 3 π 3 [z z z z z n = n n n n 3n 4π 3 π [z z z z z n = n n n n 3n 4π π 3 [z z z z = n n n n 3π 3 π [z z z z = n n n n 3π π [z z z z = n n n n 3π π 3 [z z z = n n n π π [z z z = n n n π π [z z = n n π π Hence [z 3 z 3 = n n n n 3n 4n 5π 3 π 3 + 3n n n n 3n 4π 3 π + 3n n n n 3n 4π π 3 + n n n n 3π 3 π + 9n n n n 3π π + n n n n 3π π 3 + n n n π π + n n n π π + n n π π. Therefore, [ z 3 z 3 n 3 = n n n n 3n 4n 5π 3 π 3 /θ 3 7

+ 3 n n n n 3n 4π π /θ π + π + n n n n 3π π π /θ + 9 n n n n 3π π /θ + n n n n 3π π π /θ + n n n π π π /θ + n n n π π π /θ + n n π π /θ. Therefore, [ z z z z [ z 3 n 3 θ z 3 n 3 = n n n n 3n 4n 5ππ 3 /θ 3 + n n n n 3n 4π π /θ π π + n n n n 3π π /θ θ n n n n 3n 4n 5π 3 π 3 /θ 3 3θ n n n n 3n 4ππ /θ π π θ n n n n 3π + π π π /θ 9θ n n n n 3ππ /θ θ n n n π + π π π /θ θ n n π π /θ = n n n n 3n 4π π /θ π π 8 n n n n 3π π /θ θ n n n n 3π + π π π /θ θ n n n π + π π π /θ θ n n π π /θ = n n n n 3n 4π π π π /θ 6 n n n n 3π π /θ 8

n n n 3 n π π π + π n n n π π π π n n π π and [ [ z z z z [z z z n 3 + θ z [z z n 3 [ = n 3 z z z z θzz [z z [ = n n π π n [ [ n z z z z θ n zz = n π π θ n n π π [n π + π + = n π n π [n π π + /θ [ [ z z θz z z z = 0. n n So [ z z θz z z z [z z n n n = n n n n 3n 4π π π π /θ 6 n n n n 3π π /θ n n n 3 n π π π π n n n π π π π n n π π + n n π n π π π /θ + n π n π /θ. We also have n n n n 3n 4 + n n n 9

= n n n n 3n 4 + n n = n n n n + 7n + n n = n n n 6n = n n n. Writing N = K n, for fixed K and θ, µ x = O n and the third term of 4. is K z K z /n θz z /n z z /n [z z /n µ x µ x = µ 3 x K n n n ππ π π = O/N 3 = O/N 6 n n n n 3π π /θ n n n n 3π π π π n n n π π π π n n π π + n π n π/θ K On. Therefore, Bias MH = µ x = µ x K = K A + Remainder = O/N, n n π π [n π π + + O/N 4.3 as N, and the dominant term is positive. When n and K are fixed, we have µ x = O/θ and A = O, then the second term of 4. is µ x K A = Oθ O = Oθ, 30

and the third term of 4. is Oθ 3 O = Oθ 3. Therefore, for fixed K and N, when θ, the bias. For a fixed θ, bounded n and the π ij are bounded, away from 0, µ x = = K n π π K n c K max n c = OK. Then Leading term of the Bias = µ x K A = O/K OK = O/K and the third term of 4. is O/K 3 OK = O/K. So if n is fixed, K and K π π, the leading term of the bias also converges to 0. We have Next we investigate the variance. We have Var ˆθ MH θ. = Var = µ x = µ x = µ x K z z /n θz z /n = µ x [ K z z θz z 0 K K n [zz θz z z z + θ zz n [zz + [zz θ[z z z z. n [z z = [z z z z + z z z 3

+ [z z z + z z = n n n n 3π π + n n n π π π + π + n n π π and [z z z z = n n n n 3π π π π [z z = n n n n 3π π + n n n π π π + π + n n π π. Under the common odds ratio assumption, π π = θπ π. Therefore the unconditional expected value of the th term of the summation in the variance formula is [z z θz z = n n n n 3π π + n n n π π π + π + n n π π θn n n n 3π π π π + θ n n n n 3π π + θ n n n π π π + π + θ n n π π = n n n n 3θπ π π π + n n n π π π + π + n n π π θn n n n 3π π π π + θn n n n 3π π π π + θn n n π π π + π + θn n π π = n n n π π π + π + θπ + π + n n π π + θ. So the variance is Var ˆθ MH θ. = µ x K n n n π π π + π + θ π π 3

+ n π π + θ. n As µ x is ON when K and θ are fixed and OK when N and θ are fixed and the π ij are bounded away from 0, Var ˆθ MH θ is O/N for fixed K and θ and O/K for fixed N and θ with all cell probabilities bounded away from 0. For fixed N and K, µ x is O/θ that implies Var ˆθ MH θ is Oθ 3. Therefore, the Mantel-Haenszel method overestimates the common odds ratio, but when either the minimum table size or the number of strata goes to infinity bounded away from 0 cell probabilities, the estimator converges to the true common odds ratio. Hauc [9 and Breslow [7 derived similar results. 4.. Mantel-Haenszel stimator with pseudo-data To reduce the bias of the Mantel-Haenszel estimator, Wypij and Santner [48 introduced the pseudotable methods by adding to the original K table one or more pairs of the following pseudotables: 0 0 0 0 Here we focus on the case where only one pair of pseudotables are added. Then ˆθ P MH = K = z z /n + / K = z z /n + / K = z z /n + θ K = z z /n + ˆθ P MH θ = z z /n + [ z z [ Denominator = + n = [ z z + n = n π π + 33

= µ x + = µ x,p MH. As in section 4.., writing ˆθ P MH = y P MH /x P MH and using Taylor expansion, ˆθ P MH θ can be written as y P MH θx P MH = y [ P MH θx P MH x P MH µ x,p MH +... x P MH µ x,p MH µ x,p MH = y P MH θx P MH yp MH θx P MH xp MH µ x,p MH µ x,p MH µ x,p MH µ x,p MH yp MH θx P MH xp MH µ x,p MH + +. 4.4 µ x,p MH The expected value of the first term of 4.4 is [ yp MH θx P MH = = = = = µ x,p MH [ µ x,p MH µ x,p MH µ x,p MH µ x,p MH θ µ x,p MH. [ K z z + K θ = n µ x,p MH = z z n K [ z z z z θ + θ n n = [ n n + π π θ n n π π + θ n n [ n π π θπ π + θ n In the cases of θ > the expected value of the first term is negative. The expected value of the second term of 4.4 is [ z z /n + / θ z z /n + / µ x,p MH [ z z [ z z z z = θ = µ + x,p MH n n n + [ θ z z µ + x,p MH n µ x,p MH [ z z θ z z z z µ x,p MH + µ x,p MH n n [ z z θ n n z z /n + / µ x,p MH µ x,p MH µ x,p MH 34