How to design estimators tting with accuracy measures Theories and applications in bioinformatics

How to design estimators tting with accuracy measures Theories and applications in bioinformatics Michiaki HAMADA Toshiyuki SATO (): 1, Miyazawa 22) HMM 4) RNA 2 Miyazawa ( ) 22) 2 ( ) 1 1 D, Y () Y p(y D) y Y 1 (decoding) p(y D) ( ) 1( ) θ Y y Y G : Y Y R +, G(θ, y) (gain function). 2(MEG) 1 (MEG ) ŷ (MEG) =argmax G(θ, y)p(θ D)dθ y Y Θ 2010 15 1

MEG (Maximum Expected Accuracy Estimator; MEA ) (loss function) MEA ( ) 3 γ 1 Y Y {0, 1} n MEG γ RNA 2 Y {0, 1} n x x y {0, 1} x x : x i x k y ik =1 y ik =0 y Y {0, 1} n y 1 0 ( ) ( ) y Y θ Y,, TP(θ, y), TN(θ, y), FP(θ, y), FN(θ, y) (TP TN) (FP FN) G(θ, y) =α 1 TP(θ, y)+α 2 TN(θ, y) α 3 FP(θ, y) α 4 FN(θ, y). (1) α k (k =1, 2, 3, 4). Seisitivity (SEN), Positive Predictive Value (PPV), Matthews correlation coef cient (MCC) F-score ( TP, TN, FP, FN ; 1) ) MEG 3(γ-centroid estimator) γ 0 γ G(θ, y) =γtp(θ, y)+tp(θ, y) (2) MEG γ =1 γ 2) γ (1) MEG 1 (1) MEG γ = α1+α4 α 2+α 3 γ γ SEN PPV γ 2 Y y = {y i } Y y = {y i } Y i y i {y i, 0}. γ 1/(γ +1) p i = θ Y I(θ i =1)p(θ D) p i () RNA 2 (2 ) i {p i } i 21). 2 2 2 γ p i 1/(γ +1) Y Y 0 γ 1 γ 1/(γ +1) 1 0 ( Y ) 9) 1( γ ) γ 1/(γ +1) ( 2) γ [0, 1] γ 1/(γ +1) 16 2

γ >1 γ Needleman-Wunsch 24) M i 1,k 1 +(γ +1)p ik 1 M i,k =max M i 1,k. M i,k 1 M i,k x 1 x i x 1 x k 2(2 γ ) γ 2 1/(γ +1) ( 2) γ [0, 1] γ 1/(γ +1) γ >1 γ Nussinov 25) M i,j =max M i+1,j M i,j 1 M i+1,j 1 +(γ +1)p ij 1 max k [M i,k + M k+1,j ] M i,j x i x i+1 x j 2 γ ( ) γ S ( ) S 2 n 1 n 1 (n S ) 2 Hamming 26) 1-centroid γ>1 γ 4 MEG (1) SEN, PPV, F-score, MCC. MEG RNA 2 MCC/F-score 12) Hamada (pseudo-expected accuracy) 12) TP, TN, FP, FN ( MCC F-score; 1) ) Acc = f(tp, TN, FP, FN) y Âcc 0 (y) =f( TP, TN, FP, FN). X X(=TP,FP,TN,FN) {p i } ( RNA 2 ) 2 MCC F-score MCC F-score 12) γ γ 2 2 12) ( MCC F-score ) SEN PPV 5 γ x, x x x z x, x x, x, z γ 13, 14) () 4, 28) (Probabilistic consistency transformation; PCT) 9). 17 3

6.3 γ (CentroidAlign) 13) RNA 2 γ (CentroidHomfod) 14) Kato RNA-RNA γ 17) (RactIP) RNA- RNA 2 RNA 2 6 3 γ 2 NP γ ( 2) γ 1 Do 7) 6.1 RNA 2 Kall HMM 2 RNA 2 RNA 5). MEG 16) γ SEN, PPV 7 10). γ ( 1) γ 6.2 γ Schwartz AMA (Alignment Metric Accuracy) AMA 29) AMA γ 6) SEN PPV RNA 2 AMA SEN, PPV γ γ ( SPS ) SEN, PPV, MCC, F-score ( 1) 5, 18, 20, 30) 8, 31) γ 6, 32). NEDO 18 4

1 Holmes & Durbin 15) -centroid a SPS b Miyazawa 22) 1-centroid Schwartz et al. 29) AMA c AMA Do et al. 4) ProbCons -centroid ( ) d SPS Roshan et al. 27) ProbAlign -centroid ( ) SPS Sahraeian et al. 28) PicXAA -centroid ( ) SPS Frith et al. 6) LAST γ-centroid SEN, PPV Hamada et al. 10) CentroidFold RNA2 γ-centroid SEN, PPV Hamada et al. 12) CentroidFold RNA2 MCC/F-score e MCC, F-score Do et al. 5) CONTRAfold RNA2 f Lu et al. 20) MaxExpect RNA2 Ding et al. 3) Sfold RNA2 1-centroid g Hamada et al. 14) CentroidHomfold RNA2 h γ-centroid ( ) SEN, PPV Hamada et al. 11) CentroidAlifold RNA 2 γ-centroid SEN, PPV Seemann et al. 30) PETfold RNA 2 Knudsen & Hein 19) Pfold RNA 2 Kiryu et al. 18) McCaskill-MEA RNA 2 Hamada et al. 13) CentroidAlign RNA γ-centroid ( ) SEN, PPV Tabei et al. 32) SCARNA-LM RNA γ-centroid SEN, PPV Kato et al. 17) RactIP RNA-RNA γ-centroid SEN, PPV Kall et al. 16) i Do et al. 7) CONTRAST Michal et al. 23) HIV a γ γ ((2) ); b Sum-of-pairs score; c Alignment metric accuracy; d γ (5 ); e γ γ 2 ; f RNA ( ) SEN PPV ; g 2 ; h 2 ; i RNA CBRC 1) P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. Assessing the accuracy of prediction algorithms for classi cation: an overview. Bioinformatics, 16:412 424, May 2000. 2) L. Carvalho and C. Lawrence. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. U.S.A., 105:3209 3214, 2008. 3) Y. Ding, C. Chan, and C. Lawrence. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 11:1157 1166, Aug 2005. 4) C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15:330 340, Feb 2005. 5) C. Do, D. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physicsbased models. Bioinformatics, 22:e90 98, Jul 2006. 6) M. C. Frith, M. Hamada, and P. Horton. Parameters for accurate genome alignment. BMC Bioinformatics, 11:80, Feb 2010. 7) S. Gross, C. Do, M. Sirota, and S. Batzoglou. CON- TRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol., 8:R269, 2007. 8) S. S. Gross, O. Russakovsky, C. B. Do, and S. Batzoglou. Training conditional random elds for maximum labelwise accuracy. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Infor- 19 5

mation Processing Systems 19, pages 529 536. MIT Press, Cambridge, MA, 2007. 9) M. Hamada, H. Kiryu, W. Iwasaki, and K. Asai. Generalized Centroid Estimators in Bioinformatics. PLoS ONE 6(2): e16450, 2011. 10) M. Hamada, H. Kiryu, K. Sato, T. Mituyama, and K. Asai. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics, 25:465 473, 2009. 11) M. Hamada, K. Sato, and K. Asai. Improving the accuracy of predicting secondary structure for aligned RNA sequences. Nucleic Acids Res., 2010 (in press). 12) M. Hamada, K. Sato, and K. Asai. Prediction of RNA secondary structure by maximizing pseudo-expected accuracy. BMC Bioinformatics, 11:586, 2010. 13) M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-ofpairs score. Bioinformatics, 25:3236 3243, 2009. 14) M. Hamada, K. Sato, H. Kiryu, T. Mituyama, and K. Asai. Predictions of RNA secondary structure by combining homologous sequence information. Bioinformatics, 25:i330 338, 2009. 15) I. Holmes and R. Durbin. Dynamic programming alignment accuracy. J. Comput. Biol., 5:493 504, 1998. 16) L. Kall, A. Krogh, and E. L. Sonnhammer. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics, 21 Suppl 1:i251 257, 2005. 17) Y. Kato, K. Sato, M. Hamada, Y. Watanabe, K. Asai, and T. Akutsu. RactIP: fast accurate prediction of RNA-RNA interaction using integer programming. Bioinformatics, 2010 (in press). 18) H. Kiryu, T. Kin, and K. Asai. Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics, 23:434 441, 2007. 19) B. Knudsen and J. Hein. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res, 31(13):3423 3428, Jul 2003. 20) Z. J. Lu, J. W. Gloor, and D. H. Mathews. Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA, 15:1805 1813, Oct 2009. 21) J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6-7):1105 1119, May 1990. 22) S. Miyazawa. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng., 8:999 1009, Oct 1995. 23) M. Nánási, T. Vinar, and B. Brejová. The Highest Expected Reward Decoding for HMMs with Application to Recombination Detection. CoRR, abs/1001.4499, 2010. 24) S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443 453, Mar 1970. 25) R. Nussinov, G. Pieczenk, J. Griggs, and D. Kleitman. Algorithms for loop matchings. SIAM Journal of Applied Mathematics, 35:68 82, 1978. 26) D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53(1-2):131 147, February 1981. 27) U. Roshan and D. Livesay. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics, 22:2715 2721, Nov 2006. 28) S. M. Sahraeian and B. J. Yoon. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res., 38:4917 4928, Aug 2010. 29) A. S. Schwartz, E. W. Myers, and L. Pachter. Alignment metric accuracy, 2005. http://arxiv.org:qbio/0510052. 30) S. Seemann, J. Gorodkin, and R. Backofen. Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res., 36:6355 6362, 2008. 31) J. Suzuki, E. McDermott, and H. Isozaki. Training conditional random elds with multivariate evaluation measures. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 217 224, Sydney, Australia, July 2006. Association for Computational Linguistics. 32) Y. Tabei and K. Asai. A local multiple alignment method for detection of non-coding RNA sequences. Bioinformatics, 25:1498 1505, Jun 2009. 20 6