BMI/CS 776 Lecture #14: Multiple Alignment - MUSCLE. Colin Dewey

Σχετικά έγγραφα
Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

A sequence alignment algorithm using the transition quantity

Supplementary Materials for Evolutionary Multiobjective Optimization Based Multimodal Optimization: Fitness Landscape Approximation and Peak Detection

EE512: Error Control Coding

Fractional Colorings and Zykov Products of graphs

Elements of Information Theory

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

TMA4115 Matematikk 3

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Lecture 2. Soundness and completeness of propositional logic

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

derivation of the Laplacian from rectangular to spherical coordinates

Tridiagonal matrices. Gérard MEURANT. October, 2008

Approximation of distance between locations on earth given by latitude and longitude

ΑΛΓΟΡΙΘΜΟΙ Άνοιξη I. ΜΗΛΗΣ

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Reminders: linear functions

Numerical Analysis FMN011

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Section 8.3 Trigonometric Equations

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Inverse trigonometric functions & General Solution of Trigonometric Equations

38BXCS STANDARD RACK MODEL. DCS Input/Output Relay Card Series MODEL & SUFFIX CODE SELECTION 38BXCS INSTALLATION ORDERING INFORMATION RELATED PRODUCTS

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

14 Lesson 2: The Omega Verb - Present Tense

Ειδικά Θέματα Βιοπληροφορικής

Chapter 1 Introduction to Observational Studies Part 2 Cross-Sectional Selection Bias Adjustment

Notes on the Open Economy

A Lambda Model Characterizing Computational Behaviours of Terms

ΕΠΙΤΟΙΧΑ ΡΑΦΙΑ WALL UNIT

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

ΔΙΑΣΤΑΣΕΙΣ ΕΣΩΤΕΡΙΚΗΣ ΓΩΝΙΑΣ INTERNAL CORNER SIZES

Math 6 SL Probability Distributions Practice Test Mark Scheme

C F E E E F FF E F B F F A EA C AEC

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ ΤΜΗΜΑ ΟΔΟΝΤΙΑΤΡΙΚΗΣ ΕΡΓΑΣΤΗΡΙΟ ΟΔΟΝΤΙΚΗΣ ΚΑΙ ΑΝΩΤΕΡΑΣ ΠΡΟΣΘΕΤΙΚΗΣ

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Stabilization of stock price prediction by cross entropy optimization

ΓΗΠΛΧΜΑΣΗΚΖ ΔΡΓΑΗΑ ΑΡΥΗΣΔΚΣΟΝΗΚΖ ΣΧΝ ΓΔΦΤΡΧΝ ΑΠΟ ΑΠΟΦΖ ΜΟΡΦΟΛΟΓΗΑ ΚΑΗ ΑΗΘΖΣΗΚΖ

ΠΟΛΛΑΠΛΗ ΣΤΟΙΧΙΣΗ ΑΚΟΛΟΥΘΙΩΝ I

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

5.4 The Poisson Distribution.

Homework 3 Solutions

Matrices and Determinants

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

Πολλαπλή στοίχιση multiple sequence alignment (MSA)

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

Second Order RLC Filters

Lecture 34 Bootstrap confidence intervals

LTL to Buchi. Overview. Buchi Model Checking LTL Translating LTL into Buchi. Ralf Huuck. Buchi Automata. Example

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

Numerical Methods for Civil Engineers. Lecture 10 Ordinary Differential Equations. Ordinary Differential Equations. d x dx.

ΛΥΜΕΝΕΣ ΑΣΚΗΣΕΙΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΥ-ΓΛΩΣΣΑ C ΑΤΕΙ (ΝΑ ΕΚΤΕΛΕΣΤΟΥΝ ΤΑ ΠΑΡΑΚΑΤΩ ΜΕ ΧΡΗΣΗ ΤΟΥ LCC COMPILER)

FORMULAS FOR STATISTICS 1

Instruction Execution Times

Βιοπληροφορική. Πίνακες Αντικατάστασης BLOSUM & Οπτική Σύγκριση Αλληλουχιών. Αλέξανδρος Τζάλλας

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Exercises to Statistics of Material Fatigue No. 5

Σχέσεις, Ιδιότητες, Κλειστότητες

Chap. 6 Pushdown Automata

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Πανεπιστήμιο Δυτικής Μακεδονίας. Τμήμα Μηχανικών Πληροφορικής & Τηλεπικοινωνιών. Τεχνητή Νοημοσύνη. Ενότητα 2: Αναζήτηση (Search)

2 Composition. Invertible Mappings

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Other Test Constructions: Likelihood Ratio & Bayes Tests

ΓΡΑΜΜΙΚΟΣ & ΔΙΚΤΥΑΚΟΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ

Πανεπιστήμιο Δυτικής Μακεδονίας. Τμήμα Μηχανικών Πληροφορικής & Τηλεπικοινωνιών. Βιοπληροφορική

ΜΟΝΤΕΛΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

Review Test 3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

CRASH COURSE IN PRECALCULUS

GS3. A liner offset equation of the volumetric water content that capacitance type GS3 soil moisture sensor measured

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Si + Al Mg Fe + Mn +Ni Ca rim Ca p.f.u

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

ER-Tree (Extended R*-Tree)

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Εφαρμοσμένη Βιοτεχνολογία Σημειώσεις. Νίκος Τσουκιάς Σχολή Χημικών Μηχανικών ΕΜΠ

10/3/ revolution = 360 = 2 π radians = = x. 2π = x = 360 = : Measures of Angles and Rotations

Supporting Information

Θεωρία Πληροφορίας και Κωδίκων

Chalkou I. C. [PROJECT] Ανάθεση εργασιών.

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Advanced Subsidiary Unit 1: Understanding and Written Response

Προσομοίωση BP με το Bizagi Modeler

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Computing the Gradient

From the finite to the transfinite: Λµ-terms and streams

Μηχανισμοί πρόβλεψης προσήμων σε προσημασμένα μοντέλα κοινωνικών δικτύων ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

, -.

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Macromechanics of a Laminate. Textbook: Mechanics of Composite Materials Author: Autar Kaw

Transcript:

BMI/CS 776 Lecture #14: Multiple Alignment - MUSCLE Colin Dewey 2007.03.08 1

Importance of protein multiple alignment Phylogenetic tree estimation Prediction of protein secondary structure Critical residue identification AHLGHYGPEP SHVSHYGSDS SHVSHYGSDS TSVSHYGAEP PSASHYGVEH 2

Three cutting-edge multiple alignment methods MUSCLE (Edgar, 2004) progressive (profile alignment), fast tree building, refinement step ProbCons (Do et al., 2005) progressive alignment, PHMMs, maximum expected accuracy, consistency transformation AMAP (Schwartz & Pachter, 2007) sequence annealing, PHMMs, no tree, alignment metric accuracy 3

MUSCLE overview Edgar, 2004 4

Stage 1- Draft progressive 1. Compute kmer distance between all pairs of input sequences 2. Construct initial tree with UPGMA and distances from 1. 3. Progressive profile alignment with tree from 2. 5

kmer distance Much faster than performing pairwise alignment to get distances Use compressed alphabet (elements represent classes of amino acids) d X,Y = τ min(n X(τ), n Y (τ)) min(l X, l Y ) k + 1 X, Y : sequences τ: kmer n X (τ): Number of occurrences of τ in X l X : Length of X 6

Compressed alphabet Table 1. Examples of compressed alphabets produced by different methods Alpha(N) SE-B(14) SE-B(10) SE-V(10) Li-A(10) Li-B(10) Solis-D(10) Solis-G(10) Murphy(10) SE-B(8) SE-B(6) Dayhoff(6) Classes A, C, D, EQ, FY, G, H, IV, KR, LM, N, P, ST, W AST, C, DN, EQ, FY, G, HW, ILMV, KR, P AST, C, DEN, FY, G, H, ILMV, KQR, P, W AC, DE, FWY, G, HN, IV, KQR, LM, P, ST AST, C, DEQ, FWY, G, HN, IV, KR, LM, P AM, C, DNS, EKQR, F, GP, HT, IV, LY, W AEFIKLMQRVW, C, D, G, H, N, P, S, T, Y A, C, DENQ, FWY, G, H, ILMV, KR, P, ST AST, C, DHN, EKQR, FWY, G, ILMV, P AST, CP, DEHKNQR, FWY, G, ILMV AGPST, C, DENQ, FWY, HKR, ILMV Alphabet names are de ned in the main text. E(A) = i A j A p ij log ( pij p i p j ) Edgar, 2004 7

UPGMA vs. Neighborjoining UPGMA better for progressive alignment because forces alignment of most similar sequences first u v x y x y u v True tree, recovered by NJ UPGMA tree 8

Progressive profile alignment Profile: alignment of two alignments by matching up corresponding columns, scoring based on composition of columns Progressive: alignment at each node in tree from leaves to root X Y M Q T F L H T W L Q S W L T I F M T I W Profile alignment (figures from Edgar, 2004) M Q T - F L H T - W L Q S - W L - T I F M - T I W M Q T I F L H - I W L Q S - W L - S - F M Q T I F L H - I W L Q S W L - S F M Q T I F L H I W L Q S W L S F Progressive profile alignment 9

Profile alignment scoring How to score alignment of two profile positions? Common function (profile sum-of-pairs): PSP xy = i fi x f y j S ij = j i MUSCLE s log-expectation score: LE xy = (1 f x G)(1 f y G ) log i frequency of gaps in profile column x j f x i f y j log ( pij p i p j fi x f y j j ( pij p i p j ) ) 10

Stage 2 - Improved progressive 1. Using multiple alignment from Stage 1, extract all implied pairwise alignments 2. Compute Kimura distance between all pairs of sequences using pairwise alignments 3. Compute a new tree using Kimura distances 4. Compute new multiple alignment with new tree 11

Stage 3 - Refinement 1. Chose an edge in the tree 2. Divide sequences into two sets according to split in tree defined by edge 3. Extract multiple alignment (profile) for each set of sequences 4. Re-align the two profiles 5. Accept new alignment if SP score improves 6. Repeat 12

Performance Table 1. BAliBASE scores and times Method Q TC CPU MUSCLE 0.896 0.747 97 MUSCLE-p 0.883 0.727 52 T-Coffee 0.882 0.731 1500 NWNSI 0.881 0.722 170 CLUSTALW 0.860 0.690 170 FFTNS1 0.844 0.646 16 Table 6. Q scores and CPU times on SABmark Method All Superfamily Twilight CPU MUSCLE 0.430 0.523 0.249 1886 T-Coffee 0.424 0.519 0.237 5615 MUSCLE-p 0.416 0.511 0.230 304 NWNSI 0.410 0.506 0.223 629 CLUSTALW 0.404 0.498 0.220 206 FFSNT1 0.373 0.467 0.190 75 Align-m 0.348 0.445 0.172 8902 Edgar, 2004 13