BMI/CS 776 Lecture #14: Multiple Alignment - MUSCLE. Colin Dewey

BMI/CS 776 Lecture #14: Multiple Alignment - MUSCLE Colin Dewey 2007.03.08 1

Importance of protein multiple alignment Phylogenetic tree estimation Prediction of protein secondary structure Critical residue identification AHLGHYGPEP SHVSHYGSDS SHVSHYGSDS TSVSHYGAEP PSASHYGVEH 2

Three cutting-edge multiple alignment methods MUSCLE (Edgar, 2004) progressive (profile alignment), fast tree building, refinement step ProbCons (Do et al., 2005) progressive alignment, PHMMs, maximum expected accuracy, consistency transformation AMAP (Schwartz & Pachter, 2007) sequence annealing, PHMMs, no tree, alignment metric accuracy 3

MUSCLE overview Edgar, 2004 4

Stage 1- Draft progressive 1. Compute kmer distance between all pairs of input sequences 2. Construct initial tree with UPGMA and distances from 1. 3. Progressive profile alignment with tree from 2. 5

kmer distance Much faster than performing pairwise alignment to get distances Use compressed alphabet (elements represent classes of amino acids) d X,Y = τ min(n X(τ), n Y (τ)) min(l X, l Y ) k + 1 X, Y : sequences τ: kmer n X (τ): Number of occurrences of τ in X l X : Length of X 6

Compressed alphabet Table 1. Examples of compressed alphabets produced by different methods Alpha(N) SE-B(14) SE-B(10) SE-V(10) Li-A(10) Li-B(10) Solis-D(10) Solis-G(10) Murphy(10) SE-B(8) SE-B(6) Dayhoff(6) Classes A, C, D, EQ, FY, G, H, IV, KR, LM, N, P, ST, W AST, C, DN, EQ, FY, G, HW, ILMV, KR, P AST, C, DEN, FY, G, H, ILMV, KQR, P, W AC, DE, FWY, G, HN, IV, KQR, LM, P, ST AST, C, DEQ, FWY, G, HN, IV, KR, LM, P AM, C, DNS, EKQR, F, GP, HT, IV, LY, W AEFIKLMQRVW, C, D, G, H, N, P, S, T, Y A, C, DENQ, FWY, G, H, ILMV, KR, P, ST AST, C, DHN, EKQR, FWY, G, ILMV, P AST, CP, DEHKNQR, FWY, G, ILMV AGPST, C, DENQ, FWY, HKR, ILMV Alphabet names are de ned in the main text. E(A) = i A j A p ij log ( pij p i p j ) Edgar, 2004 7

UPGMA vs. Neighborjoining UPGMA better for progressive alignment because forces alignment of most similar sequences first u v x y x y u v True tree, recovered by NJ UPGMA tree 8

Progressive profile alignment Profile: alignment of two alignments by matching up corresponding columns, scoring based on composition of columns Progressive: alignment at each node in tree from leaves to root X Y M Q T F L H T W L Q S W L T I F M T I W Profile alignment (figures from Edgar, 2004) M Q T - F L H T - W L Q S - W L - T I F M - T I W M Q T I F L H - I W L Q S - W L - S - F M Q T I F L H - I W L Q S W L - S F M Q T I F L H I W L Q S W L S F Progressive profile alignment 9

Profile alignment scoring How to score alignment of two profile positions? Common function (profile sum-of-pairs): PSP xy = i fi x f y j S ij = j i MUSCLE s log-expectation score: LE xy = (1 f x G)(1 f y G ) log i frequency of gaps in profile column x j f x i f y j log ( pij p i p j fi x f y j j ( pij p i p j ) ) 10

Stage 2 - Improved progressive 1. Using multiple alignment from Stage 1, extract all implied pairwise alignments 2. Compute Kimura distance between all pairs of sequences using pairwise alignments 3. Compute a new tree using Kimura distances 4. Compute new multiple alignment with new tree 11

Stage 3 - Refinement 1. Chose an edge in the tree 2. Divide sequences into two sets according to split in tree defined by edge 3. Extract multiple alignment (profile) for each set of sequences 4. Re-align the two profiles 5. Accept new alignment if SP score improves 6. Repeat 12

Performance Table 1. BAliBASE scores and times Method Q TC CPU MUSCLE 0.896 0.747 97 MUSCLE-p 0.883 0.727 52 T-Coffee 0.882 0.731 1500 NWNSI 0.881 0.722 170 CLUSTALW 0.860 0.690 170 FFTNS1 0.844 0.646 16 Table 6. Q scores and CPU times on SABmark Method All Superfamily Twilight CPU MUSCLE 0.430 0.523 0.249 1886 T-Coffee 0.424 0.519 0.237 5615 MUSCLE-p 0.416 0.511 0.230 304 NWNSI 0.410 0.506 0.223 629 CLUSTALW 0.404 0.498 0.220 206 FFSNT1 0.373 0.467 0.190 75 Align-m 0.348 0.445 0.172 8902 Edgar, 2004 13