Molecular Structure: matching and kinematics Ioannis Z. Emiris Dept. of Informatics & Telecoms, University of Athens Algs in Struc.BioInfo 17
Outline 03. Structure types, aminoacids, Ramachandran plot 12. Structure comparison 21. Databases, and prediction 27. Kinematics and Rigid transforms 30. Motion planning: Configuration space 39. Appendix: Ramachandran, Structure matching, geometric hashing Reading: Wikipedia for RMSD: Root-mean-square deviation. Choset, Kavraki et al. Principles of Robot Motion, Chap. 3, and E.
Structure types
Structure types Primary: ATCCGTG, FQRRTVQILQT Secondary: α-helix, β-sheet Super-secondary: α-hairpin, β-hairpin, β α β Tertiary: (3.9Å, 2.2Å, -45.1Å),.... Overall fold. E.g. N, H α, C, C β form regular tetrahedron centered at C α. Mirror symmetries isomers (proteins w/one isomer always). Quaternary: several monomers (domains) van der Waals.
Primary/tertiary structure of the 20 aminoacids. Backbone skeleton is: top-c, center (C α ), N-at-left. min: Glycine, special: Proline.
α-helix DNA: usually B-DNA in cells, radius 10Å; Z-, A-DNA less frequent, less standard helix. Proteins: Usually right-handed spiral (αr); Sidechains usually lie inside the helix Glycine may form left-handed helix αl. Each C α advances by: 100 o, 1.5Å. i.e. 3.6 residues per turn of the helix. Rigidity by H-bonds: (C=O) i (HN) i+4, and inward hydrophobic, outward hydrophilic faces.
β-sheet Parallel xor antiparallel (twisted). Composed of 2 (almost) coplanar strands. φ, ψ angles differ by π. Each C α advances by: 3.5Å Rigidity due to H-bonds: C=O HN, of neighboring strands
Antiparallel vs Parallel
Protein tertiary structure N, H α, C, C β form Tetrahedron, center C α. Rotation around single bonds N-C α (φ) and C α -C (ψ) one exception: proline (only ψ). angle ω around peptide bond has 2 states: trans: ω 180 o is usual cis: ω 0 o is rare, mostly at proline
Sasisekharan-Ramakrishnan-Ramachandran diagram Describes allowed mainchain conformations. Horizontal φ, vertical ψ, typically ω = 180 o. parallel β P, twisted β T ; right-handed α, left-handed L, 3 10, Π helices. Exception: Gly (no limitation), Pro (side chain back to backbone).
Ramachandran diagram: example structure types: α, β, Gly for protein 2ACY [Lesk].
Structure comparison
Measure difference of matched sets Hypotheses: pointsets of equal cardinality, given correspondance (match) Definition. (coordinate) Root Mean Square Deviation (c-rmsd) RMSD = 1 n n i=1 x i y i 2, where x i, y i R 3 are (C α ) atom coordinates in SAME coordinate frame. Lemma. c-rmsd satisfies the triangular inequality. Hence it defines a distance metric.
Optimal Alignment of matched sets Problem. Find translation and rotation minimizing c-rmsd. 1. Translate to common origin by subtracting from x i s centroid x c = 1 n n i=1 x i, x i R 3, and subtracting y c from all y i s; overall = O(n). 2. Rotate to optimal alignment by 3 3 rotation matrix Q : Q T Q = I. Also should have det Q = 1. Deterministic linear algebra (SVD) algorithm [Kabsch]: O(n). Lemma: optimal translation can be decoupled from rotation optimization. Proof: for any Q, optimal translation brings center of mass to origin.
Matrix algebra Let X = [x 1,..., x n ] T, Y = [y 1,..., y n ] T R n 3, then RMSD(X, Y ) = 1 n X Y F, where M 2 F = i,j M 2 ij = tr(m T M), is the Frobenius norm, tr(a) = i A ii is the trace of matrix A = [A ij ]. Recall rotated vector is v T Q or Qv, for column vector v R 3. Assume common centroid = 0, X, Y R n 3 : RMSD(X, Y ) = min Q Y XQ F, Q T Q = I, Q = 1. Proposition. Optimizing rotation Q R 3 3 reduces to max Q tr(q T X T Y ).
Singular Value Decomposition Recall SVD: X T Y = UΣV T, U T U = V T V = I, Σ = σ 1 0 0 0 σ 2 0 0 0 σ 3 where : σ 1 σ 2 σ 3, U, V, Σ are 3 3 like X T Y, and singular values σ i = e i 0, e i are eigenvalues(x T Y ). We wish to find Q that maximizes: tr(q T X T Y ) = tr(q T UΣ V T ) = tr(v T Q T UΣ) tr(σ). 2nd equality by Lem. T; inequality since M = V T Q T U is orthonormal M ij 1 tr(mσ) = i M ii σ i i σ i. Thm. Maximum occurs at M = I Q = UV T. If det Q = 1 then Q reflection, hence negate Q 33 to get rotation. Overall complexity = O(n).
Algorithm Input: pointsets X, Y R n 3 of n corresponding points. Output: minimum RMSD of translated and rotated sets. Algorithm. x c n i=1 x i /n, y c n i=1 y i /n. X {x x c : x X}, Y {y y c : y Y }. SVD: X T Y = UΣV T. Optional: Check σ 3 > 0, where Σ = diag[σ 1, σ 2, σ 3 ]. Q U V T. If det Q < 0 then Q [U 1, U 2, U 3 ] V T. // U i : ith column Return X Q Y F / n // or ni=1 Qx i y i 2 /n
distance-rmsd Assume that r distances d i, i = 1,..., r are known between point-pairs in X and between the corresponding pairs in Y, denoted d i, i = 1,..., r. Defn. For r matched distances, there is a distance-rmsd 2 = 1 r r i=1 Drawback: Computed in O(r) = O(n 2 ). (d i d i )2, r ( n 2 ). Lem. d-rmsd invariant under rigid transforms: translate, rotate, reflect. d-rmsd is a metric in (Euclidean) R r space; but then one point represents a conformation and its mirror image. Please check [Guibas?]: c-rmsd / n d-rmsd 2 c-rmsd.
Vector of distances Equivalent formulation: Let v(x) = (d 1,..., d r ), v(y ) = (d 1,..., d r) R r be the vectors of distances in X, Y respectively. Their Euclidean distance is v(x) v(y ) 2 = r d-rmsd(x, Y ). Subset of distances: Use r ( ) n 2 distances. Must correspond to the same pairs of points in all conformations. May choose r uniformly selected pairs among ( ) n 2. May choose r smallest or largest distances, in one conformation. Alternative idea: distances from few landmark atoms.
Databases, and prediction
Databases Protein Data Bank (PDB) (www.rcsb.org) Structure information and retrieval File starts with protein name, author, maybe secondary structure Omits H-atoms Example: Hemoglobin, residue of Argynine: ATOM N ARG 16.467-2.155-11.004 ATOM CA ARG 16.174-2.970-9.786 ATOM C ARG 14.696-3.056-9.412 ATOM O ARG 14.307-3.945-8.624 ATOM CB ARG 16.892-2.495-8.550 Protein fold classification into hierarchies: SCOP (Structural Classification of Proteins), cf next slide [Murzin et al 95, Andreeva et al 04] CATH (domains) (Class, architecture, topology, homology) [Orengo et al 97, Pearl et al 05] FSSP (DALI offers structural alignment) [Holm,Sander 96] CE (structural alignment)
SCOP Hierarchy Lowest level: individual protein domains (from PDB) families of homologues: similar structure, sequence, (function) imply common evolutionary origin superfamilies: families of similar structure and function, weak evolutionary relationship folds: superfamilies with common folding topology Highest level: classes: α, β, α + β, α/β (α and β) and small proteins Homology of structures expresses common ancestry: either evolutionary: evolved from structure in common ancestor (wings of bats and arms of primates), or developmental: from same tissue in embryonal development (ovaries of female and testicles of male humans).
SCOP example 1 Root SCOP 2 Class α/β, mainly parallel β-sheets (β α β units) 3 Fold Flavodoxin-like: 3 layers, α/β/α; parallel β-sheet of 5 strands, order 21345 4 Superfamily Flavoproteins 5 Family Flavodoxin-related binds FMN 6 Protein Flavodoxin 7 Species Clostridium beijerinckii [Lesk,p.224]
SCOP size In July 2001, SCOP contained 13,220 PDB entries, in 31,474 domains: Class families superfamilies folds All-α proteins 337 224 138 All-β proteins 276 171 93 α/β proteins 374 167 97 α + β proteins 391 263 184 Multi-domain 35 28 28 membrane, cell-surface 28 17 11 Small proteins 116 77 54 Total 1557 947 605
Rigid-body kinematics: Motivation
Molecular kinematics Given a rigid body with specific degrees of freedom (e.g. dihedral angles about covalent bonds), its kinematics describe the allowed motions under certain geometric constraints (distances, angles etc) Modeling of constraints as an algebraic / optimization problem. Applications: structure determination of small (sub)molecules, dimension-reduction during docking, pharmacophore matching. There s many small molecules: most (about 15%) with 4 dof, < 10% with > 10 dof, out of 730,000 w/rotational dof [Irwin-Shoichet 04]
Rigid transforms
Rigid (Euclidean) transformations Preserve distances and angles. Translation d R 3, x x + d. Rotation R SO(3) : R 1 = R T, det R = 1, x Rx. R 1 : rotation by negative angle. R 1 by θ 1, R 2 by θ 2 R 1 R 2 by θ 1 + θ 2. Reflection R : det R = 1 (reflection in R 2 takes body out of the plane) Scaling and Shearing are NOT rigid.
2D transforms Rotation, scaling, shearing: [ ] [ cos θ sin θ sx 0, sin θ cos θ 0 s y ] (typically s x, s y > 0), [ 1 a 0 1 ]. T = cos θ sin θ 0 sin θ cos θ d 0 0 1 : homogeneous transform: translation by d, rotation (by θ) : R SO(2), R 1 SO(2), R 1 = R T, det R = 1. cos θ sin θ 0 sin θ cos θ d 0 0 1 x y 1 i+1 = x y 1 i
Motion planning
Εισαγωγή Ερωτήματα σχετικά με τον σχεδιασμό κίνησης (motion planning) ενός ρομποτικού μηχανισμού: Πόση πληροφορία χρειάζεται για να προσδιοριστεί η θέση κάθε σημείου του ρομπότ; Πώς θα αναπαρασταθεί η παραπάνω πληροφορία; Ποιες είναι οι μαθηματικές ιδιότητες της αναπαράστασης της πληροφορίας; Πώς θα λάβουμε υπ όψιν τα εμπόδια στον σχεδιασμό των κινήσεων; [Choset, Kavraki et al. Principles of Robot Motion, Chapter 3]
Βασικές έννοιες Διαμόρφωση (robot configuration, molecule conformation): πλήρης προσδιορισμός της θέσης (π.χ. 3 συντεταγμένες) κάθε σημείου του ρομπότ. Χώρος διαμορφώσεων (Configuration space, C-space): Ο χώρος όλων των πιθανών διαμορφώσεων του ρομπότ, όπου καθε διαμόρφωση αντιστοιχεί σε ένα σημείο του χώρου. Βαθμοί ελευθερίας (Degrees of freedom): Ο αριθμός των παραμέτρων που απαιτούνται για να προσδιοριστεί μία διαμόρφωση. Ισοδύναμα, η διάσταση του χώρου διαμορφώσεων. Χώρος εργασίας (Workspace): Ο φυσικός χώρος που είναι προσβάσιμος από το ρομπότ, τυπικά 3Δ. Προσοχή: Χώρος εργασίας Χώρος διαμορφώσεων.
Παράδειγμα 1: Ρομπότ-δίσκος Ρομπότ-δίσκος, δεδομένης ακτίνας r, το οποίο κινείται στο δισδιάστατο επίπεδο R 2. Διαμόρφωση: q = (x, y) αρκεί να προσδιοριστεί το κέντρο του ρομπότ, άρα C-space R 2. Για κάθε διαμόρφωση μπορούμε να υπολογίσουμε τα σημεία που καταλαμβάνει το ρομπότ ως εξής: R(x, y) = {(x, y ) R 2 (x x ) 2 + (y y ) 2 r 2 }, r = ακτίνα του ρομπότ. Μπορούμε να ορίσουμε τον χώρο διαμορφώσεων και τον χώρο εργασίας. Είναι και οι δύο υποσύνολα του R 2, αλλά είναι διαφορετικοί!
Παράδειγμα 2: Βραχίονας με δύο αρθρώσεις
Παράδειγμα 2: Βραχίονας με δύο αρθρώσεις Διαμόρφωση: η θέση του χεριού (elbow up / down δηλ. θ 2 ) δεν αρκεί: χρειάζονται οι γωνίες και των 2 αρθρώσεων: q = (θ 1, θ 2 ). Κάθε άρθρωση μπορεί να περιστραφεί σε ένα μοναδιαίο κύκλο S 1 χώρος διαμορφώσεων Q = S 1 S 1 = T 2 δηλ. δισδιάστατος τόρος. χώρος εργασίας = ένας δίσκος R 2 (εικόνα δεξιά).
Εμπόδια Εμποδια χώρου διαμορφώσεων (C-space obstacles): Διαμορφώσεις q όπου το ρομπότ R(q) συγκρούεται με εμπόδιο W i : O i = {q Q R(q) W i }. Ελευθερος χώρος διαμορφώσεων (free C-space): Q free = Q \ ( i O i ) Ελεύθερο μονοπάτι (free path): Μονοπάτι χωρίς συγκρούσεις με εμπόδια που δεν περιλαμβάνει ούτε τα ακραία σημεία του Q free. Δίνεται από παραμετροποίηση: c : [0, 1] Q free. Ημι-ελεύθερο μονοπάτι (semifree path): Οπως το ελεύθερο, αλλά μπορεί να περιλαβει ακραία σημεία (όριο) του Q free : c : [0, 1] Closure(Q free ).
Παράδειγμα 1 (με εμπόδια) (1) Κυκλικό ρομπότ και πολυγωνικό εμπόδιο στο R 2. (2) το ρομπότ διατρέχει το εμπόδιο του χώρου εργασίας (workspace obstacle). Ελέγχουμε συγκεκριμένα σημεία. (3) Η τροχιά του κέντρου ορίζει το εμπόδιο στον χώρο διαμορφώσεων (C-space obstacle), όπου το ρομπότ = σημείο. Επαυξημένο πολύγωνο = άθροισμα Minkowski του αρχικού + δίσκο
Παράδειγμα 2 (με εμπόδια) A A Για τα εμπόδια στον χώρο διαμορφώσεων, θεωρούμε σύνολο διαμορφώσεων και για καθεμία υπολογίζουμε αν προκαλεί σύγκρουση. Ο βραχίονας έχει 2 αρθρώσεις: θ 1 = 0 στον άξονα x, θ 2 = 0 στον x, αμφότερες CCW. One point is fixed (center of left fig.). [Choset,Kavraki et al. Sec.3.2.2]
Appendix
Ramachandran diagram (stats) 20-residue average except Gly / Pro
Structure matching
Rigid Matching Finding best transform ie. yielding max/bio-favorable superposition. Dependent on sequence-order: Matching set [Taylor-Orengo 89] (Dynamic Programming SSAP). fragments [Vriend-Sander] follow sequence order. FSSP-DALI [Holm,Sander 93], CE [Bourne,Shindyalov 98] Independent of Sequence (unlabeled points, different cardinalities) Geometric hashing (from vision): finds translation, rotation, scaling maxclique in SSE graph (by 2ary elements) [Mitchel et al] [Koch et al] Sequence independence: - 3d task vs essentially linear task. - Simultaneous match of sequence / structure is better + Finds non-sequential motifs eg. binding sites + works with partial / disconnected input
Geometric Hashing: 2D preprocess Preprocess each pointset (model) in database: pair (points #4, #1 below), define a reference frame: Compute coordinates (x, y) of all points in this frame, store [model, frame] in entry Hash(x, y). Storing 3 hash entries (2 shown by arrows) in 2D
Geometric Hashing: 2D query Online processing of query pointset (image): I. Pick reference frame (defined by 2 points): compute coordinates of all query points in this frame. II. Hash query points: for every data point in its hash-entry, cast a vote for the corresponding [model, frame] III [model, transform] with high scores induce potential match: optimize transform by least-squares (or RMSD on matched points) Hashed points vote for each [model,frame] pair in their hash entries (2 arrows shown)
Geometric Hashing: Complexity Parameters. M = #structures in database (models), n = #points per structure/model, c = 1 + #points to define a frame: c = 3 in 2D, c = 4 in 3D. Time complexity. preprocess = O(Mn c ), online query = O(Hn c ), where H = #complexity of checking one hashtable entry. H = O(1) typically when Space = O(Mn c ), good hashing; or can be H = O(Mn c ) for small/unlucky tables. [eclass/eggrafa/apallaktikh/wolfson-rigoutsos 99]
Geometric hashing: generalization Idea: Given two objects each with n unlabeled points: Each Pair of almost-congruent triangles defines 3D rigid transform (congruent/similar: invariant under translation, rotation, scaling) For each candidate transform, count superposed points. For best candidates, find RMSD on matched pairs, keep the best. Complexity O(n 7 ) (if we exploit backbone geometry: n 3 ) [Wolfson slides] Against database: 0. point (residue), define local neighborhood. 1. Geometric Hashing gives seed matches. 2. Cluster seed matches by merging matched points 3. Compare RMSDs of clusters; extend better clusters until solution Extra: store features into [model, frame, features]
Flexible Alignment Motivation. Mutations/docking imply conformational change Hinge and shear motion of domains [Lesk] Existing work 3D curve matching [Schwartz,Sharir 87], using splines [Wolfson et al 91] Dock [Leach,Kutz]. FlexX (dock), FlexS (structures) use anchors [Lengauer,Lemmen,Klebe 98] small-molecule database search [Rigoutsos,Platt,Califano 96] Pose clustering [Verbitsky,Wolfson,Nussinov 99]. Known hinges, hashing [Fligelman,Nussinov,Wolfson 00] FlexProt [Shatsky,Nussinov,Wolfson 02]