Tilapia Genome Status Analysis of Broad assembly v. 1 February, 2011 Tom Kocher, Matt Conte, Lucile Soler University of Maryland and CIRAD
Genome Browser We have loaded the Broad assembly into Gbrowse: Bouillabase.org or http://cichlid.umd.edu/cichlidlabs/kocherlab/genomebrowsers.html We are adding a variety of annotation, and mapping of read data, to the browser tracks
Assembly stats (Broad v.1) 77,754 contigs (N 50 = 29,493) (>1kb?) 5,900 scaffolds (>1kb) scaffold length including gaps: 924,023,520 (N 50 = 2,800,770) scaffold length excluding gaps: 816,089,150 (N 50 = 2,757,744)
Good continuity Tilapia opsin BAC compared to Broad assembly v.1
Base accuracy Average depth of coverage is high, should give very high accuracy Have not yet made detailed comparisons to gold standards (Sanger BAC ends) Expect base accuracy >>99% (Q??)
Gaps The current 927Mb assembly contains 112Mb of gaps (12.0%)
Gaps
Current Assembly % total assembly cumulative length #scaffolds scaffold length 0.1 92,772,591 8 9,361,541 0.2 185,545,182 21 5,793,747 0.3 278,317,774 39 4,605,129 0.4 371,090,365 62 3,703,282 0.5 463,862,956 90 2,801,867 0.6 556,635,547 127 2,265,546 0.7 649,408,138 175 1,590,368 0.8 742,180,730 250 952,115 0.90 834,953,321 403 372,320 0.91 844,230,580 430 330,839 0.92 853,507,839 459 292,362 0.93 862,785,098 495 240,771 0.94 872,062,357 536 209,527 0.95 881,339,616 584 168,139 0.96 890,616,876 645 129,997 0.97 899,894,135 730 88,114 0.98 909,171,394 877 43,542 0.99 918,448,653 1,518 6,459 1.00 927,725,912 5,900 1,000
Constructing a golden path with RH map of Galibert et al. A golden path is the ordered sequence of assembly scaffolds along each chromosome. Assembly scaffolds that cannot be placed in the golden are lumped together in the unordered chromosome.
Golden path total length of genome total length of golden path ratio 927,725,912 657,266,498 0.708 total number of scaffolds number of scaffolds in GP ratio 5,899 236 0.040
Karyotype of O. niloticus O. niloticus FISH with repeat-containing BAC (Ferreira et al. 2010). Note the high density of repeats on chr4 (LG3 in the genetic map).
Golden path Expect average of 50Mb/chr Most have 25-30Mb LG3 (largest chromosome) has only 17Mb LG7 has 53Mb? LG total length nb scaffold scaffold LG1 31,194,087 8 LG2 25,304,446 6 LG3 17,278,939 9 LG4 26,483,370 8 *RH LG5 27,331,326 8 LG6 27,289,678 14 LG7 53,105,870 14 RH LG8-24 29,449,623 10 LG9 19,809,448 4 RH LG10 10,773,098 5 LG11 31,190,552 13 * LG12 34,678,406 14 RH LG13 31,740,381 8 RH LG14 30,266,167 16 RH LG15 26,979,052 10 LG16-21 28,301,266 11 LG17 23,955,958 8 LG18 26,197,606 8 LG19 29,056,773 10 LG20 31,469,886 9 LG22 20,073,157 10 RH LG23 18,956,114 5 ORPHANS 56,381,295 29
Golden path About 60% of markers are found in the golden path This is the expected value if the golden path contains 70% of the genome and the assembly has 10% gaps LG number of marker number of marker matching ratio LG1 45 30 0.667 LG2 45 28 0.622 LG3 35 16 0.457 LG4 54 34 0.630 LG5 54 42 0.778 LG6 60 37 0.617 LG7 79 45 0.570 LG8-24 60 36 0.600 LG9 36 19 0.528 LG10 23 16 0.696 LG11 48 30 0.625 LG12 79 51 0.646 LG13 48 27 0.563 LG14 53 35 0.660 LG15 47 36 0.766 LG16-21 54 34 0.630 LG17 50 26 0.520 LG18 54 37 0.685 LG19 53 35 0.660 LG20 55 37 0.673 LG22 43 25 0.581 LG23 36 19 0.528 ORPHANS 126 75 0.595
LG1 Scaffold_94 Scaffold_222 Scaffold_287 Scaffold_4098 Scaffold_40 total length nb LG scaffold scaffold LG1 31194087 8 Scaffold_17 number number of marker LG of marker matching ratio LG1 45 30 0.667 Scaffold_154 Scaffold_0 A good example/result!
LG3 Scaffold_414 Scaffold_56 total length LG scaffold nb scaffold LG3 17,278,939 9 Scaffold_75 Scaffold_88 number of LG number of marker marker matching ratio LG3 35 16 0.457 Scaffold_357 Scaffold_116 Scaffold_297 Scaffold_341 Not so good relatively little of this large chromosome is represented.
LG7 total length LG scaffold nb scaffold LG7 53105870 14 RH Scaffold_52 Scaffold_9 Scaffold_138 Scaffold_270 Scaffold_8 number of LG number of marker marker matching ratio LG7 79 45 0.57 S182 S142 Scaffold_30 Scaffold_166 Scaffold_103 Scaffold_78 Scaffold_171 Scaffold_6 Problematic RH map and assembly scaffolds disagree. Scaffold_291
Fish genome assemblies Species Chrom # Genome size (Mb) Mb unordered % unordered Tilapia v. 1 (2011) 22 927 270 29.2 Stickleback (2006) 22 463 62 13.5 Medaka (2005) 24 869 145 16.7 Tetraodon v.8 (2007) 21 358 118 32.8 Tetraodon v.7 (2004) 21 402 185 45.9
Probable misassembly Scaffold 21 has multiple hits to RH markers on both: tilapia LG4 (stickleback chr 11) and tilapia LG11 (stickleback chr 20
Breakpoint in Scaffold 21 Not much support here Good support for 25kb gap
Not much support in 40kb data Closer look at Scaffold 21
Not much support in U Md 5kb data Detail view Scaffold 21
BAC scaffolding Type 1 only 1 end available, or mapping to the assembly after repeat masking. Type 2 both ends map to the same scaffold, at an appropriate distance/orientation (the vast majority) Type 3 - both ends map to the same scaffold, but at the wrong distance (almost none of these). Type 4 the two ends map to different scaffolds. These have the potential to help link scaffolds.
Scaffolding with BACs
BAC scaffolding Approximately 1,000 type 4 BACs map within 200kb of the end of a scaffold. These are available for the next round of scaffolding, but seem to be too few to be of much help.
40kb libraries SRR071595 low complexity SRR071611 good complexity
Question Would additional 40kb libraries help the scaffolding effort?
Recent duplications Recently duplicated genes may have important roles, but are poorly assembled by the WGS approach. We have been studying the vasa gene, in order to identify promoter sequences to create a transgenic expressing GFP in the primordial germ cells and gonad.
BAC contigs Most vertebrates have a single copy of the vasa gene PCR screening of the Katagiri/UNH tilapia BAC library identified two FPC contigs of BACs containing vasa genes Contig992 T4-2R 71CD02* T4-2R 52B(B02) T4-2R 53B(A09) Contig542 T4-2R 72AB04* T4-2R 05B(B04) T3-2R 72CG(01) We sequenced (by 454) and assembled the two * BACs
Two vasa BACs sequenced (by 454) ~ 1% seq divergence
71H03 vs Broad v.1 BAC seq contigs contig1 scaffold_19 scaffold_19 BAC seq contigs of 71H03 are on Scaffold_19, and the organization of the contings was confirmed. However, the seqence in vasa gene loci of 71H03 was not completely supported by the Broad sequence. vasa gene locus scaffold_19
72C07 vs Broad v.1 BAC seq contigs contig1 scaffold_11 scaffold_11 BAC seq contigs of 72C07 are on Scaffold_11, and the organization of the contings was confirmed. However, the seqence in vasa gene loci of 72C07 was not completely supported by the Broad sequence. vasa gene locus scaffold_11
Scaffolds 11 & 19 are incomplete 71H03 72C07 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 * 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 scaffold_11 + + + + + + + + + + + scaffold_19 + + + + + + + + + Scaffold_11 possesses exons 12-22 of vasa gene Scaffold_19 possesses exons 4 13, except for 12, of vasa gene.
vasa cdna 3 vasa scaffolds! scaffold_19 scaffold_11 scaffold_160
Three copies in Broad v.1, each assembled scaffold incomplete Original location Scaffold 11 Scaffold 160 Scaffold 19 Koji Fujimura, in prep
Annotation U Maryland Maker annotation running Should be available after 2 more weeks of computation Expected results on next slide
Conclusions Assembly has high continuity and base accuracy, similar to previous, Sanger-based, fish genomes. Spanned gaps represent 10% of the current assembly. Reasons for gaps not yet known. It would be desirable to fill these. About 25% of the assembly is not yet in a golden path. Placing the top 1500 scaffolds >6kb would incorporate 99% of the assembly into the golden path. At least one probably misassembly has been identified, and should be scrutinized for general lessons.