The Gramene Team is pleased to announce its release #51. In collaboration with Ensembl Plants, we are providing in this release:
The new maize B73 RefGen_v4 assembly is an entirely new assembly of the maize genome and is being released with a new set of gene annotations. It was constructed from PacBio Single Molecule Real-Time (SMRT) sequencing at approximately 60-fold coverage and scaffolded with the aid of a high-resolution whole-genome restriction (optical) mapping. The pseudomolecules of maize B73 RefGen_v4 were assembled nearly end-to-end, representing a 52-fold improvement in average contig size relative to the previous reference (B73 RefGen_v3).
The gene set was annotated with the Maker pipeline (Campbell et al, 2014) using 111,000 transcripts obtained by single-molecule sequencing. These long-read Iso-Seq data (Wang et al, 2016) improved annotation of alternative splicing, more than doubling the number of alternative transcripts from 1.5 to 3.8 per gene, thereby improving our knowledge of gene structure and transcript variation, resulting in substantial improvements including resolved gaps and misassembles, corrections to strand, consolidation of gene models, and anchoring of unanchored genes.
Gene annotation was performed in the laboratory of Doreen Ware (CSHL/USDA). Protein-coding genes were identified using MAKER-P software version 3.1 (Campbell et al, 2014) with the following transcript evidence: 111,151 PacBio Iso-Seq long-reads from 6 tissues (Wang et al, 2016), 69,163 full-length cDNAs deposited in Genbank (Alexandrov et al, 2008; Soderlund et al, 2009), 1,574,442 Trinity-assembled transcripts from 94 B73 RNA-Seq experiments (Law et al, 2015), and 112,963 transcripts assembled from deep sequencing of a B73 seedling (Martin et al, 2014). Additional evidence included annotated proteins from Sorghum bicolor, Oryza sativa, Setaria italica, Brachypodium distachyon, and Arabidopsis thaliana downloaded from Ensembl Plants Release 29 (Oct-2015). Gene calling was assisted by Augustus (Keller et al, 2011) and FGENESH (Salamov & Solovyev, 2000) trained on maize and monocots, respectively. Low-confidence gene calls were filtered on the basis of an Annotation Edit Distance (AED) score and other criteria and are viewable as a separate track. In the end, the higher confidence set (called filtered gene set) has 39,324 protein coding genes. Gene annotations from B73 RefGen_v3 were mapped to the new assembly and are also available as a separate track. In addition, 2,532 long non-coding RNA (lncRNA) genes were mapped and annotated from prior studies (Li et al, 2014; Wang et al, 2016), while 2,290 tRNA genes were identified using tRNAscan-SE (Lowe & Eddy, 1997), and 154 miRNA genes mapped from miRBase (Kozomara & Griffiths-Jones, 2014).
NOTE: We continue to provide the maize B73 RefGene_v3 gene annotations at http://maizev4.gramene.org/Zea_mays/Info/Index
A complete description of the contents of this new release is available in our release notes.
Please let us know if you have questions or suggestions.
The Gramene Team