Updates to the Zea mays reference assembly

Since publishing the initial assembly of the reference strain (B73) in 2008, the maize sequencing project has released two updates. The initial assembly and RefGen_v2 are based on a BAC (bacterial artificial chromosome) sequencing strategy. RefGen_v2 included ~2000 updated BAC sequences and improved upon the initial assembly by resolving overlaps between BACs in the minimum tiling path to define the 10 chromosome pseudomolecules. For RefGen_v3, an effort was made to capture missing gene space within and between BACs by using Roche/454 reads produced from a whole genome shotgun sequencing library. These reads were assembled into contigs with AbySS and aligned to the RefGen_v2 assembly to identify contigs with novel sequence. In order to focus on the missing gene space, ~65,000 full length cDNAs were aligned to RefGen_v2 and the novel contigs. A greedy algorithm was employed to select compatible partial alignments to reference or novel contigs to cover as much of each FLcDNA as possible. The 1,844 novel contigs that were selected by this algorithm were scaffolded according to the FLcDNA alignments to determine their order and orientation. Finally, unanchored scaffolds composed entirely of novel contigs were placed into gaps in the reference assembly according to a synteny-refined genetic map wherever possible. There are over 2,100 FLcDNAs that have longer alignments to RefGen_v3 that to RefGen_v2. Of these, 300 were completely absent from v2 but have a full length alignment to v3. Some genes that had been split in RefGen_v2 could be rebuilt completely with the addition of novel contigs coving the missing exons (see screenshot or visit Gramene). Examples like this explain the apparent reduction in the number of protein coding gene models (from 39,656 to 39,475). The novel contigs (accession: AHID01000000) and an AGP (accessioned golden path) for the assembly were submitted to Genbank (accession: GCA_000005005.5). Gramene has hosted RefGen_v3 since release b37 – June 2013.

Improved GRMZM5G891969 gene model. A denovo contig provided two missing exons needed to merge two apparently distinct gene models (GRMZM5G823855 and GRMZM5G891969) into a single model that is conserved in sorghum (Sb01g050450).