Friday, February 23, 2018

GRCg6: Curation of the chicken reference genome assembly transfers to the GRC

The GRC announces the release of the latest chicken reference genome assembly, GRCg6.

The chicken reference assembly defines a standard upon which other avian whole genome studies are based. Providing the best representation of the chicken genome is essential for facilitating continued progress in understanding and improving human health as this species serves as a model organism similar to mouse, zebrafish and other vertebrates.

The chicken reference genome project began as an international research collaboration coordinated by the McDonnell Genome Institute with past funding from the National Institutes of Health (NIH) and U.S. Department of Agriculture whose shared goals were to determine the sequence of the chicken chromosomes and annotate all possible chicken genes. The initial genome reference was completed and published in Nature in 2004 and has since evolved in quality. The reference experienced a major upgrade in 2016, termed Gallus_gallus-5.0 (GCA_000002315.3), as a result of long read sequencing technology and added transcriptome data. In 2017, responsibility for the management of chicken genome assembly updates transferred to the Genome Reference Consortium (GRC). The generation of additional sequence coverage comprised of long read data, in particular average read lengths of 12kb, as well as improvements to de novo assembly algorithms has resulted in another upgrade, GRCg6, that has now been released for immediate community use. Manual annotation of de novo assembled contigs that have been integrated with finished BAC clones have produced an assembly with superior metrics of quality, such as N50 contig size of 18Mb and much lower gap counts.

Visit the chicken homepage at the GRC website for assembly notifications, report assembly issues or contact us with questions.

The GRCg6 assembly will be available in all major genome browsers, and will be annotated by both the NCBI eukaryotic genome annotation pipeline and Ensembl.

* Photo courtesy of Dr. Jerry Dodgson 

Friday, February 9, 2018

New technique closes gaps in GRCm38.p6

Ongoing efforts to close gaps and to correct clone problems remaining in the GRCm38 mouse reference assembly have proved difficult. The available clone library resources have been exhausted, and the remaining gaps are recalcitrant to cloning, with either no clones available or gap-spanning clones deleted for the expected genomic sequence. The GRC has previously used contigs from publicly available whole genome shotgun assemblies to provide sequence at some of these gaps, and in some cases have been able to close gaps entirely with this approach. Nonetheless, several hundred sequence gaps, many of which are known to contain genes, remain.

With the release of 17 strain-specific genome assemblies from the Mouse Genomes Project, the GRC evaluated alignments between C57BL/6NJ, the most closely related strain, and the GRCm38 reference (C57BL/6J). This evaluation found genes missing from the reference assembly to be present in the new strain assembly. Utilising the C57BL/6J read set (PRJNA51977) deposited in GenBank by the Broad Institute, and used in the production of the C57BL/6J ALLPATHS WGS assembly GCA_000185105.2, the Genome Reference Consortium’s sought to generate local assemblies from these reads that could be used for curation of the GRCm38 reference. The read set was initially aligned to the C57BL/6NJ assembly using bwa-mem. Once completed, reads aligning to the C57BL/6NJ assembly corresponding to GRCh38 gaps and the location of clone-assembly problems in the GRCm38 reference were identified and subsequently assembled using the Geneious software platform (version 10.1.3). The resulting assembly BAMs were then loaded into GAP5 for manual curation. The assembled WGS contigs were then submitted to GenBank.

The patch release GRCm38.p6 addresses 20 regions with these newly created and submitted sequences. These contigs fix and improve representation for several genes, examples of which are shown in Table 1 and Figure 1.

Table 1: Examples of issues fixed in GRCm38.p6 using assembled Illumina reads.

Figure 1 Top: Incomplete representation of Anxa13 gene in GRCm38 due to a deletion in reference component AC152395.9. Middle: clone error corrected in GRCm38.p6. Fix patch uses MF597750.1 and MF597749.1 to add deleted sequence to AC152395.9. It also provided a complete representation of Anxa13. Bottom: Representation of Anxa13 by reference chr. 15 and fix patch highlighting complete representation of Anxa13 (NM_027211.2).