* Survey and Classification of Human Genome Reference (Analysis) Models

Randy Harr - Mon 22 of Feb, 2021 18:02 EST

We have just concluded an 8 month exhaustive survey of available Human Genome Reference Models in FASTA format. The result was a little surprising in some instances but mostly followed well-repeated conventions from the industry. Out of this study, we have developed a new classification system for the models that can be better used to match an already aligned BAM file to its likely reference model used to build it. This is important for follow-on processing to extract variants into VCF files or possibly compress with the CRAM format.

We initiated this study because we could not find a reliable way to determine the Human Genome Reference Model used to align and create a SAM / BAM / CRAM file from a WGS output of unaligned read segments. The same issue applies to VCF files and the same principles developed here can, for the most part, be used there. Key with this work is defining a nomenclature of the model content and sequence naming from the provider. As we discovered, this was critical for the 1000 Genomes Project models that are so prevalent in the industry. There is much diversity among them and is where the most care needs to be taken. Additionally, there are some very different named models that are actually identical for all uses.

The results of this study are being used in the WGS Extract tool being redeveloped here to thus allow it to better guide users (or automatically chose) the correct reference model. The reference model used is not a parameter hard coded in and required as part of the SAM file format. As such, it has to be determined, if even possible, from what is generally always available. A CRAM file ends up having enough information to correctly and uniquely determine the model that was used to compress it. Which is critical as the exact model must be used to un-compress it. This even though it also does not have the model identified with a parameter.

The focus here is on the primary model sequences. Namely the 24 chromosomes and mitochondria. The additional sequences consisting of alternate contiguous regions and such was only incorporated in as much as the count of their existence and the names were incorporated into the algorithms to more uniquely identify a specific model. But details of the content of the contigs and their MD5 signatures was not analyzed.

Summary chart of classification system developed as part of the Determining your BAM Reference Model Classification Study. See https://bit.ly/34CO0vj

Classification Nomenclature for Human Genome Reference (analysis) Models

References

Determining Your BAM Reference Model: A Classification Study by Randy Harr, 15 Feb 2021
Companion Reference Spreadsheet that catalogues the models found and their studied parameters

Translations

Article actions

References