We initiated this study because we could not find a reliable way to determine the Human Genome Reference Model used to align and create a SAM / BAM / CRAM file from a WGS output of unaligned read segments. The same issue applies to VCF files and the same principles developed here can, for the most part, be used there. Key with this work is defining a nomenclature of the model content and sequence naming from the provider. As we discovered, this was critical for the 1000 Genomes Project models that are so prevalent in the industry. There is much diversity among them and is where the most care needs to be taken. Additionally, there are some very different named models that are actually identical for all uses.

The results of this study are being used in the WGS Extract tool being redeveloped here to thus allow it to better guide users (or automatically chose) the correct reference model. The reference model used is not a parameter hard coded in and required as part of the SAM file format. As such, it has to be determined, if even possible, from what is generally always available. A CRAM file ends up having enough information to correctly and uniquely determine the model that was used to compress it. Which is critical as the exact model must be used to un-compress it. This even though it also does not have the model identified with a parameter.

The focus here is on the primary model sequences. Namely the 24 chromosomes and mitochondria. The additional sequences consisting of alternate contiguous regions and such was only incorporated in as much as the count of their existence and the names were incorporated into the algorithms to more uniquely identify a specific model. But details of the content of the contigs and their MD5 signatures was not analyzed.


Classification Nomenclature for Human Genome Reference (analysis) Models


References