The results of this study are being used in the WGS Extract tool being redeveloped here to thus allow it to better guide users (or automatically chose) the correct reference model. The reference model used is not a parameter hard coded in and required as part of the SAM file format. As such, it has to be determined, if even possible, from what is generally always available. A CRAM file ends up having enough information to correctly and uniquely determine the model that was used to compress it. Which is critical as the exact model must be used to un-compress it. This even though it also does not have the model identified with a parameter.
The focus here is on the primary model sequences. Namely the 24 chromosomes and mitochondria. The additional sequences consisting of alternate contiguous regions and such was only incorporated in as much as the count of their existence and the names were incorporated into the algorithms to more uniquely identify a specific model. But details of the content of the contigs and their MD5 signatures was not analyzed.
References
- Determining Your BAM Reference Model: A Classification Study by Randy Harr, 15 Feb 2021
- Companion Reference Spreadsheet that catalogues the models found and their studied parameters