The first in Spring 2021 is based on the cell line CHM13 that contains all except the Y chromosome (as it is a female sample). This cell line is one of the most studied and provides much legacy for them to compare with and build this first gapless model. The pre-publication paper draft was out in early summer 2021 along with the public release of the data on the T2T project site. And generated talk and excitement even in the consumer WGS community and its tool developers; like for WGS Extract. (See posts from May through August in the Personal WGS Testing Facebook Group.) Excitement revolved around a model that would fill in the remaining 130 million or so "N" base-pair values, and had a gapless, complete sequence of a single human genome. About a year earlier they had announced the first full telomere to telomere sequence of the X chromosome. But this pre-publication added all 22 autosomes to that.
The CHM13 link also provides the details about the process and techniques used to reign in emerging 2.5G and 3G sequencing technology to achieve this accomplishment. The model released is exciting because it is finally a true COMPLETE human genome sequence. No gaps. No 'N' values. The sample sequenced was then used to generate a draft human genome reference model; albeit based on a single individual. And of course one hassle we noticed last summer was it was only a female sample. The NHGRI project lead told us in August that the Y would be coming soon.
This effort is comparable to the release in 2003 of the final Human Genome Project (HGP) first-ever model of the complete human genome. That model was subsequently refined by the International HapMap and 1K Genome projects to "normalize" it to be more representative of the original, most ancient human (before variations). Similarly, this new T2T model is the start of a larger Human Pangenome Project (HPP) that will seek to understand the variations among populations of variants larger than simple SNVs and similar 5 base-pair and smaller variations. So define a basis for a "normalized" human model on a much larger scale (albeit likely no longer linear).
When that first HGP model was released, there were still hundreds of known gaps. And over 7.5% was filled with N values (representing any value possible). Even with Build 37, this only improved by 0.1%. Build 38 was a bigger leap by reducing the N's to just less than 5% and thus an increase in the overall effective model length by abut 80 million base-pairs. With the T2T model, the number of N's has gone to zero and thus the effective model length has increased another 80 million base-pairs. Basically, a 9% improvement in defining the complete human genome since the original 2003 release and 6% since the last major Build 38 model.
UPDATE: The same month they released their HG002 NA24385 v2 version of the long read MODEL, they prepared a final T2T model that would include the HG002 v2.7 Y chromosome. That work finalized, now internally as v2.0, in January of 2022. UCSC calls this the hs1 release. It is more often referred to as the T2T CHM13 v2.0 with Y or simply T2T v2.
In November of 2021, the HG002 v2 allosome sequences were submitted to the NCBI GenBank database. And some in the Y chromosome phylogenetic tree community discovered it in December. Leading to a rush for many to incorporate the technology of those sequences and the tools they use around them; as well as push to discover and name new variants that may be discoverable from existing NGS (that is, 2G sequencing) results remapped to the new model. But all this work is very manual and deeply knowledge oriented. Not for the average consumer WGS tester.
Unbeknownst to the deep ancestry community at the time, the T2T working group had been releasing a full HG002 model that consisted of their original CHM13 autosomes combined with the HG002 allosome sequences. It is this model that is most exciting and of use to the general NGS WGS community. The NCBI check-in of the T2T project HG002 v2 model from August is known as CP086568/9.1. The planned final v2.7 will likely be known as CP086568/9.2. The WGS community will be using the full genome model put out by the T2T HG002 working group and known as HG002xy.
One of the researchers from a lab at the Harvard University Medical Center provided the sample for HG002 and it is from haplogroup J1. It represents a detailed sequencing of a non-R1b haplogroup male sample. An R1b haplogroup sample was used as the basis of the current reference models from the HGP result going back to 2001. So this alone was exciting for the phylogenetic tree community.
Now many are likely saying: "Why is this so ground breaking? We had the whole genome sequenced and modeled way back in 2001." To understand why, you need to dig deeper under the covers of sequencing technology over time, and what was actually done back in the 1990's for over $3 billion dollars versus what was done today.
These recent releases went through a newly devised, rigorous process of merging 2.5 to 3G long read sequencing (specifically, high-depth PacBio HiFi CCS and Oxford Nanopore (ONT) with Illumina 250 base-pair NGS standard technology reads. New assembly tools were developed to align the sequencing results into a more accurate, constructed, true full-sequence. Thus filling in the nearly 5% of N's in the current HG38 reference model, adding base-pairs to the current model and getting a first true Telomere - to - Telomere sequence of the complete human genome. This is all part of the Pangenome project which has many other samples in the works and in the release process as well.
More so than the original HGP project back in 2003, each of these recent "models" are from a single source sample. So not as much a reference as a single sequence that has been used to build a complete sequence from. The pangenome project plan is to apply the technique to many samples and thus build up a pangenome model as a much better reference model going forward. But for the time being, like happened in the early 2000's, this is simply a single sample that has been more fully sequenced.
So the innovation or change is in applying the newly emerging long-read sequencing to get a more accurate picture of the actual human genome and reconstruct it from longer sequenced segments more accurately. The initial HGP was done soley using Sanger Sequencing. Which is good for up to about 1,000 base pairs of length on a segment. And when you know what you are looking for. You need a primer to attach to the DNA you want to sequence and start from there. Then with more advanced and early new generation sequencing tools, the 1KGenome project came along and refined that initial model. Filling in many gaps. But more importantly, identifying what are the common values and variants from that. As opposed to using that single sample in the beginning to represent the base reference model for all humanity. So now they are using 2.5G to 3G long-read sequencing technology to fill in the more than 5% of the human genome that is still not fully known (mostly centromere and telomere regions; and some other highly replicated regions in the xxx stem of autosomes and the Y xxx region).
The analysis human reference models used in bioinformatic tools today are specially tuned to the limitations of short-read, shotgun, massively parallel sequencing. They are also linear (or flat) and represent a single instance that may not be the best representation for most of humanity. As is depicted in the HS38DH reference model that was released with hundreds of variations on the HLA region of Chromosome 6. A linear, flat model cannot represent the large structural variations found in actual practice among various cultures and human samples. Hence the goal of the pangenome project is to instead define a xxxxx structure that can more accurately represent the variance seen in the general population.
It is not enough to have a reference sequence except if wanting to compare two different peoples sequencing results to each other. Additional files are genealogy needed. Especially to correlate and extract data that is already well known and based on the previous models. BED files, liftover files and SNP location files are needed.
What it is and is not :
- Is a true, complete Telomere to Telomere sequence of "a" human genome Key word there is complete and "a" (one example)
- Is not a refined reference or analysis model as is used today. Known ancestral SNP alleles are not necessarily set but instead simply represent the sample used.
Why is it so difficult to determine a phylogenetic tree and order in time of variants?
- a rare allele could be rare for many reasons. It is a recent variant in a few, more closely related people. It is an ancient variant and a majority of the population is historically from a more prolific branch that did not have that variant. It is a variant from sometime in the middle but a branch not well tested or represented in the population due to bottlenecks. Basically, a rare allele can occur anywhere in the representative tree. So you have to look at a samples rare and common values in comparison to others to try and put order to the data and even determine what is the ancestral reference and what are variants from it.
There is a third model that seems to have come up. But it is not clear if truly a part of the PanGenome Project (PGP) or even strictly using the T2T procedures. We mention because it is reported they patched a gap in the Y with the HG002 T2T Y model. This is known as the PR1 model.
Here is a table by the different names associated with the same pr similar models:
Project Name | Y Accession1 | Description, other names |
T2T CHM13 v0.7 | CM020874.1 (X) | Only X, NCBI GCA_009914755.1, only mentioned because source of paper on first telomere to telomere sequence of a human chromosome (X in this case) |
T2T CHM13 v1.0 | CP068255.1 (X) | SNs 24 (sans Y), NCBI GCA_009914755.1; female so no Y |
T2T CHM13 v1.1 | CP068255.2 (X) | SNs 24 (sans Y), NCBI GCA_009914755.3; female so no Y |
T2T HG002xy v2 | CP086569.1 | NCBI GCA_020881995.1 aka ASM2088199v1, Biosample SAMN03283347, NIST HG002 NA24385, SRA: SRS817069, The assembly model has CHM13 v1.1 for the autosomes. Finalized xx Aug 2021. Only the allosomes checked in. |
T2T HG002xy v2.7 | CP086569.2 (expected) | The assembly model has CHM13 v1.1 for the autosomes; finalized 18 Jan 2022. Is expected to be the final and submitted as an update in GENBank |
HPP CHM13v1Y | CM000686.2 (GRCh38.p13) | A mix of the CHM13 v1 with the GRCh38 Y sequence. Not checked in as a separate assembly but used for the HPP Year 1 release. |
HPP CHM13v1.1Y | CM000686.2 (GRCh38.p13) | A mix of the CHM13 v1.1 with the GRCh38 Y sequence. Not checked in as a separate assembly but used for updates in the HPP Year 1 release. |
One also hears of CM034974.1 for the Y portion of the PR1 "Puerto Rican" sample that has been analyzed by xxxxx. According to Goran and the paper details, this model grabbed some portions from the hs38 Y model to fill in gaps they still had with the long-read sequence assembly of the original sample.
Additional Material
- Nurk, Sergey, et al, The Complete Sequence of the Human Genome, preprint, May 2021; finally published in Nature on 20 July 2021
- Wrighton, Katherine, Filling in the gaps telomere to telomere, Nature Portfolio, Feb 2021
- Interview and video on the pangenome effort
- Aganezov, Sergey m et al, "A complete reference genome improves analysis of human genetic variation", preprint in BioRxiv, 12 Jul 2021, (DOI)
- Li, Heng, On a reference pan-genome model, blog posts Part I and Part II from July 2019.
- Runström, Göran, "Announcing the new FTT Index based on T2T extracted new SNPs" on the ISOGG Facebook Group, 17 Dec 2021, with comments
- Kane, Jamesm "T2T model Experiments at the Y DNA Warehouse", ongoing; Facebook Y NGS Discussion Forum post
- Krahn, Thomas, "YBrowse adoption of experimental models", ongoing
- Ted Kandell's early announcement of the final HG002 Y chromosome model (mixed with the chm13 like we have been using in WGS Extract already)
- PR1 model
- Classification of Reference Models document prepared by the author
- T2T Update