Loading...
 

Single Nucleotide Polymorphism (SNP)

SNP's are the main DNA genetic markers that are tested for in Genetic Genealogy. A Single Nucleotide Polymorphism (or SNP; often pronounced "Snip") is a specific nucleotide locus (or base-pair location) in a DNA strand that has exhibited a variation in some portion of the population. Most often this is a swapping of the base pair values (A to T, C to G, etc) within the DNA ladder. Any other point mutation is usually given a different name. There can be more than one SNP defined for the same location. Either because of different variances or because it is a neighboring change that spills over. The term "single" is often interpreted to mean 5 or less base-pairs.

But not all uses in the biological community are so strict. And the genetic genealogy community even less so. So sometimes they lump other variations into the SNP nomenclature to cover all variations (or markers) they are testing for. At minimum, this often includes an insertion or deletion of a base-pair (officially, a DIP/DIV or INDEL) but still more of a point mutation form, Less often, a very short sequence that is inserted or deleted is defined as an SNP as well. This because there are insertions or deletions of thousands of base-pairs in length. Hence not always a "single" nucleotide although still termed loosely as SNP to talk of all simple variations. (Biology and the science around it is messy and never absolute, it seems.)

Larger variations in DNA usually result in a cell that cannot survive and so are not usually observed. Or they structurally change the DNA so much as to make it difficult to find the variation by known testing methods. The exception is with STRs in the inter-gene (or junk) regions; the other marker tested for in genetic genealogy. Sometimes large INDEL's in non-coding or inter-gene regions can be observed as well. But for our purposes here, think of SNPs and mainly / only as point mutations of single base-pair locations. SNPs are the dominant markers of concern; both for genetic genealogy and in the medical genetics community.

SNP genetic markers are tested so as to compare and contrast peoples DNA with each other (and even to compare the paired chromosomes within a person to each other). If the SNP marker value is changed from the defined population normal (that is, the reference build usually), then this is termed a positive marker result (also known as plus, +, derived, and changed) and is often colored green. If the change in the SNP value is not observed. it is a negative result (also known as minus, -, ancestral, and not-changed) and often colored red.

The value of an SNP genetic marker in a particular person and chromosome is termed an allele. In fact, in biology, a sequence or group of base-pairs and their particular values, is also termed the allele. Here and elsewhere in genetic genealogy, we tend to simply say "SNP value" or "marker value" to mean the allele. Simply to avoid an overload of terminology. ** Technically, an allele is a variant (derived value from the reference (or ancestral) at a location. But can often can be used to simply mean the measured value at a given location.

The collective values of an SNP in a person, taken from both chromosome copies, is termed the genotype. Genetic Genealogy testing returns the genotype value set (or two alleles) when two copies of the chromosome exist in a person (often the autosomes). Key to understand in genetic genealogy is that the test process cannot generally determine which allele came from which parent; individually per SNP or collectively across all the SNP's measured. The genotype returned is an un-ordered set of values obtained from both parents. The obvious exception to this is the measured yDNA SNP which can only come from the father (for the most part). And a son's (non-recombining portion of their xDNA which can only come from their mother.

Because SNPs are inherently more stable and do not tend to change back and forth so often (as STRs may), they are used for identifying different anthropological lines of humans through tens of thousands of years. Mostly with testing on the Y chromosome and the mitochondria that pass down relatively unchanged from generation to generation. Only recently, with Sequencing testing on a wider set of samples, are we seeing more variance attributed to the nearer time frame that is allowing SNP testing to contribute in the genealogical time frame tree formation; not just ancient humans. Thus merging anthropological phylogenetic tree work with traditional genealogical time frame genealogy work.

SNP Testing

There are a number of different types of SNP testing offered in the genetic genealogy community. They are:
  • Autosomal SNP testing is the most popular and common form of genetic genealogy testing of the last decade. While many are drawn in by advertising to the generated pie charts of an ethnicity mix, the real value is in determining how many SNP values you share in common in a continuous sequence with another person. This is termed a matching segment.
  • X chromosome SNP testing is often included in Autosomal testing although the matching and reporting of it varies by company. Unique properties of X inheritance exist and can be used to glean more information than simply found in the Autosomes.
  • Y chromosome SNP testing is used to confirm, when similar or matching Haplotypes exist, that two people are in the same Patriline (as opposed to their matching Haplotype representing a (re-)converged set of STR value changes over time). Testers with the same Haplotype but different Haplogroups are not in the same Patriline.
  • Mitochondrial SNP testing has limited utility in genetic genealogy and is most used for ancient anthropological studies. But one very specific question can be answered (shared Matrilineal lines) and so some value is found.
All of these different SNP groups are tested for in the microarray tests prevalent in the last decade. Tests as offered by Ancestry, 23andMe and others. They often return all four of these types of test values. Some companies strip out some of the results before delivering the data files. FTDNA, for example, strips out all but the autosomes. They will put the xDNA back in if asked for. Because the main use of the microarray test result is the matching segment database based on primarily the autosomal results, most people erroneously call these autosomal tests.

SNP Names and Haplogroups

It is common to name the SNP with a letter or two that represents the group or person that first identified it; followed by a sequence number given by that group. NIH has a central database (dbSNP) and naming scheme for each SNP that is termed the refSNP number ID or rsID for short. In their database, each SNP begins with the letters rs followed by a sequence of numbers unique for each SNP. Not-yet-named SNPs are identified by a triplet value of:
Defining an exact locus is an issue given insertions and deletions can occur, sometimes fairly large, at different points with different people. Hence the use of a standard Genome Build and identification of the locus to that model. Ancestral always refers to the reference model value.

To help understand this, here is a screen capture of SNP U152 from the yTree block page description. Only thing not given there is the rsID which we added to the image for completeness. The "not-yet-named" method shown is without the leading chromosome indicator as this is from a yDNA site where the Y reference is implied. Note there are synonyms for U152 of S28 and PF6570. Simply, the same SNP was given names near-simultaneously by different labs. So we often see all the various synonyms of the SNP being used. In our example here, the SNP sometimes appears as S28 instead of U152 as a result. S116 is a synonym for P312 and shown there as well. Confusion abounds.

More formally, this is actually a haplogroup description and not of the SNP U152. Haplogroups are defined by one or more derived-value SNPs and named by usually one but sometimes more than one of those same SNPs used to define it. See YCC for other naming conventions of haplogroups. Different phylogenetic tree creators may use different SNPs from the haplogroup to name each haplogroup; they do not coordinate. This image is from just one such creator and may not reflect the SNP or its synonym selected by some other tree creator.

SNVs, MNVs and such

Technically, the markers checked for are termed Single-Nucleotide Variations (SNV)s. And only if they meet a certain criteria among the general population are they termed an SNP. But this distinction seems to be lost on most in the industry and outside the academic world; especially the genetic genealogy community. So we tend to use them interchangeably or stick with the original SNP term to simply mean all variations. And even that clarity in use is in question. See Heng Li, author of the key bioinformatic tools, blog post on the subject saying it is all hogwash (about SNP used only for more frequently occurring variations.)

SNVs tend to really be more inclusive and cover more than the "single" base-pair location changes. Even though single is in the definition of the term. Beside the already mentioned InDels, there are Multi-Nucleotide Variations (MNV)s and other variation types. Generally, it is accepted the number of base-pairs involved must be 5 or less to be classified as an SNV.

InDels, a term we still use here, are classified by NIH in their dbSNP database as DIP or more properly now DIV. For Deletion / Insertion Polymorphism or Variation. We tend to still use the more commonly used term of InDel here.

Frequency

Studies estimate that there is a variance in any one persons genome every 1,200 base-pairs. For the haploid of approximately 3.2 billion base-pairs, that implies any individual will have around 2.66 million variances. We tend to see a bit closer to 3.5 million variants recorded in 30x WGS tests that have human genome maps of just over 3 billion base-pairs. The NIH dbSNP database claims to have over 700 million entries as of 2020. Which would imply more like one variance every 5 base-pairs. But most of those entries have not been fully curated. And sometimes identifying overlapping and same-location variances of different types. It is reported that there are more like 100 million curated, unique variations in the dbSNP database. There are closer to only 500 thousand rsIDs in the ClinVar database (that associates variances with clinical conditions through rereferred study publications).

Typical microarray test result files have around 600,000 positions with values of which maybe only 10% are variants from the reference (or 60,000) and close to 5% not called at all (no value determined) WGS result files have close to 100% of the mappable 3 billion positions with actual values and very low no-call rates. Thus over 5,000 times the microarray test values and often deliver over 3.5 million variances alone per user or roughly 60x the number of variances than the microarray test. Some of the variants cannot be annotated with an rsID and represent either novel variants yet to be identified, studied and curated. Or possibly mis-reads posing as variants. The testing process is not fool proof.

External Resources