A very early, in-development tutorial on levels and types of autosomal match analysis. Reference system still not making references visible either.
Any match analysis of the autosomes starts with determining the shared segments between two testers. Something done by the site or test company. It is how that shared segment data is then summarized and further distilled that determines the level of detail or analysis you are able to perform. The more distilling, the less detail, and the more chance of introducing over-simplified statements of matching and common ancestors.
For a test company to be a genetic genealogy test company, they need to provide autosomal match analysis with the autosomal test results. Usually, at minimum, every company provides a match list with some measure of the strength of the match to each entry in the list. At best, they will provide the actual matching segments of DNA between the matches on the match list. Third Party Tool providers exist to not only compare cross-company matches but also provide a match list capability to the few pseudo-genetic genealogy test companies that are used / considered but do not provide even the bare minimum match list capability.
The bare minimum match list is a directly derived product from the test process and determining matching segments. The next level, oft-required and needed is a shared match list. A common, growing-popularity analysis provided with shared match lists is termed clustering. There are many approaches and types of clustering tools; with varying degrees of success in use and manipulation. These tools are most heavily focused around Ancestry as that is, by far, the largest match database of testers. But also provides the least information about the genetic match strength. In particular, they do not provide the list of matching segments nor even the simple metric of the largest matching segment. The latter is often the most helpful indicator of the match strength or closeness of a matching cousin.
Segment Matching is the detailed analysis that can lead to truly verifying common ancestors. A common, growing popularity analysis "result" with segment matching is chromosome painting. Where you identify the source of segments in the autosomes (and xDNA often) of the primary source. Like clustering is an outgrowth of shared match lists and leeds method, chromosome painting is an outgrowth of visual phasing and, to a certain extent, true segment triangulation.. The outgrowths tend to be more widely applicable and usable though. See our 3rd Party Tools section for more details.
During this tutorial, we will break up the discussion across three different windows of summed total matching segment lengths:
Types of analysis (in increasing level of detail and accuracy) are:
We deal with each of these types of analysis and weave in the windows of match strength given above.
Why is there not a clear, concise mathematical model for all this? Why do even statistical ranges not seem to apply well? A number of issues exist.
Note that it could be a 1st cousin twice removed; the generational difference could be large. The ages of the testers do not imply the generation difference to the common ancestor. We have one tester who is a 2nd cousin to their match's grandparent but are younger than the match!
But even so, with such close matches, there can still be some effort to determine the genealogical relationship. Often because of some unrecorded NPE. Or because people never understood who their grandparents or great-grandparents were; especially in cases of adoption or similar..
It is with these matches where you have the greatest chance of finding a triangulated group with more distant matches taken from the other groups below. This because you and the closer match have much more in common to increase the chance you both have the same inherited small segment that is matching a third, more distant tester.
It is with these matches you most likely can use mainly traditional genealogical methods to determine the matchs' relation. More sophisticated DNA analysis is often not necessary or even helpful. Again, it depends on what is known about the testers family and possibly the matches. A possible NPE or adoption bring in a requirement for deeper analysis.
It is with these matches you will get the small, matching segments isolated to farther back ancestors. Thus allowing you to use chromosome painting tools to tag areas of the main testers chromosomes with more distant ancestors. You use tools like clustering and visual phasing to push matching segments and matches back to further and further distant ancestors. Note that even if you do not find the common ancestor, having the match more isolated to a testers distant distant ancestor is helpful to narrow the search and provide a better point to start when a new shared match appears and you return to analyzing this match later. Shared match analysis and its related clustering along with a tool to manage notes and communication with matches becomes critical. You can quickly become overwhelmed with data.
Why 40 cM? Once a single matching segment is getting down to 20 cM or lower, the chance of it recombining drops to a less than 20% in each generation. So segments tend to become "sticky". Which means, if they pass the 50% chance of being inherited that generation at all, it is likely the same size as what the parent had. We often see three generations having this same matching segment of the same size. Which throws off the expected relation calculations based on match strength when all three generations have the same large matching segment and resultant match strength. It should be noted that this boundary or defining point is higher in Family Finder, MyHeritage and LivingDNA. Sites that tend to pad their match strengths with very small, possibly false matching segments or by imputing between very different test kits which artificially extends the sizes of some matching segments. Our experience tends to use 50 to 60 cM for these sites.
The match lists dramatically climb in size with each of these categories above. In early 2020, with Ancestry, a typical colonial-ancestry American will see a few dozen Close Relations, one to a few thousand Distant Relations, and 40 to 100 thousand Possible, Distant Relations. Using clustering tools to include some of the Possible, Distant Relations helps build larger clusters tor the Distant Relations.
Any match analysis of the autosomes starts with determining the shared segments between two testers. Something done by the site or test company. It is how that shared segment data is then summarized and further distilled that determines the level of detail or analysis you are able to perform. The more distilling, the less detail, and the more chance of introducing over-simplified statements of matching and common ancestors.
For a test company to be a genetic genealogy test company, they need to provide autosomal match analysis with the autosomal test results. Usually, at minimum, every company provides a match list with some measure of the strength of the match to each entry in the list. At best, they will provide the actual matching segments of DNA between the matches on the match list. Third Party Tool providers exist to not only compare cross-company matches but also provide a match list capability to the few pseudo-genetic genealogy test companies that are used / considered but do not provide even the bare minimum match list capability.
The bare minimum match list is a directly derived product from the test process and determining matching segments. The next level, oft-required and needed is a shared match list. A common, growing-popularity analysis provided with shared match lists is termed clustering. There are many approaches and types of clustering tools; with varying degrees of success in use and manipulation. These tools are most heavily focused around Ancestry as that is, by far, the largest match database of testers. But also provides the least information about the genetic match strength. In particular, they do not provide the list of matching segments nor even the simple metric of the largest matching segment. The latter is often the most helpful indicator of the match strength or closeness of a matching cousin.
Segment Matching is the detailed analysis that can lead to truly verifying common ancestors. A common, growing popularity analysis "result" with segment matching is chromosome painting. Where you identify the source of segments in the autosomes (and xDNA often) of the primary source. Like clustering is an outgrowth of shared match lists and leeds method, chromosome painting is an outgrowth of visual phasing and, to a certain extent, true segment triangulation.. The outgrowths tend to be more widely applicable and usable though. See our 3rd Party Tools section for more details.
During this tutorial, we will break up the discussion across three different windows of summed total matching segment lengths:
- Close Relations: 1.5 to 50% (or 120 cM and larger) (tends to be all 2nd cousins and closer; which are essentially guaranteed to DNA match)
- Distant Relations: 0.5 to 1.5% (or 40 cM to 120 cM)
- Possible Distant Relations: < 0.5% (or under 40 cM)
Types of analysis (in increasing level of detail and accuracy) are:
- Summed Total of all the Matching Segment Lengths (simple)
- Total Number Count of Matching Segments
- Longest Matching Segment Size (first basic value not provided by all)
- In-Common Matches (aka Shared Matches) between two matching testers (and, by corollary, not-in-common matches) (Clustering)
- Chromosome Browser (aka Shared Segment browser) with listing of matching segments
- Fake Triangulation
- True Triangulation of Matching Segment(s)
We deal with each of these types of analysis and weave in the windows of match strength given above.
Why is there not a clear, concise mathematical model for all this? Why do even statistical ranges not seem to apply well? A number of issues exist.
- Remember our clean, overriding mathematical inheritance formula given in our consanguinity glossary entry? It is based on the fact that each generation (or meiosis event more specifically) gets 1/2 the DNA of the generation of before. But this simple "average" carried out further assumes an infinite precision and perfect mixing of the amount of DNA passed. And is based on extending the fact that a child will always get 1/2 (or 50%) of their atDNA from each parent. The problem this oversimplifies is that there are 22 discrete chromosomes of varying length. The longest chromosome being about 5 times longer than the shortest ones. Assuming no cross-overs for the moment, which ever grandparent's chromosome 1 makes it to the grandchild already has a huge leap on the other grandparent who did not pass that chromosome. As chromosome 1 is almost 8% of your haploid that you get from each parent. Things would be much worse and more varied if it were not for the 30-40 cross-over events which tend to always split chromosome 1 in one or two places. But still, even with cross-overs, we often see grandparents be off the simple, expected average by 20% from what they are expected to share with the grandchildren. That is, often more in the range of 20 to 30% instead of the average 25%.
- Beside the 50% chance of some DNA being passed down from a parent to a child, we have the chance of a segment of DNA being split by the cross-overs. In fact ,the very definition of a centimorgan is the rough percentage chance that a segment of that length will be split during the cross-over event. So 100 cM means a 100% chance it will be split. With chromosome 1 being well over 200 million base-pairs, and over 200 cM, we expect to see 2 cross-over boundaries in chromosome 1 with each generation. With chromosomes 20 and 21, likely no cross-over event at all.
Close Relations (1.5 to 50%; or 120 to 3600 cM or more)
Except for some cases of populations with extreme endogamy, like the Parsi's, a match in this range is pretty much guaranteed to be around the expected 2nd cousin level or closer. The variance on possibilities (ranges) is much tighter and closer to the average. In fact, the overlap of standard deviations is almost zero in many cases. So for most, the values are nearer the averages expected by the traditional, simple model given by consanguinity. And if a value is half-way between two averages, the actual result will be the one above or below and not likely further. There is not much variation.Note that it could be a 1st cousin twice removed; the generational difference could be large. The ages of the testers do not imply the generation difference to the common ancestor. We have one tester who is a 2nd cousin to their match's grandparent but are younger than the match!
But even so, with such close matches, there can still be some effort to determine the genealogical relationship. Often because of some unrecorded NPE. Or because people never understood who their grandparents or great-grandparents were; especially in cases of adoption or similar..
It is with these matches where you have the greatest chance of finding a triangulated group with more distant matches taken from the other groups below. This because you and the closer match have much more in common to increase the chance you both have the same inherited small segment that is matching a third, more distant tester.
It is with these matches you most likely can use mainly traditional genealogical methods to determine the matchs' relation. More sophisticated DNA analysis is often not necessary or even helpful. Again, it depends on what is known about the testers family and possibly the matches. A possible NPE or adoption bring in a requirement for deeper analysis.
Distant Relations (0.5 to 1.5%; or 40 to 120 cM)
This is likely (hopefully) the bulk of your matches you will work with after a small but healthy selection of Close Relatives from above. It is with these matches you will spend the most time looking for common ancestry possibilities, trying to triangulate using matching segments and the like. The issue here is the range of possibilities for what the relation is can vary widely. The estimation of a relationship based on match strength offers many, just as equal probability, possibilities. The biggest problem, beside being the greater distance to the common ancestor and the smaller number of matching segments to work with in possible triangulations, is that the matches often are not necessarily into genealogy. So you may have to work on their tree a significant amount to develop where the overlap of ancestors occurs.It is with these matches you will get the small, matching segments isolated to farther back ancestors. Thus allowing you to use chromosome painting tools to tag areas of the main testers chromosomes with more distant ancestors. You use tools like clustering and visual phasing to push matching segments and matches back to further and further distant ancestors. Note that even if you do not find the common ancestor, having the match more isolated to a testers distant distant ancestor is helpful to narrow the search and provide a better point to start when a new shared match appears and you return to analyzing this match later. Shared match analysis and its related clustering along with a tool to manage notes and communication with matches becomes critical. You can quickly become overwhelmed with data.
Possible, Distant Relations (<0.5%, or under 40 cM)
Indicative of these matches is there is a single, large segment that comprises most, if not all of, the matching. Some analysis companies throw in lots of small-segment matches to pump up the result to around 40 cm or larger of total matching strength. Which indicates a stronger match and larger matching segment count than may really exist and be reported by other tools. For this determination, we are considering 500 SNPs / 7 cM or greater matching segments only. See our separate explanation for why the near-mythical 7 cM step is determined and used — for good reason.Why 40 cM? Once a single matching segment is getting down to 20 cM or lower, the chance of it recombining drops to a less than 20% in each generation. So segments tend to become "sticky". Which means, if they pass the 50% chance of being inherited that generation at all, it is likely the same size as what the parent had. We often see three generations having this same matching segment of the same size. Which throws off the expected relation calculations based on match strength when all three generations have the same large matching segment and resultant match strength. It should be noted that this boundary or defining point is higher in Family Finder, MyHeritage and LivingDNA. Sites that tend to pad their match strengths with very small, possibly false matching segments or by imputing between very different test kits which artificially extends the sizes of some matching segments. Our experience tends to use 50 to 60 cM for these sites.
The match lists dramatically climb in size with each of these categories above. In early 2020, with Ancestry, a typical colonial-ancestry American will see a few dozen Close Relations, one to a few thousand Distant Relations, and 40 to 100 thousand Possible, Distant Relations. Using clustering tools to include some of the Possible, Distant Relations helps build larger clusters tor the Distant Relations.