Blaine Bettinger started a survey of Autosomal matching values a bit over a year ago. When it first came out, it was exactly what we were looking for at the time and drew great excitement from us and others. It was the project we wanted to love. Based both on the strength of the author, in our eyes, and the citizen scientist focus of the expected results. But in the process of trying to fill out the survey forms, we learned a lot more about the issue of comparing data from different services and even collecting data off the GEDMatch site. At the time, we mentioned these issues, which lead to variances in results of 20% or more, but saw no real response. Even the data clearly shows these issues. But, with all this said, and the below caveats understood, there is still some 10,000 foot analysis that can possibly glean some value from the results. But, in genetal, the results are definitely not usable in any kind of genealogical proof or basis to build further work on.

Let's start with the Parent / Child relationship. And then look at some others. We will finalize with maybe some rough, 10,000 foot view items that are anecdotal in nature but still really need a more concrete basis before making a statement.

Parent / Child Relationship

By definition, all these relationships should be identical. Each parent - child relationship is half because you get half your chromosomes. This is further true because Autosomal is not supposed to include Somal chromosomes of X and Y as well as the non-nuclear DNA the Mitochondria. So why all that variation? There are multiple reasons. The different testing companies and GEDMatch do not truly include just the Autosomal in their match number. As X is often lumped in, so is it sometimes included in the match result numbers. In fact, for some people, the longest matching segment is in X and was likely included. The instructions did not document how to account for and remove this variance.

Further exacerbating this issue is the fact that recombination and the testing process / accuracy causes the tools to measure many more segments than chromosomes. Usually around 50 segments between Parent / Child. Shouldn't it be 22 segments because 22 chromosomes fully match? You have to understand the testing and matching process, in addition to recombination itself, to understand why there are so many segments. But suffice it to say, there will be variance in total matched segment length introduced by the fact that the matching algorithms find some gaps and thus report multiple segments in chromosomes that should fully match.

A second big factor is the different sources of data have different counts as to what a total "half match" length or sum is. You really need to "normalize the numbers between these different sources to use a common count before lumping them together in comparison for medians and variances.

Now this has been on what should be the strongest, most stable, and easiest value that is always considered a "constant". Whoa. But it continues on.

(Full) Siblings

Most people do not realize but (full) siblings can have, theoretically, from 0 to 100 percent matching between them across ALL their nuclear DNA (or chromosomes). The expected is still around 50%. So why did the median show so far off the parent / child median value which we also expect at 50%? A number of factors need to be addressed.

First, revisiting the above, you have to realize some values that were submitted included matching X (or not). Two female siblings would have much more matching X as the father's X passed to both is essentially identical. As before, this can throw the numbers off significantly as the X chromosome is fairly large; representing just over 5% of the total base pairs of all the DNA.

Without recombination, we expect much more variance in matching. Recombination actually smooths out what otherwise would be chunky or sharp jumps in the amount of matching and thus helps remove the chance of small outlying values.

The next issue is tied up in the testing process, matching process, and how values are reported. Results are always only reported as half-matches. Meaning if overlapping strands of matches in a chromosome results occur, only one of the matches is used. So this can easily lead to less than 50% matching appearing when in reality the matching was much over 50%. So what is really being measured and reported? Simply an artifact of the matching algorithm that is not geared to handle siblings and similar close relationships from common parents well.

To understand this a little bit, remember that the test process cannot tell which value of each SNP comes from which of the two copies of a chromosome. Only in the case of Y, for the most part, do you have only a single value. So the results are simply reported as two values for each SNP of each chromosome. (For males, they often simply duplicate the values in the reporting of the X.) Matching algorithms simply look for the longest strings of matches where they take either value. (This is why phasing is so important. It determines the association of values back to individual chromosome strands and thus creates an ordered value pair and removes false matches that crop up.) Siblings, unlike most others, have significant matching on both chromosomes.

Wait a minute. You said 0% match? Yes. Assume there is no recombination for the moment. In that case, each sibling gets one chromosome of each type from each parent. You can easily imagine two siblings getting opposite sets from each other. Note that this is exactly true for Autosomal. For Somal, the siblings have to be two girls or a boy and a girl if you include those chromosomes in the comparison. Because two boys would get the same, unaltered, single Y chromosome from their father. We mentioned nuclear chromosomes because siblings will always have the same Mitochondria as they always get this from the mother. So during egg and sperm formation, exact "opposite" copies with no overlap are created. While the chances are infinitesimally small that those exact opposite sperm and egg pairs are likely to be used to form two siblings, it is possible. Throwing recombination into the mix only reduces the chances more of creating a matching copy of the exact opposite sperm and egg instances. But we bring this up not because the survey got anything wrong about this in specific, but to point out the median and variance one can expect even with this close relationship.

Update: Since this post, we have expanded our Half Identical page to include the analysis of Full Siblings and show how more accurate numbers can be extracted. It is interesting to note, anecdotally, that the variance off the average for sibling matches given by the half identical tools in our 7-sibling comparison matches the chart they collected with thousands of samples. But that this imbalanced variance does not appear in the phased-results case. A good mathematical modeling test for Jim Bartlett to go figure out why!

Other issues

The biggest issue we uncovered as part of the process was the dramatic difference in GEDMatch's 1:1 versus 1:Many matching results. The data collection process did not specify which set of values to use. In fact, I had noticed this difference even before the survey. I thought the difference was maybe because 1:Many used different defaults. I could often come close to recreating the 1:Many result by using 5/500 instead of 7/700 default values in the 1:1 comparison. But there was still variance, in both directions, even after doing this. In inquiring with the tools developers to understand why, I determined that the 1:Many tool is a quick, rough-guess estimator in order to get a table quickly generated from the whole database of testers who submitted. The 1:1 tool is the more accurate, specific result. This variance was incorporated into the survey as the data collection description did not specifically ask for values from one source or the other. Or even what specific parameters to generate the submitted results. This is likely one of the largest sources of the Garbage In / Garbage Out as it makes the reported results highly variant purely based on the collection process and not the DNA testing nor matching process variations.

10,000 Foot Items (the good stuff)

It was surprising we did not see more Poisson curves versus traditional, balanced Bell curves. Especially for the father out (generation wise) matches. But these Bell curves may be an artifact of all the factors mentioned above. More analysis is really needed to understand. Most importantly, the Bell curve of siblings should be completely ignored.

It was nice to see the "range" of even fairly close (in generation) matches including 0% matches. While when we see this in our initial search and analysis with others in a relationship, it is statistically possible (even with Siblings!) and so is good to see it came through.

It was nice to see the "sticky segment" concept being shown with the fifth cousin and farther matches reported and shown in the right hand column of his summary chart. While the likelihood of having any match with a "known" cousin of this distance is small, when you do have a match, it tends to be with a sticky segment that does not continue to drop in length. We would have more and larger sticky segments, including whole chromosomes, if recombination did not occur. But sticky segments are a result of the few recombinations that really do occur each generation and as such, shorter segments have a much smaller chance of being broken up by recombination. Jim Bartlett Segment-ology Blog has been doing some nice analysis to develop more complex mathematical models to understand this whole process better.

We have more to add and will update this article as time permits. But these are just the initial, quick reactions. In most cases, we did not go into the analysis and details for the comments, And may add that supporting material over time — likely to the Wiki in other sections and simply reference it here. Just felt it was more important to get these points out there to avoid some support issues. Just as we are always getting comments from newbies to the field in the project like "it says we are 2nd to 4th cousins, so we must be close relatives" when they have only a 12cM total match length across one or two short matching segments on a single chromosome. Whoa. Step back a minute.