Half-identical and full-identical matching in autosomal DNA test analysis can be tough concepts to understand. But when comprehended, this leads to a better understanding of the test process and the results reported. Especially results from comparing full siblings and members of endogamous populations. Understanding these matching concepts and how to use the tools is the best way to tell if siblings are full or half from DNA results. We start with a graphic from GEDMatch to quickly visualize the difference in these terms. Then we continue with the verbiage to explain these two types of match regions and how to distinguish them.
Full- and half-identical are terms created by the genetic genealogy community and its test and tool developers to help explain the match process they compute. These are not real concepts that apply outside this community to describe what actually exists in nature. They are artificial concepts born of the testing process. Match regions and matching segments are also very different terms that should be more fully understood; especially in the context of visualizations. How tools determine matching segments (that is, where they start and, very importantly, how they end) is important to understand and differentiate also. It is not simply a string of matching SNP (allele) values from each chromosome.
As part of this work, we introduce here two variants of the half-identical match term: pseudo-half-identical and true-half-identical. This helps explain differences you see in tools from what you would expect from reality. Virtually all analysis tools are pseudo-half-identical match tools. This is the quickest and easiest answer to generate from the data provided with the current testing process. In reality, it is a lazy solution. We call for the abandonment of pseudo-half-identical reporting in all tools and instead push for the use of only true-half-identical reporting.
Provided here are three graphic chart examples showing the autosomal comparison using GEDMatch's Chromosome browser. (You can click on any image to view it full size.)
Full-Identical (Green) Regions | Half-Identical (Yellow) Regions | No Matching (Red) Regions
|
GEDMatch happens to visually indicate true-half- and full-identical regions in yellow and green, respectively, in their Chromosome Browser graphical option of the 1:1 comparison tool. (Red represents no match.) The solid green areas are the full-identical regions where BOTH chromosomes of a pair are matching the other pair being compared too. These are the regions that must be doubled when summing up the total matching segment length reported by a pseudo-half-identical tool. It is these matching regions that are under-reported in ALL pseudo-half-identical matching tools. Below each chromosome it shows either grey or blue bars. This is showing the region of a chromosome considered a matching segment (blue) or non-matching segment (grey). Note the use of the term segment here. If blue with green above, then this is likely a full-identical region comprised of two or more overlapping matching segments from each chromosome. If blue with yellow above, then this is a true-half-identical region with a single matching segment from only one chromosome. If grey below, it should be mostly red above and represents a no matching segment and region. Note that we share much of our DNA and rarely will you see solid red areas. |
Virtually all autosomal analysis tools do only pseudo-half-Identical matching segment analysis and reporting. This partly because the genetic genealogy testing process cannot differentiate the SNP allele measurement from each instance of the chromosome pair (and each instance of the X for females). Each person has two copies of each chromosome in the autosomes. The lab returns two values for each SNP but does not know which value was with which chromosome of a persons pair. As we get one chromosome of each autosome from each parent, this mixes up the result from each parent in a testers autosome DNA test result. Analysis tools thus do an aggressive pseudo-half-identical match by simply creating the longest matching segments possible using either value from the pair returned for each SNP. In reality, they may be creating matching segments composed of SNPs alternating from each parents contribution in two different chromosomes. More importantly though. the pseudo-half-Identical tools never go back and try to look for further matching segments from the remaining, un-selected SNP allele values that were not used to form the initial matching segment (that is, go back and recognize and further process to find any full-identical regions). Tools that recognize and report full-identical matching segments are likely also doing true-half-identical matching. That is, differentiating between full-identical and half-identical regions from the pair of each chromosome.
For the most part, pseudo-half-identical match analysis is good enough when comparing two individuals autosomal testing results. It only differs greatly from reality in endogamous populations, when comparing Full Siblings, or similar instances where a testers parents are both related to the other testers parents. But even outside these conditions, you could have a full-identical match region with another IF (a) one of your parents is related to one of your matches parents and, separately, your other parent is related to the matches other parent; and (b) the matching segment carried down from each happens to overlap in the same region on the same chromosome. Hence why true-half-identical and full-identical analysis is always important.
Half-identical matching, of either variety, can report longer matching segments than actually exist as this matching technique extends a matching segment in common allele value areas. But these errors due to common DNA are often handled specially by the tools. The total segment length match error appears to be under 10% in practice. It seems there is enough variation in the general population DNA and there is a large enough number of SNPs measured in the autosomes (over 600,000 in most cases) to make half-identical matching practical.
Full Siblings and Half-Identical Match tools
With Full Siblings, it is important to not look too closely at tabular match report values from pseudo-half-Identical tools. They can be pretty inaccurate for the total matching segment length and the number of matching segments. Pseudo-half-identical match tools do not recognize and add back in the contribution of Full identical match regions (to generate a truer total matching segment length). Also, Full Siblings, much more than endogamous populations, have matching segments on each chromosome copy that often overlap each other. This cannot be detected as two, separate matching segments by half-identical matching tools. So instead, the tools report one long matching segment from the two overlapping matching segments. Beside generating a lower matching segment count, this also under reports the total matching segment amount. This is why pseudo-half-identical matching tools report around 38 +- 3% matching for Full Siblings as opposed to 50 +- 5% (which should be the rough reported average and standard deviation that really exists in nature). The overlap area of matching segments from two different chromosomes of a pair is termed a Full identical region. It will often appear "surrounded" by a half-identical region (see above; most green areas are surrounded by yellow ones). In half-identical match tools, full siblings will exhibit about 25% of their autosomes as true- half-identical and about 12.5% as full-identical. When not looking for and incorporating full-identical matching segments, the tool simply measures a 37% total pseudo-half-identical total segment matching length, on average, and implies a larger than expected 62% as not matching at all. In reality, Full siblings share about 50% +- 5% of their autosomal DNA and there is no real concept of half-identical matching in nature. A pseudo-half-identical match tool will never report more than 50% matching even though it often really exists with full siblings; especially certain identical twins.If the test results of both individuals to be compared are phased, then more accurate match analysis is possible using a half-identical match tool. That is, you can look for matching segments on each chromosome individually because the chromosomes and their sequence of SNP values have been recreated for each parents contribution from the test results. You do not create any false, longer matching segments because you are picking up values from the same region to compare against from each, separate chromosome. There are no full-identical matching areas, by definition, when comparing phased results in a pairwise fashion.. To use phased results with full siblings, you compare the paternal chromosomes to each other and then the maternal chromosomes to each other; and simply sum the results (total matching segment length and number of matching segments). Note that to phase the testers, you need at least one of the parents and preferably both parents tested. The phasing process is not 100% accurate (especially with only a single parent) and so the comparison with phased results will still not be 100% accurate with nature (i.e. reality). But it is much closer to the expected reality than with half-identical matching on the lab results alone.
When phased results are not available, if you have a chromosome browser option for the half-identical matching tools, and they indicate the sections of full-identical matching separate from the sections of half-identical matching (as visually shown above), then you can detect any full-identical matching (and basically visually confirm they are full siblings). If the tool provides tabular output of full-identical matching segments, you can simply add that full-identical region result to the reported half-identical match report to get a more accurate total matching segment length. The total number of segments will still be off but at least you have a closer estimate for the actual total shared autosomal DNA.
Using phased results of full siblings to improve accuracy
Using phased results to more accurately measure shared DNA in full siblings 'Note: GEDMatch, when creating phased kits, duplicates the single SNP value to create a pair of values. This to make all kits appear the same to the tools. Thus phased kits will not show yellow for matched areas but always only green if matching. This is not implying full-identical region as only one chromosome is really being compared. |
Full siblings will have roughly 12.5% of their autosomes be in a full-identical match region — meaning both chromosomes of each autosome are matching to each other; respectively. This contributes to half their real matching total (25%) because you have to double the full-identical match areas when calculating the total match. The other half of the resultant total match is from half-identical match regions comprising about 25% of the autosomes; on average. This leaves about 50% of the autosomes unmatching.
VISUALLY, in half-identical match tools, these match percentages appear near double and the non-matching regions appear as half. This is due to the mash up (i.e. merger) of the chromosome pair matching being displayed together. Because of the way the half-identical match tools VISUALIZE their results, full siblings in half-identical match tools will visually show about 50% of their chromosomes as half-identical and about 25% as full-identical with only 25% not matching. (These happen to be the numbers given by ISOGG in their Full Identical Match page and hence why we try to explain them here. These are artificial, visualization-only numbers that do not represent any kind of reality.) The total match percentage in pseudo-half-identical tools will be reported as 38% (on average) instead of the visual 75%. (Ignoring for the moment that we are expecting a 50% match, on average, as that is what happens in nature). The reported, calculated shared segment amount is only half of what visually appears. This can be really confusing. Especially given that the tools are reporting 62% not matching and yet visually show only 25% of the chart has grey, non-matching segment bars. Using the process to compare phased test results will yield more accurate match numbers that reflect more of the real biology underneath. Remember, half- and full-identical are concepts born of the test process and tools and not the real biology underneath.
To further demonstrate and emphasize this point, we have access to the test results of seven (7) siblings and both their parents, all with the same test company (Ancestry in this case). The top chart here depicts comparing on GEDMatch using half-identical tools (lower left half of the chart) versus when using phased results and summing the total of each piece-wise comparison (upper right half). The bottom, second chart is simply giving the results extracted from GEDMatch when comparing the phased maternal against maternal (lower left half) and the phased paternal against paternal (upper right). It is the second chart values we sum to get the first, top charts upper right half. The average, as shown at the bottom, is 50.3% (plus or minus 5.3%) for the phased comparison result (which is about what would be expected), and 37.8% +2.4% -3.7% for the raw, unaltered comparison of the siblings in the standard, pseudo-half-identical tool with GEDMatch. To further indicate the variance and thus unsuitability of the pseudo-half-identical report; see that the siblings sharing maximum (red) and minimum (blue) percent shared values does not match up with each other. Meaning, the paternal to paternal comparison max/min is with different siblings compared to the maternal max/min; which is different than the standard pseudo- and true- tool run. So it does not appear you can apply an algebraic scaling factor to the pseudo-half-identical tool reported values to get closer to reality. (We must point out that adding the predicted 12.5% full-identical value to the 37.8% does give the 50.3% shown here. Likely a fluke but ... "the outer limits" theme song rings in our head here.) The half-identical tools cannot do any better with the data they are given from the lab run. The more accurate results can only be generated from the phased sibling data which currently requires test results be available from one or both parents (both for highest accuracy).
The tool in GEDMatch called "Are your parents related" is basically looking for areas of full-identical match to yourself in your own DNA sample; not comparing it to someone else. This is different than half-Identical and full-Identical which is comparing results of two different people. In endogamous populations where you are showing full-identical matching with a non-sibling, this tool will likely show your parents are related. (note: there is a slight chance you and the other tester could both be related to someone on both your maternal and paternal side, and your parents not show being related, and you thus show full-identical match segments reflecting the common ancestor from each parent individually. As we continually say, biology and statistics are messy.)
Notes on the GEDMatch Chromosome browser images and otherwise
- Each individual chromosome match region image is scaled to fill the image. The amount of scaling is shown below each chromosome match bar. To help remind you, the length of each chromosome in cM is shown to the right of each chromosome match region. This should help emphasize the scale difference. Perceived, visual matching areas must also be scaled accordingly. Large matching segments in chromosome 1 will dwarf any lack of matching segments in later chromosomes.
- If you see the Half Siblings Chromosome match chart, it appears about 50% of the area is depicting matching segments (a blue bar below) — after accounting for the scaling described above. This is because these are half-identical match areas AND the chart is showing matching for both chromosomes mashed-up together. In reality, there is only a 25% total matching on both chromosomes as the blue bar is "hiding" a grey bar behind from the other chromosome in Half identical match areas. Similarly, in the full siblings chart, it appears more like 75% of the area has blue bars below indicating matching segments. This is representing that reported 38% TOTAL matching segment amount. The 25% non-matching area (grey bar below) has to be doubled to get the total 50% non-matching in reality (which is still less than the nearly 60% non-matching reported because those half-identical areas are hiding some non-matching on the other chromosome). This is what makes it so confusing. The ISOGG wiki chooses to give mixed numbers of how the half-identical tools report things. We prefer to give the numbers of actual, real DNA match and then explain why the half-identical tools appear to visually show something different.
- Note that a half-identical match tool will NEVER report more than 50% total matching. But Full Siblings will often have more than that. The only way to more accurately measure Full Siblings is by the phased mechanism described above. With this method, you will often see a greater than 50% match (as well as less than 50% at times).
- Everything described here also applies to analysis with the X chromosome; especially in girls. Some tools will report and compare the X. Some do not.
- Chromosome browser visualizations are trying to cram more data into the screen than can be depicted. So naturally, it over-simplifies the data visually and can indicate things that do not really exist. A classic example we often see is a region where chromosomes are naturally similar and the SNPs are naturally the same. They may show as lots of green meaning appearing like a full-identical match area as there are many SNPs the same in the area. But there are enough gaps of SNPs not the same (or the tools recognize these similar, pile-up areas) that the tools do not actually report it as a matching segment. It is always best to rely on the tabular form of the results for more accuracy of what the tool is calculating. Use the visualization for a quick, rough analysis only; like finding full-identical regions with pseudo-half-identical tools. Or at least use the "zoomed in" form of the chromosome bar chart that tries to have a pixel per SNP so all SNPs are accurately portrayed. Note that the "zoomed in" still does not represent where in the chromosome the SNP really is and how many base-pairs may be between each SNP.
- See our news article about Blaine's Autosomal Comparison data collection. It is interesting to note that the bar chart of the distribution of values he collected somewhat mimics our 7-sibling data above. In that the values are not "normal" (statistical sense) around the average but weighted below the average. Also interesting is the fact that the phased-comparison result we show is a normal distributed around the average. This seems worthy of a Jim Bartlett segment mathematical model study to explain why!
- Comparing results from different companies and even from different versions of the same company is not easy. One of the main reasons is that the different test company runs are testing for different SNPs. This was first alluded to by Felix Immanuel in a blog post while he was trying to develop his y-str analysis tools. Four years later, ISOGG is finally trying to do a similar analysis and document the issue; which has become major with the latest generation equipment being used by LivingDNA and 23andMe (v5) now.
Further Reading
- ISOGG Wiki Fully Identical Region explanation (shows some chromosome browser captures from tools other than GEDMatch) (albeit has incorrect numbers for percentage / amount matching between close relations; as detailed above)