Clustering

Currently a place holder for some of the first level of atDNA match analysis. Applies primarily to clustering of matches from a primary testers match list. Eventually this page will also apply to the original clustering of matches with common matching segments obtained from detailed segment analysis (or what we may start terming segment clustering). For now, only the former, more popular and newer fad technique is covered here.

Clustering, in general, is the grouping of matches from a primary testers match list into mostly disjoint sets (or clusters) of their common matches. The grouping is based on members of a cluster appearing in each others match lists. A "strict" cluster is one where every member of the cluster appears in every other members match list. A "merge" cluster is one where each member appears in the shared match list of at least the primary and one other match. Automated tools for clustering often have settable criteria like cluster members must appear in at least 50% of the other cluster members match lists. If manually creating clusters, the availability of shared match lists makes it a lot easier to determine a cluster. Otherwise, you must have access to the match lists of your matches. Match list clustering is often termed "color clustering" after Dana Leeds introductory spreadsheet technique that developed out of finding birth parents for adopted children. But tools that present clusters as matrices have been released and use colored blocks as well.

Clustering has become a new fad in 2018 within genetic genealogy — one that everyone is jumping on including both users and tool providers. But in reality, it has been around in various forms for many years. Just mostly focused on Segment Clustering historically. Ancestry is the largest match / test database and they do not provide matching segments. Thus the shared match list form of clustering has become very popular.

In reality, clustering in a perfect world, results in simply identifying which ancestor(s) the matches have in common. Often, we try to start by creating two clusters to bin matches to either a paternal or maternal side. Then maybe, using the Leeds Method, we try to create four clusters that hopefully represent the four grandparents. And then eight clusters to represent the eight great-grandparents. And so on. The issue is that we are not in a perfect world. We have an in-balance of matches that come from our various ancestors. We have matches that are not equally distant from each other (maybe two matches are siblings but 5th cousins to the primary). And we have variance in DNA match strength between similarly distant relatives; which can even cause them to not appear on a match list when they should. Only closer than 3rd cousins are usually guaranteed to appear as a ((DNA) match. If you have second cousins from all the different siblings to the primary testers grandparents, then you can likely create four grandparent clusters cleanly. And use it to sort more distant matches into one of those four bins. If you do not have such a complete coverage (which rarely occurs unless you solicit it), it can be a frustrating process to get it down to the four initial clusters.

A shared match list is not a cluster although it appears like it could be (on first blush). You have to check each member of the shared match list to see that they appear on each other members match list to make it a real cluster. But it is the start of a simple, first approach cluster. Especially if you remove close matches like the Leeds Method describes.

A shared match list is a great place to start by creating a merge cluster. Take the primary and match used to create the shared match list and add them to a cluster. Then, for each member of the shared match list, look at their shared match list with the primary and add each member there to the cluster. If you start with the closest match to the primary that is no closer than a 2nd cousin, then you can likely identify this merge cluster with a grandparent of the primary tester. Then take the next, more distant match that is not yet part of a cluster and do the same process. After creating four clusters, you may just have your list partitioned into the four grandparents. (note: this is an even simpler process than described in the Leeds Method but works surprisingly well for most that have 2nd cousin matches.)

You want to set a lower bound of a match strength to use when doing manual clustering like this. Say 80 cM of the total matching segment sum. This helps keeps the manual process easier and avoids the pitfall of clusters due to only more distant cousins and thus more distant common ancestors. Also remember when we say that "no closer than 2nd cousins" should be included, we mean 1st cousins, 1st cousins once removed, and so on. Just setting an upper bound on the total matching segment sum is not enough as the variance in match strength after 1st cousins is already pretty wide. Setting the upper bound to not include any 2nd cousins can work to help generate 8 clusters. This only if you have a healthy number of 3rd cousins tested from many of the 2x Great Grandparents. Note that more distant matches are not covered by this manual, clustering technique. Leave that to the automated tools and later stage analysis. As you tend to get into tens to a hundred disjoint clusters. This is mainly due to the much greater increase in ancestors to create clusters against, the variance of actual relatives even matching at all (3rd cousins are known to have a zero match strength), and the lower overlap of "strict cluster" matching across descendants of each of your 2x Great Grandparents.

Tools that do clustering fall on a spectrum between strict and merge. They may have parameters to control the process or simply use fixed techniques and heuristics; possibly not even visible to the user. So the success with different tools and using different parameters (if available) can vary widely. Not to mention it can vary widely with different testers and their match lists. Often, at minimum, you want to control the starting match list by pruning matches that are too strong (close) and those that are too weak (or distant).

Usually you ignore matches that are closer than 2nd cousin when creating clusters. Also, any descendants of the primary tester that might appear on their match list (and their matches match lists). Otherwise, like when endogamy or relations between the parents exist, you will create super large and all inclusive clusters that do not tell you anything.

The biggest benefit of clusters is grouping weak matches (< 50cM) with known, stronger matches (2nd and maybe 3rd cousins) so you can identify the (great) grandparents that contributed the DNA leading to the shared matches. This especially if you do not have segment triangulation to rely on. And thus can at least partition distant matches into each grandparent they may be related too.

While one could develop clusters by hand, it is much easier to use an automated tool that has access to each matches shared match list with the primary tester and all their matches. Clustering from simple match lists has been one of the only useful tools for Ancestry match lists as matching segment data is not provided by them. Only the ICW or shared match list with other tester is needed.

Note that endogamy destroys clustering techniques because matches appear in many clusters. Techniques to remove endogamy effects in shared match lists need to be applied first to make clustering effective in these cases. No such techniques are known by us to exist at this time although heuristics are being analyzed by some.

Key with clustering is to tune the lower and upper match strength bounds of who to include in the clustering process. While the Leeds method specifically works to include 2nd cousins, using matches this high leads to fewer, larger clusters. Most doing clustering are looking for smaller, finer grain clusters that represent a more distant common ancestor. Many clustering tools allow the user to adjust the minimum and maximum shared match threshold (of cms) to include. 200 cM as the high end and 40 cm as the low end is a good starting point in most cases. The higher a threshold set, the fewer, larger the clusters created (to the point of having just two for maternal and paternal, possibly, or even one if you include siblings). The lower the threshold you set, the more smaller size and much larger number of clusters you create; possibly that do not add much value beyond the ICW match lists themselves. Tools that do not allow you to tune these parameters are simply picking them for you; which may not be optimum in your situation.

Leeds Method

The Leeds method is a first level, basic and manual technique for clustering of shared matches on a shared match list. See references below for instructions on how to use this technique. Refinement notes follow here.

Is most useful for a quick and dirty sorting of 2nd and 3rd cousin matches into grandparent groups. Especially useful for adoption / unknown parentage cases as a first-order grouping. Can help sort strongest 4th cousin matches into these groups also. And closer 1st cousin matches into paternal and maternal groups.
The technique assumes:
- you have 10 to 30 matches of the strength 70 to 300 cM. Stronger matches are OK but are not used. If most / all the matches are below 100 cM, this technique will likely not work well.
- you have minimal endogamy, minimal matching between your matches that is not through distinct, single lines to a common ancestor, that the parents are not related, and no close half-sibling relationships in the pedigree of the main test subject. Any of these throws the simple match list grouping into a dis-array or affects the strength of determining the likely relation of a member of the match list.
You should set an upper and lower bound on the Total Matching Segment strength to include in the main grouping / coloring technique passes
- The upper bound should be at likely 2nd Cousin level (350 cM) or lower to get 4 groups representing 4 grandparents. Closer matches will likely fit in multiple groups and their common matches will muddle the grouping. Use NO upper bound if simply shooting for two groups — maternal and paternal. But still try to avoid including any descendants of the primary match. Setting the upper bound lower to try for 8 great-grandparent groups can sometimes work. Problem there is you start to see the overlap of the various degree of kinship distribution spreads come into play and muddle the groups.
- The lower bound should be not be too small so as to try and include too distant a relative. Maybe simply include the first 100 matches when rank-ordered by match strength. Or 40 cM or higher representing 3rd cousins or closer. The more distant, then the more subgroups you will start finding / creating. Also, you include many more of each matches ancestors and will start to find cases that muddle this type of clustering. That is Person A will match B due to common ancestor X. Person A will match C due to common ancestor Y. Person B will match C due to a common ancestor Z that is not in common with A. But all three will appear in the same group. The more distant the matches, the more likely the shared match may be related to the primary and match through different ancestor lines.
You can keep the close matches above the upper bound in the spreadsheet and mark them later. But do not use them to create a new group / color or their shared matches to color others. If you have 1st cousin matches, they should match in 2 of the 4 color groups and thus define the maternal and paternal grandparent groups. Thus you can use 1st cousin matches to sort into parental groups after you use the technique on 2nd cousins to create the grandparent groups.
Remember this method, because it is not based on common segments but merely on common matches, can break down if ancestors are related. Segment clustering / matching will mostly eliminate this issue.
Remember Ancestry will only show shared matches that are 20 cM or larger in total segment match strength. Some extensions and gather tools get around this but can generate very large files of these very distant, possibly false, matches.
It can be useful to use AncestryDNA Helper Chrome Extension (see 3rd Party Tools page) and let it run for under an hour to get a downloaded spreadsheet of the first 100 matches. Then simply add the four columns to that spreadsheet generated. This allows you to update the spreadsheet later when you run the helper again. Maybe "group" most of the other columns so you can hide them while coloring. You may need to sort the spreadsheet by centiMorgans first. You may need to run the extension a few hours to get beyond your lower limit if set very low. I tend to background color the rows outside of the limits in grey so I know to ignore them when doing the main Leeds coloring. Similar processing can be achieved using the DNAGedcom App to capture the Ancestry matches as well.
If working primarily / only in one test kit on Ancestry, and you have the criteria for getting 4 good, solid grandparent groupings / clusters (or less than 8), then you do not need a spreadsheet. You can do the match coloring using Ancestrys colored grouping mechanism to annotate your match list itself. But that coloring is not easily exported in all cases. So a spreadsheet you maintain can be a little more reliable in the long term.

Match List Matrix Clustering


No Endogamy, Colonial	Ancient Endogamy, Isolated Village	Extreme Endogamy
Various types of clustering results obtained from the Genetic Affairs tool on MyHeritage are given here. Tool settings are automatic and not adjustable. No second cousins or closer were selected with the chosen settings. The first is a colonial america tester (also Eastern Slav) with tight, nearly full / complete clusters. The second a recent immigrant from a small Slavic country with ancient roots. The third a highly endogamous Parsi from India. The first two indicate the parents are not related (meaning, there is no identical segments of DNA within the tester among the pairs of chromosomes. The third tester indicates the parents are related due to the extreme endogamy. The Parsi population exhibits that strongly, even today, as they have been isolated and intermarried in India for over 1,000 years. The charts represent the extremes of what one might see when trying to cluster their match lists.

Just recently, Jim Bartlett has tried to put more rigor on the process we describe above of going from 2 to 4 then to 8 and so on clusters. The ability to tune the clustering tools to enable this is just not really there. But he takes a stab at describing how you can do it with a very manual, intensive-judgment process. See Walking Back The Clusters on his blog.

Tools

Genetic Affairs (for Ancestry and FamilyTreeDNA kits; also the solution provided by MyHeritage now) by Evert-Jan Bloom
DNAGedCom Client (paid) CLM (works off Ancestry, 23andMe, FamilyTreeDNA, ... kits)
GEDMatch Tier1 (paid) has added a cluster tool; nicely shows segment triangulations in the blocks as well (also due to Evert-Jan Bloom)
Jonathan Brecher's Shared Clustering tool

External References

Dana Leeds Method article (from the source), Roberta Estes intro
Kitty Coopers blog post on automated clustering tools
Family History Fanatics Youtube Video on the GEDMatch tool
Jim Bartlett's Walking Back The Clusters

Backlinks

Structures

Leeds Method

Match List Matrix Clustering

Tools

External References