|What we aim to accomplish:
- Dissuade people from chasing after small autosomal DNA segments by
- demonstrating too-common matching in particular regions of chromosomes;
- emphasizing the pitfalls of half-identical matching;
- briefly reviewing linkage measurements, emphasizing chromosome region lengths (in centiMorgans) as approximations.
- Illustrate the need that AncestryDNA had for redesigning their matching system to account for population wide shared chromosome regions. Whether or not the current AncestryDNA Timber algorithm is the best possible is not here the issue, but rather that something like it is needed, not just at AncestryDNA but at all the autosomal DNA services.
|Warning: Genetics in general is a field with a steep learning curve; thus Genetic Genealogy likewise has a steep learning curve. The following will be a bit technical.
On various forums and blogs there is much discussion over the use of autosomal DNA matching for genetic genealogy. This field is emerging as a popular approach to extend what we know as the practice of “genealogy”, but there is much confusion and even angst running through the community of consumers of genetic genealogy tests over exactly what is meant by a “match”, and why companies decide on whom to exclude from match lists.
A common belief or assertion by some who are interested in genetic genealogy is that relatively small regions (below roughly 7 centiMorgans) of a chromosome are important for genealogical research. I won’t quote from various forum posts because I do not want individuals to think that I am picking on them, but whether we look at a Facebook group (e.g., the ISOGG group), the forums on the company websites of DTC DNA testing companies, or in general forums about genealogy and ancestry, there seems to be no end of a stream of posts of people trying to make small shared regions mean something.
The problem is these small regions often don’t mean what the posters think they mean.
Though we all like to think of ourselves as unique, here I need to emphasize:
☞ We humans are unique assemblages of very common small bits and pieces of chromosomes (DNA.)
Furthermore, there are some bloggers who are attempting to be highly visible in their attack on AncestryDNA, who late in 2014 revamped their matching algorithm, to the great dismay of some. This attack on AncestryDNA has been on Facebook groups as well as forums. Whatever one thinks of the company itself, the need for AncestryDNA to incorporate into their matching algorithm a method to deal with too-commonly matched regions of chromosomes is one of the reasons for writing this post. Regardless of the company one uses to do matching for the purposes of genetic genealogy, that company will improve their product if they can find a way to deal with false positives, and currently AncestryDNA does that, 23andMe does that to some extent, and FTDNA does not (to the best of my knowledge.)
One of the issues that drove AncestryDNA to redesign their matching algorithm, in part, we can couple to the problem of matching on small segments.
Let’s take a look at some data – all the data in this post is from a test I manage at 23andMe, and a profile I manage at FTDNA, from Family Finder, which is a transfer of the same 23andMe (V3) raw data of the same person tested at 23andMe. If you have tests at these companies you can download your own data and see for yourself where pile-ups are occurring.
23andMe customers have a tool available to them called “Countries of Ancestry” (CoA, formerly known as “Ancestry Finder”), which is a misnomer. It is a compilation of matches (with match data), composed in part from entries from the person’s DNA Relatives and others not in a person’s DNA Relatives list. Only 23andMe customers who have filled out the ancestry survey will find themselves in somebody else’s CoA list. Unfortunately the size of the CoA list is, similar to DNA Relatives, limited in the number of people allowed on the list (approximately a thousand.) Nevertheless, CoA is a list of over 1000 matching segments (as some person matches will be multi-segment matches) and thus useful for our purposes.
We can use the CoA data to plot out the “segments” (technically, half-identical regions) for each chromosome, stacking the segments to show overlaps and total coverage of the chromosome. Here is a plot for chromosome 1 showing 95 matching (to the test used in this post) segments as stacked rectangles:
Fig. 1: Chromosome 1 matching segments for our test, as rectangles, from 23andMe Countries of Ancestry list.
(The red line is a count of overlaps, thus indicating shared regions for that portion of the chromosome, but since the segment blocks are offset vertically by slim white spaces to make the segments visibly distinguishable from each other the red line ends up being on a different vertical scale than the blocks.)
We can see from figure 1 that chromosome 1 is nearly completely covered by matches (from the 23andMe Countries of Ancestry list). We also notice that while the matching segments are not evenly distributed along the chromosome the distribution is not so lopsided to demonstrate a significant “cold region”, though the peak around 60Mbp (mega base-pairs) followed by the trough is starting to look suspicious. However, around 240 Mbp there appears to be the beginning of a “pile-up” region.
Let’s look now at another chromosome, 6, which is known to have some regions that are troublesome:
Fig. 2: Chromosome 6 matching segments from 23andMe Countries of Ancestry match list.
Looking at figure 2 it is readily evident that there is an overwhelming number of chr6 matches in a single region of the chromosome, around 30Mbp. Chromosome 6 is noted for it’s HLA regions, parts of the chromosome which house genes vital to the human immune system. The segment pileups in these regions are suggestive, and may demonstrate a non-random phenomenon, probably a selection event in human evolution, perhaps even during historic (i.e., since the invention of writing) times.
Also noticeable is a desert of matching around the 100 Mbp region.
This lopsided distribution of matching segments should give us pause: what does it mean for our genealogy efforts “to match” in these cases?
What we are seeing in the figures in this post is how commonly distributed small fragments of chromosomes (or more precisely, sets of alleles, or haplotypes) can be in our society. The 23andMe CoA file only has 1000 people, out of the entire 23andMe database of 800,000 customers. What if everyone in the US were tested, and the CoA list not capped? We likely would see hundreds of thousands of “matches” in this regions of chr6. And it should be noted that in figure 2 some of those segments in the major pile-up areas are over 10cM in length (according to current linkage maps – more on that below.)
In doing genealogy, trying to makes sense of these kinds of matches (in overly-common regions) is an exercise in futility. Using the oft stated standard of a minimum size of 7cM for a match that is likely to be identical-by-descent, trying to identify the most recent common ancestor (MRCA) with a match such as in these chr6 pile-up regions is not tractable, even if the segment surpasses the 7cM threshold; the MRCA could be dozens of generations ago. One could propose that if a large enough sample of our population tested, and if select buried individuals could be exhumed and tested, then we could recreate partial genotypes of the individuals of entire communities of our ancestors from centuries ago. If this could be done then we could determine how common among our ancestors’ communities these shared chromosome regions today were in any given community, and perhaps trace the rapid growth of particular families or clans. Testing on such a scale is unlikely in the near future, however, and such an effort will likely face other hurdles.
Additionally, this region in chr6 in particular is demonstrating the problem of half-identical matching. The massive pileup is likely not due to a single physical 7cm – 10cM strand of chr6, but the superpositioning of several smaller (say .5cm to 1cM) regions, haplotypes found on chr6 which are very common throughout the European population. By random, these small fragments will superimpose (given the two copies of chr6 we all carry) to present these larger half-identical regions (HIR) which make the matching threshold cutoffs (say 7cM.)
Given that some of these regions in chromosomes are known, at least 23andMe filters out the most common ones before a match can make it into DNA Relatives. This is an important distinction.
But what if we did not filter out these known regions? And what if we were not limited to match lists of only 1000 people?
For this we turn to FTDNA’s Family Finder, which is not limited in the number of matches, and includes HIRs as small as 1cM.
Here are the matching chromosome 1 segments from Family Tree DNA’s Family Finder, for the same person as in the 23andMe CoA test:
Fig. 3: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.
First thing to note is that, while the FTDNA customer database for Family Finder is much smaller than 23andMe’s customer base, because FTDNA is reporting matching regions down to 1cM the Family Finder Chromosome Browser downloaded data set is quite a bit larger than the 23andMe CoA file. In figure 3, for chromosome 1, we are looking at over 1000 such matching regions.
It is quite clear that there are pile-up regions, sticking up like telephone poles in the forest of matches. These “matches” are occurring much, much more frequently than one would expect for a random distribution of chromosomes recombining in each generation.
Since the 23andMe CoA has a minimum cutoff of 5cM for a segment, we can filter the FTDNA data to include only those segments that likewise are at least 5cM in size. A plot of the FTDNA result for chromosome 1 ends up looking similar to the plot of matches from 23andMe:
Fig. 4: Chromosome 1 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.
Figure 4 is similar to figure 1, and we notice the pile-up near the 230-240Mbp region. However, what stands out in fig. 4 is the pile-up region around 180Mbp, which is not evident in fig. 1. We are starting to see regions of chromosome 1 where there is excessive “matching”.
Let’s repeat this exercise for chromosome 6, first presenting all the FTDNA FF segments on chr6:
Fig. 5: Chromosome 6 matching segments from FTDNA Family Finder chromosome browser list.
The massive pile-up around 30Mbp in figure 2 is now even more massive. The FTDNA Family Finder Chromosome Browser data includes 1467 segments for chr6, a great share of them in the pileup regions. Besides the largest such pileup we can visually identify 3 others.
As before, if we filter the FTDNA data for only those segments at least 5cM in size we get a much smaller set, and when plotted we get:
Fig. 6: Chromosome 6 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.
In the filtered data, with only segments greater or equal to 5cM, only the major pile-up in chr6 remains, and is still imposing. There possibly may be a pileup near the start of the chr6 too, but it may not be statistically significant.
In the above figures it becomes evident that there is excessive matching in particular regions of chromosomes. Furthermore, the commonality of these matches suggest that attempting to incorporate this data into family history research will lead to futility.
As humans we are all related to each other, the question being how long ago did any two individual’s MRCA live. In the US, people of colonial descent are likely multiply related to each other within the past 20 generations (500 years), and many times at that. Indeed, those of colonial descent will have very many 10th cousins and closer living in the US; the numbers of 5th through 10th cousins are likely in the millions for the colonials. So, the existence of relatives, distant to very distant, is not the question.
☞ The key to doing genetic genealogy with autosomal DNA is not finding a match, but rather finding genealogically tractable matches.
“Genealogically tractable” here means that given a reasonably exhaustive search of existing records, two family trees are documented sufficiently to support the conclusions in the pedigree, and the name of a most recent common ancestor can be found. Thus ancestors who lived before the era of documentation trails are not genealogically tractable. To go back further in time than a document trail can allow means we are entering the territory of the “ethnicity” or “ancestry” estimates provided by some companies. For many people in the world there are no records before the 19th or 18th centuries, while in a small set of locales documentation may go back to before the era of European colonization, but in these cases the records rapidly collapse to the nobility and royalty.
This is important because,
☞ Given lax enough matching criteria, one can have a DNA match with a person with whom your shared MRCA existed before records were kept for that MRCA.
Thus our goal is to find matches with whom we can possibly find the MRCA. A “match” that is based on population-common fragments of chromosomes is unlikely to be resolvable by genealogical methods, as such chromosome fragments would have been found in a large number of the contemporaries of our own pedigree ancestors. This is all the more true as we move back in time, when people found their mates nearby and not uncommonly married their cousins.
As noted above, 23andMe filters out some common regions (though they have not published the details) before a “match” can make it onto the DNA Relatives list.
In the fall of 2014 AncestryDNA implemented their own means to do something likewise, presented by them with much fanfare (and here.) AncestryDNA’s new matching system has caused quite a bit of a stir, in part because customers lost some to many of their old matches. The tests I manage lost from 40% to over 90% of their previous matches. Part of the reason for this is AncestryDNA’s new “Timber” algorithm, which explicitly attempts to deal with the phenomenon of the matching of overly-common chromosome regions, the existence of which are undeniable, as we see in the plots in this post.
Without addressing the matching to overly-common DNA we end up with very large lists of matches with whom we will never find the MRCA, which was the case previously with AncestryDNA and is possibly true still at FTDNA.
In may turn out that the Timber algorithm is too aggressive, and some genealogically informative matches are being lost in the new AncestryDNA matching system. This may happen if an inherited region of a chromosome is bisected by many population-wide common segments, and, post-Timber filtering, this inherited region is broken into too small of remaining segments to make the minimum threshold to declare a “match”. However, for AncestryDNA to correct this will probably require developing even more sophisticated filtering algorithms, about which I may write later in another post.
Having established the requirement for filtering out pileup regions, I want to stress again that these regions are not just small, 1 to 2cM, portions of chromosomes. Some will be much larger.
In general in regards to matching of genotyped individuals, as a standard, or best practice, matches ought not be declared on small (less than 7cM) regions of unphased genotype datasets, a subject discussed at length by various authors and bloggers and which I won’t repeat here. Phased genotypes can be matched on smaller regions with greater confidence, though I would not do so for regions less than about 5cM. (And if you’re interested in current academic discussions of identical-by-descent detection try here, here, here, and here). Unfortunately FTDNA insists on reporting small segments on their unphased genotype data sets, which misleads the customer in regards to how significantly they match other people.
Given all of that, we conclude this:
☞ Even after filtering out HIRs smaller than 5cM, DNA matching services such as FTDNA or 23andMe should filter out HIR’s greater than 5cM that are appearing too commonly in the population, so that the end user receives a genealogically tractable match list.
Ancillary to the above discussion and needing to be addressed especially in light of how FTDNA’s Family Finder reports matches, and since I mentioned it above, I want to touch briefly on the concept of “size” when it comes to autosomal DNA segments. This is important because I often see commenters in forums referring to a shared HIR as being ‘n.nn’ centiMorgans, sometimes to those 2 decimal places, probably because they check their FTDNA Chromosome Browser results which report matching in centiMorgans to two decimal places.
FTDNA should not give HIR lengths, individual or in total, in cMs to two decimal places!
The diagrams in this post use base pairs (or millions of base pairs) as the coordinate to place the matching HIRs over the chromosome length. In genetics, the concept of linkage disequilibrium mapping brings about the need to map the physical molecular position to a frequency space, with the frequency unit being a Morgan but in practice the centiMorgan is used.
Each human chromosome has by experimental procedure been “mapped”, associating a physical coordinate onto a scale measured in centiMorgans.
What many enthusiasts in genetic genealogy may not know is that the thought processes and concepts associated with these linkage maps have to deal with many issues; some of these may impact our understanding of the population-wide matching segments which are the topic of this post.
Referencing “cool spots” and “hot spots” can imply that coordinates from a linkage map be taken uncritically; however, the very large sets of genotyped samples being collected by 23andMe and AncestryDNA may be used in the future to create a better understanding of recombination probabilities along the chromosomes.
There is much variability in the end products of meiosis, and in regards to recombination and creating maps of chromosomes the following issues are topics of discussion:
- age (e.g., here);
- gender (if comparing a child to a parent);
- unique genetics (e.g., here).
The measurements for segment lengths given by 23andMe or FTDNA or gedmatch.com are averages, specifically gender averages.
For further (and very technical) reading on this subject:
Identifying recombination hotspots using population genetic data
Genetic Analysis of Variation in Human Meiotic Recombination
Enhanced genetic maps from family-based disease studies: population-specific comparisons
Variation in Human Recombination Rates and Its Genetic Determinants
Genetic Control of Hotspots
|☞ The bottom line about segment lengths is this: Round the length of chromosome segments to whole numbers, and remember additionally that the size of a segment is only an estimate and is accompanied by non-trivial error bars.