Chromosome Pile-Ups in Genetic Genealogy: Examples from 23andMe and FTDNA

 

 

What we aim to accomplish:

  1. Dissuade people from chasing after small autosomal DNA segments by
    1. demonstrating too-common matching in particular regions of chromosomes;
    2. emphasizing the pitfalls of half-identical matching;
    3. briefly reviewing linkage measurements, emphasizing chromosome region lengths (in centiMorgans) as approximations.
  2. Illustrate the need that AncestryDNA had for redesigning their matching system to account for population wide shared chromosome regions. Whether or not the current AncestryDNA Timber algorithm is the best possible is not here the issue, but rather that something like it is needed, not just at AncestryDNA but at all the autosomal DNA services.

 

Warning: Genetics in general is a field with a steep learning curve; thus Genetic Genealogy likewise has a steep learning curve.   The following will be a bit technical.

 

 

On various forums and blogs there is much discussion over the use of autosomal DNA matching for genetic genealogy. This field is emerging as a popular approach to extend what we know as the practice of “genealogy”, but there is much confusion and even angst running through the community of consumers of genetic genealogy tests over exactly what is meant by a “match”, and why companies decide on whom to exclude from match lists.

A common belief or assertion by some who are interested in genetic genealogy is that relatively small regions (below roughly 7 centiMorgans) of a chromosome are important for genealogical research.   I won’t quote from various forum posts because I do not want individuals to think that I am picking on them, but whether we look at a Facebook group (e.g., the ISOGG group), the forums on the company websites of DTC DNA testing companies, or in general forums about genealogy and ancestry, there seems to be no end of a stream of posts of people trying to make small shared regions mean something.

The problem is these small regions often don’t mean what the posters think they mean.

Though we all like to think of ourselves as unique, here I need to emphasize:

☞ We humans are unique assemblages of very common small bits and pieces of chromosomes (DNA.)

Furthermore, there are some bloggers who are attempting to be highly visible in their attack on AncestryDNA, who late in 2014 revamped their matching algorithm, to the great dismay of some. This attack on AncestryDNA has been on Facebook groups as well as forums. Whatever one thinks of the company itself, the need for AncestryDNA to incorporate into their matching algorithm a method to deal with too-commonly matched regions of chromosomes is one of the reasons for writing this post. Regardless of the company one uses to do matching for the purposes of genetic genealogy, that company will improve their product if they can find a way to deal with false positives, and currently AncestryDNA does that, 23andMe does that to some extent, and FTDNA does not (to the best of my knowledge.)

One of the issues that drove AncestryDNA to redesign their matching algorithm, in part, we can couple to the problem of matching on small segments.

Let’s take a look at some data – all the data in this post is from a test I manage at 23andMe, and a profile I manage at FTDNA, from Family Finder, which is a transfer of the same 23andMe (V3) raw data of the same person tested at 23andMe. If you have tests at these companies you can download your own data and see for yourself where pile-ups are occurring.

23andMe customers have a tool available to them called “Countries of Ancestry” (CoA, formerly known as “Ancestry Finder”), which is a misnomer. It is a compilation of matches (with match data), composed in part from entries from the person’s DNA Relatives and others not in a person’s DNA Relatives list. Only 23andMe customers who have filled out the ancestry survey will find themselves in somebody else’s CoA list. Unfortunately the size of the CoA list is, similar to DNA Relatives, limited in the number of people allowed on the list (approximately a thousand.) Nevertheless, CoA is a list of over 1000 matching segments (as some person matches will be multi-segment matches) and thus useful for our purposes.

We can use the CoA data to plot out the “segments” (technically, half-identical regions) for each chromosome, stacking the segments to show overlaps and total coverage of the chromosome. Here is a plot for chromosome 1 showing 95 matching (to the test used in this post) segments as stacked rectangles:

 

Chr1 Segments from Countries of Acnestry

Fig. 1: Chromosome 1 matching segments for our test, as rectangles, from 23andMe Countries of Ancestry list.

 

 

(The red line is a count of overlaps, thus indicating shared regions for that portion of the chromosome, but since the segment blocks are offset vertically by slim white spaces to make the segments visibly distinguishable from each other the red line ends up being on a different vertical scale than the blocks.)

We can see from figure 1 that chromosome 1 is nearly completely covered by matches (from the 23andMe Countries of Ancestry list). We also notice that while the matching segments are not evenly distributed along the chromosome the distribution is not so lopsided to demonstrate a significant “cold region”, though the peak around 60Mbp (mega base-pairs) followed by the trough is starting to look suspicious. However, around 240 Mbp there appears to be the beginning of a “pile-up” region.

Let’s look now at another chromosome, 6, which is known to have some regions that are troublesome:

 

Chr6 from Countries of Ancestry

Fig. 2: Chromosome 6 matching segments from 23andMe Countries of Ancestry match list.

 

Looking at figure 2 it is readily evident that there is an overwhelming number of chr6 matches in a single region of the chromosome, around 30Mbp. Chromosome 6 is noted for it’s HLA regions, parts of the chromosome which house genes vital to the human immune system. The segment pileups in these regions are suggestive, and may demonstrate a non-random phenomenon, probably a selection event in human evolution, perhaps even during historic (i.e., since the invention of writing) times.

Also noticeable is a desert of matching around the 100 Mbp region.

This lopsided distribution of matching segments should give us pause: what does it mean for our genealogy efforts “to match” in these cases?

What we are seeing in the figures in this post is how commonly distributed small fragments of chromosomes (or more precisely, sets of alleles, or haplotypes) can be in our society. The 23andMe CoA file only has 1000 people, out of the entire 23andMe database of 800,000 customers. What if everyone in the US were tested, and the CoA list not capped? We likely would see hundreds of thousands of “matches” in this regions of chr6. And it should be noted that in figure 2 some of those segments in the major pile-up areas are over 10cM in length (according to current linkage maps – more on that below.)

In doing genealogy, trying to makes sense of these kinds of matches (in overly-common regions) is an exercise in futility. Using the oft stated standard of a minimum size of 7cM for a match that is likely to be identical-by-descent, trying to identify the most recent common ancestor (MRCA) with a match such as in these chr6 pile-up regions is not tractable, even if the segment surpasses the 7cM threshold; the MRCA could be dozens of generations ago.   One could propose that if a large enough sample of our population tested, and if select buried individuals could be exhumed and tested, then we could recreate partial genotypes of the individuals of entire communities of our ancestors from centuries ago. If this could be done then we could determine how common among our ancestors’ communities these shared chromosome regions today were in any given community, and perhaps trace the rapid growth of particular families or clans. Testing on such a scale is unlikely in the near future, however, and such an effort will likely face other hurdles.

Additionally, this region in chr6 in particular is demonstrating the problem of half-identical matching. The massive pileup is likely not due to a single physical 7cm – 10cM strand of chr6, but the superpositioning of several smaller (say .5cm to 1cM) regions, haplotypes found on chr6 which are very common throughout the European population. By random, these small fragments will superimpose (given the two copies of chr6 we all carry) to present these larger half-identical regions (HIR) which make the matching threshold cutoffs (say 7cM.)

Given that some of these regions in chromosomes are known, at least 23andMe filters out the most common ones before a match can make it into DNA Relatives.   This is an important distinction.

But what if we did not filter out these known regions?  And what if we were not limited to match lists of only 1000 people?

For this we turn to FTDNA’s Family Finder, which is not limited in the number of matches, and includes HIRs as small as 1cM.

Here are the matching chromosome 1 segments from Family Tree DNA’s Family Finder, for the same person as in the 23andMe CoA test:

FTDNA chr1 all segments

Fig. 3: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

 

First thing to note is that, while the FTDNA customer database for Family Finder is much smaller than 23andMe’s customer base, because FTDNA is reporting matching regions down to 1cM the Family Finder Chromosome Browser downloaded data set is quite a bit larger than the 23andMe CoA file. In figure 3, for chromosome 1, we are looking at over 1000 such matching regions.

It is quite clear that there are pile-up regions, sticking up like telephone poles in the forest of matches. These “matches” are occurring much, much more frequently than one would expect for a random distribution of chromosomes recombining in each generation.

Since the 23andMe CoA has a minimum cutoff of 5cM for a segment, we can filter the FTDNA data to include only those segments that likewise are at least 5cM in size. A plot of the FTDNA result for chromosome 1 ends up looking similar to the plot of matches from 23andMe:

Fig. 4: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

Fig. 4: Chromosome 1 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.

 

Figure 4 is similar to figure 1, and we notice the pile-up near the 230-240Mbp region. However, what stands out in fig. 4 is the pile-up region around 180Mbp, which is not evident in fig. 1. We are starting to see regions of chromosome 1 where there is excessive “matching”.

Let’s repeat this exercise for chromosome 6, first presenting all the FTDNA FF segments on chr6:

FTDNA chr6 all FTDNA segments

Fig. 5: Chromosome 6 matching segments from FTDNA Family Finder chromosome browser list.

 

The massive pile-up around 30Mbp in figure 2 is now even more massive. The FTDNA Family Finder Chromosome Browser data includes 1467 segments for chr6, a great share of them in the pileup regions. Besides the largest such pileup we can visually identify 3 others.

As before, if we filter the FTDNA data for only those segments at least 5cM in size we get a much smaller set, and when plotted we get:

FTDNA chr6 5cM floor

Fig. 6: Chromosome 6 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.

 

In the filtered data, with only segments greater or equal to 5cM, only the major pile-up in chr6 remains, and is still imposing. There possibly may be a pileup near the start of the chr6 too, but it may not be statistically significant.

In the above figures it becomes evident that there is excessive matching in particular regions of chromosomes. Furthermore, the commonality of these matches suggest that attempting to incorporate this data into family history research will lead to futility.

As humans we are all related to each other, the question being how long ago did any two individual’s MRCA live. In the US, people of colonial descent are likely multiply related to each other within the past 20 generations (500 years), and many times at that. Indeed, those of colonial descent will have very many 10th cousins and closer living in the US; the numbers of 5th through 10th cousins are likely in the millions for the colonials. So, the existence of relatives, distant to very distant, is not the question.

☞ The key to doing genetic genealogy with autosomal DNA is not finding a match, but rather finding genealogically tractable matches.

“Genealogically tractable” here means that given a reasonably exhaustive search of existing records, two family trees are documented sufficiently to support the conclusions in the pedigree, and the name of a most recent common ancestor can be found. Thus ancestors who lived before the era of documentation trails are not genealogically tractable. To go back further in time than a document trail can allow means we are entering the territory of the “ethnicity” or “ancestry” estimates provided by some companies. For many people in the world there are no records before the 19th or 18th centuries, while in a small set of locales documentation may go back to before the era of European colonization, but in these cases the records rapidly collapse to the nobility and royalty.

This is important because,

☞ Given lax enough matching criteria, one can have a DNA match with a person with whom your shared MRCA existed before records were kept for that MRCA.

Thus our goal is to find matches with whom we can possibly find the MRCA. A “match” that is based on population-common fragments of chromosomes is unlikely to be resolvable by genealogical methods, as such chromosome fragments would have been found in a large number of the contemporaries of our own pedigree ancestors. This is all the more true as we move back in time, when people found their mates nearby and not uncommonly married their cousins.

As noted above, 23andMe filters out some common regions (though they have not published the details) before a “match” can make it onto the DNA Relatives list.

In the fall of 2014 AncestryDNA implemented their own means to do something likewise, presented by them with much fanfare (and here.) AncestryDNA’s new matching system has caused quite a bit of a stir, in part because customers lost some to many of their old matches. The tests I manage lost from 40% to over 90% of their previous matches. Part of the reason for this is AncestryDNA’s new “Timber” algorithm, which explicitly attempts to deal with the phenomenon of the matching of overly-common chromosome regions, the existence of which are undeniable, as we see in the plots in this post.

Without addressing the matching to overly-common DNA we end up with very large lists of matches with whom we will never find the MRCA, which was the case previously with AncestryDNA and is possibly true still at FTDNA.

In may turn out that the Timber algorithm is too aggressive, and some genealogically informative matches are being lost in the new AncestryDNA matching system. This may happen if an inherited region of a chromosome is bisected by many population-wide common segments, and, post-Timber filtering, this inherited region is broken into too small of remaining segments to make the minimum threshold to declare a “match”. However, for AncestryDNA to correct this will probably require developing even more sophisticated filtering algorithms, about which I may write later in another post.

Having established the requirement for filtering out pileup regions, I want to stress again that these regions are not just small, 1 to 2cM, portions of chromosomes. Some will be much larger.

In general in regards to matching of genotyped individuals, as a standard, or best practice, matches ought not be declared on small (less than 7cM) regions of unphased genotype datasets, a subject discussed at length by various authors and bloggers and which I won’t repeat here. Phased genotypes can be matched on smaller regions with greater confidence, though I would not do so for regions less than about 5cM. (And if you’re interested in current academic discussions of identical-by-descent detection try here, herehere, and here). Unfortunately FTDNA insists on reporting small segments on their unphased genotype data sets, which misleads the customer in regards to how significantly they match other people.

Given all of that, we conclude this:

☞ Even after filtering out HIRs smaller than 5cM, DNA matching services such as FTDNA or 23andMe should filter out HIR’s greater than 5cM that are appearing too commonly in the population, so that the end user receives a genealogically tractable match list.
 ❦

 

Ancillary to the above discussion and needing to be addressed especially in light of how FTDNA’s Family Finder reports matches, and since I mentioned it above, I want to touch briefly on the concept of “size” when it comes to autosomal DNA segments. This is important because I often see commenters in forums referring to a shared HIR as being ‘n.nn’ centiMorgans, sometimes to those 2 decimal places, probably because they check their FTDNA Chromosome Browser results which report matching in centiMorgans to two decimal places.

FTDNA should not give HIR lengths, individual or in total, in cMs to two decimal places!

The diagrams in this post use base pairs (or millions of base pairs) as the coordinate to place the matching HIRs over the chromosome length. In genetics, the concept of linkage disequilibrium mapping brings about the need to map the physical molecular position to a frequency space, with the frequency unit being a Morgan but in practice the centiMorgan is used.

Each human chromosome has by experimental procedure been “mapped”, associating a physical coordinate onto a scale measured in centiMorgans.

What many enthusiasts in genetic genealogy may not know is that the thought processes and concepts associated with these linkage maps have to deal with many issues; some of these may impact our understanding of the population-wide matching segments which are the topic of this post.

Referencing “cool spots” and “hot spots” can imply that coordinates from a linkage map be taken uncritically; however, the very large sets of genotyped samples being collected by 23andMe and AncestryDNA may be used in the future to create a better understanding of recombination probabilities along the chromosomes.

There is much variability in the end products of meiosis, and in regards to recombination and creating maps of chromosomes the following issues are topics of discussion:

  • age (e.g., here);
  • ethnicity;
  • gender (if comparing a child to a parent);
  • unique genetics (e.g., here).

The measurements for segment lengths given by 23andMe or FTDNA or gedmatch.com are averages, specifically gender averages.

For further (and very technical) reading on this subject:

Identifying recombination hotspots using population genetic data

Genetic Analysis of Variation in Human Meiotic Recombination

Enhanced genetic maps from family-based disease studies: population-specific comparisons

Variation in Human Recombination Rates and Its Genetic Determinants

Genetic Control of Hotspots

 

☞ The bottom line about segment lengths is this: Round the length of chromosome segments to whole numbers, and remember additionally that the size of a segment is only an estimate and is accompanied by non-trivial error bars.

 

 ❦ ❦

 

 

Bookmark the permalink.

19 Comments

  1. I have a little more background about the HLA region in my JoGG column from 2010 “Up Hill and Down Dale in the Genomic Landscape: The Odd Distribution of Matching Segments” http://www.jogg.info/62/files/SatiableCuriosity.pdf. Some of the haplotypes are very common (8.7% for the single most common one in one study) and others are very rare. AncestryDNA has stated that they look at the distribution of segments on an individual basis, which would be even more refined than 23andMe’s list of regions with “excess IBD.”

  2. A cutoff of 5cM is pretty generous, isn’t it?

    How much does the issue of these “pile ups” become moot when the cutoff is raised to genealogically tractable ranges? How many genealogically tractable HIRs are lost when the cutoff is raised to 7 cM?

    What happens when you raise the cutoff to 10 cM?

    I’m not suggesting that segments below the 10 cM threshold are never useful or genealogically tractable, but I think that raising the threshold higher than 5 cM could mitigate the pile up issue significantly without loss of a significant number of genealogically tractable HIRs.

    • There will always be trade-offs we have to accept in the balance between false positives and false negatives.

      To me the most important thing is return on investment. I already have so many leads generated by autosomal matches > 20cM that I don’t have time to look at all the smaller matches. In certain cases where I am seeing an unusual number of matches on certain names or locales maybe I will look at the smaller matches.

      I agree that 10cM is a more useful bottom floor than 5cM in order to manage the lists of thousands of DNA cousins we who test at all the sites end up receiving.

      Still, a few of those pileups on chr6 even eek over the 10cM bar.

      • I’m sure that a 10 cM threshold wouldn’t be a panacea, but it’s an incredibly simple solution fraught with far fewer potential pitfalls than Ancestry’s “ad hoc” fixes. For all its technical sophistication, Timer seems to be a blunt and quirky instrument.

        Ideally, I can imagine a variable threshold that changes across the genome and varies with ethnicity, but I assume the science isn’t quite up to that speed yet.

  3. Rereading this today with coffee on board and will continue to read it until I understand it better.. Thank you for sharing this fine blog and examples..

  4. I am emailing this link to the next “match” who emails me and gets upset when I try to explain there is insufficient data for determining a common ancestor ! Great article !

    I wrote a piece of code to arrange my own matches along matching segments and showing the overlaps, but the graphical representation of the problem here makes it clear.

  5. Thank you for an excellent blog post which I think should be required reading for anyone who has taken an autosomal DNA test.

    In case it’s of interest there is some information on “pile-up regions” in the ISOGG Wiki article on IBD in the section on “Excess IBD sharing”::

    http://www.isogg.org/wiki/Identical_by_descent

    Your blog post has now been included in that article.

  6. Excellent post. Thank you!
    I find myself wondering if I should do something with the FTDNA matches in my spreadsheet that seem now suspicious to me? They are the ones that don’t have any 23andMe matches in the same region of a chromosome.

    • On FTDNA I never look at the segments less than 5cM if I’m looking at the chromosome browser.

      One thought on not having any matches at 23andMe in the parts of chromosomes that you apparently have at FTDNA: not everyone at FTDNA is at 23andMe, and 23andMe limits DNA Relatives to the top 1000 matches (unless you invite people to share at the bottom of your list, then you can get it to grow over 1000.) It’s quite possible that the 23andMe DNAR list size limit is keeping you from seeing distant matches which concur with the ones you have at FTDNA.

      • I have 17 consecutive FTDNA matches on CHR 6 between 78139397 and 90717040. The sizes range from 7.70 – 8.75 cM. At least they all match each other. Just so odd that there aren’t any 23andMe matches interspersed, as there are everywhere else.

  7. The process that AncestryDNA is using seems to be eliminating segments larger than 20 cMs. My mother lost a confirmed 3rd cousin after the pile-ups were eliminated. I’m trying to persuade her to compare with me at GEDmatch. I believe her mother tested and currently is an Extremely High Confidence match. A cousin of mine lost 8 95% confidence matches after the pile-ups were eliminated. I believe these matches likely share over 20 cM segments? Hopefully I’ll find out.

  8. It’s a little misleading to say that FTDNA includes HIRs down to 1 cM, given that two people must have a match of at least 7 cM before they are counted as “matching.” True, one can view smaller segments in the chromosome browser, but only for “matches” based on sharing at least one segment of at least 7 cM (plus enough total DNA in common). Like others, I leave the chromsome browser set at 5 cM, so that I only see segments on which the matches are based. What interests me is that all of my noticeable pile-ups are for people of Ashkenazi Jewish descent (I have one Jewish 2G grandparent).

  9. To me, the question of what is of genealogical relevance is a personal thing, not to be decided for me by a genetic matching company. It may good to know where the stacking is likely to occur for most people of a certain ethnicity, but I want to be able to determine its own relevance to me.

    I have not seen in this discussion anything about the probability of false negative matches in recent generations, but elsewhere I had read that the probability of having significant matches in the 5th generation back [to 4th cousins, and 3rd great grandparents] was less than 50%.

    I trace all of my lines back to that point with fairly well documented history, so the value of using DNA to help with that is minimal. What I want to find using DNA is hints about lesser known connections in Colonial America 7 to 10 generations [or more] back, and in Denmark, because of the excellent historical records there, perhaps to as much as 16 generations back.

    Coming back to the probability of cM length matching accurately the number of generations distant our MRCA is/are, does that not vary a good deal?

  10. Does anyone have a tool or technique for creating these pile-up graphs from chromosome browser data? I think I’ve found a pile-up region in my Chr 7 matches and I’d like to make a graph of it.

    • One can use a variety of analysis tools to create graphs. Also, some software for genetic genealogy, I believe GenomeMate is one, can visually line up segments to give one a visual clue of regions of too frequent “matching”.

  11. Clearly there is something wrong with a matching algorithm that shows “3,468 centimorgans shared across 54 DNA segments” and “3,455 centimorgans shared across 60 DNA segments” for two parents and their child.

    Yet that’s exactly what Ancestry claims as what my wife and I share with our daughter. Never mind that this is over twice the number of actual segments shared (exactly 23, in each case). At least my sharing with myself (v1 versus v2) is a more reasonable “3,489 centimorgans shared across 25 DNA segments.” Still not quite right, but only off by a bit.

    However, of far more significance than whether or not any “piling up” may have occurred in Ancestry’s refusal to provide any sort of chromosome browser. In that case, it would be possible to see and judge for ourselves which matches seem most worthwhile to follow up on — and not solely based on the total amount of matching, or number of segments.

    • AncestryDNA reports shared chromosome regions that are believed to be of *genealogical value*. They are not statements of whether or not a molecule was inherited in whole.

      Regarding parent-child reported number of segments – the number of segments is probably increased by small phasing errors, which are inevitable. No-calls are a common problem.

      Even at 23andMe I have more than 23 segments with my mother, because on a couple of the chromosomes there were enough no-calls to split up the chromosome-long run of half-identical matching.

  12. (para. 4, “… [is] Ancestry’s refusal …”

  13. Yikes! I can’t count, as well as can’t write. I meant para. 3, not 4.

Leave a Reply