Of Mirroring And Shared Ancestors: Exploiting AncestryDNA To Find Biological Families


What we aim to accomplish in this post:

  1. Provide definitions and an overview of methods for finding biological families of those who don’t know one or both biological parents, using tools at AncestryDNA.
  2. Discuss possible future methods that may be helpful in reaching the same goals.


Finding a bio-ancestor is like finding the horizon on a foggy day - keeping looking and you might find it.

Looking for a bio-parent may feel like looking for the horizon on a foggy day – but keeping looking and you might find what you seek.

Edit 10 May 2016: The May 2016 change in the matching algorithm at AncestryDNA makes some of the details in the PDF out of date, but the basic ideas still apply.

One of the more common uses of autosomal DNA matching products such as 23andMe, AncestryDNA, and FTDNA Family Finder, is for people searching for their biological families, including:

  1. Adoptees searching for their biological ancestors;
  2. Those who discover that one of their presumed biological parents turns out not to be so – often a shocking surprise, but sometimes also a deeply and long held suspicion that turns out to be true;
  3. Those who never knew one or both parents (usually but not always a missing father.)


The processes described here are not new nor invented by me.  An entire community of people dedicated to genetic genealogy and helping others (such as adoptees) have arisen and the methods being used are continually improving, especially as creative software developers bring forward new tools to make use of autosomal DNA matches.

However, on forums and Facebook groups it seems that the same questions get asked continually about finding family using DNA, sometimes posters presenting the same question within minutes of each other. Thus we believe that there is still a need for more educational material on this topic.

In the presentation below, as a PDF file, is an overview of how to exploit one’s DNA test at AncestryDNA, including:

  1. mirror pedigrees;
  2. shared matches;
  3. ancestor harvesting.

This presentation is not an exhaustive exploration of these topics, but hopefully will be helpful to many seeking bio-families.

Data Mining Through Mirroring

Download (PDF, 3.35MB)



I believe there are two possible tool types that AncestryDNA could give us that would help in research, both for more traditional genealogy goals (i.e., finding ancestors beyond those we knew personally) as well as for those searching for birth families:

Surname analysis

See my post “Surnames, 23andMe, and AncestryDNA: Making the Most of Match Counts and “Enrichment”

Geographic analysis

Our ancestors lived in locales at specific times and much family history work depends upon locating our ancestors within their landscape.  Not enough of this is done by novices to family history but often analyzing locations are key to unraveling complex families.  Anyone who has spent some time reviewing family trees knows that the location fields for deaths and births and many of the events in-between are often left off, or given inaccurately.

AncestryDNA does provide a map tool for every DNA match, to see where the match’s ancestors were born, according to the tree to which the DNA kit is attached. This is very useful for my matches where surnames are of little to no use (such as my Norwegian ancestors).  AncestryDNA also provides a filter for searching matches by location.  Both are limited by the difficulties in conforming geographic names to standards, as well as the sad fact that even those of our matches who have family trees often do not have birth locations for their ancestors.

Still, it ought to be possible for an automated process to sort the ancestors of our matches into location groups.  In this regard even the rudimentary Map View in the DNA Relatives tool at 23andMe offers a capability not at AncestryDNA.  For example, there are  3,144 counties (or county equivalents) in the US – why can’t we have a heat map at the county level, for the birth and death counties of the ancestors of our matches?  There are analogous (to the US county system) geographical divisions in many nations that could also be heat-mapped.

In the bigger picture we are discussing what might be called “data mining”. We want to use automated systems to find patterns in large data sets. Here we have two large data sets: DNA matches, and Family Trees.

The collected family histories available both in traditional publications (books, journals, monographs) and electronically represent our families and their culture through the collective historiography work of many thousands of people, some professional genealogists, many of them not.

Connecting this family history to that of our physical inheritance – as presented through our genotypes, created by these direct to consumer DNA testing companies – is the goal of everyone in genetic genealogy.  Of the companies involved, so far AncestryDNA has proven for me and many others to be the most able to reach this goal (though as a product it is still wanting for some basic tools, such as a chromosome browser for those wishing to do segment mapping.)   I’m hoping someday they will add tools in the categories I list above.

Genealogy and Autosomal DNA Matches: Common Errors in “Proving” An Ancestor, and the Allure of Easy Gateway Ancestors


What we aim to accomplish in this post:

  1. Illustrate a common misconception in the use of autosomal DNA in genetic genealogy; namely, assigning a single shared chromosome region to a specific common entry in two pedigrees, an ancestor, presumed to be the source of the shared DNA common to both people who tested.
  2. Define a spectrum of “proof” uses of autosomal DNA matching.

First, copying a label warning from previous posts:

Warning: Genetics is a field with a steep learning curve; thus Genetic Genealogy likewise has a steep learning curve.   The following will be a bit technical.   If it seems hard don’t worry, it’s supposed to be.

With the rising use of direct to consumer (DTC) autosomal DNA testings (such as through AncestryDNA, 23andMe, or Family Tree DNA) for the purpose of genealogy we have seen also a great number of posts in blogs, Facebook, etc. on using chromosome browsers and matching chromosome segments to prove an ancestor.

And now there are services popping up on the internet that claim to be able to give you a DNA match to famous ancestors. Like with many third party “ethnicity” websites, my advice to the consumer is an old one: Caveat Emptor.

To explain why, we have to review many related concepts.

The first principle with which we must come to grips is this:

☞ DNA (of any kind) directly gives us a way to construct clades (represented in graphical form by cladograms), but does not construct family trees as we know them in genealogy.

And the corollary:

☞ Family trees are constructed from a variety of evidence and must permit the DNA data as empirical evidence.


Let us first compare a cladogram to a family tree.
What is a cladogram? A cladogram is a way of showing relationships among living things, the relationships based on morphology (classically, derived characteristics) or through DNA. There are many possible ways to draw a cladogram but my favorite is the horizontal format. A classic cladogram for us apes is thus:


Cladogram of Apes

Fig. 1: Cladogram of Apes, annotated.



Contrast this with the classic scheme for a descendant chart, often used in genealogy:

descendant chart v2

Fig. 2: A Simple Descendant Chart, a paradigm used much in genealogy.



The family tree, in this case represented by a descendant chart, includes a parent for each child. This is in contrast to the cladogram, which does not have a parent for each (or any) living organism included.

The practical import of this difference between a cladogram and a family tree will become more clear as we place DNA data on what I will call a genetic genealogy proof spectrum, for the lack of a better phrase. DNA data can be used to prove or disprove relationships within certain boundaries depending upon the nature of the relationship:

Proof Spectrum of Autosomal DNA Use

Fig. 3: Proof Spectrum of Autosomal DNA Use In Genealogy


Practical definitions in the “proof spectrum”:

“Defined” – the relationship between two people can only fit one model, either parent/child, or full siblings (or equivalent.) This is clear in autosomal DNA testing as parent/child are half-identical across all their autosomes and for the X between son and mother, or daughter and both parents. As has been noted by others, these autosomal tests are excellent paternity tests.

“Delimited” – there are a limited (tractable) number of possible genealogical relationships, the ambiguity of which can be eliminated by a DNA test of the right third party. The relationships are easy to determine (assuming one is able to do DNA testing) because relationships like grandparents and half siblings and aunts and uncles are seen readily by two people being half identical over large regions of several chromosomes. 3rd cousins are at the weak edge of this category, as the randomness of inheritance can on rare occasions lead to genealogical 3rd cousins not sharing any half-identical chromosome regions.

“Evidenced” – the existence of a common ancestor between two people is clear because of the presence of a statistically significant shared region of a chromosome, but the genealogical relationship cannot be defined (other than the relationship will not be in the set of the closer relationships which are described above) based on the DNA tests of the two individuals. There is no guarantee that any two distant genealogical cousins will share any statistically significant chromosome regions, and the likelihood of two cousins, beyond 4th cousins, sharing these regions is quite small.

Here now we come to a dilemma faced by those who are using autosomal testing for genealogy: Nearly all your DNA matches will be of distant cousins, and as more customers are gathered by the companies the number of your distant matches will grow to be very many. For example, today on AncestryDNA it is not unusual for an American to have over 5000 matches, yet only a handful will fall into categories closer than 4th cousin.

The few matches one may have who are close relatives may be either expected or be a shocking surprise, but because close relatives will share statistically significant identical regions on several to many chromosomes the closeness of a relationship will not be in doubt, and if the other party is willing to share information one can arrive at the identity of the common ancestors. [There are exceptions of course, see the end of this blog entry.]

However, given so many distant cousin matches, we are tempted to use these distant matches to try and “prove” our family trees, but all too often the novice falls into the following algorithm, which is a trap:

  • Step 1) Person B matches you (person A.)
  • Step 2) Ancestor Z is in B’s pedigree and in your pedigree.
  • Step 3) Therefore the match proves you (A) and B are descended from Z and the DNA you share came from Z.

The above algorithm is flawed. The reason is straightforward – how do you prove that the piece of DNA, a shared chromosome region, did indeed arise from ancestor Z and not from another ancestor?

Hint – you can’t, at least easily; we’ll cover that a bit more later.

We have a great many ancestors even from the time of the first European colonization of the Americas. For finding from which ancestor we inherit a shared chromosome region with another person we must come to terms with the many possible overlapping ancestors between any two people whose ancestors lived in the same part of a continent.

Now the more aware consumer will jump up and shout CHROMOSOME BROWSER!!

Alas, it is not that simple.

Even with a chromosome browser available, DNA does not come with little labels attached saying from which person centuries ago said bit of DNA came through. A process of “triangulation” may be used, but its value as a proving-mechanism is based upon going from the known to the unknown.

Otherwise, without starting from a known ancestor (say a tested grandparent), if one has a group of people who are matches to each other for the same region of a particular chromosome one may be able to lay out the pedigrees of all the group members and discover that there is only a single person, or a couple, common to all pedigrees. In these cases the hypothesis will be that one ancestor (or couple) who is common in all the pedigrees is the source of the shared chromosome region, but if the effort to identify this common ancestor does not start from a (later, descended) known person we are still at a hypothesis and not a conclusion.  The shared DNA in these cases is simply evidence, not proof.

There are problems in the use of “triangulation” that may be overlooked by the inexperienced.   My goal in this post is not to review triangulation, which is covered in blog posts by genetic genealogists and providers of support software, and in a few books now available.   But I do want to point out two pitfalls about which to be aware: 1) the diploid problem, and 2) the existence of too common identical chromosome regions (see here) being found in the customer sets of DTC testing companies. Both of these pitfalls may come into play as we tackle one of the instigations for this blog entry: the claim that one can know deep, often “gateway”, ancestors with DNA.

As noted at the beginning of this blog entry, there are now outfits claiming to be able to tell you your “Gateway” ancestor(s), by submitting your autosomal test results to them. These claims are unsubstantiated and more directly do not follow proven methods of determining relationships.

Simply because one shares a small chromosome region, or even worse only a limited set of SNP alleles, with a group of other people all claiming descent from some particular ancestor does not mean that person is your ancestor too!

Let us look at an example, where you and three others (A, B, and C, not of your immediate family) all get autosomal tests, and submit your raw data to a service claiming to give you a “gateway” ancestor based on matching:

Marketing vs. Biology

Fig. 4: Marketing vs. Biology


On the left of the picture is what the vendor may want to sell you; on the right is the conclusion that can be based on what your raw data says.

The reason it is so hard to determine from whom a shared chromosome region may arise is because of the vast number of arrangements of possible descent. Here are but two scenarios based on 4 people sharing a chromosome segment:

Two Descendancy Scenarios

Fig. 5: For any set of distant cousin matches, there exists a vast number of possible descendancy paths from the most recent common ancestor of all those involved. Here two scenarios are provided for 4 people matching each other.


In Case 1, A and B are 4th cousins 1x removed. You and C are 6th cousins to A and 6th cousins 1x removed to B. Only A and B descend from the “Gateway Ancestor”, while said ancestor is a 4th great grand uncle (or aunt) relationship to you and C. The reason some matching  algorithm might declare that both you and C descend from the Gateway Ancestor is because both of you match two people (A and B) who in fact do descend from the Gateway Ancestor. However, the DNA you all share come from a founding couple who were the parents to the Gateway Ancestor.

Case 2 presents an even more insidious example of how descendancy over time will obscure actual DNA inheritance paths. In this case person B is indeed the 4th great grandchild of the Gateway Ancestor, an ancestor so named because one of their parents was either famous or came from a famous family line. Person A thinks they are descended from the same Gateway Ancestor, but unbeknown to him is that his great grandparent was only a half sibling to B’s 2nd great grandparent, because of an unrecorded parentage (benignly so or not.) Meanwhile you and C, even though you match B, are the 2nd cousin 6x removed of the Gateway Ancestor, but not on the path with the famous person. And because you have some pedigree collapse going on, you are also the 2nd cousin 5x removed of the Gateway Ancestor, as well as being the 5th cousin 1x removed to C (besides also being 8th cousins with him.) Yet some matching service might declare that you are descended from the Gateway ancestor, because you match B and probably several others who are indeed descended from the Gateway ancestor.

By the time we get back to 4th, 5th, 6th and so on great grandparents there is an immense number of possible family connections. Even though you have only a low probability of having an autosomal match with any specific distant cousin, because you have millions of cousins since the time of colonialism you will eventually end up with a very large list of matches.

Trying to make genealogical proof arguments out of single segment DNA matches is a challenge for which I believe few are prepared. To unravel these deep connections will take much time and money as large groups of people will need be tested, and uniparental DNA (Y and mitochondrial) testing may be required to rule out possible lines of descent. Given large enough data sets of tested individuals some innovations may eventually be accepted as “proof”, such as AncestryDNA’s “Circles”. However, the AncestryDNA Circles, and even more so AncestryDNA’s recently introduced New Ancestor Discoveries, are still in the early stages of being user tested and for now cannot alone be used as “proof” in genealogy.

So, beware if someone wants you to send them your raw data and some money, to get a certificate or document or just an email claiming you are descended from a Somebody or a Gateway Ancestor.

End note: As mentioned above, there are problems with DNA matching even closer cousinry, for specific people. Here we find the wacky DNA world world of closed communities – there exist human communities which have arisen from a small number of founders, whose descendants only marry with each other, and in these situations even multiple segment matches can be misleading. Well known examples include Ashkenazi Jews and Pacific Islanders. For people from such populations, even multi-segment matches across several chromosomes could still be distant cousins.

Additionally, if two people match closely and if the two are both adoptees and do not know either of their birth parents, it may not be clear on how to define the tested relationship without the serendipity of having other, closer, matches appear.   Adoptees are recommended to “fish in all ponds” to increase the chance that serendipitous matches appear.

Chromosome Pile-Ups in Genetic Genealogy: Examples from 23andMe and FTDNA



What we aim to accomplish:

  1. Dissuade people from chasing after small autosomal DNA segments by
    1. demonstrating too-common matching in particular regions of chromosomes;
    2. emphasizing the pitfalls of half-identical matching;
    3. briefly reviewing linkage measurements, emphasizing chromosome region lengths (in centiMorgans) as approximations.
  2. Illustrate the need that AncestryDNA had for redesigning their matching system to account for population wide shared chromosome regions. Whether or not the current AncestryDNA Timber algorithm is the best possible is not here the issue, but rather that something like it is needed, not just at AncestryDNA but at all the autosomal DNA services.


Warning: Genetics in general is a field with a steep learning curve; thus Genetic Genealogy likewise has a steep learning curve.   The following will be a bit technical.



On various forums and blogs there is much discussion over the use of autosomal DNA matching for genetic genealogy. This field is emerging as a popular approach to extend what we know as the practice of “genealogy”, but there is much confusion and even angst running through the community of consumers of genetic genealogy tests over exactly what is meant by a “match”, and why companies decide on whom to exclude from match lists.

A common belief or assertion by some who are interested in genetic genealogy is that relatively small regions (below roughly 7 centiMorgans) of a chromosome are important for genealogical research.   I won’t quote from various forum posts because I do not want individuals to think that I am picking on them, but whether we look at a Facebook group (e.g., the ISOGG group), the forums on the company websites of DTC DNA testing companies, or in general forums about genealogy and ancestry, there seems to be no end of a stream of posts of people trying to make small shared regions mean something.

The problem is these small regions often don’t mean what the posters think they mean.

Though we all like to think of ourselves as unique, here I need to emphasize:

☞ We humans are unique assemblages of very common small bits and pieces of chromosomes (DNA.)

Furthermore, there are some bloggers who are attempting to be highly visible in their attack on AncestryDNA, who late in 2014 revamped their matching algorithm, to the great dismay of some. This attack on AncestryDNA has been on Facebook groups as well as forums. Whatever one thinks of the company itself, the need for AncestryDNA to incorporate into their matching algorithm a method to deal with too-commonly matched regions of chromosomes is one of the reasons for writing this post. Regardless of the company one uses to do matching for the purposes of genetic genealogy, that company will improve their product if they can find a way to deal with false positives, and currently AncestryDNA does that, 23andMe does that to some extent, and FTDNA does not (to the best of my knowledge.)

One of the issues that drove AncestryDNA to redesign their matching algorithm, in part, we can couple to the problem of matching on small segments.

Let’s take a look at some data – all the data in this post is from a test I manage at 23andMe, and a profile I manage at FTDNA, from Family Finder, which is a transfer of the same 23andMe (V3) raw data of the same person tested at 23andMe. If you have tests at these companies you can download your own data and see for yourself where pile-ups are occurring.

23andMe customers have a tool available to them called “Countries of Ancestry” (CoA, formerly known as “Ancestry Finder”), which is a misnomer. It is a compilation of matches (with match data), composed in part from entries from the person’s DNA Relatives and others not in a person’s DNA Relatives list. Only 23andMe customers who have filled out the ancestry survey will find themselves in somebody else’s CoA list. Unfortunately the size of the CoA list is, similar to DNA Relatives, limited in the number of people allowed on the list (approximately a thousand.) Nevertheless, CoA is a list of over 1000 matching segments (as some person matches will be multi-segment matches) and thus useful for our purposes.

We can use the CoA data to plot out the “segments” (technically, half-identical regions) for each chromosome, stacking the segments to show overlaps and total coverage of the chromosome. Here is a plot for chromosome 1 showing 95 matching (to the test used in this post) segments as stacked rectangles:


Chr1 Segments from Countries of Acnestry

Fig. 1: Chromosome 1 matching segments for our test, as rectangles, from 23andMe Countries of Ancestry list.



(The red line is a count of overlaps, thus indicating shared regions for that portion of the chromosome, but since the segment blocks are offset vertically by slim white spaces to make the segments visibly distinguishable from each other the red line ends up being on a different vertical scale than the blocks.)

We can see from figure 1 that chromosome 1 is nearly completely covered by matches (from the 23andMe Countries of Ancestry list). We also notice that while the matching segments are not evenly distributed along the chromosome the distribution is not so lopsided to demonstrate a significant “cold region”, though the peak around 60Mbp (mega base-pairs) followed by the trough is starting to look suspicious. However, around 240 Mbp there appears to be the beginning of a “pile-up” region.

Let’s look now at another chromosome, 6, which is known to have some regions that are troublesome:


Chr6 from Countries of Ancestry

Fig. 2: Chromosome 6 matching segments from 23andMe Countries of Ancestry match list.


Looking at figure 2 it is readily evident that there is an overwhelming number of chr6 matches in a single region of the chromosome, around 30Mbp. Chromosome 6 is noted for it’s HLA regions, parts of the chromosome which house genes vital to the human immune system. The segment pileups in these regions are suggestive, and may demonstrate a non-random phenomenon, probably a selection event in human evolution, perhaps even during historic (i.e., since the invention of writing) times.

Also noticeable is a desert of matching around the 100 Mbp region.

This lopsided distribution of matching segments should give us pause: what does it mean for our genealogy efforts “to match” in these cases?

What we are seeing in the figures in this post is how commonly distributed small fragments of chromosomes (or more precisely, sets of alleles, or haplotypes) can be in our society. The 23andMe CoA file only has 1000 people, out of the entire 23andMe database of 800,000 customers. What if everyone in the US were tested, and the CoA list not capped? We likely would see hundreds of thousands of “matches” in this regions of chr6. And it should be noted that in figure 2 some of those segments in the major pile-up areas are over 10cM in length (according to current linkage maps – more on that below.)

In doing genealogy, trying to makes sense of these kinds of matches (in overly-common regions) is an exercise in futility. Using the oft stated standard of a minimum size of 7cM for a match that is likely to be identical-by-descent, trying to identify the most recent common ancestor (MRCA) with a match such as in these chr6 pile-up regions is not tractable, even if the segment surpasses the 7cM threshold; the MRCA could be dozens of generations ago.   One could propose that if a large enough sample of our population tested, and if select buried individuals could be exhumed and tested, then we could recreate partial genotypes of the individuals of entire communities of our ancestors from centuries ago. If this could be done then we could determine how common among our ancestors’ communities these shared chromosome regions today were in any given community, and perhaps trace the rapid growth of particular families or clans. Testing on such a scale is unlikely in the near future, however, and such an effort will likely face other hurdles.

Additionally, this region in chr6 in particular is demonstrating the problem of half-identical matching. The massive pileup is likely not due to a single physical 7cm – 10cM strand of chr6, but the superpositioning of several smaller (say .5cm to 1cM) regions, haplotypes found on chr6 which are very common throughout the European population. By random, these small fragments will superimpose (given the two copies of chr6 we all carry) to present these larger half-identical regions (HIR) which make the matching threshold cutoffs (say 7cM.)

Given that some of these regions in chromosomes are known, at least 23andMe filters out the most common ones before a match can make it into DNA Relatives.   This is an important distinction.

But what if we did not filter out these known regions?  And what if we were not limited to match lists of only 1000 people?

For this we turn to FTDNA’s Family Finder, which is not limited in the number of matches, and includes HIRs as small as 1cM.

Here are the matching chromosome 1 segments from Family Tree DNA’s Family Finder, for the same person as in the 23andMe CoA test:

FTDNA chr1 all segments

Fig. 3: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.


First thing to note is that, while the FTDNA customer database for Family Finder is much smaller than 23andMe’s customer base, because FTDNA is reporting matching regions down to 1cM the Family Finder Chromosome Browser downloaded data set is quite a bit larger than the 23andMe CoA file. In figure 3, for chromosome 1, we are looking at over 1000 such matching regions.

It is quite clear that there are pile-up regions, sticking up like telephone poles in the forest of matches. These “matches” are occurring much, much more frequently than one would expect for a random distribution of chromosomes recombining in each generation.

Since the 23andMe CoA has a minimum cutoff of 5cM for a segment, we can filter the FTDNA data to include only those segments that likewise are at least 5cM in size. A plot of the FTDNA result for chromosome 1 ends up looking similar to the plot of matches from 23andMe:

Fig. 4: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

Fig. 4: Chromosome 1 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.


Figure 4 is similar to figure 1, and we notice the pile-up near the 230-240Mbp region. However, what stands out in fig. 4 is the pile-up region around 180Mbp, which is not evident in fig. 1. We are starting to see regions of chromosome 1 where there is excessive “matching”.

Let’s repeat this exercise for chromosome 6, first presenting all the FTDNA FF segments on chr6:

FTDNA chr6 all FTDNA segments

Fig. 5: Chromosome 6 matching segments from FTDNA Family Finder chromosome browser list.


The massive pile-up around 30Mbp in figure 2 is now even more massive. The FTDNA Family Finder Chromosome Browser data includes 1467 segments for chr6, a great share of them in the pileup regions. Besides the largest such pileup we can visually identify 3 others.

As before, if we filter the FTDNA data for only those segments at least 5cM in size we get a much smaller set, and when plotted we get:

FTDNA chr6 5cM floor

Fig. 6: Chromosome 6 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.


In the filtered data, with only segments greater or equal to 5cM, only the major pile-up in chr6 remains, and is still imposing. There possibly may be a pileup near the start of the chr6 too, but it may not be statistically significant.

In the above figures it becomes evident that there is excessive matching in particular regions of chromosomes. Furthermore, the commonality of these matches suggest that attempting to incorporate this data into family history research will lead to futility.

As humans we are all related to each other, the question being how long ago did any two individual’s MRCA live. In the US, people of colonial descent are likely multiply related to each other within the past 20 generations (500 years), and many times at that. Indeed, those of colonial descent will have very many 10th cousins and closer living in the US; the numbers of 5th through 10th cousins are likely in the millions for the colonials. So, the existence of relatives, distant to very distant, is not the question.

☞ The key to doing genetic genealogy with autosomal DNA is not finding a match, but rather finding genealogically tractable matches.

“Genealogically tractable” here means that given a reasonably exhaustive search of existing records, two family trees are documented sufficiently to support the conclusions in the pedigree, and the name of a most recent common ancestor can be found. Thus ancestors who lived before the era of documentation trails are not genealogically tractable. To go back further in time than a document trail can allow means we are entering the territory of the “ethnicity” or “ancestry” estimates provided by some companies. For many people in the world there are no records before the 19th or 18th centuries, while in a small set of locales documentation may go back to before the era of European colonization, but in these cases the records rapidly collapse to the nobility and royalty.

This is important because,

☞ Given lax enough matching criteria, one can have a DNA match with a person with whom your shared MRCA existed before records were kept for that MRCA.

Thus our goal is to find matches with whom we can possibly find the MRCA. A “match” that is based on population-common fragments of chromosomes is unlikely to be resolvable by genealogical methods, as such chromosome fragments would have been found in a large number of the contemporaries of our own pedigree ancestors. This is all the more true as we move back in time, when people found their mates nearby and not uncommonly married their cousins.

As noted above, 23andMe filters out some common regions (though they have not published the details) before a “match” can make it onto the DNA Relatives list.

In the fall of 2014 AncestryDNA implemented their own means to do something likewise, presented by them with much fanfare (and here.) AncestryDNA’s new matching system has caused quite a bit of a stir, in part because customers lost some to many of their old matches. The tests I manage lost from 40% to over 90% of their previous matches. Part of the reason for this is AncestryDNA’s new “Timber” algorithm, which explicitly attempts to deal with the phenomenon of the matching of overly-common chromosome regions, the existence of which are undeniable, as we see in the plots in this post.

Without addressing the matching to overly-common DNA we end up with very large lists of matches with whom we will never find the MRCA, which was the case previously with AncestryDNA and is possibly true still at FTDNA.

In may turn out that the Timber algorithm is too aggressive, and some genealogically informative matches are being lost in the new AncestryDNA matching system. This may happen if an inherited region of a chromosome is bisected by many population-wide common segments, and, post-Timber filtering, this inherited region is broken into too small of remaining segments to make the minimum threshold to declare a “match”. However, for AncestryDNA to correct this will probably require developing even more sophisticated filtering algorithms, about which I may write later in another post.

Having established the requirement for filtering out pileup regions, I want to stress again that these regions are not just small, 1 to 2cM, portions of chromosomes. Some will be much larger.

In general in regards to matching of genotyped individuals, as a standard, or best practice, matches ought not be declared on small (less than 7cM) regions of unphased genotype datasets, a subject discussed at length by various authors and bloggers and which I won’t repeat here. Phased genotypes can be matched on smaller regions with greater confidence, though I would not do so for regions less than about 5cM. (And if you’re interested in current academic discussions of identical-by-descent detection try here, herehere, and here). Unfortunately FTDNA insists on reporting small segments on their unphased genotype data sets, which misleads the customer in regards to how significantly they match other people.

Given all of that, we conclude this:

☞ Even after filtering out HIRs smaller than 5cM, DNA matching services such as FTDNA or 23andMe should filter out HIR’s greater than 5cM that are appearing too commonly in the population, so that the end user receives a genealogically tractable match list.


Ancillary to the above discussion and needing to be addressed especially in light of how FTDNA’s Family Finder reports matches, and since I mentioned it above, I want to touch briefly on the concept of “size” when it comes to autosomal DNA segments. This is important because I often see commenters in forums referring to a shared HIR as being ‘n.nn’ centiMorgans, sometimes to those 2 decimal places, probably because they check their FTDNA Chromosome Browser results which report matching in centiMorgans to two decimal places.

FTDNA should not give HIR lengths, individual or in total, in cMs to two decimal places!

The diagrams in this post use base pairs (or millions of base pairs) as the coordinate to place the matching HIRs over the chromosome length. In genetics, the concept of linkage disequilibrium mapping brings about the need to map the physical molecular position to a frequency space, with the frequency unit being a Morgan but in practice the centiMorgan is used.

Each human chromosome has by experimental procedure been “mapped”, associating a physical coordinate onto a scale measured in centiMorgans.

What many enthusiasts in genetic genealogy may not know is that the thought processes and concepts associated with these linkage maps have to deal with many issues; some of these may impact our understanding of the population-wide matching segments which are the topic of this post.

Referencing “cool spots” and “hot spots” can imply that coordinates from a linkage map be taken uncritically; however, the very large sets of genotyped samples being collected by 23andMe and AncestryDNA may be used in the future to create a better understanding of recombination probabilities along the chromosomes.

There is much variability in the end products of meiosis, and in regards to recombination and creating maps of chromosomes the following issues are topics of discussion:

  • age (e.g., here);
  • ethnicity;
  • gender (if comparing a child to a parent);
  • unique genetics (e.g., here).

The measurements for segment lengths given by 23andMe or FTDNA or gedmatch.com are averages, specifically gender averages.

For further (and very technical) reading on this subject:

Identifying recombination hotspots using population genetic data

Genetic Analysis of Variation in Human Meiotic Recombination

Enhanced genetic maps from family-based disease studies: population-specific comparisons

Variation in Human Recombination Rates and Its Genetic Determinants

Genetic Control of Hotspots


☞ The bottom line about segment lengths is this: Round the length of chromosome segments to whole numbers, and remember additionally that the size of a segment is only an estimate and is accompanied by non-trivial error bars.


 ❦ ❦



Surnames, 23andMe, and AncestryDNA: Making the Most of Match Counts and “Enrichment”


The questions we hope to answer are these:

Can we use the surnames provided (either in lists or pedigrees) by our DNA matches to:
  1. Discover the names of ancestors about whom we have no a priori knowledge?
  2. Confirm the name of ancestors about which we may have doubts?


This is a very complex subject, so this is a lengthy post to wade through – you are forewarned!


Making most of the direct to consumer (DTC) DNA services for doing genealogy is a challenge currently, with no single company offering everything an aspiring (professional or dilettante) genetic genealogist requires to accomplish many goals.   This quandary includes dealing with the cultural attachments to our DNA – our names.

One of the central theories of genealogy as practiced in the English speaking world is that surnames matter.   This turns out to be true enough, and is a reflection of the deep patriarchy embedded in western society.

Yet for many of us with ancestry from other nations outside the Anglosphere this surname-centric view of families will not be of much use.   I for instance have half my pedigree that is from a culture where patronymics were the custom until the time of my grandfather’s emigration; I am only the second generation born with the surname I carry.

Still, for the English genealogy world the surname ranks as one of the more important concepts.   For example – the field of one-name studies, while not absolutely wedded to the idea, centers on the name being studied as a surname.   Genealogy database programs such as Family Tree Maker are structured around a naming paradigm in which there are surnames.  Your local courthouse records are probably indexed by surname, and so on.

A Tree of Names, but Which Ones Belong?

A Tree of Names, but Which Ones Belong?


In the DNA view of genealogy as practiced today surnames still are center stage – at FTDNA Family Finder there is a field for surnames, at AncestryDNA the match page has a section dedicated to pedigree surnames and surnames in-common, and at 23andMe a customer can enter surnames into one’s profile, which will then appear in DNA Relatives and upon which one can search or sort one’s matches.

23andMe goes one step further, though, and offers what they call “Surname View”. This is a ranking of surnames as found in a customer’s DNA Relatives list.   The surnames are ranked by an “enrichment” score.   23andMe defines this value as:

Enrichment is computed via a one-tailed binomial test. The 23andMe-wide frequency of a given surname is the reference frequency. The number of occurrences of the surname among your matches and the total number of surnames among your matches are the counts in the binomial test. This results in a p-value; we then report -1.0 * log10(p), so the bigger the number is, the more unusual it is that it was at such high frequency among your matches.


For the purposes of this discussion I’ll use the term “enrichment” in the same sense as 23andMe. Their definition of enrichment is an interesting idea, though I have some questions about the validity of using a binomial test.  In a future post I may delve into more technical details about this issue.   There is one very important point to make:

☞ The more often a name occurs (in your match list) is not sufficient to define the importance of having a match with that surname in their pedigree.   Rather, we need to take into account the likelihood (based on the population one is testing) of the particular name occurring at random.


Recognizing that 23andMe is just one pond in which to fish, many of us have tested elsewhere, such as at AncestryDNA.    However, AncestryDNA does not provide any analysis regarding name frequencies in a customer’s set of matches.   But, with a little bit (or a lot) of cleverness we can collect the surname data from our matches at AncestryDNA, found on each match page in the left hand column titled “Surnames (10 generation pedigree)”.

With this data I can then calculate what 23andMe calls the “Enrichment” for our AncestryDNA set of surnames from matches, using the US Census surname frequency data in place of 23andMe’s “23andMe-wide frequency “, as I don’t have access to the entire database of AncestryDNA tests (if only!) to calculate an equivalent AncestryDNA-wide frequency table.   Given that the tested person in this case has all colonial American ancestry, and because the AncestryDNA customer base to date is overwhelmingly from the US with the “colonials” being a significant share of the customer base, using US Census data for expected values of name occurrence is presumed to be a good approximation of an AncestryDNA-wide data set.

A concrete example, from two tests I manage, of the same person, one test at 23andMe and the other at AncestryDNA:

Here are all 226 entries under “Surname View” for this particular test:


23andMe Surname View

Surname23andME Count23andME Enrichment
23andMe Surname View on 24 Jan 2015, for same individual as tested at AncestryDNA


Here are the top (i.e., the most “enriched”) 226 (a number selected to be the same quantity as that from 23andMe) entries from the list of all surnames in matches at AncestryDNA for this same person:


AncestryDNA "Surname View"

SurnameAncestryDNA Count"Enrichment"
As of 24 Dec 2014, AncestryDNA matches' surnames-in-pedigrees analyzed similarly to 23andMe "Surname View". The top 226 "enriched" surnames are shown, the same number of surnames as in the total Surname View at 23andMe, but which are only 3% of all the surnames from the AncestryDNA match list.

One thing is very clear upon comparing the two lists is that they are not same.

They are not even close.

The surnames in the 23andMe list do show up on the larger AncestryDNA list (not just the top 226), but scattered across the nearly 10,300 names in the much larger AncestryDNA list.

What also should stand out is that the counts of matches (second column) are much larger for the AncestryDNA data set than for the 23andMe data set.    While the number of customers of each service is about equal, those of us who have tried to use 23andMe for genealogy have discovered quite quickly that many DNA Relatives there are incognito, and even if a DNA Relative has a public profile they usually list very few surnames.   Add to this that 23andMe limits the DNA Relatives list size to only the top matches (for those customers with more than 1000 matches, which would include many Americans) and we discover that there are relatively few data points from the 23andMe database in regards to surnames.

In the above example, the testee has 4x as many matches at AncestryDNA than at 23andMe, and before “autosomalgeddon” at AncestryDNA last fall the match count was twice that.    However, even more significant is that the average number of people in the pedigrees of matches at AncestryDNA is over 50, while at 23andMe the average number of surnames per match is but a small fraction of that.   Because of the smaller data set of matches with ancestral names at 23andMe, it is possible for a few diehard genealogists, who test multiple members of their families, to skew the Surname View results by putting in long lists of surnames in their profiles.   For example, the above 23andMe customer matches 3 siblings, each of which has a surname list of approximately 215 names!   This one family significantly skews the Surname View results for the example test I am discussing.   In this case, this family is on the first page of matches (that is about 1/12th the total) in DNA Relatives, and accounts for 645 name occurrences on that page.   Everybody else on that page (95 people) account for just 413 more name occurrences.


☞ Caution: Outliers, such as extremely deep pedigrees or very long name lists will skew results on relatively small datasets, such as at FTDNA and 23andMe.


Given that we have two different lists of “enriched” surnames for the same person, which one is more informative? (Note that I did not write “accurate”.)

After a couple of years of working on developing the pedigree for the subject of this post, with plenty of surprises along the way, it turns out the AncestryDNA surname enrichment list better reflects what I know of the pedigree.   Especially in the top 25 names, where 4 on the list of AncestryDNA enriched names are the same as pedigree surnames, while only one name (“Green”) on the top 25 of the 23andMe enrichment list is found in the developed pedigree, and that name is questioned by some researchers of that ancestral family.

Of the entire 23andMe list of 226 surnames, only 11 are found in the developed pedigree.   For the top 226 from AncestryDNA it also turns out there are 11 names found in the testee’s pedigree.   The only name in common between the two groups of 11 is “Johnston”, which is good as that is the family name of the testee’s grandfather.   It should be noted that the entire list of AncestryDNA matches’ pedigree-names that were included in this analysis total over 10,000, and names in the rest of the pedigree of the testee are found on that list but farther down than #226.

Which raises the next question – what about all those other names on the list(s), that are not ancestral names of the person tested?

Intuitively we recognize that our ancestors can have many descendants, and in the society of which we’re a part a daughter will lose her maiden name and pick up her husband’s surname.   Thus our cousins descended from these great-grand aunts and female cousins will carry surnames that are not part of our direct pedigree lines. This is both a bane and a boon. It is a bane because unless we’ve carried out very extensive descendancy research we will not recognize the names of our genealogical cousins.   On the other hand, all these names are a boon because until the late 20th century people found their mates relatively close by, and families often intermarried and migrated together, making certain names more associated with each other.   From the example test in this post, on the AncestryDNA enriched surname table the top entrant is “Parke”, a name often found in colonial pedigrees due to the 17th century founders in America (e.g., purportedly a Dr. Roger Parke of early New Jersey) having many male descendants, including quite a few who ended up in what is today West Virginia and Kentucky.  This surname is then associated with many families that come from these areas, which happens in this case with our subject’s father’s ancestors having lived in WV and KY where many Parke/Park/Parks families lived.

As we work our way to an answer to the two questions at the top of this post it is important to realize that the use of DNA in genealogy is rapidly expanding, and as a larger share of the population tests at these companies we will see larger data sets that will give us an opportunity to be more certain of whatever we find.

So, in regards to question #1 posed at the beginning of this post, the best answer currently is “probably.”

In this particular case, in the example DNA test I am discussing in this post, only through DNA testing has a secret been revealed, an NPE (or possibly secret adoption.) It turns out that 3 surnames in the top 25 of the AncestryDNA surname enrichment table are ancestors of the biological parent of this surprise.   Those 3 names are in 4th, 7th, and 13th place on that list of enrichment (out of 10,300 surnames.)    It may still prove to be a coincidence, but finding the surnames of 2nd and 3rd great grandparents high on such a list ought not be surprising.  In this case the surnames are unusual enough to stand out, but not so ultra-rare in our society as to not make it onto the US Census list (which only includes names which occur more than 100 times in the US).   Extremely rare surnames are also, by definition, unlikely to show up in a list of DNA matches.

Furthermore, the 3rd entry on that AncestryDNA enrichment list, “Cocke”, is probably the ancestral surname of “Cox”, and the pedigree of the tested person most likely (based on un-vetted trees) has two entrants with that surname.   Number 21 on the above AncestryDNA enrichment list, “Stuart”, is the maiden name of a known 4th great grandmother.

My experience in doing this example is that nearly all the surnames in the pedigree of the person I’ve tested appear in the upper quartile of the AncestryDNA enrichment list, with many of them in the upper 10%.

So, without a priori knowledge of the names of unknown ancestors, how would one make use of an “Enrichment” list?    Based on my experience, I recommend starting at the top of an enrichment list, and working down at least a couple of hundred names (given a dataset as large as AncestryDNA) and look for any connections the names may have with one’s known ancestors.   Speculative family trees are not a bad thing, as long as they are treated as such.

☞ Mining surname data is a source of leads for genealogical research, and cannot stand apart from the exhaustive search for evidence required in sound family history practices.

In an ideal case a newly discovered close DNA cousin will have their full tree available to you to study, to find the most recent common ancestor, but that is not always the case.   Sometimes all one has is a list of surnames, or maybe a bare-bones pedigree with only names and no other useful information. If you discover a not-previously-known not-too-distant (say 2nd to 4th) cousin, and that DNA cousin lists some ancestral surnames, it is worth going through each of their surnames looking at an enrichment list of your own DNA test, to see if one or more stand out as occurring disproportionately frequently in your set of matches.

Then what of question #2, where we are desiring to validate or support a theory about an ancestor’s possible surname (and in these cases we are often looking for the maiden names of our married ancestresses?)

In my opinion, one of the better uses of an enrichment list is to act as a check against the human tendency to fixate on an idea or observation.   Given the ubiquity of pareidolia I value any means to keep me from seeing what is not there.   In the field of genealogy that includes becoming overly sensitive to a particular name, whereby we exaggerate the importance of that particular name any time we come across it.

In the example tested person in this post, after looking at many matches at the various companies I had become fixated on certain names, but looking at the AncestryDNA enrichment list I discover that those names are not occurring any more frequently than what to expect out of the population at large.

In another application of the enrichment table: I’m researching a great grandmother (of the tested person in this post) whose maiden name is known from census records and marriage records, but whose parents have escaped identification. In this case the lack of matching surnames of the DNA matches at 23andMe and AncestryDNA, as evidenced by the low enrichment value of the great grandmother’s maiden name, and the low enrichment value of the maiden name of the only woman I can find that I hypothesized could be her mother,  supports my idea that the great grandmother was adopted (or a foster child who took the name of a couple that housed her for a while.)  This hypothesis I had previously generated based on census data where the great grandmother spends her teenage years and early 20’s living with two different families not of her own surname, and the surname data suggests that I am on the correct path.   I’ve further had tested at AncestryDNA another great grandchild of this mysterious woman, and analysis of surnames of his matches concurs with the conclusions I’m drawing from his 2nd cousin’s match data.

Which brings up an very useful strategy, whether one is doing segment chasing or looking at surnames or just trawling for unknown cousins:

☞ Testing additional known relatives, especially 2nd cousins, can provide concurring data, or prevent one from erroneously reaching a wrong conclusion based on a single test.

Before I leave question #2,  when discussing the low enrichment scores I must throw in another caveat:

☞ For very common surnames (e.g. Smith, Williams, Jones, etc.) the law of large numbers will affect the significance test such that these names will not be very high on ordered enrichment lists.

This point can best be made with a graph:

Surname Plot: Pedigree Surname Frequency vs. Census Frequency

The magenta line is a line of slope 1 intersecting the origin (0,0).   In other words, all names to the left or above the magenta line are occurring more frequently in the matches (for our test subject) than in the US population as determined from the US Census for 2000.    We can see that as we move towards the more common surnames the variance in the data decreases and the surnames move closer to the magenta line.    So if you are going to do this analysis on your own match results, remember that even if you have ancestors named “Jones” or “Smith” do not expect there to be large changes in the frequencies of those names in your match list.

Some additional observations to make:

  • The most common surnames in America are showing up in matches (of our test subject) less than the average in the US.   The entire name collection of the matches is tilted towards the less common surnames.
  • The demographic change in the US is quite noticeable, with the matches to old colonial American names prominent, while contemporary hispanic names are under-matched.
  • Many ancestry.com users insist on putting a woman’s given name in the surname field.  For the statistical significance test I culled these (except for “Barbara”, which was difficult to determine which instances were not really surnames), but here I show them to illustrate how common this habit is among ancestry.com users.

I suggest that anyone who is a “colonial” (that is, has all their ancestors who immigrated to America before 1776) will see similar results.

There is much more to discuss about surnames and counting their frequency, and what to make of our DNA matches.    Surname, and geography, frequency and associated statistical tests are a supplement to other genealogy methods.  Even with a more molecular approach to mining DNA cousins (i.e., matching chromosome regions) we still need to ascribe the shared DNA to a person, said person being known in their day and referred to by us through their name(s), familial, given, or fabricated.

For many of us with doubtful or absent ancestors at the ends of branches in our family trees, perhaps analyses like these can lead us in directions to dig for genealogical gold.


This already long post would have been longer had I not excluded diving into several topics, which are possibilities for further discussion, eventually:

  • Outlining the analysis pipeline;
  • Improved methods of significance testing that more faithfully apply to the surname practices than the binomial test;
  • Exploiting name frequency distributions other than the 2000 US Census;
  • The problem of spelling and other phenomena related to names.

With that said, happy adventures in DNA mining!