Of Mirroring And Shared Ancestors: Exploiting AncestryDNA To Find Biological Families

 

What we aim to accomplish in this post:

  1. Provide definitions and an overview of methods for finding biological families of those who don’t know one or both biological parents, using tools at AncestryDNA.
  2. Discuss possible future methods that may be helpful in reaching the same goals.

 

Finding a bio-ancestor is like finding the horizon on a foggy day - keeping looking and you might find it.

Looking for a bio-parent may feel like looking for the horizon on a foggy day – but keeping looking and you might find what you seek.

Edit 10 May 2016: The May 2016 change in the matching algorithm at AncestryDNA makes some of the details in the PDF out of date, but the basic ideas still apply.

One of the more common uses of autosomal DNA matching products such as 23andMe, AncestryDNA, and FTDNA Family Finder, is for people searching for their biological families, including:

  1. Adoptees searching for their biological ancestors;
  2. Those who discover that one of their presumed biological parents turns out not to be so – often a shocking surprise, but sometimes also a deeply and long held suspicion that turns out to be true;
  3. Those who never knew one or both parents (usually but not always a missing father.)

 

The processes described here are not new nor invented by me.  An entire community of people dedicated to genetic genealogy and helping others (such as adoptees) have arisen and the methods being used are continually improving, especially as creative software developers bring forward new tools to make use of autosomal DNA matches.

However, on forums and Facebook groups it seems that the same questions get asked continually about finding family using DNA, sometimes posters presenting the same question within minutes of each other. Thus we believe that there is still a need for more educational material on this topic.

In the presentation below, as a PDF file, is an overview of how to exploit one’s DNA test at AncestryDNA, including:

  1. mirror pedigrees;
  2. shared matches;
  3. ancestor harvesting.

This presentation is not an exhaustive exploration of these topics, but hopefully will be helpful to many seeking bio-families.

Data Mining Through Mirroring

Download (PDF, 3.35MB)

 

 

I believe there are two possible tool types that AncestryDNA could give us that would help in research, both for more traditional genealogy goals (i.e., finding ancestors beyond those we knew personally) as well as for those searching for birth families:

Surname analysis

See my post “Surnames, 23andMe, and AncestryDNA: Making the Most of Match Counts and “Enrichment”

Geographic analysis

Our ancestors lived in locales at specific times and much family history work depends upon locating our ancestors within their landscape.  Not enough of this is done by novices to family history but often analyzing locations are key to unraveling complex families.  Anyone who has spent some time reviewing family trees knows that the location fields for deaths and births and many of the events in-between are often left off, or given inaccurately.

AncestryDNA does provide a map tool for every DNA match, to see where the match’s ancestors were born, according to the tree to which the DNA kit is attached. This is very useful for my matches where surnames are of little to no use (such as my Norwegian ancestors).  AncestryDNA also provides a filter for searching matches by location.  Both are limited by the difficulties in conforming geographic names to standards, as well as the sad fact that even those of our matches who have family trees often do not have birth locations for their ancestors.

Still, it ought to be possible for an automated process to sort the ancestors of our matches into location groups.  In this regard even the rudimentary Map View in the DNA Relatives tool at 23andMe offers a capability not at AncestryDNA.  For example, there are  3,144 counties (or county equivalents) in the US – why can’t we have a heat map at the county level, for the birth and death counties of the ancestors of our matches?  There are analogous (to the US county system) geographical divisions in many nations that could also be heat-mapped.

In the bigger picture we are discussing what might be called “data mining”. We want to use automated systems to find patterns in large data sets. Here we have two large data sets: DNA matches, and Family Trees.

The collected family histories available both in traditional publications (books, journals, monographs) and electronically represent our families and their culture through the collective historiography work of many thousands of people, some professional genealogists, many of them not.

Connecting this family history to that of our physical inheritance – as presented through our genotypes, created by these direct to consumer DNA testing companies – is the goal of everyone in genetic genealogy.  Of the companies involved, so far AncestryDNA has proven for me and many others to be the most able to reach this goal (though as a product it is still wanting for some basic tools, such as a chromosome browser for those wishing to do segment mapping.)   I’m hoping someday they will add tools in the categories I list above.

Chromosome Pile-Ups in Genetic Genealogy: Examples from 23andMe and FTDNA

 

 

What we aim to accomplish:

  1. Dissuade people from chasing after small autosomal DNA segments by
    1. demonstrating too-common matching in particular regions of chromosomes;
    2. emphasizing the pitfalls of half-identical matching;
    3. briefly reviewing linkage measurements, emphasizing chromosome region lengths (in centiMorgans) as approximations.
  2. Illustrate the need that AncestryDNA had for redesigning their matching system to account for population wide shared chromosome regions. Whether or not the current AncestryDNA Timber algorithm is the best possible is not here the issue, but rather that something like it is needed, not just at AncestryDNA but at all the autosomal DNA services.

 

Warning: Genetics in general is a field with a steep learning curve; thus Genetic Genealogy likewise has a steep learning curve.   The following will be a bit technical.

 

 

On various forums and blogs there is much discussion over the use of autosomal DNA matching for genetic genealogy. This field is emerging as a popular approach to extend what we know as the practice of “genealogy”, but there is much confusion and even angst running through the community of consumers of genetic genealogy tests over exactly what is meant by a “match”, and why companies decide on whom to exclude from match lists.

A common belief or assertion by some who are interested in genetic genealogy is that relatively small regions (below roughly 7 centiMorgans) of a chromosome are important for genealogical research.   I won’t quote from various forum posts because I do not want individuals to think that I am picking on them, but whether we look at a Facebook group (e.g., the ISOGG group), the forums on the company websites of DTC DNA testing companies, or in general forums about genealogy and ancestry, there seems to be no end of a stream of posts of people trying to make small shared regions mean something.

The problem is these small regions often don’t mean what the posters think they mean.

Though we all like to think of ourselves as unique, here I need to emphasize:

☞ We humans are unique assemblages of very common small bits and pieces of chromosomes (DNA.)

Furthermore, there are some bloggers who are attempting to be highly visible in their attack on AncestryDNA, who late in 2014 revamped their matching algorithm, to the great dismay of some. This attack on AncestryDNA has been on Facebook groups as well as forums. Whatever one thinks of the company itself, the need for AncestryDNA to incorporate into their matching algorithm a method to deal with too-commonly matched regions of chromosomes is one of the reasons for writing this post. Regardless of the company one uses to do matching for the purposes of genetic genealogy, that company will improve their product if they can find a way to deal with false positives, and currently AncestryDNA does that, 23andMe does that to some extent, and FTDNA does not (to the best of my knowledge.)

One of the issues that drove AncestryDNA to redesign their matching algorithm, in part, we can couple to the problem of matching on small segments.

Let’s take a look at some data – all the data in this post is from a test I manage at 23andMe, and a profile I manage at FTDNA, from Family Finder, which is a transfer of the same 23andMe (V3) raw data of the same person tested at 23andMe. If you have tests at these companies you can download your own data and see for yourself where pile-ups are occurring.

23andMe customers have a tool available to them called “Countries of Ancestry” (CoA, formerly known as “Ancestry Finder”), which is a misnomer. It is a compilation of matches (with match data), composed in part from entries from the person’s DNA Relatives and others not in a person’s DNA Relatives list. Only 23andMe customers who have filled out the ancestry survey will find themselves in somebody else’s CoA list. Unfortunately the size of the CoA list is, similar to DNA Relatives, limited in the number of people allowed on the list (approximately a thousand.) Nevertheless, CoA is a list of over 1000 matching segments (as some person matches will be multi-segment matches) and thus useful for our purposes.

We can use the CoA data to plot out the “segments” (technically, half-identical regions) for each chromosome, stacking the segments to show overlaps and total coverage of the chromosome. Here is a plot for chromosome 1 showing 95 matching (to the test used in this post) segments as stacked rectangles:

 

Chr1 Segments from Countries of Acnestry

Fig. 1: Chromosome 1 matching segments for our test, as rectangles, from 23andMe Countries of Ancestry list.

 

 

(The red line is a count of overlaps, thus indicating shared regions for that portion of the chromosome, but since the segment blocks are offset vertically by slim white spaces to make the segments visibly distinguishable from each other the red line ends up being on a different vertical scale than the blocks.)

We can see from figure 1 that chromosome 1 is nearly completely covered by matches (from the 23andMe Countries of Ancestry list). We also notice that while the matching segments are not evenly distributed along the chromosome the distribution is not so lopsided to demonstrate a significant “cold region”, though the peak around 60Mbp (mega base-pairs) followed by the trough is starting to look suspicious. However, around 240 Mbp there appears to be the beginning of a “pile-up” region.

Let’s look now at another chromosome, 6, which is known to have some regions that are troublesome:

 

Chr6 from Countries of Ancestry

Fig. 2: Chromosome 6 matching segments from 23andMe Countries of Ancestry match list.

 

Looking at figure 2 it is readily evident that there is an overwhelming number of chr6 matches in a single region of the chromosome, around 30Mbp. Chromosome 6 is noted for it’s HLA regions, parts of the chromosome which house genes vital to the human immune system. The segment pileups in these regions are suggestive, and may demonstrate a non-random phenomenon, probably a selection event in human evolution, perhaps even during historic (i.e., since the invention of writing) times.

Also noticeable is a desert of matching around the 100 Mbp region.

This lopsided distribution of matching segments should give us pause: what does it mean for our genealogy efforts “to match” in these cases?

What we are seeing in the figures in this post is how commonly distributed small fragments of chromosomes (or more precisely, sets of alleles, or haplotypes) can be in our society. The 23andMe CoA file only has 1000 people, out of the entire 23andMe database of 800,000 customers. What if everyone in the US were tested, and the CoA list not capped? We likely would see hundreds of thousands of “matches” in this regions of chr6. And it should be noted that in figure 2 some of those segments in the major pile-up areas are over 10cM in length (according to current linkage maps – more on that below.)

In doing genealogy, trying to makes sense of these kinds of matches (in overly-common regions) is an exercise in futility. Using the oft stated standard of a minimum size of 7cM for a match that is likely to be identical-by-descent, trying to identify the most recent common ancestor (MRCA) with a match such as in these chr6 pile-up regions is not tractable, even if the segment surpasses the 7cM threshold; the MRCA could be dozens of generations ago.   One could propose that if a large enough sample of our population tested, and if select buried individuals could be exhumed and tested, then we could recreate partial genotypes of the individuals of entire communities of our ancestors from centuries ago. If this could be done then we could determine how common among our ancestors’ communities these shared chromosome regions today were in any given community, and perhaps trace the rapid growth of particular families or clans. Testing on such a scale is unlikely in the near future, however, and such an effort will likely face other hurdles.

Additionally, this region in chr6 in particular is demonstrating the problem of half-identical matching. The massive pileup is likely not due to a single physical 7cm – 10cM strand of chr6, but the superpositioning of several smaller (say .5cm to 1cM) regions, haplotypes found on chr6 which are very common throughout the European population. By random, these small fragments will superimpose (given the two copies of chr6 we all carry) to present these larger half-identical regions (HIR) which make the matching threshold cutoffs (say 7cM.)

Given that some of these regions in chromosomes are known, at least 23andMe filters out the most common ones before a match can make it into DNA Relatives.   This is an important distinction.

But what if we did not filter out these known regions?  And what if we were not limited to match lists of only 1000 people?

For this we turn to FTDNA’s Family Finder, which is not limited in the number of matches, and includes HIRs as small as 1cM.

Here are the matching chromosome 1 segments from Family Tree DNA’s Family Finder, for the same person as in the 23andMe CoA test:

FTDNA chr1 all segments

Fig. 3: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

 

First thing to note is that, while the FTDNA customer database for Family Finder is much smaller than 23andMe’s customer base, because FTDNA is reporting matching regions down to 1cM the Family Finder Chromosome Browser downloaded data set is quite a bit larger than the 23andMe CoA file. In figure 3, for chromosome 1, we are looking at over 1000 such matching regions.

It is quite clear that there are pile-up regions, sticking up like telephone poles in the forest of matches. These “matches” are occurring much, much more frequently than one would expect for a random distribution of chromosomes recombining in each generation.

Since the 23andMe CoA has a minimum cutoff of 5cM for a segment, we can filter the FTDNA data to include only those segments that likewise are at least 5cM in size. A plot of the FTDNA result for chromosome 1 ends up looking similar to the plot of matches from 23andMe:

Fig. 4: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

Fig. 4: Chromosome 1 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.

 

Figure 4 is similar to figure 1, and we notice the pile-up near the 230-240Mbp region. However, what stands out in fig. 4 is the pile-up region around 180Mbp, which is not evident in fig. 1. We are starting to see regions of chromosome 1 where there is excessive “matching”.

Let’s repeat this exercise for chromosome 6, first presenting all the FTDNA FF segments on chr6:

FTDNA chr6 all FTDNA segments

Fig. 5: Chromosome 6 matching segments from FTDNA Family Finder chromosome browser list.

 

The massive pile-up around 30Mbp in figure 2 is now even more massive. The FTDNA Family Finder Chromosome Browser data includes 1467 segments for chr6, a great share of them in the pileup regions. Besides the largest such pileup we can visually identify 3 others.

As before, if we filter the FTDNA data for only those segments at least 5cM in size we get a much smaller set, and when plotted we get:

FTDNA chr6 5cM floor

Fig. 6: Chromosome 6 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.

 

In the filtered data, with only segments greater or equal to 5cM, only the major pile-up in chr6 remains, and is still imposing. There possibly may be a pileup near the start of the chr6 too, but it may not be statistically significant.

In the above figures it becomes evident that there is excessive matching in particular regions of chromosomes. Furthermore, the commonality of these matches suggest that attempting to incorporate this data into family history research will lead to futility.

As humans we are all related to each other, the question being how long ago did any two individual’s MRCA live. In the US, people of colonial descent are likely multiply related to each other within the past 20 generations (500 years), and many times at that. Indeed, those of colonial descent will have very many 10th cousins and closer living in the US; the numbers of 5th through 10th cousins are likely in the millions for the colonials. So, the existence of relatives, distant to very distant, is not the question.

☞ The key to doing genetic genealogy with autosomal DNA is not finding a match, but rather finding genealogically tractable matches.

“Genealogically tractable” here means that given a reasonably exhaustive search of existing records, two family trees are documented sufficiently to support the conclusions in the pedigree, and the name of a most recent common ancestor can be found. Thus ancestors who lived before the era of documentation trails are not genealogically tractable. To go back further in time than a document trail can allow means we are entering the territory of the “ethnicity” or “ancestry” estimates provided by some companies. For many people in the world there are no records before the 19th or 18th centuries, while in a small set of locales documentation may go back to before the era of European colonization, but in these cases the records rapidly collapse to the nobility and royalty.

This is important because,

☞ Given lax enough matching criteria, one can have a DNA match with a person with whom your shared MRCA existed before records were kept for that MRCA.

Thus our goal is to find matches with whom we can possibly find the MRCA. A “match” that is based on population-common fragments of chromosomes is unlikely to be resolvable by genealogical methods, as such chromosome fragments would have been found in a large number of the contemporaries of our own pedigree ancestors. This is all the more true as we move back in time, when people found their mates nearby and not uncommonly married their cousins.

As noted above, 23andMe filters out some common regions (though they have not published the details) before a “match” can make it onto the DNA Relatives list.

In the fall of 2014 AncestryDNA implemented their own means to do something likewise, presented by them with much fanfare (and here.) AncestryDNA’s new matching system has caused quite a bit of a stir, in part because customers lost some to many of their old matches. The tests I manage lost from 40% to over 90% of their previous matches. Part of the reason for this is AncestryDNA’s new “Timber” algorithm, which explicitly attempts to deal with the phenomenon of the matching of overly-common chromosome regions, the existence of which are undeniable, as we see in the plots in this post.

Without addressing the matching to overly-common DNA we end up with very large lists of matches with whom we will never find the MRCA, which was the case previously with AncestryDNA and is possibly true still at FTDNA.

In may turn out that the Timber algorithm is too aggressive, and some genealogically informative matches are being lost in the new AncestryDNA matching system. This may happen if an inherited region of a chromosome is bisected by many population-wide common segments, and, post-Timber filtering, this inherited region is broken into too small of remaining segments to make the minimum threshold to declare a “match”. However, for AncestryDNA to correct this will probably require developing even more sophisticated filtering algorithms, about which I may write later in another post.

Having established the requirement for filtering out pileup regions, I want to stress again that these regions are not just small, 1 to 2cM, portions of chromosomes. Some will be much larger.

In general in regards to matching of genotyped individuals, as a standard, or best practice, matches ought not be declared on small (less than 7cM) regions of unphased genotype datasets, a subject discussed at length by various authors and bloggers and which I won’t repeat here. Phased genotypes can be matched on smaller regions with greater confidence, though I would not do so for regions less than about 5cM. (And if you’re interested in current academic discussions of identical-by-descent detection try here, herehere, and here). Unfortunately FTDNA insists on reporting small segments on their unphased genotype data sets, which misleads the customer in regards to how significantly they match other people.

Given all of that, we conclude this:

☞ Even after filtering out HIRs smaller than 5cM, DNA matching services such as FTDNA or 23andMe should filter out HIR’s greater than 5cM that are appearing too commonly in the population, so that the end user receives a genealogically tractable match list.
 ❦

 

Ancillary to the above discussion and needing to be addressed especially in light of how FTDNA’s Family Finder reports matches, and since I mentioned it above, I want to touch briefly on the concept of “size” when it comes to autosomal DNA segments. This is important because I often see commenters in forums referring to a shared HIR as being ‘n.nn’ centiMorgans, sometimes to those 2 decimal places, probably because they check their FTDNA Chromosome Browser results which report matching in centiMorgans to two decimal places.

FTDNA should not give HIR lengths, individual or in total, in cMs to two decimal places!

The diagrams in this post use base pairs (or millions of base pairs) as the coordinate to place the matching HIRs over the chromosome length. In genetics, the concept of linkage disequilibrium mapping brings about the need to map the physical molecular position to a frequency space, with the frequency unit being a Morgan but in practice the centiMorgan is used.

Each human chromosome has by experimental procedure been “mapped”, associating a physical coordinate onto a scale measured in centiMorgans.

What many enthusiasts in genetic genealogy may not know is that the thought processes and concepts associated with these linkage maps have to deal with many issues; some of these may impact our understanding of the population-wide matching segments which are the topic of this post.

Referencing “cool spots” and “hot spots” can imply that coordinates from a linkage map be taken uncritically; however, the very large sets of genotyped samples being collected by 23andMe and AncestryDNA may be used in the future to create a better understanding of recombination probabilities along the chromosomes.

There is much variability in the end products of meiosis, and in regards to recombination and creating maps of chromosomes the following issues are topics of discussion:

  • age (e.g., here);
  • ethnicity;
  • gender (if comparing a child to a parent);
  • unique genetics (e.g., here).

The measurements for segment lengths given by 23andMe or FTDNA or gedmatch.com are averages, specifically gender averages.

For further (and very technical) reading on this subject:

Identifying recombination hotspots using population genetic data

Genetic Analysis of Variation in Human Meiotic Recombination

Enhanced genetic maps from family-based disease studies: population-specific comparisons

Variation in Human Recombination Rates and Its Genetic Determinants

Genetic Control of Hotspots

 

☞ The bottom line about segment lengths is this: Round the length of chromosome segments to whole numbers, and remember additionally that the size of a segment is only an estimate and is accompanied by non-trivial error bars.

 

 ❦ ❦

 

 

Surnames, 23andMe, and AncestryDNA: Making the Most of Match Counts and “Enrichment”

 

The questions we hope to answer are these:

Can we use the surnames provided (either in lists or pedigrees) by our DNA matches to:
  1. Discover the names of ancestors about whom we have no a priori knowledge?
  2. Confirm the name of ancestors about which we may have doubts?

 

This is a very complex subject, so this is a lengthy post to wade through – you are forewarned!

 

Making most of the direct to consumer (DTC) DNA services for doing genealogy is a challenge currently, with no single company offering everything an aspiring (professional or dilettante) genetic genealogist requires to accomplish many goals.   This quandary includes dealing with the cultural attachments to our DNA – our names.

One of the central theories of genealogy as practiced in the English speaking world is that surnames matter.   This turns out to be true enough, and is a reflection of the deep patriarchy embedded in western society.

Yet for many of us with ancestry from other nations outside the Anglosphere this surname-centric view of families will not be of much use.   I for instance have half my pedigree that is from a culture where patronymics were the custom until the time of my grandfather’s emigration; I am only the second generation born with the surname I carry.

Still, for the English genealogy world the surname ranks as one of the more important concepts.   For example – the field of one-name studies, while not absolutely wedded to the idea, centers on the name being studied as a surname.   Genealogy database programs such as Family Tree Maker are structured around a naming paradigm in which there are surnames.  Your local courthouse records are probably indexed by surname, and so on.

A Tree of Names, but Which Ones Belong?

A Tree of Names, but Which Ones Belong?

 

In the DNA view of genealogy as practiced today surnames still are center stage – at FTDNA Family Finder there is a field for surnames, at AncestryDNA the match page has a section dedicated to pedigree surnames and surnames in-common, and at 23andMe a customer can enter surnames into one’s profile, which will then appear in DNA Relatives and upon which one can search or sort one’s matches.

23andMe goes one step further, though, and offers what they call “Surname View”. This is a ranking of surnames as found in a customer’s DNA Relatives list.   The surnames are ranked by an “enrichment” score.   23andMe defines this value as:

Enrichment is computed via a one-tailed binomial test. The 23andMe-wide frequency of a given surname is the reference frequency. The number of occurrences of the surname among your matches and the total number of surnames among your matches are the counts in the binomial test. This results in a p-value; we then report -1.0 * log10(p), so the bigger the number is, the more unusual it is that it was at such high frequency among your matches.

 

For the purposes of this discussion I’ll use the term “enrichment” in the same sense as 23andMe. Their definition of enrichment is an interesting idea, though I have some questions about the validity of using a binomial test.  In a future post I may delve into more technical details about this issue.   There is one very important point to make:

☞ The more often a name occurs (in your match list) is not sufficient to define the importance of having a match with that surname in their pedigree.   Rather, we need to take into account the likelihood (based on the population one is testing) of the particular name occurring at random.

 

Recognizing that 23andMe is just one pond in which to fish, many of us have tested elsewhere, such as at AncestryDNA.    However, AncestryDNA does not provide any analysis regarding name frequencies in a customer’s set of matches.   But, with a little bit (or a lot) of cleverness we can collect the surname data from our matches at AncestryDNA, found on each match page in the left hand column titled “Surnames (10 generation pedigree)”.

With this data I can then calculate what 23andMe calls the “Enrichment” for our AncestryDNA set of surnames from matches, using the US Census surname frequency data in place of 23andMe’s “23andMe-wide frequency “, as I don’t have access to the entire database of AncestryDNA tests (if only!) to calculate an equivalent AncestryDNA-wide frequency table.   Given that the tested person in this case has all colonial American ancestry, and because the AncestryDNA customer base to date is overwhelmingly from the US with the “colonials” being a significant share of the customer base, using US Census data for expected values of name occurrence is presumed to be a good approximation of an AncestryDNA-wide data set.

A concrete example, from two tests I manage, of the same person, one test at 23andMe and the other at AncestryDNA:

Here are all 226 entries under “Surname View” for this particular test:

 

23andMe Surname View

Surname23andME Count23andME Enrichment
Roberts2561
Bryan1150
Chiles549
Webb1748
Adkins842
Green2442
Hoskins642
Smythe541
Dowell541
Peyton540
Cole1437
Hinds537
Beasley737
Griffith1036
Forster536
Wentworth533
Parker1933
Henley533
Holman633
Henson633
Schmid533
Gibbs832
Wilcox831
Drake930
Wyatt730
Vaughn829
Rowe829
McDaniel829
Allison729
Rush629
Turner1729
Dick529
Bishop1129
Best629
Garner728
Moyer528
Powell1328
Roe528
Mackey527
Bowen827
Edwards1627
Bates926
Stout626
Moon626
Garrison626
Dudley626
Sharpe526
Harper1026
James1226
Chambers825
Ballard725
Anderson3024
Gibson1124
Christian624
Valentine524
Pitts524
Lane1023
Hale823
Scott1921
Ward1421
Preston621
Hayden521
Duke521
Thompson2621
Wallace1020
Lewis2120
Shepherd620
Davenport620
McMillan520
Chapman1020
Berry1020
McDonald1019
Abbott619
Yates619
Dennis619
Robertson1119
Orr519
White2619
Stephens919
Woods918
Kemp518
Wilson2918
Stephenson518
Rogers1417
Harrison1017
Booth617
Ingram517
Warren916
Fleming716
Barrett716
Clark2316
Chandler615
Collier515
Dyer515
Brooks1015
Walker1914
Allen1914
Cox1214
Stone914
Parsons614
Mullins514
Campbell1713
Foster1013
Mason913
Howell713
Miles513
Thomas2013
Day712
Franklin612
Love512
Matthews712
Miller3311
Cook1211
Austin611
Russell910
Wells810
Reynolds810
Lawrence710
Keller610
Gilbert610
Sherman510
Pratt510
Hubbard510
Young159
Evans139
Weaver69
Fuller69
Sutton59
Lowe59
Garrett59
Fletcher59
Fitzgerald59
Dawson59
Dunn79
Hoffman68
Davis288
Kennedy88
Rice78
Owens68
Hopkins68
Carter118
Hawkins68
Bradley68
Willis58
Dean58
Jones348
Murphy118
Hall157
Wright147
Hughes87
Fisher87
Sanders77
Patterson77
Murray77
Jackson157
Johnston77
Griffin77
Bryant67
Riley57
Mueller57
May57
Mills77
Taylor216
Watson86
Gray86
Meyer76
Long76
Phillips106
Graham76
Jenkins66
Armstrong66
Andrews66
Greene56
Hunt76
Baker135
Morgan95
Bell95
Bailey85
Wheeler65
Spencer65
Marshall65
Alexander65
Payne55
Henry55
Elliott55
Duncan55
West64
Smith564
Robinson124
Stewart104
Nelson104
Wood94
Morris84
Cooper84
Reed74
Martin194
Price64
Myers64
Coleman64
Tucker54
Nichols54
Knight54
Carr54
Arnold54
Williams284
Johnson333
Adams113
Butler63
Schmidt53
Perry53
Palmer53
Jordan53
Hunter53
Ford53
Brown282
Ross52
Richardson52
Howard52
Bennett52
Collins72
Harris111
Hill91
King81
Mitchell61
Moore110
Lee90
23andMe Surname View on 24 Jan 2015, for same individual as tested at AncestryDNA

 

Here are the top (i.e., the most “enriched”) 226 (a number selected to be the same quantity as that from 23andMe) entries from the list of all surnames in matches at AncestryDNA for this same person:

 

AncestryDNA "Surname View"

SurnameAncestryDNA Count"Enrichment"
Parke3834.41
Batte2333.33
Cocke2531.83
Pridmore2327.49
Barbara2826.86
Woodson5426.75
Crownover2626.41
Reade2123.67
Browne5623.39
Isham2822.69
Bartlett7822.33
Bolling3222.31
Prigmore1521.53
Wood23221.00
Eppes1720.99
Chiles2820.92
Fitzrandolph1120.84
Cresson1220.68
Poythress1720.19
Taliaferro2519.91
Stuart6719.82
Mellott2619.21
Cooke6219.15
Andersson1619.00
Ironmonger1018.59
Elliot3217.63
Pettypool917.26
Pleasants1816.94
Stillwell2916.91
Jonsson1516.79
Demarest2016.74
Bryan7916.47
Brashears1816.45
Vanderveer1616.28
Piles1016.06
Woodward6115.99
Owen8215.81
Cock1015.61
Nalle1015.61
Paine2915.25
Glascock1614.94
Hubbard8614.82
Griffith9314.66
Fielding2414.62
Anna1314.60
Hix2414.45
Vancleef1114.29
Olofsson814.24
Whitehead6414.16
Moredock914.11
Neale2014.08
Goad2813.90
Vawter1513.75
Doyne913.71
Clarke7913.67
Nilsson1813.57
Read3713.41
Ball9013.39
Moor1613.36
Frances1613.13
Chamberlain4813.10
Grymes913.00
Mumford2312.95
Dickenson2012.94
Cossart712.81
Tuttle4712.78
Bird5312.76
Symons1712.71
Tschudi812.62
Jennings9512.59
Tarpley1812.41
Blankenbaker1112.40
Denton4612.40
Markham3112.39
Predmore1312.05
Wheeler11812.04
Meriwether1411.98
Dudley5211.86
Lanier3811.68
Brereton1211.65
Wynne2611.64
Harrison15211.51
Wallis2811.49
Schenck2111.45
Drake7111.43
Ragsdale2911.42
Pride2211.36
Cawood1111.31
Graves8811.31
Clements5311.29
Agnes1011.23
Langston3611.18
Overton3511.13
Morton7011.10
Worsham2010.96
Muller4310.92
Brevard1110.85
Crew1710.85
Johnston11910.82
Larsson1110.75
Tydings810.70
Tarleton1210.70
Enyard610.66
Petersson610.66
Roscow610.66
Lyon4310.63
Osborne7410.57
Vaughn8810.57
Brooke1810.53
Larzelere810.47
Parsons7410.45
Waggoner3010.29
Tandy1310.29
Clayton6510.28
Armistead1410.28
Oldham2810.27
Daniel8010.25
Cathey2310.24
Eva910.20
Sharp8110.16
Rawlings2410.14
Roosa1210.13
Fowke610.09
Vivion610.09
Antram710.05
Hogg219.92
Newberry289.91
Darnall129.90
Henley359.82
Beall239.81
Rachel139.72
Stout549.66
Earle229.65
Meacham199.59
Dabney209.58
Ballenger179.54
Hegeman99.51
Garnet79.50
Pledge79.50
Fowler979.50
Ely279.50
Harwood259.45
Royall119.39
Suddarth99.39
Allen3139.39
Brouwer139.37
Bull259.37
Williamson1039.37
Persson129.36
Faure89.33
Westcott189.32
Scripture79.27
Calvert319.26
Stedman169.23
Boone609.23
Mcgehee189.22
Cloyes69.22
Simcock69.22
Standley169.16
Craige79.05
Salling79.05
Pope679.04
Warren1259.03
Maria158.98
Mendenhall238.97
Crow368.96
Poor148.95
Waddy118.88
Hertzel68.88
Kip68.88
Vannuys68.88
Field378.85
Steel198.83
Threlkeld128.81
Mershon118.75
Hoge138.75
Follansbee88.74
Scudder158.74
Pendleton298.73
Hawkins1158.72
Baskett118.68
Crawford1308.66
Griswold238.65
Alice78.65
Strother198.64
Street328.64
Alden178.60
Aldin58.59
Efland58.59
Mcilhaney58.59
Thomasen58.59
Loomis278.54
Vanbibber98.53
Brewster318.47
Stryker138.37
Hale808.36
Erwin348.33
Hobart138.33
Mccombe68.30
Bromwell78.30
Bushnell168.30
Hull508.28
Rogers2108.26
Irvine218.25
Runyon218.25
Sprague348.25
Garland378.24
Tinsley278.23
Pridemore128.17
Buys98.17
Bennet148.13
Gooch238.11
Battaile58.11
Faut58.11
Hogshead58.11
Chappel128.08
Randolph488.07
Croasdale68.06
Fiske148.02
Esther77.99
Basye87.92
Straughan87.92
Wyatt567.90
Haile157.89
Low237.87
Harbour147.87
As of 24 Dec 2014, AncestryDNA matches' surnames-in-pedigrees analyzed similarly to 23andMe "Surname View". The top 226 "enriched" surnames are shown, the same number of surnames as in the total Surname View at 23andMe, but which are only 3% of all the surnames from the AncestryDNA match list.

One thing is very clear upon comparing the two lists is that they are not same.

They are not even close.

The surnames in the 23andMe list do show up on the larger AncestryDNA list (not just the top 226), but scattered across the nearly 10,300 names in the much larger AncestryDNA list.

What also should stand out is that the counts of matches (second column) are much larger for the AncestryDNA data set than for the 23andMe data set.    While the number of customers of each service is about equal, those of us who have tried to use 23andMe for genealogy have discovered quite quickly that many DNA Relatives there are incognito, and even if a DNA Relative has a public profile they usually list very few surnames.   Add to this that 23andMe limits the DNA Relatives list size to only the top matches (for those customers with more than 1000 matches, which would include many Americans) and we discover that there are relatively few data points from the 23andMe database in regards to surnames.

In the above example, the testee has 4x as many matches at AncestryDNA than at 23andMe, and before “autosomalgeddon” at AncestryDNA last fall the match count was twice that.    However, even more significant is that the average number of people in the pedigrees of matches at AncestryDNA is over 50, while at 23andMe the average number of surnames per match is but a small fraction of that.   Because of the smaller data set of matches with ancestral names at 23andMe, it is possible for a few diehard genealogists, who test multiple members of their families, to skew the Surname View results by putting in long lists of surnames in their profiles.   For example, the above 23andMe customer matches 3 siblings, each of which has a surname list of approximately 215 names!   This one family significantly skews the Surname View results for the example test I am discussing.   In this case, this family is on the first page of matches (that is about 1/12th the total) in DNA Relatives, and accounts for 645 name occurrences on that page.   Everybody else on that page (95 people) account for just 413 more name occurrences.

 

☞ Caution: Outliers, such as extremely deep pedigrees or very long name lists will skew results on relatively small datasets, such as at FTDNA and 23andMe.

 

Given that we have two different lists of “enriched” surnames for the same person, which one is more informative? (Note that I did not write “accurate”.)

After a couple of years of working on developing the pedigree for the subject of this post, with plenty of surprises along the way, it turns out the AncestryDNA surname enrichment list better reflects what I know of the pedigree.   Especially in the top 25 names, where 4 on the list of AncestryDNA enriched names are the same as pedigree surnames, while only one name (“Green”) on the top 25 of the 23andMe enrichment list is found in the developed pedigree, and that name is questioned by some researchers of that ancestral family.

Of the entire 23andMe list of 226 surnames, only 11 are found in the developed pedigree.   For the top 226 from AncestryDNA it also turns out there are 11 names found in the testee’s pedigree.   The only name in common between the two groups of 11 is “Johnston”, which is good as that is the family name of the testee’s grandfather.   It should be noted that the entire list of AncestryDNA matches’ pedigree-names that were included in this analysis total over 10,000, and names in the rest of the pedigree of the testee are found on that list but farther down than #226.

Which raises the next question – what about all those other names on the list(s), that are not ancestral names of the person tested?

Intuitively we recognize that our ancestors can have many descendants, and in the society of which we’re a part a daughter will lose her maiden name and pick up her husband’s surname.   Thus our cousins descended from these great-grand aunts and female cousins will carry surnames that are not part of our direct pedigree lines. This is both a bane and a boon. It is a bane because unless we’ve carried out very extensive descendancy research we will not recognize the names of our genealogical cousins.   On the other hand, all these names are a boon because until the late 20th century people found their mates relatively close by, and families often intermarried and migrated together, making certain names more associated with each other.   From the example test in this post, on the AncestryDNA enriched surname table the top entrant is “Parke”, a name often found in colonial pedigrees due to the 17th century founders in America (e.g., purportedly a Dr. Roger Parke of early New Jersey) having many male descendants, including quite a few who ended up in what is today West Virginia and Kentucky.  This surname is then associated with many families that come from these areas, which happens in this case with our subject’s father’s ancestors having lived in WV and KY where many Parke/Park/Parks families lived.

As we work our way to an answer to the two questions at the top of this post it is important to realize that the use of DNA in genealogy is rapidly expanding, and as a larger share of the population tests at these companies we will see larger data sets that will give us an opportunity to be more certain of whatever we find.

So, in regards to question #1 posed at the beginning of this post, the best answer currently is “probably.”

In this particular case, in the example DNA test I am discussing in this post, only through DNA testing has a secret been revealed, an NPE (or possibly secret adoption.) It turns out that 3 surnames in the top 25 of the AncestryDNA surname enrichment table are ancestors of the biological parent of this surprise.   Those 3 names are in 4th, 7th, and 13th place on that list of enrichment (out of 10,300 surnames.)    It may still prove to be a coincidence, but finding the surnames of 2nd and 3rd great grandparents high on such a list ought not be surprising.  In this case the surnames are unusual enough to stand out, but not so ultra-rare in our society as to not make it onto the US Census list (which only includes names which occur more than 100 times in the US).   Extremely rare surnames are also, by definition, unlikely to show up in a list of DNA matches.

Furthermore, the 3rd entry on that AncestryDNA enrichment list, “Cocke”, is probably the ancestral surname of “Cox”, and the pedigree of the tested person most likely (based on un-vetted trees) has two entrants with that surname.   Number 21 on the above AncestryDNA enrichment list, “Stuart”, is the maiden name of a known 4th great grandmother.

My experience in doing this example is that nearly all the surnames in the pedigree of the person I’ve tested appear in the upper quartile of the AncestryDNA enrichment list, with many of them in the upper 10%.

So, without a priori knowledge of the names of unknown ancestors, how would one make use of an “Enrichment” list?    Based on my experience, I recommend starting at the top of an enrichment list, and working down at least a couple of hundred names (given a dataset as large as AncestryDNA) and look for any connections the names may have with one’s known ancestors.   Speculative family trees are not a bad thing, as long as they are treated as such.

☞ Mining surname data is a source of leads for genealogical research, and cannot stand apart from the exhaustive search for evidence required in sound family history practices.

In an ideal case a newly discovered close DNA cousin will have their full tree available to you to study, to find the most recent common ancestor, but that is not always the case.   Sometimes all one has is a list of surnames, or maybe a bare-bones pedigree with only names and no other useful information. If you discover a not-previously-known not-too-distant (say 2nd to 4th) cousin, and that DNA cousin lists some ancestral surnames, it is worth going through each of their surnames looking at an enrichment list of your own DNA test, to see if one or more stand out as occurring disproportionately frequently in your set of matches.

Then what of question #2, where we are desiring to validate or support a theory about an ancestor’s possible surname (and in these cases we are often looking for the maiden names of our married ancestresses?)

In my opinion, one of the better uses of an enrichment list is to act as a check against the human tendency to fixate on an idea or observation.   Given the ubiquity of pareidolia I value any means to keep me from seeing what is not there.   In the field of genealogy that includes becoming overly sensitive to a particular name, whereby we exaggerate the importance of that particular name any time we come across it.

In the example tested person in this post, after looking at many matches at the various companies I had become fixated on certain names, but looking at the AncestryDNA enrichment list I discover that those names are not occurring any more frequently than what to expect out of the population at large.

In another application of the enrichment table: I’m researching a great grandmother (of the tested person in this post) whose maiden name is known from census records and marriage records, but whose parents have escaped identification. In this case the lack of matching surnames of the DNA matches at 23andMe and AncestryDNA, as evidenced by the low enrichment value of the great grandmother’s maiden name, and the low enrichment value of the maiden name of the only woman I can find that I hypothesized could be her mother,  supports my idea that the great grandmother was adopted (or a foster child who took the name of a couple that housed her for a while.)  This hypothesis I had previously generated based on census data where the great grandmother spends her teenage years and early 20’s living with two different families not of her own surname, and the surname data suggests that I am on the correct path.   I’ve further had tested at AncestryDNA another great grandchild of this mysterious woman, and analysis of surnames of his matches concurs with the conclusions I’m drawing from his 2nd cousin’s match data.

Which brings up an very useful strategy, whether one is doing segment chasing or looking at surnames or just trawling for unknown cousins:

☞ Testing additional known relatives, especially 2nd cousins, can provide concurring data, or prevent one from erroneously reaching a wrong conclusion based on a single test.

Before I leave question #2,  when discussing the low enrichment scores I must throw in another caveat:

☞ For very common surnames (e.g. Smith, Williams, Jones, etc.) the law of large numbers will affect the significance test such that these names will not be very high on ordered enrichment lists.

This point can best be made with a graph:

Surname Plot: Pedigree Surname Frequency vs. Census Frequency

The magenta line is a line of slope 1 intersecting the origin (0,0).   In other words, all names to the left or above the magenta line are occurring more frequently in the matches (for our test subject) than in the US population as determined from the US Census for 2000.    We can see that as we move towards the more common surnames the variance in the data decreases and the surnames move closer to the magenta line.    So if you are going to do this analysis on your own match results, remember that even if you have ancestors named “Jones” or “Smith” do not expect there to be large changes in the frequencies of those names in your match list.

Some additional observations to make:

  • The most common surnames in America are showing up in matches (of our test subject) less than the average in the US.   The entire name collection of the matches is tilted towards the less common surnames.
  • The demographic change in the US is quite noticeable, with the matches to old colonial American names prominent, while contemporary hispanic names are under-matched.
  • Many ancestry.com users insist on putting a woman’s given name in the surname field.  For the statistical significance test I culled these (except for “Barbara”, which was difficult to determine which instances were not really surnames), but here I show them to illustrate how common this habit is among ancestry.com users.

I suggest that anyone who is a “colonial” (that is, has all their ancestors who immigrated to America before 1776) will see similar results.

There is much more to discuss about surnames and counting their frequency, and what to make of our DNA matches.    Surname, and geography, frequency and associated statistical tests are a supplement to other genealogy methods.  Even with a more molecular approach to mining DNA cousins (i.e., matching chromosome regions) we still need to ascribe the shared DNA to a person, said person being known in their day and referred to by us through their name(s), familial, given, or fabricated.

For many of us with doubtful or absent ancestors at the ends of branches in our family trees, perhaps analyses like these can lead us in directions to dig for genealogical gold.

 ❦

This already long post would have been longer had I not excluded diving into several topics, which are possibilities for further discussion, eventually:

  • Outlining the analysis pipeline;
  • Improved methods of significance testing that more faithfully apply to the surname practices than the binomial test;
  • Exploiting name frequency distributions other than the 2000 US Census;
  • The problem of spelling and other phenomena related to names.

With that said, happy adventures in DNA mining!