Surnames, 23andMe, and AncestryDNA: Making the Most of Match Counts and “Enrichment”

 

The questions we hope to answer are these:

Can we use the surnames provided (either in lists or pedigrees) by our DNA matches to:
  1. Discover the names of ancestors about whom we have no a priori knowledge?
  2. Confirm the name of ancestors about which we may have doubts?

 

This is a very complex subject, so this is a lengthy post to wade through – you are forewarned!

 

Making most of the direct to consumer (DTC) DNA services for doing genealogy is a challenge currently, with no single company offering everything an aspiring (professional or dilettante) genetic genealogist requires to accomplish many goals.   This quandary includes dealing with the cultural attachments to our DNA – our names.

One of the central theories of genealogy as practiced in the English speaking world is that surnames matter.   This turns out to be true enough, and is a reflection of the deep patriarchy embedded in western society.

Yet for many of us with ancestry from other nations outside the Anglosphere this surname-centric view of families will not be of much use.   I for instance have half my pedigree that is from a culture where patronymics were the custom until the time of my grandfather’s emigration; I am only the second generation born with the surname I carry.

Still, for the English genealogy world the surname ranks as one of the more important concepts.   For example – the field of one-name studies, while not absolutely wedded to the idea, centers on the name being studied as a surname.   Genealogy database programs such as Family Tree Maker are structured around a naming paradigm in which there are surnames.  Your local courthouse records are probably indexed by surname, and so on.

A Tree of Names, but Which Ones Belong?

A Tree of Names, but Which Ones Belong?

 

In the DNA view of genealogy as practiced today surnames still are center stage – at FTDNA Family Finder there is a field for surnames, at AncestryDNA the match page has a section dedicated to pedigree surnames and surnames in-common, and at 23andMe a customer can enter surnames into one’s profile, which will then appear in DNA Relatives and upon which one can search or sort one’s matches.

23andMe goes one step further, though, and offers what they call “Surname View”. This is a ranking of surnames as found in a customer’s DNA Relatives list.   The surnames are ranked by an “enrichment” score.   23andMe defines this value as:

Enrichment is computed via a one-tailed binomial test. The 23andMe-wide frequency of a given surname is the reference frequency. The number of occurrences of the surname among your matches and the total number of surnames among your matches are the counts in the binomial test. This results in a p-value; we then report -1.0 * log10(p), so the bigger the number is, the more unusual it is that it was at such high frequency among your matches.

 

For the purposes of this discussion I’ll use the term “enrichment” in the same sense as 23andMe. Their definition of enrichment is an interesting idea, though I have some questions about the validity of using a binomial test.  In a future post I may delve into more technical details about this issue.   There is one very important point to make:

☞ The more often a name occurs (in your match list) is not sufficient to define the importance of having a match with that surname in their pedigree.   Rather, we need to take into account the likelihood (based on the population one is testing) of the particular name occurring at random.

 

Recognizing that 23andMe is just one pond in which to fish, many of us have tested elsewhere, such as at AncestryDNA.    However, AncestryDNA does not provide any analysis regarding name frequencies in a customer’s set of matches.   But, with a little bit (or a lot) of cleverness we can collect the surname data from our matches at AncestryDNA, found on each match page in the left hand column titled “Surnames (10 generation pedigree)”.

With this data I can then calculate what 23andMe calls the “Enrichment” for our AncestryDNA set of surnames from matches, using the US Census surname frequency data in place of 23andMe’s “23andMe-wide frequency “, as I don’t have access to the entire database of AncestryDNA tests (if only!) to calculate an equivalent AncestryDNA-wide frequency table.   Given that the tested person in this case has all colonial American ancestry, and because the AncestryDNA customer base to date is overwhelmingly from the US with the “colonials” being a significant share of the customer base, using US Census data for expected values of name occurrence is presumed to be a good approximation of an AncestryDNA-wide data set.

A concrete example, from two tests I manage, of the same person, one test at 23andMe and the other at AncestryDNA:

Here are all 226 entries under “Surname View” for this particular test:

 

23andMe Surname View

Surname23andME Count23andME Enrichment
Roberts2561
Bryan1150
Chiles549
Webb1748
Adkins842
Green2442
Hoskins642
Smythe541
Dowell541
Peyton540
Cole1437
Hinds537
Beasley737
Griffith1036
Forster536
Wentworth533
Parker1933
Henley533
Holman633
Henson633
Schmid533
Gibbs832
Wilcox831
Drake930
Wyatt730
Vaughn829
Rowe829
McDaniel829
Allison729
Rush629
Turner1729
Dick529
Bishop1129
Best629
Garner728
Moyer528
Powell1328
Roe528
Mackey527
Bowen827
Edwards1627
Bates926
Stout626
Moon626
Garrison626
Dudley626
Sharpe526
Harper1026
James1226
Chambers825
Ballard725
Anderson3024
Gibson1124
Christian624
Valentine524
Pitts524
Lane1023
Hale823
Scott1921
Ward1421
Preston621
Hayden521
Duke521
Thompson2621
Wallace1020
Lewis2120
Shepherd620
Davenport620
McMillan520
Chapman1020
Berry1020
McDonald1019
Abbott619
Yates619
Dennis619
Robertson1119
Orr519
White2619
Stephens919
Woods918
Kemp518
Wilson2918
Stephenson518
Rogers1417
Harrison1017
Booth617
Ingram517
Warren916
Fleming716
Barrett716
Clark2316
Chandler615
Collier515
Dyer515
Brooks1015
Walker1914
Allen1914
Cox1214
Stone914
Parsons614
Mullins514
Campbell1713
Foster1013
Mason913
Howell713
Miles513
Thomas2013
Day712
Franklin612
Love512
Matthews712
Miller3311
Cook1211
Austin611
Russell910
Wells810
Reynolds810
Lawrence710
Keller610
Gilbert610
Sherman510
Pratt510
Hubbard510
Young159
Evans139
Weaver69
Fuller69
Sutton59
Lowe59
Garrett59
Fletcher59
Fitzgerald59
Dawson59
Dunn79
Hoffman68
Davis288
Kennedy88
Rice78
Owens68
Hopkins68
Carter118
Hawkins68
Bradley68
Willis58
Dean58
Jones348
Murphy118
Hall157
Wright147
Hughes87
Fisher87
Sanders77
Patterson77
Murray77
Jackson157
Johnston77
Griffin77
Bryant67
Riley57
Mueller57
May57
Mills77
Taylor216
Watson86
Gray86
Meyer76
Long76
Phillips106
Graham76
Jenkins66
Armstrong66
Andrews66
Greene56
Hunt76
Baker135
Morgan95
Bell95
Bailey85
Wheeler65
Spencer65
Marshall65
Alexander65
Payne55
Henry55
Elliott55
Duncan55
West64
Smith564
Robinson124
Stewart104
Nelson104
Wood94
Morris84
Cooper84
Reed74
Martin194
Price64
Myers64
Coleman64
Tucker54
Nichols54
Knight54
Carr54
Arnold54
Williams284
Johnson333
Adams113
Butler63
Schmidt53
Perry53
Palmer53
Jordan53
Hunter53
Ford53
Brown282
Ross52
Richardson52
Howard52
Bennett52
Collins72
Harris111
Hill91
King81
Mitchell61
Moore110
Lee90
23andMe Surname View on 24 Jan 2015, for same individual as tested at AncestryDNA

 

Here are the top (i.e., the most “enriched”) 226 (a number selected to be the same quantity as that from 23andMe) entries from the list of all surnames in matches at AncestryDNA for this same person:

 

AncestryDNA "Surname View"

SurnameAncestryDNA Count"Enrichment"
Parke3834.41
Batte2333.33
Cocke2531.83
Pridmore2327.49
Barbara2826.86
Woodson5426.75
Crownover2626.41
Reade2123.67
Browne5623.39
Isham2822.69
Bartlett7822.33
Bolling3222.31
Prigmore1521.53
Wood23221.00
Eppes1720.99
Chiles2820.92
Fitzrandolph1120.84
Cresson1220.68
Poythress1720.19
Taliaferro2519.91
Stuart6719.82
Mellott2619.21
Cooke6219.15
Andersson1619.00
Ironmonger1018.59
Elliot3217.63
Pettypool917.26
Pleasants1816.94
Stillwell2916.91
Jonsson1516.79
Demarest2016.74
Bryan7916.47
Brashears1816.45
Vanderveer1616.28
Piles1016.06
Woodward6115.99
Owen8215.81
Cock1015.61
Nalle1015.61
Paine2915.25
Glascock1614.94
Hubbard8614.82
Griffith9314.66
Fielding2414.62
Anna1314.60
Hix2414.45
Vancleef1114.29
Olofsson814.24
Whitehead6414.16
Moredock914.11
Neale2014.08
Goad2813.90
Vawter1513.75
Doyne913.71
Clarke7913.67
Nilsson1813.57
Read3713.41
Ball9013.39
Moor1613.36
Frances1613.13
Chamberlain4813.10
Grymes913.00
Mumford2312.95
Dickenson2012.94
Cossart712.81
Tuttle4712.78
Bird5312.76
Symons1712.71
Tschudi812.62
Jennings9512.59
Tarpley1812.41
Blankenbaker1112.40
Denton4612.40
Markham3112.39
Predmore1312.05
Wheeler11812.04
Meriwether1411.98
Dudley5211.86
Lanier3811.68
Brereton1211.65
Wynne2611.64
Harrison15211.51
Wallis2811.49
Schenck2111.45
Drake7111.43
Ragsdale2911.42
Pride2211.36
Cawood1111.31
Graves8811.31
Clements5311.29
Agnes1011.23
Langston3611.18
Overton3511.13
Morton7011.10
Worsham2010.96
Muller4310.92
Brevard1110.85
Crew1710.85
Johnston11910.82
Larsson1110.75
Tydings810.70
Tarleton1210.70
Enyard610.66
Petersson610.66
Roscow610.66
Lyon4310.63
Osborne7410.57
Vaughn8810.57
Brooke1810.53
Larzelere810.47
Parsons7410.45
Waggoner3010.29
Tandy1310.29
Clayton6510.28
Armistead1410.28
Oldham2810.27
Daniel8010.25
Cathey2310.24
Eva910.20
Sharp8110.16
Rawlings2410.14
Roosa1210.13
Fowke610.09
Vivion610.09
Antram710.05
Hogg219.92
Newberry289.91
Darnall129.90
Henley359.82
Beall239.81
Rachel139.72
Stout549.66
Earle229.65
Meacham199.59
Dabney209.58
Ballenger179.54
Hegeman99.51
Garnet79.50
Pledge79.50
Fowler979.50
Ely279.50
Harwood259.45
Royall119.39
Suddarth99.39
Allen3139.39
Brouwer139.37
Bull259.37
Williamson1039.37
Persson129.36
Faure89.33
Westcott189.32
Scripture79.27
Calvert319.26
Stedman169.23
Boone609.23
Mcgehee189.22
Cloyes69.22
Simcock69.22
Standley169.16
Craige79.05
Salling79.05
Pope679.04
Warren1259.03
Maria158.98
Mendenhall238.97
Crow368.96
Poor148.95
Waddy118.88
Hertzel68.88
Kip68.88
Vannuys68.88
Field378.85
Steel198.83
Threlkeld128.81
Mershon118.75
Hoge138.75
Follansbee88.74
Scudder158.74
Pendleton298.73
Hawkins1158.72
Baskett118.68
Crawford1308.66
Griswold238.65
Alice78.65
Strother198.64
Street328.64
Alden178.60
Aldin58.59
Efland58.59
Mcilhaney58.59
Thomasen58.59
Loomis278.54
Vanbibber98.53
Brewster318.47
Stryker138.37
Hale808.36
Erwin348.33
Hobart138.33
Mccombe68.30
Bromwell78.30
Bushnell168.30
Hull508.28
Rogers2108.26
Irvine218.25
Runyon218.25
Sprague348.25
Garland378.24
Tinsley278.23
Pridemore128.17
Buys98.17
Bennet148.13
Gooch238.11
Battaile58.11
Faut58.11
Hogshead58.11
Chappel128.08
Randolph488.07
Croasdale68.06
Fiske148.02
Esther77.99
Basye87.92
Straughan87.92
Wyatt567.90
Haile157.89
Low237.87
Harbour147.87
As of 24 Dec 2014, AncestryDNA matches' surnames-in-pedigrees analyzed similarly to 23andMe "Surname View". The top 226 "enriched" surnames are shown, the same number of surnames as in the total Surname View at 23andMe, but which are only 3% of all the surnames from the AncestryDNA match list.

One thing is very clear upon comparing the two lists is that they are not same.

They are not even close.

The surnames in the 23andMe list do show up on the larger AncestryDNA list (not just the top 226), but scattered across the nearly 10,300 names in the much larger AncestryDNA list.

What also should stand out is that the counts of matches (second column) are much larger for the AncestryDNA data set than for the 23andMe data set.    While the number of customers of each service is about equal, those of us who have tried to use 23andMe for genealogy have discovered quite quickly that many DNA Relatives there are incognito, and even if a DNA Relative has a public profile they usually list very few surnames.   Add to this that 23andMe limits the DNA Relatives list size to only the top matches (for those customers with more than 1000 matches, which would include many Americans) and we discover that there are relatively few data points from the 23andMe database in regards to surnames.

In the above example, the testee has 4x as many matches at AncestryDNA than at 23andMe, and before “autosomalgeddon” at AncestryDNA last fall the match count was twice that.    However, even more significant is that the average number of people in the pedigrees of matches at AncestryDNA is over 50, while at 23andMe the average number of surnames per match is but a small fraction of that.   Because of the smaller data set of matches with ancestral names at 23andMe, it is possible for a few diehard genealogists, who test multiple members of their families, to skew the Surname View results by putting in long lists of surnames in their profiles.   For example, the above 23andMe customer matches 3 siblings, each of which has a surname list of approximately 215 names!   This one family significantly skews the Surname View results for the example test I am discussing.   In this case, this family is on the first page of matches (that is about 1/12th the total) in DNA Relatives, and accounts for 645 name occurrences on that page.   Everybody else on that page (95 people) account for just 413 more name occurrences.

 

☞ Caution: Outliers, such as extremely deep pedigrees or very long name lists will skew results on relatively small datasets, such as at FTDNA and 23andMe.

 

Given that we have two different lists of “enriched” surnames for the same person, which one is more informative? (Note that I did not write “accurate”.)

After a couple of years of working on developing the pedigree for the subject of this post, with plenty of surprises along the way, it turns out the AncestryDNA surname enrichment list better reflects what I know of the pedigree.   Especially in the top 25 names, where 4 on the list of AncestryDNA enriched names are the same as pedigree surnames, while only one name (“Green”) on the top 25 of the 23andMe enrichment list is found in the developed pedigree, and that name is questioned by some researchers of that ancestral family.

Of the entire 23andMe list of 226 surnames, only 11 are found in the developed pedigree.   For the top 226 from AncestryDNA it also turns out there are 11 names found in the testee’s pedigree.   The only name in common between the two groups of 11 is “Johnston”, which is good as that is the family name of the testee’s grandfather.   It should be noted that the entire list of AncestryDNA matches’ pedigree-names that were included in this analysis total over 10,000, and names in the rest of the pedigree of the testee are found on that list but farther down than #226.

Which raises the next question – what about all those other names on the list(s), that are not ancestral names of the person tested?

Intuitively we recognize that our ancestors can have many descendants, and in the society of which we’re a part a daughter will lose her maiden name and pick up her husband’s surname.   Thus our cousins descended from these great-grand aunts and female cousins will carry surnames that are not part of our direct pedigree lines. This is both a bane and a boon. It is a bane because unless we’ve carried out very extensive descendancy research we will not recognize the names of our genealogical cousins.   On the other hand, all these names are a boon because until the late 20th century people found their mates relatively close by, and families often intermarried and migrated together, making certain names more associated with each other.   From the example test in this post, on the AncestryDNA enriched surname table the top entrant is “Parke”, a name often found in colonial pedigrees due to the 17th century founders in America (e.g., purportedly a Dr. Roger Parke of early New Jersey) having many male descendants, including quite a few who ended up in what is today West Virginia and Kentucky.  This surname is then associated with many families that come from these areas, which happens in this case with our subject’s father’s ancestors having lived in WV and KY where many Parke/Park/Parks families lived.

As we work our way to an answer to the two questions at the top of this post it is important to realize that the use of DNA in genealogy is rapidly expanding, and as a larger share of the population tests at these companies we will see larger data sets that will give us an opportunity to be more certain of whatever we find.

So, in regards to question #1 posed at the beginning of this post, the best answer currently is “probably.”

In this particular case, in the example DNA test I am discussing in this post, only through DNA testing has a secret been revealed, an NPE (or possibly secret adoption.) It turns out that 3 surnames in the top 25 of the AncestryDNA surname enrichment table are ancestors of the biological parent of this surprise.   Those 3 names are in 4th, 7th, and 13th place on that list of enrichment (out of 10,300 surnames.)    It may still prove to be a coincidence, but finding the surnames of 2nd and 3rd great grandparents high on such a list ought not be surprising.  In this case the surnames are unusual enough to stand out, but not so ultra-rare in our society as to not make it onto the US Census list (which only includes names which occur more than 100 times in the US).   Extremely rare surnames are also, by definition, unlikely to show up in a list of DNA matches.

Furthermore, the 3rd entry on that AncestryDNA enrichment list, “Cocke”, is probably the ancestral surname of “Cox”, and the pedigree of the tested person most likely (based on un-vetted trees) has two entrants with that surname.   Number 21 on the above AncestryDNA enrichment list, “Stuart”, is the maiden name of a known 4th great grandmother.

My experience in doing this example is that nearly all the surnames in the pedigree of the person I’ve tested appear in the upper quartile of the AncestryDNA enrichment list, with many of them in the upper 10%.

So, without a priori knowledge of the names of unknown ancestors, how would one make use of an “Enrichment” list?    Based on my experience, I recommend starting at the top of an enrichment list, and working down at least a couple of hundred names (given a dataset as large as AncestryDNA) and look for any connections the names may have with one’s known ancestors.   Speculative family trees are not a bad thing, as long as they are treated as such.

☞ Mining surname data is a source of leads for genealogical research, and cannot stand apart from the exhaustive search for evidence required in sound family history practices.

In an ideal case a newly discovered close DNA cousin will have their full tree available to you to study, to find the most recent common ancestor, but that is not always the case.   Sometimes all one has is a list of surnames, or maybe a bare-bones pedigree with only names and no other useful information. If you discover a not-previously-known not-too-distant (say 2nd to 4th) cousin, and that DNA cousin lists some ancestral surnames, it is worth going through each of their surnames looking at an enrichment list of your own DNA test, to see if one or more stand out as occurring disproportionately frequently in your set of matches.

Then what of question #2, where we are desiring to validate or support a theory about an ancestor’s possible surname (and in these cases we are often looking for the maiden names of our married ancestresses?)

In my opinion, one of the better uses of an enrichment list is to act as a check against the human tendency to fixate on an idea or observation.   Given the ubiquity of pareidolia I value any means to keep me from seeing what is not there.   In the field of genealogy that includes becoming overly sensitive to a particular name, whereby we exaggerate the importance of that particular name any time we come across it.

In the example tested person in this post, after looking at many matches at the various companies I had become fixated on certain names, but looking at the AncestryDNA enrichment list I discover that those names are not occurring any more frequently than what to expect out of the population at large.

In another application of the enrichment table: I’m researching a great grandmother (of the tested person in this post) whose maiden name is known from census records and marriage records, but whose parents have escaped identification. In this case the lack of matching surnames of the DNA matches at 23andMe and AncestryDNA, as evidenced by the low enrichment value of the great grandmother’s maiden name, and the low enrichment value of the maiden name of the only woman I can find that I hypothesized could be her mother,  supports my idea that the great grandmother was adopted (or a foster child who took the name of a couple that housed her for a while.)  This hypothesis I had previously generated based on census data where the great grandmother spends her teenage years and early 20’s living with two different families not of her own surname, and the surname data suggests that I am on the correct path.   I’ve further had tested at AncestryDNA another great grandchild of this mysterious woman, and analysis of surnames of his matches concurs with the conclusions I’m drawing from his 2nd cousin’s match data.

Which brings up an very useful strategy, whether one is doing segment chasing or looking at surnames or just trawling for unknown cousins:

☞ Testing additional known relatives, especially 2nd cousins, can provide concurring data, or prevent one from erroneously reaching a wrong conclusion based on a single test.

Before I leave question #2,  when discussing the low enrichment scores I must throw in another caveat:

☞ For very common surnames (e.g. Smith, Williams, Jones, etc.) the law of large numbers will affect the significance test such that these names will not be very high on ordered enrichment lists.

This point can best be made with a graph:

Surname Plot: Pedigree Surname Frequency vs. Census Frequency

The magenta line is a line of slope 1 intersecting the origin (0,0).   In other words, all names to the left or above the magenta line are occurring more frequently in the matches (for our test subject) than in the US population as determined from the US Census for 2000.    We can see that as we move towards the more common surnames the variance in the data decreases and the surnames move closer to the magenta line.    So if you are going to do this analysis on your own match results, remember that even if you have ancestors named “Jones” or “Smith” do not expect there to be large changes in the frequencies of those names in your match list.

Some additional observations to make:

  • The most common surnames in America are showing up in matches (of our test subject) less than the average in the US.   The entire name collection of the matches is tilted towards the less common surnames.
  • The demographic change in the US is quite noticeable, with the matches to old colonial American names prominent, while contemporary hispanic names are under-matched.
  • Many ancestry.com users insist on putting a woman’s given name in the surname field.  For the statistical significance test I culled these (except for “Barbara”, which was difficult to determine which instances were not really surnames), but here I show them to illustrate how common this habit is among ancestry.com users.

I suggest that anyone who is a “colonial” (that is, has all their ancestors who immigrated to America before 1776) will see similar results.

There is much more to discuss about surnames and counting their frequency, and what to make of our DNA matches.    Surname, and geography, frequency and associated statistical tests are a supplement to other genealogy methods.  Even with a more molecular approach to mining DNA cousins (i.e., matching chromosome regions) we still need to ascribe the shared DNA to a person, said person being known in their day and referred to by us through their name(s), familial, given, or fabricated.

For many of us with doubtful or absent ancestors at the ends of branches in our family trees, perhaps analyses like these can lead us in directions to dig for genealogical gold.

 ❦

This already long post would have been longer had I not excluded diving into several topics, which are possibilities for further discussion, eventually:

  • Outlining the analysis pipeline;
  • Improved methods of significance testing that more faithfully apply to the surname practices than the binomial test;
  • Exploiting name frequency distributions other than the 2000 US Census;
  • The problem of spelling and other phenomena related to names.

With that said, happy adventures in DNA mining!

Bookmark the permalink.

4 Comments

  1. I intuitively did the same thing…divided the number of matches that have a given surname in their tree by the census frequency for that surname. I have also used in the denominator an alternative — frequency in the 1850 census (the first census to include all individuals, not just heads of households). The two approaches can give very different rankings, depending upon how a surname has expanded over time. Any opinion on which denominator is ‘better’?

    (I apologize for posting this same comment in the wrong section of the blog earlier…)

    • Whatever baseline one uses for a surname frequency, one has to take into consideration of how the population has changed over time.

      I used the 2000 name frequency as it is readily available, and that so many of my matches have shallow, if any, family trees. A majority of my matches do not have trees that go before 1850 for the majority of their pedigree.

      And, as I noted in the article, we are using these surname frequency statistics from the census as as proxy for what we really want: name frequency in all the pedigrees of the DNA database, which AncestryDNA does not make available.

  2. The early 23andme.com (2013?) had more relevant names from grandparents….they were similar to my hand-written geneologist lists. The new 23andme.com (May 2016), I don’t see many names that are there that I used to see. But I do see my maternal grandparents’s R1b—which is cherish. Grandma Lynn

    • The “New Experience” at 23andMe no longer does the Surname View in DNA Relatives, unfortunately. If you have not yet been transitioned you will still have Surname View in your DNA Relatives.

      I’ve just recently been transitioned, but the last update to the Surname View in my colonial US kit had name frequencies that were in accord to what I have come to research in the background of that person.

      Given a large enough people (who are from a culture which has used surnames for a long time) the Surname View can be useful. Yet 23andMe chose to get rid of it.

Leave a Reply