Chromosome Pile-Ups in Genetic Genealogy: Examples from 23andMe and FTDNA

 

 

What we aim to accomplish:

  1. Dissuade people from chasing after small autosomal DNA segments by
    1. demonstrating too-common matching in particular regions of chromosomes;
    2. emphasizing the pitfalls of half-identical matching;
    3. briefly reviewing linkage measurements, emphasizing chromosome region lengths (in centiMorgans) as approximations.
  2. Illustrate the need that AncestryDNA had for redesigning their matching system to account for population wide shared chromosome regions. Whether or not the current AncestryDNA Timber algorithm is the best possible is not here the issue, but rather that something like it is needed, not just at AncestryDNA but at all the autosomal DNA services.

 

Warning: Genetics in general is a field with a steep learning curve; thus Genetic Genealogy likewise has a steep learning curve.   The following will be a bit technical.

 

 

On various forums and blogs there is much discussion over the use of autosomal DNA matching for genetic genealogy. This field is emerging as a popular approach to extend what we know as the practice of “genealogy”, but there is much confusion and even angst running through the community of consumers of genetic genealogy tests over exactly what is meant by a “match”, and why companies decide on whom to exclude from match lists.

A common belief or assertion by some who are interested in genetic genealogy is that relatively small regions (below roughly 7 centiMorgans) of a chromosome are important for genealogical research.   I won’t quote from various forum posts because I do not want individuals to think that I am picking on them, but whether we look at a Facebook group (e.g., the ISOGG group), the forums on the company websites of DTC DNA testing companies, or in general forums about genealogy and ancestry, there seems to be no end of a stream of posts of people trying to make small shared regions mean something.

The problem is these small regions often don’t mean what the posters think they mean.

Though we all like to think of ourselves as unique, here I need to emphasize:

☞ We humans are unique assemblages of very common small bits and pieces of chromosomes (DNA.)

Furthermore, there are some bloggers who are attempting to be highly visible in their attack on AncestryDNA, who late in 2014 revamped their matching algorithm, to the great dismay of some. This attack on AncestryDNA has been on Facebook groups as well as forums. Whatever one thinks of the company itself, the need for AncestryDNA to incorporate into their matching algorithm a method to deal with too-commonly matched regions of chromosomes is one of the reasons for writing this post. Regardless of the company one uses to do matching for the purposes of genetic genealogy, that company will improve their product if they can find a way to deal with false positives, and currently AncestryDNA does that, 23andMe does that to some extent, and FTDNA does not (to the best of my knowledge.)

One of the issues that drove AncestryDNA to redesign their matching algorithm, in part, we can couple to the problem of matching on small segments.

Let’s take a look at some data – all the data in this post is from a test I manage at 23andMe, and a profile I manage at FTDNA, from Family Finder, which is a transfer of the same 23andMe (V3) raw data of the same person tested at 23andMe. If you have tests at these companies you can download your own data and see for yourself where pile-ups are occurring.

23andMe customers have a tool available to them called “Countries of Ancestry” (CoA, formerly known as “Ancestry Finder”), which is a misnomer. It is a compilation of matches (with match data), composed in part from entries from the person’s DNA Relatives and others not in a person’s DNA Relatives list. Only 23andMe customers who have filled out the ancestry survey will find themselves in somebody else’s CoA list. Unfortunately the size of the CoA list is, similar to DNA Relatives, limited in the number of people allowed on the list (approximately a thousand.) Nevertheless, CoA is a list of over 1000 matching segments (as some person matches will be multi-segment matches) and thus useful for our purposes.

We can use the CoA data to plot out the “segments” (technically, half-identical regions) for each chromosome, stacking the segments to show overlaps and total coverage of the chromosome. Here is a plot for chromosome 1 showing 95 matching (to the test used in this post) segments as stacked rectangles:

 

Chr1 Segments from Countries of Acnestry

Fig. 1: Chromosome 1 matching segments for our test, as rectangles, from 23andMe Countries of Ancestry list.

 

 

(The red line is a count of overlaps, thus indicating shared regions for that portion of the chromosome, but since the segment blocks are offset vertically by slim white spaces to make the segments visibly distinguishable from each other the red line ends up being on a different vertical scale than the blocks.)

We can see from figure 1 that chromosome 1 is nearly completely covered by matches (from the 23andMe Countries of Ancestry list). We also notice that while the matching segments are not evenly distributed along the chromosome the distribution is not so lopsided to demonstrate a significant “cold region”, though the peak around 60Mbp (mega base-pairs) followed by the trough is starting to look suspicious. However, around 240 Mbp there appears to be the beginning of a “pile-up” region.

Let’s look now at another chromosome, 6, which is known to have some regions that are troublesome:

 

Chr6 from Countries of Ancestry

Fig. 2: Chromosome 6 matching segments from 23andMe Countries of Ancestry match list.

 

Looking at figure 2 it is readily evident that there is an overwhelming number of chr6 matches in a single region of the chromosome, around 30Mbp. Chromosome 6 is noted for it’s HLA regions, parts of the chromosome which house genes vital to the human immune system. The segment pileups in these regions are suggestive, and may demonstrate a non-random phenomenon, probably a selection event in human evolution, perhaps even during historic (i.e., since the invention of writing) times.

Also noticeable is a desert of matching around the 100 Mbp region.

This lopsided distribution of matching segments should give us pause: what does it mean for our genealogy efforts “to match” in these cases?

What we are seeing in the figures in this post is how commonly distributed small fragments of chromosomes (or more precisely, sets of alleles, or haplotypes) can be in our society. The 23andMe CoA file only has 1000 people, out of the entire 23andMe database of 800,000 customers. What if everyone in the US were tested, and the CoA list not capped? We likely would see hundreds of thousands of “matches” in this regions of chr6. And it should be noted that in figure 2 some of those segments in the major pile-up areas are over 10cM in length (according to current linkage maps – more on that below.)

In doing genealogy, trying to makes sense of these kinds of matches (in overly-common regions) is an exercise in futility. Using the oft stated standard of a minimum size of 7cM for a match that is likely to be identical-by-descent, trying to identify the most recent common ancestor (MRCA) with a match such as in these chr6 pile-up regions is not tractable, even if the segment surpasses the 7cM threshold; the MRCA could be dozens of generations ago.   One could propose that if a large enough sample of our population tested, and if select buried individuals could be exhumed and tested, then we could recreate partial genotypes of the individuals of entire communities of our ancestors from centuries ago. If this could be done then we could determine how common among our ancestors’ communities these shared chromosome regions today were in any given community, and perhaps trace the rapid growth of particular families or clans. Testing on such a scale is unlikely in the near future, however, and such an effort will likely face other hurdles.

Additionally, this region in chr6 in particular is demonstrating the problem of half-identical matching. The massive pileup is likely not due to a single physical 7cm – 10cM strand of chr6, but the superpositioning of several smaller (say .5cm to 1cM) regions, haplotypes found on chr6 which are very common throughout the European population. By random, these small fragments will superimpose (given the two copies of chr6 we all carry) to present these larger half-identical regions (HIR) which make the matching threshold cutoffs (say 7cM.)

Given that some of these regions in chromosomes are known, at least 23andMe filters out the most common ones before a match can make it into DNA Relatives.   This is an important distinction.

But what if we did not filter out these known regions?  And what if we were not limited to match lists of only 1000 people?

For this we turn to FTDNA’s Family Finder, which is not limited in the number of matches, and includes HIRs as small as 1cM.

Here are the matching chromosome 1 segments from Family Tree DNA’s Family Finder, for the same person as in the 23andMe CoA test:

FTDNA chr1 all segments

Fig. 3: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

 

First thing to note is that, while the FTDNA customer database for Family Finder is much smaller than 23andMe’s customer base, because FTDNA is reporting matching regions down to 1cM the Family Finder Chromosome Browser downloaded data set is quite a bit larger than the 23andMe CoA file. In figure 3, for chromosome 1, we are looking at over 1000 such matching regions.

It is quite clear that there are pile-up regions, sticking up like telephone poles in the forest of matches. These “matches” are occurring much, much more frequently than one would expect for a random distribution of chromosomes recombining in each generation.

Since the 23andMe CoA has a minimum cutoff of 5cM for a segment, we can filter the FTDNA data to include only those segments that likewise are at least 5cM in size. A plot of the FTDNA result for chromosome 1 ends up looking similar to the plot of matches from 23andMe:

Fig. 4: Chromosome 1 matching segments from FTDNA Family Finder chromosome browser list.

Fig. 4: Chromosome 1 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.

 

Figure 4 is similar to figure 1, and we notice the pile-up near the 230-240Mbp region. However, what stands out in fig. 4 is the pile-up region around 180Mbp, which is not evident in fig. 1. We are starting to see regions of chromosome 1 where there is excessive “matching”.

Let’s repeat this exercise for chromosome 6, first presenting all the FTDNA FF segments on chr6:

FTDNA chr6 all FTDNA segments

Fig. 5: Chromosome 6 matching segments from FTDNA Family Finder chromosome browser list.

 

The massive pile-up around 30Mbp in figure 2 is now even more massive. The FTDNA Family Finder Chromosome Browser data includes 1467 segments for chr6, a great share of them in the pileup regions. Besides the largest such pileup we can visually identify 3 others.

As before, if we filter the FTDNA data for only those segments at least 5cM in size we get a much smaller set, and when plotted we get:

FTDNA chr6 5cM floor

Fig. 6: Chromosome 6 matching segments ≧ 5cM, from FTDNA Family Finder chromosome browser list.

 

In the filtered data, with only segments greater or equal to 5cM, only the major pile-up in chr6 remains, and is still imposing. There possibly may be a pileup near the start of the chr6 too, but it may not be statistically significant.

In the above figures it becomes evident that there is excessive matching in particular regions of chromosomes. Furthermore, the commonality of these matches suggest that attempting to incorporate this data into family history research will lead to futility.

As humans we are all related to each other, the question being how long ago did any two individual’s MRCA live. In the US, people of colonial descent are likely multiply related to each other within the past 20 generations (500 years), and many times at that. Indeed, those of colonial descent will have very many 10th cousins and closer living in the US; the numbers of 5th through 10th cousins are likely in the millions for the colonials. So, the existence of relatives, distant to very distant, is not the question.

☞ The key to doing genetic genealogy with autosomal DNA is not finding a match, but rather finding genealogically tractable matches.

“Genealogically tractable” here means that given a reasonably exhaustive search of existing records, two family trees are documented sufficiently to support the conclusions in the pedigree, and the name of a most recent common ancestor can be found. Thus ancestors who lived before the era of documentation trails are not genealogically tractable. To go back further in time than a document trail can allow means we are entering the territory of the “ethnicity” or “ancestry” estimates provided by some companies. For many people in the world there are no records before the 19th or 18th centuries, while in a small set of locales documentation may go back to before the era of European colonization, but in these cases the records rapidly collapse to the nobility and royalty.

This is important because,

☞ Given lax enough matching criteria, one can have a DNA match with a person with whom your shared MRCA existed before records were kept for that MRCA.

Thus our goal is to find matches with whom we can possibly find the MRCA. A “match” that is based on population-common fragments of chromosomes is unlikely to be resolvable by genealogical methods, as such chromosome fragments would have been found in a large number of the contemporaries of our own pedigree ancestors. This is all the more true as we move back in time, when people found their mates nearby and not uncommonly married their cousins.

As noted above, 23andMe filters out some common regions (though they have not published the details) before a “match” can make it onto the DNA Relatives list.

In the fall of 2014 AncestryDNA implemented their own means to do something likewise, presented by them with much fanfare (and here.) AncestryDNA’s new matching system has caused quite a bit of a stir, in part because customers lost some to many of their old matches. The tests I manage lost from 40% to over 90% of their previous matches. Part of the reason for this is AncestryDNA’s new “Timber” algorithm, which explicitly attempts to deal with the phenomenon of the matching of overly-common chromosome regions, the existence of which are undeniable, as we see in the plots in this post.

Without addressing the matching to overly-common DNA we end up with very large lists of matches with whom we will never find the MRCA, which was the case previously with AncestryDNA and is possibly true still at FTDNA.

In may turn out that the Timber algorithm is too aggressive, and some genealogically informative matches are being lost in the new AncestryDNA matching system. This may happen if an inherited region of a chromosome is bisected by many population-wide common segments, and, post-Timber filtering, this inherited region is broken into too small of remaining segments to make the minimum threshold to declare a “match”. However, for AncestryDNA to correct this will probably require developing even more sophisticated filtering algorithms, about which I may write later in another post.

Having established the requirement for filtering out pileup regions, I want to stress again that these regions are not just small, 1 to 2cM, portions of chromosomes. Some will be much larger.

In general in regards to matching of genotyped individuals, as a standard, or best practice, matches ought not be declared on small (less than 7cM) regions of unphased genotype datasets, a subject discussed at length by various authors and bloggers and which I won’t repeat here. Phased genotypes can be matched on smaller regions with greater confidence, though I would not do so for regions less than about 5cM. (And if you’re interested in current academic discussions of identical-by-descent detection try here, herehere, and here). Unfortunately FTDNA insists on reporting small segments on their unphased genotype data sets, which misleads the customer in regards to how significantly they match other people.

Given all of that, we conclude this:

☞ Even after filtering out HIRs smaller than 5cM, DNA matching services such as FTDNA or 23andMe should filter out HIR’s greater than 5cM that are appearing too commonly in the population, so that the end user receives a genealogically tractable match list.
 ❦

 

Ancillary to the above discussion and needing to be addressed especially in light of how FTDNA’s Family Finder reports matches, and since I mentioned it above, I want to touch briefly on the concept of “size” when it comes to autosomal DNA segments. This is important because I often see commenters in forums referring to a shared HIR as being ‘n.nn’ centiMorgans, sometimes to those 2 decimal places, probably because they check their FTDNA Chromosome Browser results which report matching in centiMorgans to two decimal places.

FTDNA should not give HIR lengths, individual or in total, in cMs to two decimal places!

The diagrams in this post use base pairs (or millions of base pairs) as the coordinate to place the matching HIRs over the chromosome length. In genetics, the concept of linkage disequilibrium mapping brings about the need to map the physical molecular position to a frequency space, with the frequency unit being a Morgan but in practice the centiMorgan is used.

Each human chromosome has by experimental procedure been “mapped”, associating a physical coordinate onto a scale measured in centiMorgans.

What many enthusiasts in genetic genealogy may not know is that the thought processes and concepts associated with these linkage maps have to deal with many issues; some of these may impact our understanding of the population-wide matching segments which are the topic of this post.

Referencing “cool spots” and “hot spots” can imply that coordinates from a linkage map be taken uncritically; however, the very large sets of genotyped samples being collected by 23andMe and AncestryDNA may be used in the future to create a better understanding of recombination probabilities along the chromosomes.

There is much variability in the end products of meiosis, and in regards to recombination and creating maps of chromosomes the following issues are topics of discussion:

  • age (e.g., here);
  • ethnicity;
  • gender (if comparing a child to a parent);
  • unique genetics (e.g., here).

The measurements for segment lengths given by 23andMe or FTDNA or gedmatch.com are averages, specifically gender averages.

For further (and very technical) reading on this subject:

Identifying recombination hotspots using population genetic data

Genetic Analysis of Variation in Human Meiotic Recombination

Enhanced genetic maps from family-based disease studies: population-specific comparisons

Variation in Human Recombination Rates and Its Genetic Determinants

Genetic Control of Hotspots

 

☞ The bottom line about segment lengths is this: Round the length of chromosome segments to whole numbers, and remember additionally that the size of a segment is only an estimate and is accompanied by non-trivial error bars.

 

 ❦ ❦

 

 

Surnames, 23andMe, and AncestryDNA: Making the Most of Match Counts and “Enrichment”

 

The questions we hope to answer are these:

Can we use the surnames provided (either in lists or pedigrees) by our DNA matches to:
  1. Discover the names of ancestors about whom we have no a priori knowledge?
  2. Confirm the name of ancestors about which we may have doubts?

 

This is a very complex subject, so this is a lengthy post to wade through – you are forewarned!

 

Making most of the direct to consumer (DTC) DNA services for doing genealogy is a challenge currently, with no single company offering everything an aspiring (professional or dilettante) genetic genealogist requires to accomplish many goals.   This quandary includes dealing with the cultural attachments to our DNA – our names.

One of the central theories of genealogy as practiced in the English speaking world is that surnames matter.   This turns out to be true enough, and is a reflection of the deep patriarchy embedded in western society.

Yet for many of us with ancestry from other nations outside the Anglosphere this surname-centric view of families will not be of much use.   I for instance have half my pedigree that is from a culture where patronymics were the custom until the time of my grandfather’s emigration; I am only the second generation born with the surname I carry.

Still, for the English genealogy world the surname ranks as one of the more important concepts.   For example – the field of one-name studies, while not absolutely wedded to the idea, centers on the name being studied as a surname.   Genealogy database programs such as Family Tree Maker are structured around a naming paradigm in which there are surnames.  Your local courthouse records are probably indexed by surname, and so on.

A Tree of Names, but Which Ones Belong?

A Tree of Names, but Which Ones Belong?

 

In the DNA view of genealogy as practiced today surnames still are center stage – at FTDNA Family Finder there is a field for surnames, at AncestryDNA the match page has a section dedicated to pedigree surnames and surnames in-common, and at 23andMe a customer can enter surnames into one’s profile, which will then appear in DNA Relatives and upon which one can search or sort one’s matches.

23andMe goes one step further, though, and offers what they call “Surname View”. This is a ranking of surnames as found in a customer’s DNA Relatives list.   The surnames are ranked by an “enrichment” score.   23andMe defines this value as:

Enrichment is computed via a one-tailed binomial test. The 23andMe-wide frequency of a given surname is the reference frequency. The number of occurrences of the surname among your matches and the total number of surnames among your matches are the counts in the binomial test. This results in a p-value; we then report -1.0 * log10(p), so the bigger the number is, the more unusual it is that it was at such high frequency among your matches.

 

For the purposes of this discussion I’ll use the term “enrichment” in the same sense as 23andMe. Their definition of enrichment is an interesting idea, though I have some questions about the validity of using a binomial test.  In a future post I may delve into more technical details about this issue.   There is one very important point to make:

☞ The more often a name occurs (in your match list) is not sufficient to define the importance of having a match with that surname in their pedigree.   Rather, we need to take into account the likelihood (based on the population one is testing) of the particular name occurring at random.

 

Recognizing that 23andMe is just one pond in which to fish, many of us have tested elsewhere, such as at AncestryDNA.    However, AncestryDNA does not provide any analysis regarding name frequencies in a customer’s set of matches.   But, with a little bit (or a lot) of cleverness we can collect the surname data from our matches at AncestryDNA, found on each match page in the left hand column titled “Surnames (10 generation pedigree)”.

With this data I can then calculate what 23andMe calls the “Enrichment” for our AncestryDNA set of surnames from matches, using the US Census surname frequency data in place of 23andMe’s “23andMe-wide frequency “, as I don’t have access to the entire database of AncestryDNA tests (if only!) to calculate an equivalent AncestryDNA-wide frequency table.   Given that the tested person in this case has all colonial American ancestry, and because the AncestryDNA customer base to date is overwhelmingly from the US with the “colonials” being a significant share of the customer base, using US Census data for expected values of name occurrence is presumed to be a good approximation of an AncestryDNA-wide data set.

A concrete example, from two tests I manage, of the same person, one test at 23andMe and the other at AncestryDNA:

Here are all 226 entries under “Surname View” for this particular test:

 

23andMe Surname View

Surname23andME Count23andME Enrichment
Roberts2561
Bryan1150
Chiles549
Webb1748
Adkins842
Green2442
Hoskins642
Smythe541
Dowell541
Peyton540
Cole1437
Hinds537
Beasley737
Griffith1036
Forster536
Wentworth533
Parker1933
Henley533
Holman633
Henson633
Schmid533
Gibbs832
Wilcox831
Drake930
Wyatt730
Vaughn829
Rowe829
McDaniel829
Allison729
Rush629
Turner1729
Dick529
Bishop1129
Best629
Garner728
Moyer528
Powell1328
Roe528
Mackey527
Bowen827
Edwards1627
Bates926
Stout626
Moon626
Garrison626
Dudley626
Sharpe526
Harper1026
James1226
Chambers825
Ballard725
Anderson3024
Gibson1124
Christian624
Valentine524
Pitts524
Lane1023
Hale823
Scott1921
Ward1421
Preston621
Hayden521
Duke521
Thompson2621
Wallace1020
Lewis2120
Shepherd620
Davenport620
McMillan520
Chapman1020
Berry1020
McDonald1019
Abbott619
Yates619
Dennis619
Robertson1119
Orr519
White2619
Stephens919
Woods918
Kemp518
Wilson2918
Stephenson518
Rogers1417
Harrison1017
Booth617
Ingram517
Warren916
Fleming716
Barrett716
Clark2316
Chandler615
Collier515
Dyer515
Brooks1015
Walker1914
Allen1914
Cox1214
Stone914
Parsons614
Mullins514
Campbell1713
Foster1013
Mason913
Howell713
Miles513
Thomas2013
Day712
Franklin612
Love512
Matthews712
Miller3311
Cook1211
Austin611
Russell910
Wells810
Reynolds810
Lawrence710
Keller610
Gilbert610
Sherman510
Pratt510
Hubbard510
Young159
Evans139
Weaver69
Fuller69
Sutton59
Lowe59
Garrett59
Fletcher59
Fitzgerald59
Dawson59
Dunn79
Hoffman68
Davis288
Kennedy88
Rice78
Owens68
Hopkins68
Carter118
Hawkins68
Bradley68
Willis58
Dean58
Jones348
Murphy118
Hall157
Wright147
Hughes87
Fisher87
Sanders77
Patterson77
Murray77
Jackson157
Johnston77
Griffin77
Bryant67
Riley57
Mueller57
May57
Mills77
Taylor216
Watson86
Gray86
Meyer76
Long76
Phillips106
Graham76
Jenkins66
Armstrong66
Andrews66
Greene56
Hunt76
Baker135
Morgan95
Bell95
Bailey85
Wheeler65
Spencer65
Marshall65
Alexander65
Payne55
Henry55
Elliott55
Duncan55
West64
Smith564
Robinson124
Stewart104
Nelson104
Wood94
Morris84
Cooper84
Reed74
Martin194
Price64
Myers64
Coleman64
Tucker54
Nichols54
Knight54
Carr54
Arnold54
Williams284
Johnson333
Adams113
Butler63
Schmidt53
Perry53
Palmer53
Jordan53
Hunter53
Ford53
Brown282
Ross52
Richardson52
Howard52
Bennett52
Collins72
Harris111
Hill91
King81
Mitchell61
Moore110
Lee90
23andMe Surname View on 24 Jan 2015, for same individual as tested at AncestryDNA

 

Here are the top (i.e., the most “enriched”) 226 (a number selected to be the same quantity as that from 23andMe) entries from the list of all surnames in matches at AncestryDNA for this same person:

 

AncestryDNA "Surname View"

SurnameAncestryDNA Count"Enrichment"
Parke3834.41
Batte2333.33
Cocke2531.83
Pridmore2327.49
Barbara2826.86
Woodson5426.75
Crownover2626.41
Reade2123.67
Browne5623.39
Isham2822.69
Bartlett7822.33
Bolling3222.31
Prigmore1521.53
Wood23221.00
Eppes1720.99
Chiles2820.92
Fitzrandolph1120.84
Cresson1220.68
Poythress1720.19
Taliaferro2519.91
Stuart6719.82
Mellott2619.21
Cooke6219.15
Andersson1619.00
Ironmonger1018.59
Elliot3217.63
Pettypool917.26
Pleasants1816.94
Stillwell2916.91
Jonsson1516.79
Demarest2016.74
Bryan7916.47
Brashears1816.45
Vanderveer1616.28
Piles1016.06
Woodward6115.99
Owen8215.81
Cock1015.61
Nalle1015.61
Paine2915.25
Glascock1614.94
Hubbard8614.82
Griffith9314.66
Fielding2414.62
Anna1314.60
Hix2414.45
Vancleef1114.29
Olofsson814.24
Whitehead6414.16
Moredock914.11
Neale2014.08
Goad2813.90
Vawter1513.75
Doyne913.71
Clarke7913.67
Nilsson1813.57
Read3713.41
Ball9013.39
Moor1613.36
Frances1613.13
Chamberlain4813.10
Grymes913.00
Mumford2312.95
Dickenson2012.94
Cossart712.81
Tuttle4712.78
Bird5312.76
Symons1712.71
Tschudi812.62
Jennings9512.59
Tarpley1812.41
Blankenbaker1112.40
Denton4612.40
Markham3112.39
Predmore1312.05
Wheeler11812.04
Meriwether1411.98
Dudley5211.86
Lanier3811.68
Brereton1211.65
Wynne2611.64
Harrison15211.51
Wallis2811.49
Schenck2111.45
Drake7111.43
Ragsdale2911.42
Pride2211.36
Cawood1111.31
Graves8811.31
Clements5311.29
Agnes1011.23
Langston3611.18
Overton3511.13
Morton7011.10
Worsham2010.96
Muller4310.92
Brevard1110.85
Crew1710.85
Johnston11910.82
Larsson1110.75
Tydings810.70
Tarleton1210.70
Enyard610.66
Petersson610.66
Roscow610.66
Lyon4310.63
Osborne7410.57
Vaughn8810.57
Brooke1810.53
Larzelere810.47
Parsons7410.45
Waggoner3010.29
Tandy1310.29
Clayton6510.28
Armistead1410.28
Oldham2810.27
Daniel8010.25
Cathey2310.24
Eva910.20
Sharp8110.16
Rawlings2410.14
Roosa1210.13
Fowke610.09
Vivion610.09
Antram710.05
Hogg219.92
Newberry289.91
Darnall129.90
Henley359.82
Beall239.81
Rachel139.72
Stout549.66
Earle229.65
Meacham199.59
Dabney209.58
Ballenger179.54
Hegeman99.51
Garnet79.50
Pledge79.50
Fowler979.50
Ely279.50
Harwood259.45
Royall119.39
Suddarth99.39
Allen3139.39
Brouwer139.37
Bull259.37
Williamson1039.37
Persson129.36
Faure89.33
Westcott189.32
Scripture79.27
Calvert319.26
Stedman169.23
Boone609.23
Mcgehee189.22
Cloyes69.22
Simcock69.22
Standley169.16
Craige79.05
Salling79.05
Pope679.04
Warren1259.03
Maria158.98
Mendenhall238.97
Crow368.96
Poor148.95
Waddy118.88
Hertzel68.88
Kip68.88
Vannuys68.88
Field378.85
Steel198.83
Threlkeld128.81
Mershon118.75
Hoge138.75
Follansbee88.74
Scudder158.74
Pendleton298.73
Hawkins1158.72
Baskett118.68
Crawford1308.66
Griswold238.65
Alice78.65
Strother198.64
Street328.64
Alden178.60
Aldin58.59
Efland58.59
Mcilhaney58.59
Thomasen58.59
Loomis278.54
Vanbibber98.53
Brewster318.47
Stryker138.37
Hale808.36
Erwin348.33
Hobart138.33
Mccombe68.30
Bromwell78.30
Bushnell168.30
Hull508.28
Rogers2108.26
Irvine218.25
Runyon218.25
Sprague348.25
Garland378.24
Tinsley278.23
Pridemore128.17
Buys98.17
Bennet148.13
Gooch238.11
Battaile58.11
Faut58.11
Hogshead58.11
Chappel128.08
Randolph488.07
Croasdale68.06
Fiske148.02
Esther77.99
Basye87.92
Straughan87.92
Wyatt567.90
Haile157.89
Low237.87
Harbour147.87
As of 24 Dec 2014, AncestryDNA matches' surnames-in-pedigrees analyzed similarly to 23andMe "Surname View". The top 226 "enriched" surnames are shown, the same number of surnames as in the total Surname View at 23andMe, but which are only 3% of all the surnames from the AncestryDNA match list.

One thing is very clear upon comparing the two lists is that they are not same.

They are not even close.

The surnames in the 23andMe list do show up on the larger AncestryDNA list (not just the top 226), but scattered across the nearly 10,300 names in the much larger AncestryDNA list.

What also should stand out is that the counts of matches (second column) are much larger for the AncestryDNA data set than for the 23andMe data set.    While the number of customers of each service is about equal, those of us who have tried to use 23andMe for genealogy have discovered quite quickly that many DNA Relatives there are incognito, and even if a DNA Relative has a public profile they usually list very few surnames.   Add to this that 23andMe limits the DNA Relatives list size to only the top matches (for those customers with more than 1000 matches, which would include many Americans) and we discover that there are relatively few data points from the 23andMe database in regards to surnames.

In the above example, the testee has 4x as many matches at AncestryDNA than at 23andMe, and before “autosomalgeddon” at AncestryDNA last fall the match count was twice that.    However, even more significant is that the average number of people in the pedigrees of matches at AncestryDNA is over 50, while at 23andMe the average number of surnames per match is but a small fraction of that.   Because of the smaller data set of matches with ancestral names at 23andMe, it is possible for a few diehard genealogists, who test multiple members of their families, to skew the Surname View results by putting in long lists of surnames in their profiles.   For example, the above 23andMe customer matches 3 siblings, each of which has a surname list of approximately 215 names!   This one family significantly skews the Surname View results for the example test I am discussing.   In this case, this family is on the first page of matches (that is about 1/12th the total) in DNA Relatives, and accounts for 645 name occurrences on that page.   Everybody else on that page (95 people) account for just 413 more name occurrences.

 

☞ Caution: Outliers, such as extremely deep pedigrees or very long name lists will skew results on relatively small datasets, such as at FTDNA and 23andMe.

 

Given that we have two different lists of “enriched” surnames for the same person, which one is more informative? (Note that I did not write “accurate”.)

After a couple of years of working on developing the pedigree for the subject of this post, with plenty of surprises along the way, it turns out the AncestryDNA surname enrichment list better reflects what I know of the pedigree.   Especially in the top 25 names, where 4 on the list of AncestryDNA enriched names are the same as pedigree surnames, while only one name (“Green”) on the top 25 of the 23andMe enrichment list is found in the developed pedigree, and that name is questioned by some researchers of that ancestral family.

Of the entire 23andMe list of 226 surnames, only 11 are found in the developed pedigree.   For the top 226 from AncestryDNA it also turns out there are 11 names found in the testee’s pedigree.   The only name in common between the two groups of 11 is “Johnston”, which is good as that is the family name of the testee’s grandfather.   It should be noted that the entire list of AncestryDNA matches’ pedigree-names that were included in this analysis total over 10,000, and names in the rest of the pedigree of the testee are found on that list but farther down than #226.

Which raises the next question – what about all those other names on the list(s), that are not ancestral names of the person tested?

Intuitively we recognize that our ancestors can have many descendants, and in the society of which we’re a part a daughter will lose her maiden name and pick up her husband’s surname.   Thus our cousins descended from these great-grand aunts and female cousins will carry surnames that are not part of our direct pedigree lines. This is both a bane and a boon. It is a bane because unless we’ve carried out very extensive descendancy research we will not recognize the names of our genealogical cousins.   On the other hand, all these names are a boon because until the late 20th century people found their mates relatively close by, and families often intermarried and migrated together, making certain names more associated with each other.   From the example test in this post, on the AncestryDNA enriched surname table the top entrant is “Parke”, a name often found in colonial pedigrees due to the 17th century founders in America (e.g., purportedly a Dr. Roger Parke of early New Jersey) having many male descendants, including quite a few who ended up in what is today West Virginia and Kentucky.  This surname is then associated with many families that come from these areas, which happens in this case with our subject’s father’s ancestors having lived in WV and KY where many Parke/Park/Parks families lived.

As we work our way to an answer to the two questions at the top of this post it is important to realize that the use of DNA in genealogy is rapidly expanding, and as a larger share of the population tests at these companies we will see larger data sets that will give us an opportunity to be more certain of whatever we find.

So, in regards to question #1 posed at the beginning of this post, the best answer currently is “probably.”

In this particular case, in the example DNA test I am discussing in this post, only through DNA testing has a secret been revealed, an NPE (or possibly secret adoption.) It turns out that 3 surnames in the top 25 of the AncestryDNA surname enrichment table are ancestors of the biological parent of this surprise.   Those 3 names are in 4th, 7th, and 13th place on that list of enrichment (out of 10,300 surnames.)    It may still prove to be a coincidence, but finding the surnames of 2nd and 3rd great grandparents high on such a list ought not be surprising.  In this case the surnames are unusual enough to stand out, but not so ultra-rare in our society as to not make it onto the US Census list (which only includes names which occur more than 100 times in the US).   Extremely rare surnames are also, by definition, unlikely to show up in a list of DNA matches.

Furthermore, the 3rd entry on that AncestryDNA enrichment list, “Cocke”, is probably the ancestral surname of “Cox”, and the pedigree of the tested person most likely (based on un-vetted trees) has two entrants with that surname.   Number 21 on the above AncestryDNA enrichment list, “Stuart”, is the maiden name of a known 4th great grandmother.

My experience in doing this example is that nearly all the surnames in the pedigree of the person I’ve tested appear in the upper quartile of the AncestryDNA enrichment list, with many of them in the upper 10%.

So, without a priori knowledge of the names of unknown ancestors, how would one make use of an “Enrichment” list?    Based on my experience, I recommend starting at the top of an enrichment list, and working down at least a couple of hundred names (given a dataset as large as AncestryDNA) and look for any connections the names may have with one’s known ancestors.   Speculative family trees are not a bad thing, as long as they are treated as such.

☞ Mining surname data is a source of leads for genealogical research, and cannot stand apart from the exhaustive search for evidence required in sound family history practices.

In an ideal case a newly discovered close DNA cousin will have their full tree available to you to study, to find the most recent common ancestor, but that is not always the case.   Sometimes all one has is a list of surnames, or maybe a bare-bones pedigree with only names and no other useful information. If you discover a not-previously-known not-too-distant (say 2nd to 4th) cousin, and that DNA cousin lists some ancestral surnames, it is worth going through each of their surnames looking at an enrichment list of your own DNA test, to see if one or more stand out as occurring disproportionately frequently in your set of matches.

Then what of question #2, where we are desiring to validate or support a theory about an ancestor’s possible surname (and in these cases we are often looking for the maiden names of our married ancestresses?)

In my opinion, one of the better uses of an enrichment list is to act as a check against the human tendency to fixate on an idea or observation.   Given the ubiquity of pareidolia I value any means to keep me from seeing what is not there.   In the field of genealogy that includes becoming overly sensitive to a particular name, whereby we exaggerate the importance of that particular name any time we come across it.

In the example tested person in this post, after looking at many matches at the various companies I had become fixated on certain names, but looking at the AncestryDNA enrichment list I discover that those names are not occurring any more frequently than what to expect out of the population at large.

In another application of the enrichment table: I’m researching a great grandmother (of the tested person in this post) whose maiden name is known from census records and marriage records, but whose parents have escaped identification. In this case the lack of matching surnames of the DNA matches at 23andMe and AncestryDNA, as evidenced by the low enrichment value of the great grandmother’s maiden name, and the low enrichment value of the maiden name of the only woman I can find that I hypothesized could be her mother,  supports my idea that the great grandmother was adopted (or a foster child who took the name of a couple that housed her for a while.)  This hypothesis I had previously generated based on census data where the great grandmother spends her teenage years and early 20’s living with two different families not of her own surname, and the surname data suggests that I am on the correct path.   I’ve further had tested at AncestryDNA another great grandchild of this mysterious woman, and analysis of surnames of his matches concurs with the conclusions I’m drawing from his 2nd cousin’s match data.

Which brings up an very useful strategy, whether one is doing segment chasing or looking at surnames or just trawling for unknown cousins:

☞ Testing additional known relatives, especially 2nd cousins, can provide concurring data, or prevent one from erroneously reaching a wrong conclusion based on a single test.

Before I leave question #2,  when discussing the low enrichment scores I must throw in another caveat:

☞ For very common surnames (e.g. Smith, Williams, Jones, etc.) the law of large numbers will affect the significance test such that these names will not be very high on ordered enrichment lists.

This point can best be made with a graph:

Surname Plot: Pedigree Surname Frequency vs. Census Frequency

The magenta line is a line of slope 1 intersecting the origin (0,0).   In other words, all names to the left or above the magenta line are occurring more frequently in the matches (for our test subject) than in the US population as determined from the US Census for 2000.    We can see that as we move towards the more common surnames the variance in the data decreases and the surnames move closer to the magenta line.    So if you are going to do this analysis on your own match results, remember that even if you have ancestors named “Jones” or “Smith” do not expect there to be large changes in the frequencies of those names in your match list.

Some additional observations to make:

  • The most common surnames in America are showing up in matches (of our test subject) less than the average in the US.   The entire name collection of the matches is tilted towards the less common surnames.
  • The demographic change in the US is quite noticeable, with the matches to old colonial American names prominent, while contemporary hispanic names are under-matched.
  • Many ancestry.com users insist on putting a woman’s given name in the surname field.  For the statistical significance test I culled these (except for “Barbara”, which was difficult to determine which instances were not really surnames), but here I show them to illustrate how common this habit is among ancestry.com users.

I suggest that anyone who is a “colonial” (that is, has all their ancestors who immigrated to America before 1776) will see similar results.

There is much more to discuss about surnames and counting their frequency, and what to make of our DNA matches.    Surname, and geography, frequency and associated statistical tests are a supplement to other genealogy methods.  Even with a more molecular approach to mining DNA cousins (i.e., matching chromosome regions) we still need to ascribe the shared DNA to a person, said person being known in their day and referred to by us through their name(s), familial, given, or fabricated.

For many of us with doubtful or absent ancestors at the ends of branches in our family trees, perhaps analyses like these can lead us in directions to dig for genealogical gold.

 ❦

This already long post would have been longer had I not excluded diving into several topics, which are possibilities for further discussion, eventually:

  • Outlining the analysis pipeline;
  • Improved methods of significance testing that more faithfully apply to the surname practices than the binomial test;
  • Exploiting name frequency distributions other than the 2000 US Census;
  • The problem of spelling and other phenomena related to names.

With that said, happy adventures in DNA mining!

MyHeritage’s Family Tree Builder – Not Ready For Primetime

Not to continue to rain on the MyHeritage parade, but I discovered something today about their home-computer-hosted front end, otherwise known as “Family Tree Builder” (FTB), (which is a not-priced download from MyHeritage), that I think is important to highlight.

On 22 Dec 2014 I downloaded the latest OSX version of FTB.  The software claims to be version “.20”.  The other day I discussed my opinions of this software in conjunction with MyHeritage as an online genealogy service and I won’t rehash those ideas here, but in that post I mentioned how displeased I was with FTB and why I won’t use it.

However, on DearMyrtle’s genealogy community it was suggested to me that I download the latest version and try it.

But… I thought I had the latest version.   Why did I think that?  Well, in FTB, accessing the Help ->Check for Updates…  menu item brings up the Update Wizard, which I had run the other day.   I ran it again, and it assured me I was up to date:

 

FTB Update Wizard

FTB Update Wizard in version .20

 

So everything is copacetic, no?   No.    I went back to the MyHeritage FTB download page (linked above) and decided to download the disk image today, three weeks after I last did this.    I renamed the old app (so OSX wouldn’t overwrite the file) and installed the new download.   And guess what, after installing FTB from this latest download, the newly reinstalled FTB now claims to be version “.21” , which I take to be the next increment after the version I previously had (“.20”).

So, if this is a new version, then why just a couple of hours earlier when I did the Update Wizard in version .20 was version .21 not downloaded?  And, why was I told that my old version was “fully up-to-date”???

Sure enough, the latest version (.21) also has an Update Wizard, and it tells me (within minutes after running the .20 Update Wizard):

FTB Update Wizard version .21

FTB Update Wizard version .21, run at same time as the .20 Update Wizard

 

Clearly the Update Wizard did not previously update me to the latest version that was in fact available.

[Aside – One further thing I noticed which puzzled me – in OSX one can view from the operating system when the creation date of file was, and it turns out that my FTB version .20 and the newly-downloaded FTB version .21 both have (in the downloaded disk image) the same (very old) creation date, suggesting to me that the FTB developers need a better way to track and manage software releases.)  Version .21 has a last modified date of 12 Jan 2015, which implies my current version is only 5 days old. ]

I wonder (and this is speculation) if whoever is managing the download software at MyHeritage had multiple versions of the FTB software builds available on the same date, and only released version .21 after .20 was out into the community.

If so, then that is telling me (what ought to be already obvious from using any FTB version) that MyHeritage is using their customers as the beta testers for their FTB software.

But no where on the FTB download splash page is it explained that FTB is an early (at least for the Mac) beta version.

BEWARE EARLY BETA SOFTWARE!

I think this is important to stress here.   My impression is that many people who are into genealogy are older, and some large set of these family historians are not computer savvy, at least to the point to be able to navigate among various beta versions of software, and it is impractical to require these historians  to continually manage the version of the software they are (perhaps unwarily) testing.

So this is just one more reason why I can’t recommend MyHeritage/FTB for those of us who have other options.

 

Reasons Why I’m Not Picking MyHeritage Even Though 23andMe Tried

Over the past year I’ve played with a few items on MyHeritage, always looking for online resources besides ancestry.com by which I can make progress on my difficult branches of my family tree.

As a customer of 23andMe, as part of a new promotion they are doing with MyHeritage I get a “free” account at MyHeritage to post a family tree, that can be linked with my 23andMe profile for my matches to view.

However, as I wrote on a 23andMe Communities thread earlier, in response to the roll out on 15 Jan of the integrated MyHeritage family tree capability for 23andMe customers:

 I find MyHeritage way too frustrating to use.   I did the transfer thing, but when I wanted to delete someone it wouldn’t let me do it.   And there are no instructions for disconnecting people.   When I wanted to delete the person I had intended, the warning box said I had to delete his relatives first, which I didn’t want to do.

And for us Mac users, MyHeritage’s downloadable tree management software is like a trip back to the bad old days of Windows, complete with “C:” in filenames?!

So I deleted the whole tree.   I will just change my public profile here to state that if anybody is interested they can visit my trees at ancestry.com .

 

My frustration at times with exploring MyHeritage stem from more than just a few issues, but I thought it worth noting a handful here:

  1. Awkward Tree Management;
  2. Lack of novel records for my ancestors;
  3. Overpriced for what I get;
  4. Mac-unfriendly computer tool Family Tree Builder and poor web tree controls;

Let’s take these one by one:

1.   I find tree management at MyHeritage to be quite awkward.   As I complained on a 23andMe community post, MyHeritage’s online software will not allow me to delete someone, without deleting others who I do not want to delete.   This is just unacceptable.   I have not been able to discover how to selectively edit relationships of individuals, thought that capability might be lurking somewhere under the obscuring user interface.    I couldn’t even find a way to add a mother for someone, as the pop-up (transparent) window wouldn’t give me that option for some people, without telling me why.

Add to that no pedigree view and too-sparsely spaced nodes in the family view and one ends up with a most uncomfortable viewing and climbing experience to get around the tree.

2.  As a test for Record Matching and “Smart Matching” I uploaded a 289 person tree of a relative of mine, and let the computers do their thing over night.   When I finally got the Record Matches and Smart Matches, what was found by the algorithms were few and not very enlightening.   There was nothing that I had not already found elsewhere.   And the few matching trees that were found were of no real use and had less information that I had.    A review of the MyHeritage records database – something not quickly obvious – revealed a very limited set of records for my (American colonial) use.

3.  Here’s the deal on pricing at MyHeritage – no one with a decent size tree can get away from paying extra.    23andMe is advertising that its customers get a free tree, but it is really not different than the usual 250 node-limited trees that MyHeritage lets anyone creates.   However, one cannot practically include descendants with that limited of a tree, even though a well documented family trees should include the nuclear family, at a minimum, of your direct line ancestors.   A serious researcher will need the “PremiumPlus” package, and the regular price for that is $13.27/month, or $9.95/mo if bought in annual increments.   While not exhorbitantly priced, that is not significantly different than my AARP-discount price at ancestry.com, which has much more of what I need to research my American family tree.

4.   Not only is MyHeritage’s  “Family Tree Builder” a knock-off name of ancestry.com’s Family Tree Maker, but for a Mac user like myself the program Family Tree Builder (FTB) is painful to use.    It is  simply the Windows version ported with some underlying bridging software, so the program is very much not following the user paradigm of OSX that all Mac programs ought to use.   For example, the menu bar is floating with the window, and not anchored at the top of the screen.  File names and file operations are highly reminiscent of DOS  ( “C:\” does not belong on a Mac!)   Here, for example, is what the file browser looks like when one chooses to open a file using the file menu of the native program:

FTB_filebrowser1

Native file browser in FTB

 

… which is decidedly not what one wants to see in OSX.   Even more on point than the poor aesthetics is the loss of functionality that native Mac programs provide, even in file browsers.

But wait, there’s more.

You see, there is a second menu bar after all, the standard Mac menu bar, but it only has very limited functionality, such as opening and closing a FTB tree.   But let’s look at what that file browser looks like:

 

FTB_filebrowser2_crop

Second file browser in FTB

 

Read carefully the text at the top of the file browser.

FTB as a family tree database management tool is woefully lacking in capability, and that is why clearly it is “free”.   It is intended for a hook to get people to use MyHeritage, but the inability to view and manipulate trees, etc., means that it cannot replace a full featured software like FTM.


 

The Bottom Line:

Though 23andMe has sold their connection with MyHeritage as a new capability, it is really nothing than a mutual marketing gimmick.   MyHeritage now adds an advertisement (featured above the FTDNA tests) for 23andMe on their DNA page.   Meanwhile, over at 23andMe, their roughly 750 thousand customers will be directed to MyHeritage to buy online family tree services.   There is no indication that there will be integrated tree-matching with the DNA matching (such as at AncestryDNA), which will require integrating the 23andMe DNA Relatives (or the Countries of Ancestry .csv file) with a MyHeritage tree.  So, a 23andMe user will have to hope that their DNA Relative (1) has a public profile, and (2) has put a tree on MyHeritage, and (3) made that link visible and then the user can manually click on that link to be sent to the MyHeritage site.

So for all the above reasons and more, I say to MyHeritage and 23andMe … No Thanks.

It’s All About Connections, This Time To Google+

The challenge, using a self-hosted WordPress blog, is to integrate with Google+ seamlessly.

If I used a Google-owned blogger account to blog then the connections would be easier, but I am aiming for more control over the blog appearance and content.

If I’m using the right WordPress plugins then this blog entry should be available on by Google+ home.

 

… holding breath …

Does The World Really Need Another Blog?

 

In starting this blog I have considered, for some time, if the world truly needs  another blog.

Especially one about genealogy.

Yet something drives me to go forth and plant my flag on the now hoary internet.

Perhaps I’m just doing this as “cousin-bait”, as some very well known genealogy bloggers encourage.

Still, when I look at the volume of posts I have made over the past two years on other websites I realize that I could fill up my own blog.  And perhaps I should, for here I will feel a bit more free to express some opinions that may not be so popular in certain corners of the genealogy world.

My focus will be on the exploitation of genetics – or better yet, genomics – in the field of family history.  And by family history I do mean “history” – the writing of stories, the end result of historiography.

But, first things first – I have to find some nifty pictures to spruce up the place….