Loading...
 

Microarray File Formats (aka RAW)

Documented here are the various Microarray file formats representing the comprehensive SNP test "RAW data". These file formats are generally the result of doing a DNA Microarray Testing lab procedure. They are simplifications of the "RAW" VCF Sequencing File Format already familiar to many in the genetics community and more generally, simply termed, a TSV (or CSV) format file as termed in the computer industry. The vast majority of the content (and sometimes only in FTDNA's case) is covering the Autosomes only and hence why some call them Autosomal tests and formats. But that is a misnomer because the laboratory test and its results include SNPs from all the DNA: the Allosomes and mtDNA as well. It is the Autosomal content, also being much larger in content, that tends to be most used in a unique matching segment mechanism between testers.

Note that the microarray file formats are somewhat independent of their content and the company generating them. Meaning, they do not specify what company or version of test from that company the file contains. Or even necessarily what reference genome model was used. Making it more difficult for the development of automated analysis tools. Only with developed heuristics is a tool able to determine the source of the file and its content.

Although more than just autosomes are included in the file, the microarray file formats are not the method of documenting specific test results from most yDNA, mtDNA or NGS targeted tests. The latter use the Sequencing File Formats well known to geneticists. With that said, these microarray file formats include xDNA, yDNA and mtDNA SNP values as provided by most of the testing companies. Check the companion page SNP Databases to learn more about the content itself (for example, how an rsID compares to a common SNP name such as P312).

Some common features between the formats (and even the VCF standard) exist. For example, all use a form of Tab-Separated Value (TSV) columnated textual format. Something that grew out of the simple textual table and then follow-on spreadsheet processing. The extended Comma-Separated Value (CSV) form is a more robust format that is still textual but not really directly readable by a human. It is a superior format for capturing a variety of information but must usually be computer processed. Spreadsheets support both TSV nd CSV forms in files they read and write. Another common feature of the microarray file formats is most have one or more lines of header information. This header is free form but each line starts with the hash ("#") symbol to distinguish its content. The hash was Introduced in the 1970's by the UNIX Shell to indicate a comment line in a script file that should not be processed. These hash lines that form a header are not defined as part of the spreadsheet formats and, as such, make the microarray file formats described here a little more unique and not 100% compatible with spreadsheet programs.

Both file forms in the spreadsheet world tend to be given a .csv file extension or suffix. But in this industry, the TSV file is often given a .txt file name extension. VCF happens to also use a simple, free-form TSV design. Often the miroarray file format .csv/.txt files are compressed to save space as they contain 600 to 700 thousand diploid SNP values and can get quite large. Standard ZIP format is most often used to compress them and so the files are commonly delivered with a .zip container suffix.

These microarray file formats have one row or text line for each Probe result — most often an SNP. They identify the Probes by rsIDs as well as the chromosome (sequence) name and position within it. Some yDNA and mtDNA sole-content files use SNP names in place of the rsID. Key is, at least one of three forms is needed to identify the SNP row: (1) Chromosome name and position, (2) rsID, or (3) SNP name. Usually more than one identification form is present. Most microarray file formats use both the rsID and the chromosome name and position to identify each SNP. And sometimes the two forms conflict with each other; leading to more confusion. The chromosome coordinate position is often defined for the forward / 3'-5" / positive direction. Occasionally, and without indication, they are using a backward / 5"-3' / negative direction (and then often a complimented value). Thus leading to further confusion.

The microarray file formats are a simple form of a RAW, annotated VCF. Basic, unannotated VCFs do not include the rsID. Normal VCFs only contain the derived variants whereas microarray file formats contain all tested values — whether they are derived or ancestral. A RAW VCF, before filtering, may include ancestral value SNPs. Hence likely why you often see the term "RAW Data File" associated with these microarray file formats.

There are a number of tools that read and process the microarray file formats files. See the Third Party Analysis Tools page for more details. In particular the DNA Kit Studio allows the manipulation and even merger of various microarray test result files. Felix Immanuel was the first to provide such tools early on. Even bcftools from the Broad Institute can read the TSV files and generate basic VCF ones due to the popularity of this free form mechanism with early, simple DNA testing results.

Features and Variances

Let us first cover some of the basic features and variances of the files between vendors. And then delve into examples of the actually file formats.

Feature Comparison Quick Summary

A table below summarizes the major features / content from each company. This is taken from our original chart at the bottom of the Genetic Genealogy Testing page that we introduced back in 2014.

Feature23-v323-v423-v5Anc-v1Anc-v2FTDNANGG
Geno2.0
NGG
Geno2.0+
NextGen
MyHerLivDNA
Approx Size (MB, compressed)856666.516.56.46
Build Type373737373736/3737373738
DNA Type(s)Auto,X,
Y,MT
Auto,X,
Y,MT
Auto,X,
Y,MT
Auto,X,
Y
Auto,X,
Y
Auto, XAuto,X,
Y,MT
Auto,X,
Y,MT
Auto, XAuto,X,
Y,MT
SNP IDRS#, iRS#, iRS#, iRS #, iRS #, iRS #, iRS #, SNP2RS #, SNP2RS #, iRS #,i
SNP2
Auto Probes
1-22
930,281577,382614,007682,549650,647690,715126,306698,192702,442603,129
X/23 26,00719,48716,530?25,25017,4783,80317,81217,89215,511
Y/24 1,7662,3293,734 885 1,668-11,978 13,533 482 3822
MT/261 2,4593,1544,273-262-442412-212
1 Ancestry has the X/Y PAR as chr 25. v2 has MT as chr 26. 23andMe simply includes Y PAR values in the X result (as the 2nd value in the un-ordered pair for males).
2 LivingDNA and NGG only report positive-for-change (derived, positive, changed) SNPs for Y and Mt. So it is not clear how many they are actually testing nor how many are ancestral (negative, un-changed). They also only report these values as a list of SNP names and not by rsID or position. Each is in a separate file.

File Sizes and Versions

It is not enough to know the vendor of your test. You need to know which version of the chip microarray (CMA) they used in the lab on your sample. And even, as it turns out, which minor version of file format they have provided your data in. Note that some of these minor versions were coding errors and later fixed. You can sometimes get an updated, corrected file simply by re-downloading a new RAW file. If you have a file that does not fit into the metrics of the chart below, please let us know so we can catalog another minor version.

A quick and dirty way to figure out your particular test company and file version is to count the number of lines in your file. The number of header lines is always under two dozen and so does not really affect the rounded-to-thousands count. This count method is more reliable if you know the test company source as well. As some of the test company files for a particular version are very similar in size. The data rows / lines contain the result for one Probe or marker result from the test.

On any Unix or BASH shell, one can simply execute the command
Copy to clipboard
zcat <microarray>.zip | wc -l
On Win10 Powershell, the command is (using a 7Zip 64 bit installation):
Copy to clipboard
7z.exe e -so <microarray>.zip | Measure-Object -Line
This assumes you were given a compressed file. If not compressed, us the "wc" or "Measure-Object" command directly on the file by placing the file name after the command. Note that some text editors, like Notepad++, can load the files and will report on the number of lines within the tool. Spreadsheet programs and most text editors cannot otherwise load such a large file.

VendorVer
sion
Start DateEnd DateFile Size
(K lines)
ISOGG Table
(K SNPs)
WGS Extract
(K SNPs)
HGR Model0Microarray Chip Used
23andMe API - Sep 2018 - 1,498 Supported API interface SNP list (now researcher access only)
23andMe v2 7 late 2007 - 571 -NCBI36 Illumina Hap550+ (Human BeadChip)
23andMe v3 Nov 2010 Nov 2013 961 956 959 Illumina Omniexpress (Human BeadChip)
23andMe v4 Nov 2013 Aug 2017 602, 611 (599) 6 605 602 Illumina Infinium HTS iSelect HD
23andMe v5 Aug 2017 - 639 630 638 Illumina GSA
Ancestry v1 Jan 2012 May 2016 701 4 700 701 Illumina Omniexpress (Genotyping BeadChip)
Ancestry v2 a-b May 2016 May 2018 669 / 650 5 - 669 Illumina Omniexpress+ (Genotyping BeadChip)
Ancestry v2 c-d May 2018 - 664 / 678 5 662? - Illumina Omniexpress+ (Genotyping BeadChip)
FTDNA v1 - Feb 2011 564 (550) - 548 HG16 / NCBI34 Affymetrix Axiom xxx 1 (No Y, MT)
FTDNA v2 Feb 2011 Apr 2019 725 (708 / 716), 720 9 725 (v1) 8 720 Illumina OmniExpress (Microarray Chip) (No Y, MT)
FTDNA v3 Apr 2019 - 630 (v2) 8 614 Illumina GSA (No Y, MT)
LivingDNA v1 Sep 2016 Oct 2018 619 619 619 Illumina GSA
LivingDNA v2 Oct 2018 - 692 (660 Fem)11 699 699 Affymetrix12 Axiom Sirius
MyHeritage v1 Nov 2016 Mar 2019 721 720 721 Illumina OmniExpress (Microarray Chip)
MyHeritage v2 Mar 2019 - 607 610 Illumina GSA
TellMeGenv? ? - 780 (609 / 678) - - Illumina GSA
MHTFR Genetics v? - 640 - UK (no male Y sample)
Genera v? - 640 - BR
meuDNA v1 - Dec 2021 632 - BR
meuDNA v2Jan 2022 - 654 - BR
SelfDecode v? ? ? 687 - GRCh38 USA
Reich Labv1??? 2015 - 598 - Affymetrix12 Human Origins v1
Reich Lab - 1,233 - 1240K panel
(Allen Ancient DNA Resource - AADR)
NGG Geno v2 2 Oct 2012 Nov 2015 142 -
NGG Geno v2+ 2 Nov 2015 May 2019 3 730 - NCBI36 Illumina custom GenoChip
WGS Extract
CombinedKit
v2Nov 2019 Jun 2020 2,080 - 2,080 HG19 /
GRCh37
WGS Extract's "CombinedKit"10 (Superkit on Steriods) option from WGS Results
note0: Build is 37 unless otherwise noted (most are 36 otherwise)
note1: FTDNA retested all FamilyFinder v1 samples using the new v2 Illumina chip and replaced the output files
note2: National Geographic Genographic files are separated by chromosome type and use SNP names and not rsIDs to identify the yDNA and mtDNA entries. v2 is mainly a haplogroup test. v2+ is better known as NextGen.
note3: After Nov 2016, this is only for non-North-America orders (non-Helix, still FTDNA) till the shutdown of testing in Nov 2019.
note4: During Fall 2015 (Sep-Nov), Ancestry put out their RAW files with a truncated header not giving version numbers and other information
note5: 669 is the norm that was started with. Winter 2018 (650) and Summer 2018 (664) saw smaller sizes that were often "fixed" on request; Feb 2019 began to see larger (678) v2a-b have minor variations and similar between v2c-d. But v2b to v2c saw over 150K entries dropped and another ~150K different entries added. (SNPedia picked up on this and calls them variations 2c and 2d although we see a 4th we call 2b that they do not mention.)
note6: All of 2015 and beyond saw file sizes of 611K lines for the 23andMe v4 test with a few 599K ones scattered that year (both sexes). 602K was for 2013 and first half of 2014. The variance between kit versions is on the order of 20K entries or less.
note7: 23andMe v1 used the same chip as v2. But we have found no data about its output size and characteristics. So have left it out of the table.
note8: ISOGG chose to ignore / skip the original Affymetrix FamilyFinder test and starting numbering from 1. Most others do not follow this convention.
note9: 720K is the final standard (v2d). Pre-2015 are all reprocessed to 725K entries (v2a) (and may be really v1 kits originally?). Some 716K entry sizes (v2c) are seen in 2016 (both sexes). The earliest and single occurrence we saw of 708K entries (v2b) in 2015 is still being investigated. Note these are based on build 37 model and the Auto+X download. v2a seems to be an almost exact superset of v2b-d.
note10: There are possibly as many as 10k InDels in these files that are not currently properly handled and called correctly. Genetic Genealogy sites ignore InDels so this is generally not a problem.
note11: LivingDNA supplies the yDNA and mtDNA in separate positive-for-change SNP name lists only. Main files are atxDNA only like for FamilyTreeDNA.
note12: Thermo-Fisher Scientific acquired Affymetrix and their Axiom microarray product line in 2016
Table Sources: Reference below and Randy's 80+kits covering most versions and companies.

Note that TellMeGen, SelfDecode, MHTFR Genetics, Genera and meuDNA are not genetic genealogy focused companies. But are expanding into that area as they expand the market for their consumer DNA test product. Their result files can be used in Third Party Analysis Tools just like the others. Just as all the traditional genetic genealogy focused result files can be used on other sites that provide health, wellness and trait analysis.

Minor variations in major versions

Microarray File Minor Variations
Minor Variations in Microarray Files
The chart here provides a few more details on the variations of microarray file formats within a major version. Only Ancestry made a very significant change.

ISOGG has since created comparison tables of the various test kits. Their covered SNP counts often vary considerably from our measured values. We have not yet determined the reason. The companies vary the outputs within a version and time period; as is shown in the table above. But this does not seem to account for ISOGGs generally lower counts.

Not incorporated in the above is an article detailing some variations in the 23andMe files for mitochondria over time. This is mostly found in files downloaded before 2012. If the file is re-downloaded, it is often corrected. Similar documented and undocumented changes occurred in 23andMe, Ancestry and FTDNA file content within major versions over time.

UCSC Templates

We have discovered "templates" for many of the microarray chips on the UCSC server. Not clear why they are there and what they use them for. They do not appear to have the vendor introduced variations. (Illumina and others let larger customers customize around 50k entries on a microarray chip. This is how NGG was able to have around 13K Y SNPs defined.) Here is the template listing found when we visited the site in 2020.
Affy5 Affy6, Affy6SV Affy250Nsp, Affy250Sty
Illumina1M Illumina1MRaw IlluminaGDA
Illumina300 IlluminaHuman660W_Quad IlluminaHuman660W_QuadRaw
Illumina550 IlluminaHumanCytoSNP_12 IlluminaHumanCytoSNP_12Raw
Illumina650 IlluminaHumanOmni1_Quad IlluminaHumanOmni1_QuadRaw
See also SNP Genotyping Arrays, Recombination Hotspots for Genotyping Arrays, Recombination Arrays for Genotyping Arrays, and Formatting of Data (Genotyping Arrays) for more information on what these various files are used for.

Study of Available Arrays

Long after we compiled the information for this page, a study has come out of the utility of the various genotyping arrays. Part of the study does include the data and analysis of the various arrays. Some of which we capture in the list below. (Sizes are in thousands of entries.) Showing much more diversity than we expected. And larger counts than expected as we thought most arrays were 1,000 x 1,000 at most (limiting the result to around 1 million entries).
ArraySizeArraySizeArraySizeArraySizeArraySize
Affymetrix12 6.0 932 Axiom AveraNTR 671 Axiom GW ASI 630 Axiom GW CHB2 658 Axiom GW EUR 675
Axiom GW LAT 818 Axiom GW PanAFR 2,268 Axiom PRNA 920 Axiom UKB WCSG 842 CytoSNP 850k_b 850
Drug Dev Consortium 15073507 A1 475 GSA 24v3 A1 653 GSA MD 24v1-0 20011747 A4 693 Human 660W quad v1 591 Human Core 12v1-0 a 298
Human CytoSNP 12v2-1 H 295 Human Omni 2.5-4v1 h 2,434 Human Omni 5-4v1 c 4,269 Human OmniExpress 12v1-1 b 718 Human OmniZhongHua 8v1-0 c 899
Infinium Exome 24v1-1 A1 245 Infinium Immuno Array 24v2-0 a 252 Multi-Ethnic AMR AFR 8v1-0 A1 1,425 Multi-Ethnic EUR EAS SAS 8v1-0 A1 1,474 Multi-Ethnic Global A1 1,761
Onco Array 500K B 498 PMDA hg19 918 Psuch Array B 570
* Names will be improved once the papers are thoroughly reviewed


Actual File Formats

So lets get onto describing the actual file formats themselves. A reminder that all files share a few common features. For example, being a TSV or CSV format file, having headers of one or more lines that often start with a hash ("#"); but not always. And so on. Most are Build 37 delivered results and sorted in an expected order of chromosomes 1-22, X, Y and MT. But variations exist and are indicated below.

We start with a summary table and then introduce each of the formats. All vendors and the summary table are available one at a time by clicking the named "tab". Hit the Tab for the file format of interest. Or hit the "No Tabs" button to the far right and see all at once. Which is useful if you want to print this page.



Summary table of formats

Vendor File Ext File Form File Line End Chr Labels & Entry Order Allele Form Allele Values Ref Build IDs Header Notes
23andMe .txt TSV \r\n 1-22, X, Y and MT AG ACGT, DI, — 37 rsID, iNNN ~20 # lines including last column title row Single value in X, Y and MT (for males); dash always homozygous. Female Y is all double dash.
Ancestry .txt TSV \r\n 1-22, 23 (X), 24(Y), 25 (PAR), 26 (M) T C ACGT, ID, zero ; any order 37 rsID ~18 # lines followed by column title row starting with "rsid" Always double values; zero always homozygous. Female Y is zeros but PAR is heterozygous in both. DI only in v2c and beyond
FTDNA .csv true CSV \r\n 1-22, X (if selected), (XY) AT ACGT,( DI,) — 37 (36 v1) rsID, VG, (seq-rsID. kgp, 2010-, GSA, LDLR, IDS, DY, CF, DrGene, FAM, HPS, PEX, 1SNP, indel, ...) Single row column-definition starting RSID (only unquoted value row except v3 is quoted) v3 has wide variety of IDs; v2 only first two. Early v2 ONLY generated separate 1-22 and X files or concatenated them so header appears in middle again; InDel only in v2b and v3; chr/pos 0 in v2a and v3; only v3 has XY
LivingDNA .txt TSV \n 1-22, X AT ACGT, — ; any order 37 rsID, AX, AFFX (, 1:, exm2, JHU, var, kgp, 1kg, SNP. gw) ~11 # lines of header including last column title row Y and MT in separate files listing only derived SNPs; v1 has the large variance in names; v2 has >2 allele values. Often two sets of similar sequences (two inserts?) but not always (insert and delete?); longest is 21x2
MyHeritage .csv true CSV \n 1-22, X, Y AT ACGT, DI, — 37 rsID (,VG) ~7-12 # lines followed by column title row starting RSID (only unquoted value row) only v1 has VG ID's; only v2 has ID alleles and only on X ; early v2 had no quotes EXCEPT on chromosome 17 where they quoted coordinates and inserted commas as thousand separator
TellMeGen .csv TSV \r\n 1, 10, 11 ... 22, 3, ... 9, MT, X, XY, Y TA ACGT, ID, — ; Any order 37 rsID, chr1, dupseq, ilmnseq_rs, GSA_rs, seq-rs, TOP, ... Single row column-definition starting "# rsid" Very large assortment of names including just a single dot
MTHFR Gen .txt TSV \r\n 1-22, MT, X(, Y?) TA ACGT, ID, — ; Any order 37 rsID Single row column-definition starting with "rsid" No male sample obtained yet; One RSnnn (cap)
meuDNA ,csv CSV unquoted \n 0-22, X, Y, (XY, )MT AT ACGT, DI, — 37 rsID, 2010-, GSA, ... (similar LivingDNA v1 but no AFF(x) Single row column-definition starting with RSID 782 0,0 entries ; no quotes ; no XY in v2; diff mix of IDs between v1 and v2
Genera .csv CSV unquoted \n 0-22, X, Y, MT AT ACGT, — 37 rsID, GSA, ilmseq, MTR, 2006, ... single row column-definition starting with RSID Y and MT is single value ; template only so cannot tell if InDels
Self Decode .txt TSV \n 1-22, X, Y, MT TA ACGT 38 rsID, GSA, ilmseq, exm, seq, 1:, JHU, MFN, variant, indel, BOT, chr1:, newrs, ... 8 lines including single row of column definitions X, Y and MT single value (in males)
Reich 1240K .txt TSV \r\n 1-22, X, Y, MT TA ACGT, — 37 rsID, snp_, Affx_, 1kg, Y SNP names two lines including single row of column definitions No format defined. So utilize 23andMe one.
Reich HumOrig .txt TSV \r\n 1-22, 23(X), 24(Y) TA ACGT ; Any order 37 rsID, snp_, Affx_ two lines including single row of column definitions No format defined. So utilize 23andMe one with minor exceptions.
NGGeno .csv TSV 36 Handled by FTDNA till near the end. Near identical files and formats.
*Note: unless otherwise specified, (1) heterozygous InDel alleles exist, (2) two values always exist, (3) Increasing order alleles only.

Sometimes the PAR region is split out from either X or Y. The PAR1 region is the same position in X and Y for build 38; the X is 50k shifted in build 37. The PAR2 region starts at ~95 million on X in build 37 and ~99 million on build 38. Any alleles defined in a PAR region of X or Y cannot be reliable distinguished as to the source. The Pesudo-Autosomal Regions for the two builds are:
Region Build Chr Start Stop Length
PAR1 37 X 60,001 2,699,520 2.639.519
PAR1 37 Y 10,001 2,649,520 2.639.519
PAR1 38 X or Y 10,001 2,781,479 2,771,478
PAR2 37 X 154,931,044 155,260,560 329,516
PAR2 37 Y 59,034,050 59,363,566 329,516
PAR2 38 X 155,701,383 156,030,895 329,512
PAR2 38 Y 56,887,903 57,217,415 329,512




23andMe

File formats from all versions are the same. But the SNPs reported change between versions of Microarray Testing chips used. To date, all versions use Illumina products.
  • 20 lines of header
  • Tab separated (TSV) pseudo RAW-VCF file
  • Column definition included
  • Chromosomes labeled 1-22, X, Y and MT
  • Genotype values: A, G, C, T, I, D, - (I and D are for Insert and Delete. InDels are not really SNPs but reported as such here)
  • Both genotype values together (unordered pair); always increasing alphabetic order (AG but not GA)
  • No calls: --
  • Single value in X, Y and MT but still double dash for no call (double value for X in females; Y in females is all no call)
Sample:
# This data file generated by 23andMe at: Thu Dec 17 14:11:20 2015
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been 
# individually validated for accuracy. As such, this data is suitable only for research, 
# educational, and informational use and not for medical or other use.
# 
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier 
# (an rsid or an internal id), its location on the reference human genome, and the 
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing 
# improvements in our ability to call genotypes. More information about these changes can be found at:
# https://www.23andme.com/you/download/revisions/
# 
# More information on reference human assembly build 37 (aka Annotation Release 104):
# http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606
#
# rsid	chromosome	position	genotype
rs12564807	1	734462	AA
i3001395          MT      15530     --


From their Downloads page FAQ; a change log to the format:
  • July 27, 2017: As part of our continuous efforts to improve the quality of data present in your raw data download, the number of SNPs available in your download may have changed.
  • July 22, 2015: We updated call filtering in the downloaded file so it matches filtering in the Raw Data tool. Some customers may see "--" (a "no call") as their genotype for some SNPs on the X chromosome, Y chromosome, or in their MT DNA, where their downloaded data file previously showed a "D" call.
  • July 28, 2014: Analysis of our data has allowed us to improve the interpretation of over 10,000 SNPs genome-wide on the V4 chip. In the next couple of days, V4 customers will see calls for SNPs that previously did not appear in their raw data.
  • August 9, 2012: We updated our database to report SNP positions using the NCBI Build 37 (also known as Annotation Release 104) genome assembly. Users will see changes in their raw data positions.
  • September 29, 2011: Analysis of our data has allowed us to improve the interpretation of several SNPs. In the next week, customers may see changes in their raw data.
  • January 13, 2011: We updated our database to incorporate data from a more recent build of dbSNP. Some rsids have changed location and/or flanking sequence in dbSNP such that our probes are no longer meaningful to assay them. The names of these rsids have been changed in the raw data to internal ids starting with "i499...". We have also improved the interpretation of a number of SNPs and removed others that had poor data quality. In the next couple of days, customers may see changes in calls for those SNPs.
  • March 25, 2010: Analysis of our data has allowed us to improve the interpretation of several dozen SNPs. A portion of the SNPs are on the mitochondrial chromosome. In the next couple of days, customers may see changes in calls for those SNPs.
  • October 8, 2009: Analysis of our data has allowed us to improve the interpretation of over 1500 SNPs. A portion of the SNPs are on the mitochondrial chromosome. In the next couple of days, customers may see changes in calls for those SNPs.
  • June 4, 2009: Analysis of our data has allowed us to improve the interpretation of over 500 SNPs. Most of these SNPs are on the Y chromosome. In the next couple of days, customers will see calls for SNPs that previously had a no-call or appeared not genotyped.
  • April 9, 2009: Analysis of our data has allowed us to improve the interpretation of 10 SNPs: rs4420638, rs34276300, rs3091244, rs34601266, rs2033003, rs7900194, rs9332239, rs28371685, rs1229984, and rs28399504. In the next couple of days, some customers will see calls for SNPs that previously had a no-call or appeared not genotyped.


AncestryDNA

  • 16 lines of header intro, 17th line is column headers.
  • Tab separated. NoCalls appear as '0' (zero) and always appear in pairs.
  • Allele's in separate columns (but still unordered); can be any alphabetic order (A T and T A)
  • Chromosomes labeled 1-22, 23 for X, 24 for Y, 25 for X/Y PAR region values (not sure if position from X or Y), and 26 for M (later kits only)
Sample:
#This file was generated by AncestryDNA at: 06/27/2015 09:23:22 MDT
#Data was collected using AncestryDNA array version: V1.0
#Data is formatted using AncestryDNA converter version: V1.0
#Below is a text version of your DNA file from Ancestry.com DNA, LLC.  THIS 
#INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH 
#ONLY.  IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES.  THE EXPORTED DATA IS 
#SUBJECT TO THE AncestryDNA TERMS AND CONDITIONS, BUT PLEASE BE AWARE THAT THE 
#DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES.
#
#Genetic data is provided below as five TAB delimited columns.  Each line 
#corresponds to a SNP.  Column one provides the SNP identifier (rsID where 
#possible).  Columns two and three contain the chromosome and basepair position 
#of the SNP using human reference build 37.1 coordinates.  Columns four and five 
#contain the two alleles observed at this SNP (genotype).  The genotype is reported 
#on the forward (+) strand with respect to the human reference.
rsid	chromosome	position	allele1	allele2
rs4477212	1	82154	T	T




FamilyTreeDNA

FTDNA started using an Affymetrix Microarray Testing but moved to an Illumina one very quickly after introduction.
  • CSV with commas and each field surrounded by double quotes (true CSV)
  • Single column-header definition header; no other header information
  • Build37 or Build36 (selected at download time; cannot tell which by header content)
  • Separate file for Auto and X (or now combined if desired)
  • Chromosomes numbered 1-22; X if X file or combined file
  • Both genotype values together (un-ordered pair)
  • No calls: "--"
Sample:
RSID,CHROMOSOME,POSITION,RESULT
"rs4477212","1","72017","AA"




LivingDNA

For Autosomal & X: 10 rows of header, then single row for column headers. Tab-separated (TSV) columns in pseudo RAW VCF style. RSid identifiers.
  • TSV with dual column alleles
  • Build 37
  • Separate file for Y and MT with derived, named SNPs only
  • Chromosomes labeled 1-22, X
  • Both alleles together; unordered pair
  • Alleles are rsID, AX, or AFFX
Sample (Auto/X):
# Living DNA customer genotype data download file version: 1.0.1
# File creation date 11-29-2017
# The content of this file is subject to updates and changes depending on the time of download.
# This genotype data should be treated as personal information.
# This genotype data is not suitable for clinical/medical research or diagnosis.
# The user assumes all responsibility for the security of this file.
# Please refer to the Living DNA Terms and Conditions on our website (www.livingdna.com) for more information.
# Human Genome Reference Build 37 (GRCh37.p13).
# Genotypes are presented on the forward strand.
#
# rsid	chromosome	position	genotype
rs9283150	1	565508	AA
1:726912	1	726912	AA
rs116587930	1	727841	GG

For Y: Simple list of only derived (positive, changed) SNP names. So not clear how many tested nor any that are ancestral (negative, unchanged). Sample file has 382 entries. Each row is an SNP. Variant names appear to be given on the same row with intervening slashes (/).
Sample (Y):
AM00847/AMM008/B65
AM01921.2/S475.2/Z2983.2
CTS10083
CTS10085/M1250/PF5948

For MT, simple list of only derived (positive, changed) SNP locations. So not clear how many tested nor any that are ancestral (negative, unchanged). Sample has 21 entries (which is similar to the changed value list typical in 23andMe's test). The derived value is given attached to the position number.
Sample (MT):
263G
462T
482C



MyHeritage

6 lines of header, single line of column headings. Comma separated list of entries enclosed in double quotes (") (note: early v2 is not quoted but some tools will not accept that)., 1-22, X, Y (no MT). Double allele values. All rsID names (except v1 has some VG)

Sample:
# MyHeritage DNA raw data. 
# This file was generated on 2018-06-18 14:06:02 
# For each SNP, we provide the identifier, chromosome number, base pair position and genotype.The genotype is reported on the forward (+) strand with respect to the human reference build 37. 
# THIS INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH 
# ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. PLEASE BE AWARE THAT THE 
# DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES.
RSID,CHROMOSOME,POSITION,RESULT
"rs4477212","1","82154","AA"
"rs3094315","1","752566","AG"



TellMeGen

Near identical to 23andMe format. Using Illumina GSA. Only difference is they label it a CSV file by extension but deliver a TSV like 23andMe. No header except the one line column header. Unique in that (1) is the only one with Unix-style line endings (\n only; not \r\n of DOS or \r only of MacOS), and (2) deliver a TSV format with a .csv file extension. As a result of the line endings, it broke some tools.

Sample:
# rsid	chromosome	position	genotype
rs12564807	1	734462	AA
i3001395          MT      15530     --

  • Tab separated (TSV) pseudo RAW-VCF file with .csv file name extension (UNIQUE)
  • Column definition included as only header row
  • Chromosomes labeled 1-22, MT, X, and Y (in that order)
  • Genotype values: A, G, C, T, I, D, - (I and D are for Insert and Delete. InDels are not really SNPs but reported as such here)
    • Both genotype values together (unordered pair)
    • No calls: --


meuDNA



Genera



MTHFR Genetics



Self Decode

We only have a single sample to go by that was delivered in June 2023. That sample was delivered in Build 38.


NGG Geno2.0

Comma separate list. First row is header title. rsID or "kgp" (1000 Genomes Project); no positions. 130,110 entries in Autosomal/X file. SNP names in Y file with 11,978 rows of values (in one example). Y file has DD and II values. ~45 MT file values so likely only variants (but from what model?)

note: A combined ALL file is also delivered that has the three files mashed up together.

Sample (Geno2.0 Autosomal and X single file):
SNP,Chr,Allele1,Allele2
kgp10004422,12,A,G
kgp10025979,7,C,C
kgp22732377,X,A,A
kgp22734373,X,C,C
rs10000081,4,T,T
rs10000092,4,T,T
rs1000014,16,G,G

Sample (Geno2.0 Y file):
SNP,Chr,Allele1,Allele2
CTS100,Y,C,C
CTS10004,Y,G,G

Sample (Geno2.0 mt File):
SNP,Chr,Allele1,Allele2
73,Mt,A,A
195,Mt,A,A
225,Mt,A,A



NGG Geno2.0+ (NextGen)

Comma separate list. First row is header titles. rsID's and position like all the others for Autosomal and X file; unlike NGG 2,0. In one sample example, 698,194 rows in Autosomal file, 17,813 in X, 13,534 in Y, xx in M (only simple list of derived value SNPs; not all tested). Typical pair of values: two from ATC or G along with I (Insert), D (Delete) and '--' (no call). Y file is like older Geno2.0 and has SNP names and no coordinates. Aliases for some SNP's given by underscore in name.

note: not clear if this is always the case but files we anecdotally saw are sorted by line and not specific columns. As SNP names come first, there is an alphabetic sort on them with chromosomes totally intermixed. A combined ALL file is also delivered that has the three files mashed up together.

Sample (Geno2.0+, separate Autosomal and X files with same format):
RSID,CHROMOSOME,POSITION,RESULT
rs3748597","1","878522","TC
rs13303106","1","881808","AA
rs28415373","1","883844","--
rs13303010","1","884436","AG

Sample (Geno2.0+, Y file):
SnpName","Chromosome","Result
CTS6704","Y","AA
CTS5286","Y","GG
BY1786","Y","GG
Y5543_Z20122","Y","CC
M3153_S7535","Y","AA
M245","Y","II

Sample (Geno2.0+, mt file with~40 entries; variants only):
Chromosome","Position","Result
mt","2885","T 
mt","16230","A 
mt","11719","G