* What is a "RAW" data file anyway?

Randy Harr - Thu 17 of Dec, 2020 14:20 EST

Many are confused what the RAW Data file delivered by microarray test companies like Ancestry and 23andMe really is. And more so by the BAM and VCF files delivered from NGS / WGS sequencing testing services. Here we try to explain how they are kind of all the same file format but with very different content.

All these file formats, in their basic form, are Tab-Separated Value (TSV) tables. This is a special form of the more widely known term Comma-Separated Value (CSV) used primarily in spreadsheet programs to exchange data. A (TSV) is a basic text file with columnated data that uses tab characters to separate and create columns. So often readable just by opening in any regular text editor or even a spreadsheet program. The latter being best as it keeps the data truly "columnated" no matter how much content is in each cell. Early (TSV) tables were due to mechanical typewritten manuscripts that were then captured into early computer text files. So a very visual, human interpretation. But depending on the column content, more than a single tab may be needed. So are two successive tabs indicating a blank column entry in between? With modern computer interpretation that does not focus on human readability anymore, the answer is yes. This along with maybe wanting to include a tab in the data led to the (CSV) where comma-separated fields are also double-quoted (so as to allow a comma in the field).

Sometimes these files have one or more "header" lines of comment area before the real data starts. Often those lines have a convention of starting with a hash mark (#) which was originally introduced by the Unix Shell program to mean a comment line. The last header line then often labels or titles the columns. We give examples of the Microarray File Formats and Sequencing File Formats in the glossary here.

Both microarray and WGS return data that is tested DNA — and for all the DNA that exists in a cell. Both nuclear autosomes and Allosome (or sex). As well as the mitochondria that is floating around in the cell body. If you ever get a file that does not have values for all 25 sequences in it, then the vendor has purposely stripped out some portions of the test result as delivered by the lab equipment. Part of the confusion comes from the fact that most use microarray data primarily for segment matching on the autosomes. And hence many called these autosome tests without realizing other parts of the DNA are being tested and returned. Some confusion also comes from FTDNA purposely stripping out data and marketing the microarray test as autosomal only so it does not undercut their original, main yDNA and mtDNA test service.

What really differentiates the two file formats is the volume of data. Microarray tests look for known, specific markers. Around 600,000 in most cases. Mostly SNPs but some simple InDels. They make use of "primers" that are used in targeted, early CE sequencing to find the area of the marker in the DNA strand. If there has been a change in that primer area, then they cannot find the marker. Only when the primer finds the marker can they then measure and return a value for that location. The microarray test can and only will return the pre-defined, known marker locations. Think of it like a city directory or phone book of old. Helps you find what you think you already know should exist somewhere. But not discover new streets, addresses or people not yet entered into the phone book.

Sequencing on the other hand simply reads whatever strands of DNA are thrown at it. Blindly. Without knowing where or what the source of the DNA is that it has been fed. It then needs post-processing to determine what it has seen. Kind of like being an auditor in a retail store. You are counting up everything you find on the shelves without regard to recording where you found it. You then rely on a map of the store to determine where that stock was likely found.

This is why the human genome reference model is so important to sequencing. It is the map of what you expect to find where. And thus how you can put the sequencing inventory together. Can you imagine trying to go into a large Amazon warehouse to find some product without having that map of where everything is located?

Key is sequencing can be used to discover new variants or markers. Ones that are not yet known nor studied; as traditionally measured in microarray tests.

A SAM file is simply a TSV of the 600 million plus short-read segments put out by a sequencer doing an WGS along with the alignment mapping of each to the human genome reference. A modern WGS result has about 660 million read segments of 150 base-pair length that are arbitrarily "cut" and overlap with an average read depth of 30 reads per base-pair location. A VCF is a TSV of just the variants from the human genome reference found and thus considerably smaller than the SAM. As you usually only have around 6 million SNP variants in your 6 billion base pairs. Now remember the 600,000 locations mentioned for the microarray test file? Those may be variants but more likely not (and so are matching the reference). So a sequencer is returning around 6 million variants versus the typical 60-100 thousand variants in a microarray test. And the sequencer is returning data on 3 billion tested locations versus the 600,000 of the ((microarray) test. Quite a different volume of data. Do you know of any spreadsheet program that can read a file with 3 billion rows?

There is a special "RAW" VCF that is closer to an all-call gVCF in that it carries not just variants but other locations tested that were not variant. It is this form that is same as the RAW Data files from microarray tests. And likely the source of the file name in use. In fact, using the microarray test file as a template, a slimmed down VCF can be created from the SAM of just those values of interest. Or, if you have a target list of sites of interest, an actual microarray test "RAW" result file can be created from the SAM file. That is, known locations whether variant or not in a particular sample. It takes little effort to simplify a TSV gVCF of SNPs into the RAW Data file format that you get from a microarray test.

Check out the section on the Tab Separated Value format in the Sequencing File Formats page.

Translations

Article actions