Berkeley Drosophila Transcription Network Project


Berkeley Quantitative Genome Browser User Manual

Back to TOC

File Formats

Overview

As anyone who has worked with genomic data knows, there's no shortage of file formats in which to store and organize data, both ad hoc and those with standardized and more or less widely accepted specs. While some formats obviously contain specific and unique types of data, e.g. GFF versus FASTA, others are variations and nuances of each other and contain essentially the same sort of data, e.g. GFF and BED, and others simply parsimonious and more efficient representations of other formats, e.g. SGR for GFF. BQGB currently supports the file formats most immediately useful within our own lab. Additionally, a mechanism has been put in place to handle many ad hoc formats; please see the section on delimited formats below. Finally, the code design of BQGB has attempted to ease the addition of new file formats as much as possible.

GFF, SGR, WIG, BED, FASTA

GFF, SGR, WIG, BED and FASTA are the explicitly and currently supported file formats.

Since many similarities exist between file formats, for instance, each format has its data bound to a biological position, most often counted off by base pair, a best effort is made to read the file formats into a common data structure. Reading these various formats into a common architecture helps simplify the application's design. For instance, it is not necessary to create a new graph type every time a file format is added. That said, some formats demand their own handling, such as sequence data. Other formats have special features worth preserving; for instance, .BED files allow individual records to each have their own color, an example of a format specific feature that has been preserved. What this means to BQGB's user is that, within the browser, data is data and not explicitly data tied to a file format, i.e. GFF data or BED data. What this means to the programmer is that usually most of file format's record information can be captured by a single data structure and the specific source of the data is unimportant once it has been read into memory, with the caveat "mostly!"

That said, annotation data and score data that spans a region could be stored in either GFF or BED. While the browser handles both formats, GFF has definitely been the dominant and better supported format within the browser. Where the browser allows exporting of data, such as highlighted regions, the export format is in GFF format. Currently, the block and line notation often used to represent different parts of a feature, e.g. the coding and non-coding regions of a gene, by many browsers and in more generally in the literature, can only be produced by the browser when the data has been read in from a GFF file. Please see the user manual page on the "Preference" menu for more information.

One final not one GFF. Recently, the GFF spec was ammended to allow the format to contain a FASTA file appended to the file's end. The adventage of this is to bundle annotation and sequence data in a single file. When encountering these chimeric files, BQGB should read both the GFF and FASTA portions. However, the disadvantages of bundling GFF and FASTA in a single file are manifold: most importantly it makes the file much more difficult to work with with many, powerful text processing tools, tools that are fast, efficient and easy to use, but that require records to be regularly defined throughout the file's length. As for XML, this is ten steps backward as a format for data storage, particularly large amounts of data such as found in molecular biology where the mark-up itself adds considerable bloat (up to 600% in some cases I've measured). No plan is in place to support XML annotations beyond avoiding XML!

Tab and other delimited formats

The explicitly handled file formats named above will provide the best performance as far as read speed, memory consumption and access to each format's distinct features is concerned. However, it is possible to open ad hoc delimited formats. The only hard rule in opening these files is that one column must give a sequence name and one column must give a base pair start position for the record (i.e. the row).

When BQGB is asked to open a file format it doesn't explicitly recognize, the user will be presented with the following dialog box:

The delimited file parser dialog allows one to set parse rules for delimited files as long as they contain a sequence name and start base pair position. Frequently used files are better stored in one of the supported file formats. A tool like awk can be used to transform most files to a supported format. Perl might also be used; although it is far clumsier.


The only information absolutely required by this form is, again, which column contains sequence names and which column contains the start position for the record. However, if the file contains comments, it's important to also provide the parsing engine with the character(s) introducing comments less the comment be mistaken for a row containing a data record. Other columns can be added to the parse plan as necessary in step #7 (dialog above). It is not necessary to define every single column in the delimited file, only those of interest.

At this time it is not possible to save the parse plan, i.e. this form above needs to be filled out each time a given delimited file format is opened. Savable and reusable parse plans would certainly be a good feature for the future!

Adding support for new file formats

Adding new file formats involves writing C++ code. Fortunately, it can be as easy as pulling together one short Reader class. Please see the programming notes and the source code itself, particularly the Reader and GVec classes, for more information.

Back to TOC