Berkeley Drosophila Transcription Network Project


Berkeley Quantitative Genome Browser User Manual

Back to TOC

Programming Notes

Overview

That people continue to create genome browsers underscores that they want genome browsers which they can customize for their particularly research goals. Indeed, it's incorrect to assume that any single visualization/browsing idiom could serve up insight to the diverse phenomena of molecular genetics... from highly localized single nucleotide polymorphisms, to nearby and distant cis-regulatory associations, to trans relationships or comparisons of widely diverged genomes from different species.

Recognizing that browsing needs are dynamic and will continue to evolve and present new challenges, effort was made to design a flexible underlying API. The following notes aim to provide an entrance to that API to be used in conjunction with code documentation and the code itself.

BQGB is written in C++ and was developed using gcc. It is heavily dependent on Nokia's Qt libraries (formerly Trolltech's Qt libraries).


Model-View-Controller

The idea of Model-View-Controller can be manifest at many different design levels. In BQGB it means that classes that hold the genomic data ("model") are separate from classes that render glyphs of that data ("view") and separate from classes that apply transformations to that data ("controller").

This division is reflected in the source code directory structure:

anchovy@ubuntu:~/bbrowser$ ls src/
calculator data icons ui_main

The data directory holds classes that comprise the underlying structure of genomic data and the methods for accessing that data. Additionally, classes for reading and writing data disk files (SGR, GFF, BED, etc...) are located here. The most important two classes in this directory for understanding the code's operation and API design are GVec and Reader. While classes here do rely on QtCore, methods that require the including of QtGui should never be placed in this directory! A completely Qt free library of the data directory is a goal for the future; however, the current version relies on Qt's crossplatform thread wrapper (QThread) and widely employs the QString class to avoid frequent switching between C++'s std::string and QString elsewhere in the browser code.

The GUI related code is housed under the ui_main directory. Most significantly for the present discussion, this includes the classes that draw the graphs, all of which are built by extending the DataTrackWidget class. While classes in ui_main may see, include and make use of classes in data and calculator directory, the reciprocal relationship should be guarded against, ie. no ui code among the data components.

Finally, the calculator directory should hold classes that transform data, particularly if the transformation is algorithmically complex and best separated into its own class(es). The idea is that a ui class would instantiate a calculator class, passing pointers of the required data structures (GVecs, etc...) to the calculator as necessary.


The GVec data structure

At its core, the BQGB API organizes genomic data into a sorted std::vector of records generally and loosely corresponding to a single record in a data file. The simplest form of this record is GVec::Record defined in GVec.h. At bare minimum this record contains but a single piece of information, the 1-based base pair start position. Since a record class may be instantiated many millions of times, it avoids the memory overhead of polymorphism.

A GVec, however, is much more than a vector of these records. At a minimum, two additional pieces of information must be stored: (1) on what sequence is this record (ex. chromosome, plasmid, RNA, etc...) and what kind of feature is this record associated with (ex. gene, exon, chip-chip score data, etc...). Sequence organization is contained in a std::map within the GVec object whose keys are the sequence name and whose values are std::vector of records. Here's the definition taken from GVec.h:

map< const QString, vector<Record*> > genome;

The above std::map, std::vector, GVec::Record combination accounts for base pair position and per record associated information such as a score, gene name, sequence, etc... and separates records belonging to different sequences, but there is myriad other information identifying a record. What's the source file? What feature do these records annotate? What's the strand? GVec objects are built into multi-branch trees to contend with these relations, which are many-to-one. Each node may contain a different level of information along with pointers to the child GVecs that contain additional segmentation information. The leaf nodes always contain the vectors of records, though there's nothing stopping one from putting additional vectors of records in higher nodes.

GVec data structure
The GVec data structure is a r-branched tree whose leaves contain location specific records


Additionally, many basic metrics are best compiled at read time. A reader class will have to loop through the file once in its entirety and the trip can be an opportunity to create some descriptions of the data, ex. mins/maxes and longest feature. Although BQGB does not currently do so, variance, or at least a sum-of-squares, could also be trivially calculated at read time. Beware, however, that should one be inserting into a GVec later, such metrics need updating!

The public methods in GVec.h provide the access into the data structure. They have been broadly defined and are usually declared virtual to help subsequent application code from having to make too many casts (ex. from GVec to GVecGFF). Please consult the code for details, but here are three general groups of public GVec methods:

One last method will be mentioned here explicitly as it's key for BQGB's efficiency as a browser, size_t GVec::get_closest_index_by_start( long i ). A vector of records may contain an entire chromosome's worth of data, potentially millions of records! Meanwhile, the viewer is likely to be interested only in a shorter region, dozens to tens-of-thousands of base pairs. The viewer code, however, knows the position of the right and left hand edges of the window. The trick is to find the records between those two positions quickly. BQGB does this with a modified binary search algorithm that always returns a record... the record closest to the function's argument. Since this search uses feature start positions, it's best combined with the get_longest_feature method, ie. move the start of the viewing window to the left the length of the longest feature in the vector of records.

View window drawing strategy
To limit the number of records drawn to just those necessary, a distance to the left of the screen is calculated by subtracting the length of the longest feature in the track by the coordinate of the left most edge of the view area. A modified binary search is then preformed against that far left coordinate to find the first indice of the record that would need to be drawn in the worst case scenario that the view contained the longest record. Next, the records are retrieved sequentially from the sorted vector in which they reside until either the end of the vector is reached or the start coordinates of the record are beyond the right edge of the screen.


Finally, although not strictly an abstract class, it is rarely if ever instantiated directly but rather extended to classes defining certain types of data, like scores or annotations. Classes are named such as GVecSGR, GVecBed, GVecGFF, ie. "GVec" + file_type_name. However, there's no strict rule that a certain file type must have its own GVec class. For instance, if all one was interested in in a GFF file was start position and a score, GVecSGR would provide a more memory efficient class. Of course, one might want to simply convert their GFF files to SGR files in this case and also save disk space!


Adding support for new file formats

Three groups of classes are important for adding file support to BQGB: GVec plus classes that inherit it, Reader plus classes that inherit it and GVecFactory. Additionally, MainWindow.cpp should be modified such that the file selection box will show the extension of the new file type.

The work of supporting a new file format is split among these classes as follows:

At a minimum, one will have to add code to GVecFactory and create a new reader class. Whether one creates a new GVec class depends on whether the new data can be comfortably squeezed into an existing GVec or not.

Please consult the GVecFactory code for details of how to modify this class for a new file type. The methods to play close attention to are the following:

void GVecFactory::new_gvec( QString, int, FilePreferences*, int, int, GVec* parent=0 );
int GVecFactory::determine_type( QString );

Any new reader class should inherit Reader. Reader is abstract and will require the implementation of two methods, get_gvec and read_file. While get_gvec provides a common interface by which to access the GVec associated with a reader, it's left to each type of reader class to store its own GVec, thus allowing it to store the specific type of GVec associated with that file type, ex. GVecSGR, GVecGFF, GVecWIG, etc.... Meanwhile, void read_file( ) contains the i/o loop and token parsing code along with a call to some sort of push_back method in its GVec by which to pass the tokens along. Normally, read_file will be called automatically when the reader's thread is started. However, if one wishes to run the reader directly from the caller's thread, there's no reason to not call read_file directly.


Adding a graph type

Creating a new graph type involves building classes in the src/ui_main directory. The description below describes the design used by BQGB where graph drawing classes inherit from the DataTrackWidget class. The classes ScoreTrackWidget and AnnotationTrackWidget provide two good models of how this is done. In fact a new graph type class might consider inheriting ScoreTrackWidget or AnnotationTrackWidget instead of DataTrackWidget widget directly.

Since these graph classes are ultimately Qt QWidget classes, they must implement QWidget's void paintWidget( QPaintEvent* ) method. This is our launch point for describing the graph classes' mechanisms. Within the paintEvent method, calls are made to the methods that draw the background, axis, file name, cursors and other non-data part of the graph image. The paintEvent then must set a value to two function pointers, draw_data_record and draw_data_record_lr, the latter method is for low resolution renderings, ie. at distant zoom levels when much data may be drawn to screen and complicated glyphs unnecessary and slow. This function pointer may be used to set other drawing nuances depending on factors such as, "what's the file source and is there special data associated with it?" Here's an example of the code setting the function pointer in AnnotationTrackWidget:

  switch ( gvec->get_type( ) ) {
   case GVec::GFF:
    draw_record_pt = &AnnotationTrackWidget::draw_record_gff;
    draw_record_pt_lr = &AnnotationTrackWidget::draw_record_gff_lr;
    break;
   case GVec::DELIMITED:
    draw_record_pt = &AnnotationTrackWidget::draw_record_delim;
    draw_record_pt_lr = &AnnotationTrackWidget::draw_record_delim_lr;
    break;
   case GVec::BED:
    draw_record_pt = &AnnotationTrackWidget::draw_record_bed;
    draw_record_pt_lr = &AnnotationTrackWidget::draw_record_bed_lr;
    break;
   default:
    draw_record_pt = &AnnotationTrackWidget::draw_record_default;
    draw_record_pt_lr = &AnnotationTrackWidget::draw_record_default_lr;
  }

This design builds in a simple polymorphism without creating new classes and allows the answer to gvec->get_type( ) in the switch statement above to be buffered across all the data points drawn in this view. After setting this function pointer, the graph class should call DataTrackWidget's void draw_data( QPainter* ) method. This method will loop through the data, selecting the appropriate resolution level and filtering or highlighting any data that meets such criteria. The draw_data method calls either:

virtual void draw_data_record ( QPainter*, size_t, vector* colors = 0 ){ };
virtual void draw_data_record_lr( QPainter*, size_t, vector* colors = 0 ){ }; // low res

Both of these methods should probably be overridden in the newly created graph class.

The figure below outlines the strategy described above. Of course, always consult the code... it's likely to change faster than this documentation! The basic strategy, however, will remain the same.


Outline of track drawing steps

Creating a new central widget

BQGB's original design intended to allow easy, run-time swapping out of the "central widget", that is the widget that comprises the central viewing stack of tracks and their slider navigation controls. For instance, for some visualizations, the idiom of a vertically stacked set of tracks may be less useful than horizontal positioning or even a series of discrete adjacent windows as in pair-wise comparisons. This swapping of central widgets, however, has yet to be implemented and, indeed, the central widget itself is in painful need of refactoring!

Still, if one's interested in creating new central widgets, the two classes that can serve as launch points are MainWindow and BasicDisplayWidget, both located in the ui_main directory. The MainWindow class is where the central widget is set:

central_widget = new BasicDisplayWidget( preferences, graph_factory, this ); setCentralWidget( central_widget );

The class BasicDisplayWidget is the central widget. This class coordinates graph tracks with GVecs, mouse and other events, scrolling, etc.... The future plan (would/should) be to refactor general methods from BasicDisplayWidget into something like AbstractCentralDisplayWidget which could then be used to facilitate the construction of new central widgets.

Back to TOC