GVec Class Reference

GVec is the backbone data structure for storing biological data. More...

#include <GVec.h>

Inheritance diagram for GVec:
GVecBed GVecDelimited GVecFasta GVecGFF GVecMap GVecMulti GVecMultiFasta GVecSGR GVecWIG

List of all members.

Classes

class  Record

Public Types

enum  vtype {
  SGR, WIG, GFF, FASTA,
  DELIMITED, BED, MULTI, MAP,
  MULTIFASTA, UNKNOWN
}
enum  ftype { INT, LONG, DOUBLE, STRING }

Signals

void input_complete (GVec *)
void sequence_deleted (GVec *, QString)

Public Member Functions

 GVec (QString, QString, GVec *parent=0)
virtual ~GVec ()
QString get_source_file_name ()
void set_source_file_name (QString)
QString get_name ()
QString get_fq_name ()
virtual GVec::vtype get_type ()
vector< QString > * get_comments ()
void get_sequence_names (vector< QString > &)
void set_focus (QString)
QString get_focus ()
QStringList get_foci ()
virtual QString get_feature ()
virtual void get_strands (vector< GVec * >)
virtual GVecget_strand (const QString)
virtual QString get_strand ()
void get_keys (list< QString > &)
GVecget_parent ()
void get_children (vector< GVec * > &)
GVecget_child (const QString)
GVecget_node_at (list< QString > &)
GVecget_node_at (const QString)
int get_child_count ()
GVecget_root_gvec ()
void get_record_gvecs (vector< GVec * > &)
bool is_root ()
bool is_leaf ()
QString get_type_string (GVec::ftype)
size_t get_closest_index_by_start (long)
vector< size_t > contains (long)
vector< size_t > overlaps (long, long)
virtual GVec::Recordat (size_t)
virtual GVec::Recordoperator[] (size_t)
virtual long get_start (size_t)
virtual long get_stop (size_t)
virtual real get_score (size_t)
virtual QColor * get_color (size_t)
virtual string get_info (size_t)
virtual QString get_attribute_value (size_t, QString)
virtual double max ()
virtual double min ()
virtual long get_longest_feature ()
long length ()
long last_bp ()
bool is_empty ()
bool has_focus (QString)
void set_has_score (bool)
void erase (long)
void erase (long, long)
virtual bool has_score ()
virtual bool has_sequence ()
virtual bool has_text ()
virtual bool has_stop_position ()
void push_back_comment (string comment)
void at_read_eof ()
virtual string to_string (int ident=0)
virtual string record_to_string (GVec::Record *)

Static Public Member Functions

static QString get_gvec_type_string (GVec::vtype)

Protected Member Functions

void check_sort_on_push_back (const QString)
void update_stats ()
void update_stats (long)
void update_stats (long start, long stop, real score=0)
bool breaks_stats (long start, long stop, real score=0)
size_t find_closest_index_by_start (long, long, long)

Protected Attributes

QString node_name
QString source_file_name
vtype type
QString focus
string strand
bool has_score_data
GVecgvec_parent
map< const QString, vector
< Record * > > 
genome
map< const QString, long > counts
map< const QString, double > maxes
map< const QString, double > mins
map< const QString, long > lasts
map< const QString, bool > sorted
map< const QString, long > longest_feature
vector< QString > * comments
QString feature
map< QString, GVec * > child_gvecs

Detailed Description

GVec is the backbone data structure for storing biological data.

When a data file containing biological records such as a GFF file is read, the records there are parsed by a Reader class and organized into the GVec classes. The GVec usually has children GVecs and possibly a parent GVec, maintaining pointers to these objects, ie. the GVec is part of a GVec tree. Any given level in the tree usually contains one level of description that subsequently branches out, for instance, source file->features->strands->records where "file" is the root node. While records might be stored at multiple levels, they have traditionally been placed soley at the leaf nodes.

A GVec also organize records by sequence, ie. the biological sequence structure with which the record is associated, most commonly a chromosome. At any one time and shared throughout the GVec tree a single sequence has the focus, ie. what sequence is the tree currently operating on. For instance, if the focus is chromosome 2L and one calls GVec::length( ) or GVec::max( ), one would be inquiring, "how many records for chromosome 2L?" and "what's the max score in the 2L data?" respectively. The GVec::set_focus( QString ) method can be used to change the current focus. If one uses set_focus( "2R" ) and then poses the above question, the answers pertain to 2R, not 2L.

Finally, a GVec maintains a number of global metrics about the data that are easier to calculate once at read time than to search all the data each time they are needed: max, min, longest feature, etc....


Constructor & Destructor Documentation

GVec::GVec ( QString  file_name,
QString  node_name,
GVec parent = 0 
)

GVec constructor; while normally only called by implementing classes as GVec, GVec is not abstract and nothing stops one from using it directly should the need arise.

Parameters:
file_name most likely the file from which the data was read or that it might be written to
GVec::~GVec (  )  [virtual]

GVec destructor; of course, not normally called directly, but the main task is to free memory used by records (presumably the bulk of the memory consumed by a GVec).


Member Function Documentation

GVec::Record * GVec::at ( size_t  i  )  [virtual]

This method functions somewhat like its namesake in std::vector and similar. However, remember that a GVec is actually a map or record vectors organized by sequence focus. So, this "at" returns the value at the index argument for the current focus.

In general, one should consider using methods like get_start( i ), get_stop( i ), get_score( i ), etc.. which will look up the record at that argument and return the requested information without the caller having to handle GVec::Record objects.

Finally, the method does not fail safe if i is outside of the vector's range. It's up to the programmer to check the length of the vector via the GVec::length( ) method.

Parameters:
size_t i the index of the desired record for the current focus
Returns:
a GVec::Record*
See also:
GVec::get_start( size_t i ), GVec::get_stop( size_t i ), GVec::get_score( size_t i ), GVec::set_focus( QString new_focus ), GVec::get_focus( ), GVec::length( ), etc...

Reimplemented in GVecBed, GVecDelimited, GVecFasta, and GVecGFF.

void GVec::at_read_eof (  ) 

This method is intended to be called by a Reader class when it has reached eof of its target file. Please see the code for a class like GFFReader for more details. The method's most important function is, if necessary, to sort the GVec.

bool GVec::breaks_stats ( long  start,
long  stop,
real  score = 0 
) [protected]

Checks if the parameters would break the current set of global metrics kept for each sequence focus against the current sequence focus.

Parameters:
start a base pair position in the current focus
stop a base pair position in the current focus
score 
void GVec::check_sort_on_push_back ( const QString  seqname  )  [protected]

As a file is read in, this method, if used at all, should be called after each record is pushed back to check if the record has violated the goal of sorting be start position-- many files, of course, are unsorted. At the end of the read, one can check if the file needs to be sorted or whether the sort step can be skipped.

vector< size_t > GVec::contains ( long  bp  ) 

Given a single base pair argument, this method will find all records that contain the position, returning the indices of those records.

Parameters:
bp a position in the coordinate space of the current GVec
Returns:
a vector of indices for the records containing the argument
See also:
GVec::overlaps( long from, long to )
void GVec::erase ( long  from,
long  to 
)

Removes the records between from and to indices for the record vector of the current sequence focus.

Parameters:
from the index from which to start removing records
to remove upto and including this index
void GVec::erase ( long  i  ) 

Removes the record at the i'th position of the record vector for the current focus.

Parameters:
i the index of the record in the current focus's vector of records
size_t GVec::find_closest_index_by_start ( long  bp,
long  from,
long  to 
) [protected]

This operates similarly to GVec::get_closest_index_by_start( long bp ), but limits its search to the range given by the parameters from and to. This method is useful, for instance, for finding what records need to be drawn across the screen.

See also:
GVec::get_closest_index_by_start( long bp )
GVec * GVec::get_child ( const QString  kid  ) 

This method returns the named child. It is not recursive, so the search is only against the immediate children and not more distant nodes.

Parameters:
name of GVec child for which to surf
Returns:
GVec* the GVec child named by kid or 0 if not found
void GVec::get_children ( vector< GVec * > &  vec  ) 

This method will load the vector argument with current object's children via a push back onto whatever is the argument's end.

Parameters:
a vector to hold the children
size_t GVec::get_closest_index_by_start ( long  bp  ) 

This method performs a modified binary search of the record vector for the current sequence focus, returning the lowest index of the closest matching record, ie. if multiple records share a start position, whichever record has the smallest index number. The bp argument is compared to the record's start positions.

Parameters:
bp a base pair position in the current sequence coordinate system
Returns:
the smallest index of the closest matching record
QColor * GVec::get_color ( size_t  i  )  [virtual]

Returns the color associated with the record at the argument for the current focus. In GVec, this method simply returns 0 (null pointer). However, some file sources, like BED files (see GVecBed), allow for per record color data-- then this method comes in handy.

Parameters:
size_t i the index in the current focus's vector of records
Returns:
QColor associated with record
See also:
GVec::at( size_t i )

Reimplemented in GVecBed.

vector< QString > * GVec::get_comments (  ) 

Returns a list of comments from the source file in the order in which they were encountered... assuming the reading class kept track of comments.

Returns:
a vector of comment lines from the source file

Reimplemented in GVecFasta.

QString GVec::get_feature (  )  [virtual]

DEPRECATED!!! At one time feature was more tightly defined, now this simply returns the node_name, which may or may not be the node one wants to use as the feature.

Perhaps in future versions, feature will be settable... it might be any level in a GVec tree, but the tree will keep track of its location.

Returns:
node_name

Reimplemented in GVecGFF.

QStringList GVec::get_foci (  ) 

Return a list of all foci for the object from which it's called, ie. no recursive search of GVec tree is made.

Returns:
QStringList containing the object's foci
QString GVec::get_focus (  ) 

Returns the name of the current focus, which is a sequence name of some sort.

Returns:
focus
QString GVec::get_fq_name (  ) 

Return a "fully qualified" name, ie. the node name concatenated to its parent's node name, applied recursively until reaching the root node.

Returns:
fully qualified name
See also:
GVec::get_name( )
QString GVec::get_gvec_type_string ( GVec::vtype  t  )  [static]

The GVec class defines an enum of GVec types for the purpose of identifying a class type without having to fish around via dynamic casts or similar. Additionally, this method could be useful when a string of a type name is needed for the ui. Of course, this only works if GVec type has properly defined it the get_type( ) method!

The method here simply returns a QString representation of that type.

Parameters:
t a value from the GVec::vtype enum
Returns:
a QString representation of a GVec::vtype enum
See also:
GVec::get_type( )
string GVec::get_info ( size_t  i  )  [virtual]

Returns a formatted string representation of the record, for instance, that might be displayed when a user clicks on the records glyph.

Parameters:
size_t i the index in the current focus's vector of records
Returns:
string representation of the record, formated for display
See also:
GVec::at( size_t i )

Reimplemented in GVecBed, GVecDelimited, and GVecGFF.

void GVec::get_keys ( list< QString > &  keys  ) 

Starting from this object, recursively follow parents until reaching the root node, adding the node names onto the list argument.

The nodes are added to the front, ie. the root node name will be first in the list and the name from which the method's originally called, the last.

Parameters:
a list onto which node names can be stored

Reimplemented in GVecMulti.

long GVec::get_longest_feature (  )  [virtual]

For each sequence focus, the length of the longest record is tracked. This is useful for the ui in determining the start of the left hand search point that guarantees that the end of the longest record (and any other shorter records) are drawn onto the screen (left hand px coordinate minus px length of longest feature). The value here is returned in units of base pairs; it's up to the ui to convert that to pixels.

Reimplemented in GVecFasta.

QString GVec::get_name (  ) 

Returns the name of this GVec object, known as the node name since GVecs are usually used as part of trees. The node name might be a file name, feature name, strand, etc....

Returns:
node_name
GVec * GVec::get_node_at ( const QString  name  ) 

Conducts a recursive search for the named node from the current node down through children nodes until a match is found. The first match encountered is the match returned.

Parameters:
name of the GVec for which to search
Returns:
GVec* named in argument or 0 if nothing's found
GVec * GVec::get_node_at ( list< QString > &  keys  ) 

This method will follow the keys from front to back recursively through the GVec tree starting with the object from which it's called, ie. the first key is taken off the front and searched against the object's children GVecs and then called recursively against the child until keys is empty. If a match fails at any level before keys is empty, method returns 0.

This method should return the same GVec* as get_node_name( keys.back( ) ), but providing an extra check that all intervening nodes exist.

Parameters:
a list defining the order through which to travel the tree
Returns:
GVec* at the end of the search
See also:
GVec::get_node_at( const QString name )
GVec * GVec::get_parent (  ) 

Root node should return 0.

void GVec::get_record_gvecs ( vector< GVec * > &  vec  ) 

This method has the potential for being misleading. What it does do is provide the GVec objects, added to the vector argument, that make up the leaves of a GVec tree. Normally, this is where records were intended to be kept. However, it is conceivable that one would keep records at some other level in the tree, in which case this method might not return what the want.

Parameters:
vec a vector in which to place the records GVec objects
GVec * GVec::get_root_gvec (  ) 

Traverses a GVec tree until reaching the root and then returns that root.

real GVec::get_score ( size_t  i  )  [virtual]

Returns the score associated with the record at the argument for the current focus. Of course, not all GVec's have scores. An assumption has been made that scored data and scoreless data (like many annotations) will not be in the same GVec; one can use the GVec::has_score( ) method to check if the GVec object is associated with scores.

Parameters:
size_t i the index in the current focus's vector of records
Returns:
the record's score
See also:
GVec::at( size_t i ), GVec::has_score( )

Reimplemented in GVecDelimited, and GVecGFF.

void GVec::get_sequence_names ( vector< QString > &  vec  ) 

Loads the argument with the names of sequences found in the object from which called, ie. no search of tree is made. The sequence names define the list of possible foci, for instance, chromosome names.

No check is made that the vector argument is empty. Results are simply pushed onto the end.

Parameters:
a vector in which to place the sequence names
See also:
GVec::get_focus( ), GVec::set_focus( QString )
QString GVec::get_source_file_name (  ) 
long GVec::get_start ( size_t  i  )  [virtual]

Returns the base pair start position for the record at the argument for the current focus.

Parameters:
size_t i the index in the current focus's vector of records
Returns:
the base pair position where the record starts
See also:
GVec::at( size_t i )

Reimplemented in GVecDelimited, and GVecGFF.

long GVec::get_stop ( size_t  i  )  [virtual]

Returns the base pair stop position for the record at the argument for the current focus. The records handled in this base GVec class are all associated with a single base pair locus; therefore, the start position and stop position are the same.

Parameters:
size_t i the index in the current focus's vector of records
Returns:
the base pair position where the record stops
See also:
GVec::at( size_t i )

Reimplemented in GVecBed, GVecDelimited, GVecFasta, GVecGFF, and GVecWIG.

QString GVec::get_strand (  )  [virtual]

This method should return one of three values: "+", "-" or ".". It operates by checking the strand of each leaf GVec against each other. If they're not all the same or are indeterminate, "." will be returned.

This method is definitely up for change since the strand is currently kept both as a class attribute and as its own node in a GVec tree, which is less than ideal.

Future versions will probably make the strand its own node at any level in the tree, with the tree keeping track which level contains strand nodes.

Returns:
QString representation of the GVec's strand(s)
GVec::vtype GVec::get_type (  )  [virtual]

Sometimes it's important to know what class type of GVec on is working with, for instance, when choosing a graph type to match. This method was included to help avoid casting (dynamic or otherwise) and returns the value of the GVec::vtype enum defined in GVec. Of course, this only works if a GVec inheriting class actually creates a new enum type and implements this method-- it's not required too!

So, this is just a convenience and certainly not as reliable as explicitly checking a GVec's type, for instance, with a dynamic cast.

Here the value UNKNOWN is always returned.

Reimplemented in GVecBed, GVecDelimited, GVecFasta, GVecGFF, GVecMap, GVecMulti, GVecMultiFasta, GVecSGR, and GVecWIG.

QString GVec::get_type_string ( GVec::ftype  t  ) 

This method was added to support the parsing of arbitrarily defined delimited files. GVec defines an enum called ftype that includes the types supported in this parsing. This method converts the values of the ftype enum to a QString.

Parameters:
t the value of a GVec::ftype
Returns:
a string representation of the ftype enum values
bool GVec::has_focus ( QString  focus  ) 

Addresses the question whether this GVec* tree has the named sequence focus, indpendent of what the current sequence focus is.

Parameters:
focus the name of a focus for which to search
bool GVec::has_score (  )  [virtual]

Does this GVec contain records with score data? This is set by the Reader, which should be aware of what kind of data it is encountering.

bool GVec::has_stop_position (  )  [virtual]

Does this GVec have records that contain a stop position? All records have a start position, but some (an annotation for instance) span a segment of the sequence up to and including the stop position. In the latter case, this returns true.

Reimplemented in GVecDelimited, and GVecGFF.

bool GVec::has_text (  )  [virtual]

Do the records in this GVec contain text, for instance, an annotation name. This is determined by the Reader.

Reimplemented in GVecDelimited, and GVecGFF.

bool GVec::is_empty (  ) 

What is an empty GVec? Here it is defined as a GVec that has no sequences assigned to it. It may, however, contain other important information such as a node name that gives a feature and pointers to other GVecs that are not empty.

bool GVec::is_leaf (  ) 

Is this GVec object the terminal leaf of a tree? Terminal leafs are usually where the vector of records reside. The method is useful for writing recursive functions to traverse trees.

bool GVec::is_root (  ) 

Is this object is the root of a tree, ie. is its gvec_parent == 0? The method's purpose is mostly to facilitate the writing of recursive functions traversing GVec trees.

long GVec::last_bp (  ) 

For each sequence focus, the last base pair position (a stop position) is tracked. This method returns that value for the current focus. Note that this is not necessarily the stop position of the last record since records are sorted by start position and often of varying lengths.

long GVec::length (  ) 

Returns the size of the vector of records for the current sequence focus.

double GVec::max (  )  [virtual]

For each sequence focus, the maximum score across all records is tracked. This method returns that value for the current sequence focus.

Returns:
max score record for current sequence focus

Reimplemented in GVecDelimited.

double GVec::min (  )  [virtual]

For each sequence focus, the minimum score across all records is tracked. This method returns that value for the current sequence focus.

Returns:
min score record for current sequence focus

Reimplemented in GVecDelimited.

GVec::Record * GVec::operator[] ( size_t  i  )  [virtual]

Operator wrapping GVec::at( size_t i )

vector< size_t > GVec::overlaps ( long  from,
long  to 
)

Given a segment defined by a start and end position, find the overlapping records. Records may partially overlap or be completely contained within the argument segment. Return the indices of the matching records.

Parameters:
from a start position in the coordinate space of the current GVec
to a start position in the coordinate space of the current GVec
Returns:
a vector of indices for the records containing the argument
See also:
GVec::contains( long bp )
void GVec::push_back_comment ( string  comment  ) 

If a Reader is tracking comment lines in a file, it can use this method to store them in the GVec, although currently the storage is pretty rough... just pushed back onto a vector in the order in which they're sent.

Reimplemented in GVecDelimited.

string GVec::record_to_string ( GVec::Record rec  )  [virtual]

Compiles a formatted string version of a GVec::Record, mostly for debugging purposes.

void GVec::set_has_score ( bool  tf  ) 

This is a method intended to be used by Reader classes to indicate whether the associated file has score data.

void GVec::set_source_file_name ( QString  new_name  ) 

When reading a data file, the source_file_name would normally be set via the constructor. However, when writing a file, one may very well want to change the value after instantiation.

See also:
GVec::get_source_file_name( )
string GVec::to_string ( int  ident = 0  )  [virtual]

Compiles a formatted string version of a GVec, mostly for debugging purposes.

void GVec::update_stats ( long  start,
long  stop,
real  score = 0 
) [protected]

Similar to GVec::update_stats( long i ), but instead of checking against a particular record, checks against the parameters.

Parameters:
start a base pair position
stop a base pair position score
See also:
GVec::update_stats( long i ), GVec::breaks_stats( long start, long stop, real score )
void GVec::update_stats ( long  i  )  [protected]

Checks if the record inserted at position i breaks the GVec's global metrics for the current sequence vector (min, max, etc...).

Parameters:
i index of new record
See also:
GVec::update_stats( long start, long stop, real score ), GVec::breaks_stats( long start, long stop, real score )
void GVec::update_stats (  )  [protected]

If a GVec is modified after being read in by either the insertion or removal of a record, this method should be called to ensure that the global metrics for the vector haven't been broken (min, max, etc...).

All records for the current sequence focus are checked.

See also:
GVec::update_stats( long i ), GVec::breaks_stats( long start, long stop, real score )

The documentation for this class was generated from the following files:
 All Classes Functions

Generated on Thu Sep 17 15:19:42 2009 for BQGB by  doxygen 1.6.1