3D Gene Expression

Expression Atlases
Search PointClouds
Browse PointClouds
Data Releases

   Project Goals

   Staining & Mounting
   Probe Constructs


File Formats:
   PointCloud (.pce)
   VirtualEmbryo (.vpc)
   Correspondence files

VirtualEmbryo File Format

VirtualEmbryo files are text files with a '.vpc' extension and contain composited average expression data for a set of embryos in an Gene Expression Atlas. The syntax extends that of individual PointCloud files in order to also describe temporal data for a set of discrete time points as well as additional metadata describing the collection of individual PointClouds that went into building an atlas.

This file format is compatible with the 'comma-separated values' (.csv) format, and should be readable in applications such as Excel. Meta-data is written to a header in lines starting with a '#' character. These lines conform to a certain syntax specified below. Lines starting with a double '##' are comments. The actual comma-separated values form a table in which each row contains data for a single nucleus, and each column contains a specific measurement, such as coordinates, average expression level for a given gene and time, local density of nuclei, etc. The meaning of each column can be derived from the meta-data contained in the comments.

The row for a given 'virtual nucleus' contains one value for each data column, followed by a variable number of columns that specify the list of neighbors for that nucleus. The first number after the set of data columns is the number of neighbors nn which is followed by nn more entries that indicate the ID of nuclei which are direct neighbors. This data is used in rendering the embryo and may be useful for modeling propagation of factors between nuclei.

Meta-data lines are of the form:

   # key = value

where key is the name of a property, and value can be either a number, a string or an array of numbers or strings. A value can also be empty. A string is distinguished from a number by enclosing it in double-quotes ("). An array is enclosed in square brackets ([]), with the elements in a row separated by commas and the rows separated by semi-colons (see examples below).

These are the property names currently used in the VirtualEmbryo file:

cohort_names (string array)
Name of the cohorts (time points) included in this VirtualEmbryo
cohort_times (integer array)
Approximate time after egg deposition in minutes for each cohort. Note, this value is only a rough approximation which has not been carefully verified since all staging was based purely on morphological features.
srclist (string array)
Names of all the images from which the VirtualEmbryo data is derived.
embryospergene (integer array)
2D array where each row corresponds to a time point and each column corresponds to a gene. Indicates the number of embryos averaged together to yield the data from that (gene,time) pair
column_info (string array)
2D array in which each row gives info on a group of columns, whose name start with the same prefix (see below for complete description).
column (string array)
Gives the list of column names. In the VirtualEmbryo these are typically a gene name followed by the time step (cohort number). The length of the column array specifies the number of data entries in each row (excluding the neighborhood relations).
nuclear_count (integer)
Number of nuclei in the file (i.e. number of rows in the comma-separated values file).

Additional description:

Column names

Each column has a name that specifies its meaning. There is a standard set of columns that the user is expected to know the meaning of and additional columns that are described in column_info. These additional columns include gene expression level measurements and local nuclear density measurements.

The standard set of columns used in VirtualEmbryo files are as follows:

Nucleus ID.
x__t, y__t, z__t
Coordinates of the center of mass of the nucleus, in microns at time step t.
Nx__t, Ny__t, Nz__t
Direction of the surface normal at the nucleus at time step t.

Column groups

The column_info string array gives information about a group of related data columns corresponding to a particular gene expression level over time. Each gene listed in column_info has one or more corresponding columns in the data array. A gene may have measurements associated with up to K time points where K is the length of the cohort_names array. For a gene, each time point for which data was collected has a corresponding column of the form columnname_subcolumn_timepoint. Currently the subcolumn isn't used in the VirtualEmbryo (only in PointClouds) so column names typically appear with two underscores, e.g. "bcd__6" would be bcd data for the sixth time point. Note that a gene may not have been measured at all time points, in which case those columns will not appear in the file.

The entries in each row of column_info specify in order,

  1. the column name, which matches the prefix in the column_name array
  2. the expression type, currently one of "mRNA", "protein", ""
  3. the short gene name (e.g. "eve","bcd", or "density")
  4. a long form of the name (e.g. "Biccoid protein", "Nuclear density")
  5. the data type, in VirtualEmbryos this is either "Gene Expression Average" or "Derived Morphology"
  6. preferred subcolumn, usually "" in VirtualEmbryos since they currently don't include sub-cellular localization data.
By convention, the column name is the same as the short gene name for mRNA stains (e.g. "bcd__6") and a P is appended to generate a unique column name for protein stains (e.g. "bcdP__6"). In general, with the exception of standard columns described above, the column name need only be an arbitrary unique string, so applications should always utilize entries in column_info to determine the actual contents of a column.

Sample excerpts from a VirtualEmbryo file