In Vivo DNA Binding

Browse ChIP/chip Data
Browse ChIP/seq Data
Data Releases

Overview:
   Summary
   Protocols

Software:
   TiMAT
   Secondary
      Analysis Scripts

   Berkeley
      Quantitative
      Genome Browser

 Download:
   Secondary Analysis Scripts

Secondary Analysis Scripts

This page describes the scripts written to perform initial analysis on ChIP chip data downstream of the TiMAT output.

  1. Introduction
  2. Base Composition Analysis
    • 2.1 base_composition_across_peaks.pl
    • 2.2 base_composition_down_rank_list.pl
    • 2.3 lower_case_coding_bases.pl
  3. Binding Site Enrichment Analysis
    • 3.1 enrichment_across_peaks.pl
    • 3.2 enrichment_down_rank_list.pl
    • 3.3 distribution_of_number_of_sites.pl
    • 3.4 pwm_cm_to_fm.pl
    • 3.5 pwm_fm_to_aln.pl
    • 3.6 seqlogo.pl
  4. GFF Processing
    • 4.1 get_seq_from_peaks_gff_all.pl
    • 4.2 gff_to_bed.pl
    • 4.3 gff_to_peaks_and_intervals.pl
  5. GO Term Enrichment
    • 5.1 go_enrichment_down_rank_list.pl
  6. Factor Overlap Tables
    • 6.1 interval_tables.pl
    • 6.2 peaks_overlap_1Percent_25Percent.pl
  7. Distance To Genes Analysis
    • 7.1 median_distance_to_closest_genes.pl
  8. PhastCons Analysis
    • 8.1 phastcons_across_peaks.pl
    • 8.2 phastcons_down_rank_list.pl
  9. Peak Score Distributions
    • 9.1 distribution_of_peak_scores.pl
  10. Drawing Figures
    • 10.1 plot_figures.pl

Introduction

These scripts are all written in Perl. Obscure module dependencies have been removed, but the scripts do require several modules such as File::Basename and optionally Smart::Comments. They are written to run with the minimum user interaction and command line options so as to produce figures easily and reproducibly. Many file locations and options are hardcoded. It is assumed that the data is in the following structure:

bcd_1_012505TestResults
cad_1_020107TestResults
data
gt_2_020107TestResults
hb_1_012505TestResults
kni_1_092706TestResults
kr_1_113005TestResults
matrices
processing_scripts

The following example are designed to run from the processing_scripts directory. Matrices in the matrices directory are named with the factor name, e.g bcd.cm or kr.cm. The data folder contains background sequence files, annotations etc. The scripts also heavily rely on the nomenclature of the file naming, from which they can determine factor, false discovery rate and false discovery test method etc. Naming of output files is automatic and is designed to provide one example for each type of figure for each data folder. As such, trying to generate the same figure for multiple files in the same directory will simply keep over writing the figure image, or in some cases add extra lines to the plots. They were written specifically for generating the images for the figures on the BDTNP web site and are distinct from scripts used for analysis presented in our published papers. Most scripts are designed to be run from shell scripts to iterate through file lists, rather than passing multiple files to the script directly, allowing easier parallel processing via a cluster. A number of the scripts act as wrappers for other programs such as seqlogo, patser and R, the locations of these programs will need to be hard coded into the scripts as needed.

The scripts are provided as is, with no support.

Base Composition Analysis

  • base_composition_across_peaks.pl

    This script calculates the average GC composition down the rank list in specified window sizes. It only returns the composition for upper case bases, so you can mask out features such as coding bases by using lower case.

    USAGE: WIN_SIZE STEP INFILE(S)

    WIN_SIZE = Number of bases to average over across the peaks e.g. 100

    STEP = How much to move the window over each step e.g. 100

    INFILE(S)= FastA format sequence file(s) ../*/*peaks.gff-10000.fas

    Runs a directory at a time so shell script to run on all at once. e.g.

    for f in ../*TestResults/*-sym-1_primary_peaks.gff-10000_lowercase_coding.fas
    do ~/scripts/qsub_this.pl fast "./base_composition_across_peaks.pl 100 100 ${f}";
    done

    You can visualize the output of the program like this

    ./plot_figures.pl Position 'Percentage GC' 'base composition across peaks' n ../*TestResults/*-basecomp_across_peaks.xls
    

  • base_composition_down_rank_list.pl

    This script calculates the average GC composition down the rank list in specified window sizes. It only return the composition for upper case bases, so you can mask out features such as coding bases.

    USAGE: RANK_WIN_SIZE INFILE(S)

    RANK_WIN_SIZE = How many sequences to have in a cohort e.g. 100

    INFILE(S) = FastA format files*.fas

    e.g.

    for f in ../*TestResults/*-sym-25_primary_peaks.gff-500_lowercase_coding.fas
    do ~/scripts/qsub_this.pl fast "./base_composition_down_rank_list.pl 100 ${f}";
    done
    

    You can visualize the output of the program like this

    ./plot_figures.pl Rank 'Percentage GC' 'base composition down ranks' n ../*TestResults/*-basecomp_down_ranks.xls
    

  • lower_case_coding_bases.pl

    This script makes bases that overlap with coding sequences lowercase. It uses a hardcoded file location, which will need to be changed to use other annotation definitions.

    USAGE: SEQ_FILE

    SEQ_FILE : e.g. ../*TestResults/*.fas

    for f in ../*TestResults/*.fas;
    do ~/scripts/qsub_this.pl fast "./lower_case_coding_bases.pl ${f}";
    done
    

Binding Site Enrichment Analysis

  • enrichment_across_peaks.pl

    This script calculates the enrichment for a matrix in windows across sequences, and is designed to calculate the enrichment across ChIP/chip peaks. It retrieves the matrix based on the file name. The background is set to all non-coding DNA. The location of the matrices are background sequences are hardcoded.

    USAGE: SAMPLE CUTOFF WIN_SIZE

    SAMPLE - Input fasta file e.g bcd_10kb.fas CUTOFF - patser p-value cutoff e.g 0.001 WIN_SIZE - window size, e.g. 100bp

    for f in ../*TestResults/*-sym-1_primary_peaks.gff-10000.fas;
    do ~/scripts/qsub_this.pl fast "./enrichment_across_peaks.pl ${f} 0.001 100";
    done
    

  • enrichment_down_rank_list.pl

    This script calculates the enrichment for a matrix in windows down a rank list, It retrieves the matrix based on the file name, the background is set to all non-coding DNA.

    USAGE: SAMPLE CUTOFF WIN_SIZE

    SAMPLE - Input fasta file e.g bcd_10kb.fas CUTOFF - patser p-value cutoff e.g 0.001 WIN_SIZE - window size, e.g. 100peaks

    for f in ../*TestResults/*-sym-25_primary_peaks.gff-500.fas;
    do ~/scripts/qsub_this.pl fast "./enrichment_down_rank_list.pl ${f} 0.001 100";
    done
    

  • distribution_of_number_of_sites.pl

    This script returns the distribution of the number of factor recognition sites per sequence for each of the infiles compared to random intergenic sequences. It returns the proportion of sequences with X number of sites in them. It assumed the INFILE sequences are all 500bp in length, as it uses a hardcoded background set of 500bp regions.

    USAGE: CUTOFF INFILE

    for f in ../*TestResults/*-sym-1_primary_peaks.gff-500.fas;
    do ./distribution_of_number_of_sites.pl 0.001 ${f};
    done 
    

  • pwm_cm_to_fm.pl

    This script converts counts matrixes to frequency matrices. It assumes a pseudocount of 1 and a prior frequency of 0.25.

    e.g. FROM:

    A|    9  11  49  51   0   1   1   4
    C|   19   3   0   0   0  45  25  16
    T|   18  36   0   0  34   5  21  10
    G|    5   1   2   0  17   0   4  21
    

    TO:

    0.1788  0.3692  0.1000  0.3519
    0.2173  0.0615  0.0231  0.6981
    0.9481  0.0038  0.0423  0.0058
    0.9865  0.0038  0.0038  0.0058
    0.0058  0.0038  0.3308  0.6596
    0.0250  0.8692  0.0038  0.1019
    0.0250  0.4846  0.0808  0.4096
    0.0827  0.3115  0.4077  0.1981
    

    USAGE: INFILE(S) e.g. *.cm

  • pwm_fm_to_aln.pl

    This script converts frequency matrices into alignment, which can then be used with seqlogo to generate motifs.

    USAGE: pwm_fm_to_aln.pl INFILE(S)

    INFILE(S) e.g. *.fm

  • seqlogo.pl

    This is a wrapper for the weblogo standalone version, seqlogo. Simply run on alignments to generate PNG format logos.

    It forms part of this pipeline.

    pwm_cm_to_fm.pl *.cm
    pwm_fm_to_aln.pl *.fm
    seqlogo.pl *.fas
    for f in *.png;do convert ${f} -quality 100 $(basename ${f} .fm.fas.png).jpg;done
    

GFF Processing

  • get_seq_from_peaks_gff_all.pl

    Gets sequence from the dmel-all-chromosome-r4.3.fasta from a peak GFF file and exports them to a fasta file, plus and minus the specified amount of sequence. The genome sequence location is hardcoded.

    USAGE: SIZE PEAK_FILES

            (bp either side) (e.g. ../*/*peaks*.gff)
    
    ./get_seq_from_peaks_gff_all.pl 5000 ../*TestResults/*-1_primary_peaks.gff
    ./get_seq_from_peaks_gff_all.pl 250 ../*TestResults/*_primary_peaks.gff
    

  • gff_to_bed.pl

    This script converts gff files into BED files, which can be loaded in to UCSC browser.

    USAGE: INFILE(S)

    ./gff_to_bed.pl ../*TestResults/*-sym-*_primary_peaks.gff
    ./gff_to_bed.pl ../*TestResults/*-sym-*_intervals.gff
    

  • gff_to_peaks_and_intervals.pl

    This scripts takes the TiMAT output GFF3 file and generates peaks, primary peaks and interval files from them.

    In the format:

        * INFILE_peaks.gff
        * INFILE_primary_peaks.gff
        * INFILE_intervals.gff 
    

    USAGE: INFILE(S)

    ./gff_to_peaks_and_intervals.pl ../*TestResults/*.gff3

GO Term Enrichment

  • go_enrichment_down_rank_list.pl

    This script takes interval tables and calculates the enrichment of specified GO terms down the rank list. It returns the -log(P) value from the hypergeometric test.

    USAGE: WIN_SIZE INFILE(S)

    Infiles should the the output of interval_tables.pl

    for f in ../*TestResults/*-sym-25_table.txt ;
    do ~/scripts/qsub_this.pl fast "./go_enrichment_down_rank_list.pl 100 ${f}";
    done
    

Factor Overlap Tables

  • interval_tables.pl

    This script is designed to return a table showing for each interval, it's location, score closest peak etc.

    USAGE: INFILE(S) - GFF3 TiMAT output file(s)

    for f in ../*TestResults/*-sym-*.gff3;
    do ~/scripts/qsub_this.pl fast "./interval_tables.pl ${f}";
    done
    

  • peaks_overlap_1Percent_25Percent.pl

    This script is designed to generate a table of the number of primary peaks and the overlap between antibody repeats.

    USAGE: Root_Data_Directory

    ./peaks_overlap_1Percent_25Percent.pl ../ >overlap_table.txt
    

Distance To Genes Analysis

  • median_distance_to_closest_genes.pl

    This script calculates the median distance from peaks to genes. It uses all genes, transcribed genes and patterned genes.

    USAGE: WIN_SIZE SAMPLE(S)

    WIN_SIZE : e.g. 250 SAMPLES : e.g bcd1*.gff

    for f in ../*TestResults/*-sym-25_primary_peaks.gff;
    do ~/scripts/qsub_this.pl fast "./median_distance_to_closest_genes.pl 100 ${f}";
    done
    

PhastCons Analysis

  • phastcons_across_peaks.pl

    This script is designed to return average Phastcons values around peaks. It uses parsed phastcons files in sgr format.

    USAGE: SEQUENCE WIN_SIZE INFILE

    SEQUENCE = bp +/- peaks to measure WIN_SIZE = size windows across peaks INFILE = ../bcd_1_012505TestResults/bcd_1_012505-sym-25_primary_peaks.gff

    for f in ../*TestResults/*-sym-1_primary_peaks.gff;
    do ~/scripts/qsub_this.pl fast "./phastcons_across_peaks.pl 5000 100 ${f}";
    done
    

  • phastcons_down_rank_list.pl

    This script is designed to return average PhastCons values around peaks, down the rank list. It uses parsed phastcons files, in sgr format.

    USAGE: SEQUENCE WIN_SIZE INFILE

    SEQUENCE = bp +/- peaks to measure WIN_SIZE = size windows down rank list INFILE = ../bcd_1_012505TestResults/bcd_1_012505-sym-25_primary_peaks.gff

    for f in ../*TestResults/*-sym-25_primary_peaks.gff;
    do ~/scripts/qsub_this.pl fast "./phastcons_down_rank_list.pl 250 100 ${f}";
    done
    

Peak Score Distributions

  • distribution_of_peak_scores.pl

    This script draws a histogram of the distribution of the peak scores. It generates images directly without having to run the plot_figures.pl script.

    ./distribution_of_peak_scores.pl ../*TestResults/*-sym-*_primary_peaks.gff
    

Drawing Figures

  • plot_figures.pl

    This script plots figures using R, and makes jpegs of 400x400px by default, but this can easily be changed to any size or image format.

    USAGE: X-Axis Y-Axis Title Legend Dotted_1%FDR INFILE(S)

    ./plot_figures.pl Position 'Percentage GC' 'base composition across peaks' n n ../*TestResults/*-basecomp_across_peaks.xls
    ./plot_figures.pl Rank 'Percentage GC' 'base composition down ranks' n Y ../*TestResults/*-basecomp_down_ranks.xls
    ./plot_figures.pl Rank 'Median distance to closest gene' 'median distance to genes' y y ../*TestResults/*distance_closest_genes.xls
    ./plot_figures.pl Position Enrichment 'PWM enrichment across peaks' n n ../*TestResults/*enrichment_across_peaks.xls
    ./plot_figures.pl Rank Enrichment 'PWM enrichment down ranks' n y ../*TestResults/*enrichment_down_ranks.xls
    ./plot_figures.pl Rank 'Mean PhastCons Score' 'mean PhastCons down ranks' n y ../*TestResults/*-phastcons_down_rank_list.xls
    ./plot_figures.pl Position 'Mean PhastCons Score' 'mean PhastCons across peaks' n n ../*TestResults/*-phastcons_across_peaks.xls
    ./plot_figures.pl Rank 'Enrichment -log(p)' 'GO term enrichment (-logP)' y y ../*TestResults/*go_enrichment_down_ranks.xls
    ./plot_figures.pl 'Number of Sites' 'Percentage of Peaks' 'distribution of number of sites' Y n ../*TestResults/*distribution_of_sites.xls
    ./plot_figures.pl Rank 'Percentage of Peaks' 'genomic location' Y y ../*TestResults/*-genomic_locations.xls