July, 2011




1. Overview

OGRe stands for Overlap Genomic Regions. OGRe is a stand-alone script that was developed to be used within SeqGI.
  • OGRe allows you to intersect two sets of genomic features (X,Y) and find overlaps between them.
  • The overlaps are recorded for all features in set X, independently of an overlap being found or not.
Fjoin algorithm
The basis of OGRe algorithm is an adaptation of the original algorithm "Fjoin" (Richardson, 2006).
The Fjoin algorithm makes use of sorted lists of the starts and ends of all the intervals and scans every interval from left to right, more or less in parallel, outputting when an overlap is found. The Fjoin algorithm allows for the rapid search of overlapping features in a remarkably simple procedure.

2. Option summary

Usage: $ python OGRe_V0.2.py -x fileX -y fileY -o outputname [OPTIONS]

Options Description
  -h, --help shows the full menu of available options
  -x <file>, --fileX=<file> Complete path to file X. The overlaps are recorded for all features in set X, independently of an overlap being found or not. Typically file X is a gene list (BED)
  -y <file>, --fileY=<file> Complete path to file Y. Typically set Y is a list of reads (BED, BedGraph)
  -o <file>, --out=<file> Complete path to output filename
  -f <string>, --fun=<string> [optional] Intersection function could be nr of  overlapping features, or depth of coverage ("nover" or "depth", respectively). By default --fun=depth
  --tx=<string>, --typeX=<string> [optional] Type fileX (bedgraph, bed, bed12). Default is --tx=bed
  --ty=<string>, --typeY=<string> [optional] Type fileY (bedgraph, bed, bed12). Default is  --ty=bedgraph
  --to=<string>, --typeO=<string> [optional] Type of output (overlap or sliding). Default is --to=overlap
  --sx, --sortX  [optional] Specify if need to sort fileX
  --sy, --sortY         [optional] Specify if need to sort fileY
  -s, --enforcestrand  [optional] Only report overlaps on the same strand. By default, strand info is ignored. Only "+" and "-" strands are valid, any other symbol in X or Y will be skipped.
  -b <string>, --bins=<string> Sliding window size. Only needed if --to=sliding. Needs to be specified together with "--btype". For example --bins=10 and --btype=nuc corresponds to a sliding window of 10 nucleotides. Default is None.
  --bt=<string>, --btype=<string> Type of bins. Only needed if --to=sliding. Needs to be specified together with "--bins". For example --bins=10 and --btype=perc corresponds to a sliding window of 10% of the total size of the genomic feature. Default is None.

3.1. Default behaviour (--fun=depth --to=overlap)

By default, OGRe assumes that the input file X is a BED file and file Y a BedGraph file. Other options assumed by default are the intersection function (--fun=depth) and  type of output (--to=overlap).

For example (--to=overlap):
$ cat X.bed
chr1    100    200      geneA
chr1    1000  2000    geneB

$ cat Y.bedgraph
chr1    150    250    3

$ python OGRe_V0.2.py -x X.bed -y Y.bedgraph -o results.txt

$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bedgraph
#typeX: bed, typeY: bedgraph
#intersection function: depth, enforcestrand: False
#ID      Chr      Start     End   Strand   WindowLength   Counts    AvCounts
geneA    chr1    100       200    None    100                     150          1.5
geneB    chr1    1000     2000  None    1000                    0             0.0

3.2. Using --fun=nover and --fun=depth (--to=overlap --ty=bed)

Each nucleotide from the read that is overlapped contributes to one count. This type of score is also known as the "depth of coverage" or "breadth of the overlap".

--fun=nover
the overlap is quantified considering considering uniquely the start position of each feature Y that overlaps with each feature X.

--fun=depth
In the following example, we should see the same result as in the previous example. The only difference is that file Y is in the "BED" format now:
$ cat X.bed
chr1    100    200      geneA
chr1    1000  2000    geneB

$ cat Y.bed
chr1    150    250 
chr1    150    250
chr1    150    250

$ python OGRe_V0.2.py -x X.bed -y Y.bedgraph -o results.txt --ty=bed --fun=depth

$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bedgraph
#typeX: bed, typeY: bedgraph
#intersection function: depth, enforcestrand: False
#ID      Chr      Start     End   Strand   WindowLength   Counts    AvCounts
geneA    chr1    100       200    None    100                     150          1.5
geneB    chr1    1000     2000  None    1000                    0             0.0

In the example above we have used --fun=depth to compute the depth of coverage, or the breadth of the overlap, for each gene. The option --fun=nover gives the number of reads overlapping each gene.

--fun=nover
$ cat X.bed
chr1    100    200      geneA
chr1    1000  2000    geneB

$ cat Y.bed
chr1    150    250 
chr1    150    250
chr1    150    250

$ python OGRe_V0.2.py -x X.bed -y Y.bedgraph -o results.txt --ty=bed --fun=nover

$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bedgraph
#typeX: bed, typeY: bedgraph
#intersection function: depth, enforcestrand: False
#ID      Chr      Start     End   Strand   WindowLength   Counts    AvCounts
geneA    chr1    100       200    None    100                      3             0.03
geneB    chr1    1000     2000  None    1000                    0             0.

3.3. Difference between overlap (--to=overlap) and sliding (--to=sliding) 

The overlap between feature X and Y is calculated using the total length of a genomic interval in X.

--to=sliding
The "sliding" method, allows the user to explore the pattern of the ChIP signal at the genomic intervals in feature-set X. The distribution profile is calculated by using a sliding window that scans each genomic interval defined in X. when --to=sliding / --typeO=sliding the intersection function is always --fun=depth. The output is given in terms of the average depth signal in each sliding window.

For example (--to=sliding -b 20 --btype=perc):
$ cat X.bed
chr1    100    200     geneA
chr1    1000  2000   geneB

$ cat Y.bedgraph
chr1    150    250    3

$ python OGRe_V0.2.py -x X.bed -y Y.bedgraph -o results.txt --to=sliding -b 20 --btype=perc

$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bedgraph
#typeX: bed, typeY: bedgraph
#intersection function: depth, enforcestrand: False
#Reporting binned windows of 20.0% each
geneA    chr1    100     200       None    0.0     0.0     1.5     3.0     3.0
geneB    chr1    1000    2000    None    0.0     0.0     0.0     0.0     0

In the example above we have defined the sliding window as  percentage (-b 20 --btype=perc). The sliding window was chosen to be 20% of the total length of each feature X.

For example (--to=sliding -b 25 --btype=nuc)
$ cat X.bed
chr1    100    200     geneA

$ cat Y.bedgraph
chr1    150    250    3

$ python OGRe_V0.2.py -x X.bed -y Y.bedgraph -o results.txt --to=sliding -b 25 --btype=nuc

$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bedgraph
#typeX: bed, typeY: bedgraph
#intersection function: depth, enforcestrand: False
#Reporting binned windows of 25.0nuc each
geneA    chr1    100     200     None    0.0     0.0     3.0     3.0

In the example above we have used a sliding window of 25 nucleotides (-b 25 --btype=nuc). Since the first feature X (chr1,100,200) is 100nuc long  we get 4 sliding windows in the output.

3.5. Enforcing strand (-s)

By default, OGRe will report any overlaps independently of the features X,Y being on the same or opposite strands. To only allow overlaps on the same strand the "-s" option is used.
Since the BedGraph format does not carry strand information, this option is not valid when using Bedgraph files. Only "+" and "-" strands are valid, any other symbol in X or Y will be skipped (i.e. they are not considered in any calculation). 

For example (note that geneB in the negative strand does not overlap with any Y feature in the positive):
$ cat X.bed
chr1    100    200    geneA    0   +
chr1    100    200    geneB    0    -

$ cat Y.bedgraph
chr1    150    250    0    0    +
chr1    150    250    0    0    +
chr1    150    250    0    0    +

$ python OGRe_V0.2.py -x X.bed -y Y.bed -o results.txt --to=overlap --tx=bed --ty=bed --fun=nover -s
$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bed
#typeX: bed, typeY: bed
#intersection function: nover, enforcestrand: True
#ID     Chr     Start   End     Strand  WindowLength    Counts  AvCounts
geneA   chr1    100     200    +       100     3       0.03
geneB   chr1    100     200     -        100     0       0.0

4. Input file formats

4.1. BedGraph

BedGraph is tabular format developed for use with the UCSC genome browser (see http://genome.ucsc.edu/FAQ/FAQformat#format1.8). Chromosome, start, end coordinates and a coverage value are specified in the following format:
chromosome    start    end    coveragevalue1
chromosome    start    end    coveragevalue2
For ChIP-Seq and RNA-Seq datasets, the coverage value corresponds to the depth of coverage in the correspondent genomic position.

4.2. BED

BED is tabular format also developed for use with the UCSC genome browser (see http://genome.ucsc.edu/FAQ/FAQformat#format1). The first three fields are mandatory and consist of chromosome, start, end. If "BED" is specified as the file format type, SeqGI will make use of the first three fields. strand (position 6) and ID (position 4) will only be used if available. All the other fields will be ignored by SeqGI.

BED3 format: chromosome    start    end
BED4 format: chromosome    start    end    ID
BED5 format: chromosome    start    end    ID    score
BED6 format: chromosome    start    end    ID    score    strand

4.3. Important requirements of input files (sorting, headers, positive/negative values, etc)

>> Files have to be sorted by start position

It is mandatory that the genomic coordinates within the input files are sorted by start position to run OGRe script. If the files are not sorted specify --sortx and --sorty.The internal sorting is memory intensive as it depends on the loading of the complete file into the memory. One way to go around this issue is to use previously sorted files (using command line tools such as the ?sort? command in Unix).

E.g. (sorting by start position in UNIX)
sort -k2n X.bed -o X.bed
sort -k2n Y.bed -o Y.bed

To prevent mistakes in the calculation of the overlaps, SeqGI checks if the genomic intervals are in ascending order as it is running.

>> X,Y features must contain start > end and only positive positions

OGRe will stop if any feature X or Y has negative start or end positions, or if the start position is greater then the end position.

>> BedGraph features should contain only numeric values in the forth column 

BedGraph file contains count-type data for each feature which is encoded on 4th column (see http://genome.ucsc.edu/FAQ/FAQformat#format1.8 for details). OGRe will stop if non-numeric data is found in the 4th column.

>> Headers and commented lines (#) are allowed

OGRe will skip any line that starts with "track", "browser" or the comment character "#".

>> Input and output data are zero-based


OGRe, just like BEDtools and SAMtools (but unlike Genomic Ranges / IRanges), assumes that the coordinates in each genomic feature have zero-based (0-based) start positions. Although this representation can be a little bit less intuitive, it facilitates coordinate arithmetic (see http://genome.ucsc.edu/FAQ/FAQtracks#tracks1 for more details). Similarly, any output returned by OGRe will be 0-based.  

>> If "-s" is chosen, only "+" and "-" strands are valid, any other symbol in X or Y will be skipped

When using "enforce strand" (option -s), OGRe will report overlaps only if they are on the same strand. In this case, only "+" and "-" strands are valid. OGRe will skip features which contain other symbols (i.e. the corresponding coordinates will not be considered in any calculation).

5. Intersecting genomic features: calculation of sliding profiles

5.1. Overlap vs. Sliding method

The initial step in a workflow within SeqGI is the computation of overlaps between two feature sets (X,Y). The computation of overlaps is mandatory to continue analysis under SeqGI. Typically, SeqGI would be used to calculate how many reads from a given ChIP/RNA sample overlap with each gene in the genome.
  • The overlaps are recorded for all features in set X, independently of an overlap being found or not.  
  • Two features (X,Y) are "overlapping" if they have ate least one base pair in common
Overlapping features can be calculated using either the "Overlaps" or "Sliding" methods:

For example:
________________________________________________________________________________________________
File X:               ====================                            =========                          ===============

File Y:                                  ======      ====    =====              ====

Result:             ====================                             =========                          ===============
                       score ~ 2 overlapping Y features             score ~ 1 overlapping Y feature                  score = 0
__________________________________________________________________________________________________

Overlap format

In the “Overlap” type of output, the overlap between feature X and Y is calculated using the total length of a genomic interval in X.

The format of “Overlap” files contain 8 tab delimited columns:
  • ID, chromosome, start, end, strand derived from file 1 (feature set X)
  • Window length: size of the window (end – start)
  • Counts: calculated as either the number of overlaps ("nover") or depth of coverage ("depth").
  • Average counts: counts normalised to the window length
Sliding format

The “Sliding” format, allows the user to explore the pattern of the ChIP signal at the genomic intervals of feature-set X. The distribution profile is calculated by using a sliding window that scans each genomic interval defined in feature-set X. If the genomic intervals in feature-set X contain strand information, then the sliding windows are defined for all genomic intervals in the same orientation (5' to 3'), otherwise left to right would be used. See the next section 5.2. Calculation of sliding profiles for more details.

The format of “Sliding” files consist of:
  • Five initial feature descriptors (as in *.Overlap): ID, chromosome, start, end, strand derived from file 1 (feature set X)
  • Columns after the 7th column correspond to counts normalised to the binned window length.



File sorting:
  • It is mandatory that the genomic coordinates within the input files are sorted by start position to run OGRe script. If the files are not sorted an option to “sort file X/Y” needs to be selected to sort the files internally before scanning the files for overlaps. The internal sorting is memory intensive as it depends on the loading of the complete file into the memory. One way to go around this issue is to use previously sorted files (using command line tools such as the “sort” command in Unix). To prevent mistakes in the calculation of the overlaps, SeqGI checks if the genomic intervals are in ascending order as it is running.

5.2. Calculation of sliding profiles

When building sliding files, SeqGI computes a binned complete coverage vector of each feature X. The binned coverage vector is calculated using a moving sliding window which computes the average depth of coverage at each bin.
  • The length of the sliding window can be either a fixed size (in bps) or a percentage of the total size of feature X. 
  • Defining the sliding window as a percentage is particularly important when the genomic intervals are not all of the same size, thus the percentage is used to scale each interval to a common length.

Consider the following example:

$ cat X.bed
chr1    100    200
chr1    1000  2000

$ cat Y.bedgraph
chr1    150    250    3

$ python OGRe.py -x X.bed -y Y.bedgraph -o results.txt --to=sliding -b 20 --btype=perc

$cat results.txt

#Created by Python OGRe (SeqGI format)
#fileX: X.bed
#fileY: Y.bedgraph
#typeX: bed, typeY: bedgraph
#intersection function: depth, enforcestrand: False
#Reporting binned windows of 20.0% each
None    chr1    100     200       None    0.0     0.0     1.5     3.0     3.0
None    chr1    1000    2000    None     0.0     0.0     0.0     0.0     0.0

In the example above we defined the sliding window as  percentage (-b 20 --btype=perc). The sliding window was chosen to be 20% of the total length of each feature X.

The first feature X (chr1,100,200) is 100bp long, thus each sliding window has 20 nuc.
In the "result.txt" files, OGRe reports the depth of coverage in each sliding window normalised to the sliding window length:

Complete coverage vector:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0  3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Sliding file:
                                       window1       window2      window3      window4      window5
window coordinates:     (100-120)    (120-140)    (140-160)    (160-180)    (180-200)
Window length:             20                 20                20                20                20
Raw counts:                  0                    0                 30                60                60
Average Counts:           0                    0                1.5               3.0              3.0

5.3. Important concepts regarding the -b/--bins and -bt/--btype options for the calculation of sliding profiles

>> Coverage vector is reversed for negative stranded (X) features

If the genomic intervals in feature-set X contain strand information, then the sliding windows are defined for all genomic intervals in the same orientation (5' to 3'), otherwise left to right would be used. In other words, if a feature X is in the reverse orientation (strand == "-") the coverage vector is reversed.

>> If the bin size (-b) is specified in nucleotides (--btype=nuc), all features X have to be of the same length.

When the "sliding" mode is chosen, OGRe will firstly calculate the coverage vector for each feature X at the nucleotide resolution. The complete coverage vector is then subset into bins of a user-defined length (option -b) to originate an aggregate vector. Each feature X is then characterized by an aggregate vector.
To continue the analysis within SeqGI it is required that the length of the resulting aggregate vector is the same between all features X.

When the bin size (-b) is given in nucleotides (--btype=nuc), OGRe will use this value to represent the size of each bin. Therefore all features in X have to have the same length, otherwise the resulting aggregate vector would have different lengths for different windows.

>> If --btype=perc, the bin size has to be a multiple 100

When the bin size (-b) is given in percentage (--btype=perc), OGRe will use this percentage value to calculate the size of the sliding windows for each feature X. In this case, the number of sliding windows used to summarize each feature X is the same for all features, but the size of the sliding windows is calculated as a function of the percentage of the total size of each feature. Therefore, different features X can have different sizes. For example `-b 1 --btype=perc` will correspond to 100 windows of 1% each; `-b 5 --btype=perc` will correspond to 20 windows of 5% each. However, in this sense, it is required that the bin size is a multiple of 100. For example -b 3 --btype=perc is not allowed.

>> If --btype=perc OGRe will skip features X if the entire length of the feature (start to end) is less then the number of sliding windows

When the bin size (-b) is given in percentage (--btype=perc), OGRe will use this percentage value to calculate the size of the sliding windows (i.e. the bin size) for each feature X. However, since the bin size corresponds to a "number of nucleotides" it cannot be smaller then 1. For example, if `-b 1 --btype=perc` is chosen, each feature X is characterized by 100 bins (100 bins of 1% each) . The length of each bin corresponds to 1% of the total length of the feature X. Therefore in this case, OGRe would skip any feature X which is smaller then 100 bps. Similarly, if -b 5 --btype=perc` is chosen (20 windows of 5% each) OGRe would skip features X shorter then 20 nucleotides.

6. Intersection function: Depth vs Nr of features

The user can quantify the overlap between two genomic intervals (X,Y) using between two options:

  • --fun=nover: In this case the overlap is quantified considering considering uniquely the start position of each feature Y that overlaps with each feature X.
  • --fun=depth: In this case each nucleotide from the read that is overlapped contributes to one count. This type of score is also known as the "depth of coverage" or "breadth of the overlap".
For example:

Intersection function

If sequencing reads are imputed in the BedGraph format, the only intersection function available is the “depth of coverage” (--fun=depth). This is because the BedGraph format does not have single-read information, but stores count-type data.