July, 2011




1. Overview

The GenomicIntervals2BED.py script converts a text-based input file into the standard BED file format. Additionally, it allows the user to define regions relative to coordinates within the file (relative to TSS, TES, peak maximum position, etc).
  • GenomicIntervals2BED.py allows you to convert an unstructured text-base file into the standard BED format.
    • Columns can be  organized in any order
    • File delimiters can be tab, space, comma or semicolon
  • Genomic regions of interest can be defined internally. Windows can be defined relative to:
    • Gene related features (TSS, TES)
    • Any column in the input file
GenomicIntervals2BED.py  is an important characteristic of SeqGI as it confers flexibility on the type of files that can be used. For instance columns can be  organized in any order and separated by a range of common column delimiters (tab, space, comma or semicolon).

2. Option summary

Usage: $ python GenomicIntervals2BED.py -f file [OPTIONS] -o outputname

Options Description
  -h, --help shows the full menu of available options
  -f <file>, --fname=<file> Complete path to input file. A text-base file with columns organized in any order. Mandatory fields include:  chromosome, start, end (in any order).
  -o <file>, --oname=<file> Complete path to the output filename
  -w <string>,  
  --window=<string>
[optional] Coordinates for window relative to TSS, TES, gene body or any column in the input file. Default is None.

E.g. promoter coordinates centered at +-1kb of the TSS:
 --window=TSS:-1000:1000 
Other examples:
--window=TES:0:2000
--window=GeneBody:-1000:1000
A 500bp window flanking the coordinates on column2:
--window=2:-500:500
 -t <string>,  --sep=<string> [optional] Separator of file (tab, comma, semicolon, space). Default is --sep=Tab
 -c <string>,
 --columns=<string>
[optional] chr,start,end,strand columns of file separated by commas. Strand is optional but needed if --wstrand. Default is --columns=1,2,3
  -i <string>,  
  --cID=<string>
[optional] Column nr of the ID column (e.g.: --cID=1). Default is None
  -s, --wstrand [optional] Define window coordinates based on strand. This option takes into account the orientation of the feature to define "upstream" and "downstream" coordinates. Particularly important if defining asymmetrical windows. By default GenomeIntervals2BED does not take strand information into account.
  --onebased 
Specify this option if your input file has 1-based start positions. For instance, the GFF format from Ensemble uses 1-based coordinates for both the start and the end positions. While UCSC formats (such as BED files, or other files downloaded from the Table Browser) are 0-based at the start and 1-based at the end positions. By default, this script uses start positions as if they were 0-based.

3.1. Convert any (yes! ANY) text file to BED: default behaviour (--window=None)

To convert a text-based file containing information organized in any order, to a BED format, you need to specify the column positions where the information on the coordinates (chromosome, start, end ), strand and row IDs can be found.

The option --columns=2,3,4 specifies the column positions of the chromosome, start, end information:

For example (--columns=2,3,4):
$ head mm9_canonical_chr1.txt

#name            chrom txStart        txEnd    strand kgXref.refseq
uc007aeu.1    chr1    3204562    3661579    -    NM_001011874
uc007aex.2    chr1    4333587    4350395    -    NM_011283
uc007aez.1    chr1    4481008    4486494    -    NM_011441
uc007aff.2     chr1    4763278    4775807    -    NM_001177658
uc007afh.1    chr1    4797973    4836816    +   NM_008866
uc007afi.2     chr1    4847774    4887990    +   NM_011541
uc007afl.2     chr1    4899656    5060366    -    NM_001177795
uc007afn.1    chr1    5073253    5152630    +   NM_133826
uc007afo.1    chr1    5578573    5592947    +   NM_011011

$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4 -o results.txt
GenomeIntervals2BED6.py
Message: Starting SeqGI GenomeIntervals2BED6 module
There are 0 warning(s)
Created file: results.txt
End date and time: 2014-06-27 15:17:36.072012
Elapsed time: 0:00:00.016014
Done!
$head results.txt

chr1    3204562    3661579    None    0    *
chr1    4333587    4350395    None    0    *
chr1    4481008    4486494    None    0    *
chr1    4763278    4775807    None    0    *
chr1    4797973    4836816    None    0    *
chr1    4847774    4887990    None    0    *
chr1    4899656    5060366    None    0    *
chr1    5073253    5152630    None    0    *
chr1    5578573    5592947    None    0    *
chr1    5903787    5907479    None    0    *

The option --columns=2,3,4,5 specifies the column positions of the chromosome, start, end, strand information:

For example (--columns=2,3,4,5):
$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4,5 -o results.txt

$head results.txt

chr1    3204562    3661579    None    0    -
chr1    4333587    4350395    None    0    -
chr1    4481008    4486494    None    0    -
chr1    4763278    4775807    None    0    -
chr1    4797973    4836816    None    0    +
chr1    4847774    4887990    None    0    +
chr1    4899656    5060366    None    0    -
chr1    5073253    5152630    None    0    +
chr1    5578573    5592947    None    0    +
chr1    5903787    5907479    None    0    -

The option --cID=6 specifies the column position of the Ids.
When using this option, the ID column in the output BED file (column 4) is populated with the values of column 6 of the input file:

For example (--cID=6):
$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4,5 --cID=6 -o results.txt

$head results.txt

chr1    3204562    3661579    NM_001011874    0    -
chr1    4333587    4350395    NM_011283    0    -
chr1    4481008    4486494    NM_011441    0    -
chr1    4763278    4775807    NM_001177658    0    -
chr1    4797973    4836816    NM_008866    0    +
chr1    4847774    4887990    NM_011541    0    +
chr1    4899656    5060366    NM_001177795    0    -
chr1    5073253    5152630    NM_133826    0    +
chr1    5578573    5592947    NM_011011    0    +
chr1    5903787    5907479    NM_010342    0    -

3.2. Define a window relative to the TSS (--window=TSS)

A symmetrical TSS window centered at the TSS would be defined as:
example of a TSS window
An asymmetrically TSS window would be defined as:
Example TSS window

For example, 1kb window flanking the TSS would be defined as (--window=TSS:-1000:1000):
$ head mm9_canonical_chr1.txt

#name            chrom txStart        txEnd    strand kgXref.refseq
uc007aeu.1    chr1    3204562    3661579    -    NM_001011874
uc007aex.2    chr1    4333587    4350395    -    NM_011283
uc007aez.1    chr1    4481008    4486494    -    NM_011441
uc007aff.2     chr1    4763278    4775807    -    NM_001177658
uc007afh.1    chr1    4797973    4836816    +   NM_008866
uc007afi.2     chr1    4847774    4887990    +   NM_011541
uc007afl.2     chr1    4899656    5060366    -    NM_001177795
uc007afn.1    chr1    5073253    5152630    +   NM_133826
uc007afo.1    chr1    5578573    5592947    +   NM_011011

$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4,5 --cID=6 --window=TSS:-1000:1000 -o results.txt
GenomeIntervals2BED6.py
Message: Starting SeqGI GenomeIntervals2BED6 module
There are 1 warning(s)

WARNING: Defining window ['TSS', -1000, 1000] relative to TSS,
TES or GeneBody has to be done based on strands.
"wstrand=True" was used
Created file: results.txt
End date and time: 2014-06-27 15:17:36.072012
Elapsed time: 0:00:00.016014
Done!
$head results.txt

chr1    3660579    3662579    NM_001011874    0    -
chr1    4349395    4351395    NM_011283    0    -
chr1    4485494    4487494    NM_011441    0    -
chr1    4774807    4776807    NM_001177658    0    -
chr1    4796973    4798973    NM_008866    0    +
chr1    4846774    4848774    NM_011541    0    +
chr1    5059366    5061366    NM_001177795    0    -
chr1    5072253    5074253    NM_133826    0    +
chr1    5577573    5579573    NM_011011    0    +
chr1    5906479    5908479    NM_010342    0    -


Note:
Windows defined relative to TSS, TES or the gene body require that the input file contains the strand information for each feature. The strand information is necessary to specify the upstream and downstream coordinates.
If strand information is not provided in the --columns option (i.e. chr,start,end,strand), GenomeIntervals2BED.py will give an error:

$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4 --cID=6 --window=TSS:-1000:1000 -o results.txt
ERROR: --columns=2,3,4. Strand is needed to define windows based on strand

Strand is needed to define TSS, TES and GeneBody windows
  • Windows defined relative to TSS, TES or the Gene body require that the input file contains the strand information for each feature.
  • Upstream and downstream coordinates are always defined based on strand. This means that, for instance the upstream interval will be on the right of a negative stranded feature, and on the left of a positive stranded feature.
  • If strand is different than "+" or "-" the upstream and downstream coordinates cannot be computed and the feature will be skipped.

3.3. Define a window relative to the TES (--window=TES)

Windows can also be defined relative to 3' end (transcription end sites; TES),

For example, consider an asymmetrical TES window:
Example TES window

For example, a window flanking the TES would be defined as (--window=TES:-500:1000):
$ head mm9_canonical_chr1.txt

#name            chrom txStart        txEnd    strand kgXref.refseq
uc007aeu.1    chr1    3204562    3661579    -    NM_001011874
uc007aex.2    chr1    4333587    4350395    -    NM_011283
uc007aez.1    chr1    4481008    4486494    -    NM_011441
uc007aff.2     chr1    4763278    4775807    -    NM_001177658
uc007afh.1    chr1    4797973    4836816    +   NM_008866
uc007afi.2     chr1    4847774    4887990    +   NM_011541
uc007afl.2     chr1    4899656    5060366    -    NM_001177795
uc007afn.1    chr1    5073253    5152630    +   NM_133826
uc007afo.1    chr1    5578573    5592947    +   NM_011011

$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4,5 --cID=6 --window=TES:-500:1000 --wstrand -o results.txt
GenomeIntervals2BED6.py
Message: Starting SeqGI GenomeIntervals2BED6 module
There are 0 warning(s)
Created file: results.txt
End date and time: 2014-06-27 15:56:49.069513
Elapsed time: 0:00:00.014131
Done!
$head results.txt

chr1    3203562    3205062    NM_001011874    0    -
chr1    4332587    4334087    NM_011283    0    -
chr1    4480008    4481508    NM_011441    0    -
chr1    4762278    4763778    NM_001177658    0    -
chr1    4836316    4837816    NM_008866    0    +
chr1    4887490    4888990    NM_011541    0    +
chr1    4898656    4900156    NM_001177795    0    -
chr1    5152130    5153630    NM_133826    0    +
chr1    5592447    5593947    NM_011011    0    +
chr1    5902787    5904287    NM_010342    0    -

Windows can also be defined relative to the gene body.
A Gene body window is defined relative to both, TSS and TES. Using in this case the coordinates upstream the TSS and downstream the TES:

Example Gene body window


For example, a window flanking the gene body region would be defined as (--window=GeneBody:-500:1000):
$ head mm9_canonical_chr1.txt

#name            chrom txStart        txEnd    strand kgXref.refseq
uc007aeu.1    chr1    3204562    3661579    -    NM_001011874
uc007aex.2    chr1    4333587    4350395    -    NM_011283
uc007aez.1    chr1    4481008    4486494    -    NM_011441
uc007aff.2     chr1    4763278    4775807    -    NM_001177658
uc007afh.1    chr1    4797973    4836816    +   NM_008866
uc007afi.2     chr1    4847774    4887990    +   NM_011541
uc007afl.2     chr1    4899656    5060366    -    NM_001177795
uc007afn.1    chr1    5073253    5152630    +   NM_133826
uc007afo.1    chr1    5578573    5592947    +   NM_011011

$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4,5 --cID=6 --window=TES:-500:1000 --wstrand -o results.txt
GenomeIntervals2BED6.py
Message: Starting SeqGI GenomeIntervals2BED6 module
There are 0 warning(s)
Created file: results.txt
End date and time: 2014-06-27 16:02:29.902554
Elapsed time: 0:00:00.013893
Done!
$head results.txt

chr1    3203562    3662079    NM_001011874    0    -
chr1    4332587    4350895    NM_011283    0    -
chr1    4480008    4486994    NM_011441    0    -
chr1    4762278    4776307    NM_001177658    0    -
chr1    4797473    4837816    NM_008866    0    +
chr1    4847274    4888990    NM_011541    0    +
chr1    4898656    5060866    NM_001177795    0    -
chr1    5072753    5153630    NM_133826    0    +
chr1    5578073    5593947    NM_011011    0    +
chr1    5902787    5907979    NM_010342    0    -

3.5. Define a window relative to a column position (--window=<integer>)

In some cases you might be interested in investigating the pattern of read distribution in a genomic interval defined relative to a given starting coordinate. For instance, the column position can encode peak maximum positions, SNPs, etc.
In this case, GenomeIntervals2BED.py allows you to define the window of interest relative to coordinates present in a given column of the input file.

For example, 2kb window flanking the coordinates in column 2 and with --wstrand:
Example column-based window

For example, a window flanking the TES would be defined as (--window=2:-500:1000):
$ head mm9_canonical_chr1.txt

#name            chrom txStart        txEnd    strand kgXref.refseq
uc007aeu.1    chr1    3204562    3661579    -    NM_001011874
uc007aex.2    chr1    4333587    4350395    -    NM_011283
uc007aez.1    chr1    4481008    4486494    -    NM_011441
uc007aff.2     chr1    4763278    4775807    -    NM_001177658
uc007afh.1    chr1    4797973    4836816    +   NM_008866
uc007afi.2     chr1    4847774    4887990    +   NM_011541
uc007afl.2     chr1    4899656    5060366    -    NM_001177795
uc007afn.1    chr1    5073253    5152630    +   NM_133826
uc007afo.1    chr1    5578573    5592947    +   NM_011011

$ python GenomeIntervals2BED.py -f mm9_canonical_chr1.txt --columns=2,3,4,5 --cID=6 --window=3:-5:10 --wstrand -o results.txt
GenomeIntervals2BED6.py
Message: Starting SeqGI GenomeIntervals2BED6 module
There are 0 warning(s)
Created file: results.txt
End date and time: 2014-06-27 15:56:49.069513
Elapsed time: 0:00:00.014131
Done!
$head results.txt

chr1    3204552    3204567    NM_001011874    0    -
chr1    4333577    4333592    NM_011283    0    -
chr1    4480998    4481013    NM_011441    0    -
chr1    4763268    4763283    NM_001177658    0    -
chr1    4797968    4797983    NM_008866    0    +
chr1    4847769    4847784    NM_011541    0    +
chr1    4899646    4899661    NM_001177795    0    -
chr1    5073248    5073263    NM_133826    0    +
chr1    5578568    5578583    NM_011011    0    +
chr1    5903777    5903792    NM_010342    0    -

Note:
Windows defined relative to the information in a particular column do not require that the input file to contain the strand information. Strand information is optional in this case.

Strand is optional when defining windows relative to column positions:
  • If --wstrand is provided upstream and downstream coordinates are defined based on strand. This means that, for instance the upstream interval will be on the right of a negative stranded feature, and on the left of a positive stranded feature. Strand information is expected to be "+" or "-". Any other value different than "+" or "-" will be skipped
  • If --wstrand is not provided then the features will be treated as "unstranded" and the (upstream, downstream) coordinates will be defined as if the feature is in the forward strand (left and right, respectively).
4. Input file format

“Genomic Intervals” is a standard text-based format which contains information that can be organized in any order. File delimiters allowed include tab, space, comma or semicolon.

  • Mandatory fields include:  chromosome, start, end
  • Optional fields include:  strand, ID
  • OGRe will skip any line that starts with "track", "browser" or the comment character "#".
  • The strand field can only be "+" or "-".
Upon user instructions, the positions of the coordinates of each genomic interval are specified. This file format can also be used to define regions relative to coordinates within the file (relative to TSS, TES, peak maximum position, etc). See the section on Example Usage for more details.

5. Output file format (BED)

BED is tabular format also developed for use with the UCSC genome browser (see http://genome.ucsc.edu/FAQ/FAQformat#format1). The first three fields are mandatory and consist of chromosome, start, end. If "BED" is specified as the file format type, OGRe will make use of the first three fields. strand (position 6) and ID (position 4) will only be used if available. All the other fields will be ignored by OGRe.
BED3 format: chromosome    start    end
BED4 format: chromosome    start    end    ID
BED5 format: chromosome    start    end    ID    score
BED6 format: chromosome    start    end    ID    score    strand