RepEx: Repeat Extractor for Biological Sequences

Downloads:

  RepEx.tar.gz

System Requirements:

         The RepEx package requires the following to run successfully. In the absence of one or more of these utilities, RepEx may fail to run correctly. Listed in parenthesis are the minimum versions required to run the RepEx package. These versions, or subsequent versions should assure the proper execution of RepEx.

          - make (GNU make 3.79.1)
          - perl (PERL 5.6.0)
          - sh (GNU sh 1.14.7)
          - csh (tcsh 6.10.00)
          - g++ (GNU gcc 2.95.3)
          - sed (GNU sed 3.02)
          - awk (GNU awk 3.0.4)
         - ar (GNU ar 2.9.5)

Repex can be ran almost on any Linux based operating systems, provided the above mentioned utilities are installed properly.

Sufficient memory and disk space are necessary, but required sizes vary with input size. Be aware of your disk and memory usage, because insufficient capacities will result in incorrect or missing output. Required resources differ depending on the input size, but in general 512 MB of RAM and 1 GB of disk space is sufficient.

It is possible to port the toolkit to any system with a C++ compiler but this has not been tested and will not be supported. In addition, you may need to alter the Makefile to direct 'make' to your native compiler and other system resources.

For Mac OSX, the Mac development kit must be downloaded and installed. This kit will include 'gcc', 'ar', and 'make' which are necessary for building RepEx. RepEx is not supported for any Mac operating system other than OSX.

For Windows users, Cygwin or other Unix-like environment and command-line interface can be installed with the above mentioned utilities.

How to run RepEx:

repex [options]

Program options
---------------
-m Type of molecular sequence(s): DNA [n] or Protein [p] {Default: [n]}
-t

Type of repeat to be extracted: Inverted [i] or Palindrome [ip] or Mirror [m] or Everted [e] (Mirror and everted repeat doesn't exist in protein sequence(s), thus -t m or -t e will not work for protein sequence(s)) {Default: [i]}
-l

Minimum length of repeats to be extracted [positive integers]. (Caution !! Any lower than around 15 can significantly increase the number of spurious matches and therefore burst up the runtime) {Default: [20]}
-s


Spacer intervals i.e., the number of bases or residues between the repeat pattern and its copy. All [a] or Local [l] (within 100 bases) or Global [g] (outside 100 bases) or Manual (For manual option, enter your length (x) of spacers preceding with appropriate letter (greater: g, lesser: l, equal: e) -s [gx or lx or ex]) {Default: [l] for DNA and [a] for proteins}
-c Class of repeat to be extracted: Identical [i] or Degenerative [d] or both [b]. {Default: [i]}
-f Input file path.
Example: repex -m n -t m -l 20 -s l50 -c b -f /home/User/Documents/fasta.seq

The above example can be use to extract both identical and degenerative (-c b) mirror repeats (-t m) from DNA sequences (-m n) with minimum length of 20 (-l 20), where the spacer intervals should be less than 50 bases(-s l50) from file named fasta.seq located at /home/User/Documents.

Contact:

Dr. K. Sekar
Associate Professor
Department of Computational and Data Sciences
#341,2nd Floor,Old CES building
Indian Institute of Science
Bangalore 560 012
INDIA

E-mail:  sekar@iisc.ac.in