GitHub - mbelmadani/motifgp: Motif discovery for DNA sequences using multiobjective optimization and genetic programming.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
motifgp		motifgp
.gitignore		.gitignore
CHANGELOG.txt		CHANGELOG.txt
EXAMPLES.txt		EXAMPLES.txt
INSTALL.txt		INSTALL.txt
LICENSE.txt		LICENSE.txt
README.txt		README.txt

Repository files navigation

===============
= MotifGP 0.2 =
===============
MotifGP is a de novo motif discovery tool for discriminatory network expression identification in ChIP-seq datasets.

Original author: Manuel Belmadani
	mbelm006@uottawa.ca

The project is documented by the following publications.

Manuel Belmadani and Marcel Turcotte. MotifGP: Using multi-objective evolutionary computing for mining network expressions
in DNA sequences. In IEEE International Conference on Computational Intelligence in Bioinformatics and Computational Biology
(CIBCB 2016), Chiang Mai, Thailand, October, 5-7, 2016. 
https://doi.org/10.1109/CIBCB.2016.7758133

Manuel Belmadani. MotifGP: DNA motif discovery using multiobjective evolution. Master of computer science, 
University of Ottawa, School of Electrical Engineering and Computer Science, 2016. 
Available from University of Ottawa Research under: http://www.ruor.uottawa.ca/handle/10393/34213

Acknowledgements:
MotifGP is using source code from these tools:
-hypergeometric.py from the MEME Suite (License and copyright in source file).
-altschulEriksonDinuclShuffle.py from Peter Clote - CLOTE Computational Biology LAB, http://clavius.bc.edu/~clotelab/RNAdinucleotideShuffle/
This software was also made using the DEAP - Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G., 
Parizeau, M. & Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 13, 
2171–2175 (2012).

=======================================================================================
License: (see LICENSE.txt)
=======================================================================================
Installation: (see INSTALL.txt)
=======================================================================================
Examples: (see EXAMPLES.txt)
=======================================================================================
Usage: motifgp.py [options]

Options:
  -h, --help            show this help message and exit
  -p TRAINING_PATH, --training=TRAINING_PATH
                        Fasta file to use for training (input) sequence data
  -b BACKGROUND_PATH, --background=BACKGROUND_PATH
                        [Optional] Fasta file to use for background (control)
                        sequence data. If not provided, a the generated
                        control sequences will be written to runtime_tmp/
  -m MOO, --moo=MOO     Multi-objective optimization [SPEA2, NSGA2, NSGAR,
                        MOEAD]. NSGAR is the NSGA-II_R (NSGA-II Revised)
                        algorithm improvement of NSGA2.
  -f FITNESS, --fitness=FITNESS
                        Objective fitness function. Available objectives: D=Di
                        scrimination,F=Fisher,I=ScipyFisher,O=OddsRatio,Q=Fals
                        eDiscoveryRate,S=Support,R=ScipyOddsRatio. Each single
                        character in the string represents an objective.
                        Objectives are mapped by the configuration file at
                        config/objectives. Default is 'DF' for
                        [Discrimination,Fisher] (2-objectives).
  --cxpb=CXPB           Probability [0.0 to 1.0] for a crossover during
                        variation. Requires --mutpb to be set to (1.0-cxpb).
                        Default is 0.7.
  --mutpb=MUTPB         Probability [0.0 to 1.0] for a mutation during
                        variation. Requires --cxpb to be set to (1.0-mutpb).
                        Default is 0.3.
  --short=SHORT         Stops reading in after <SHORT> input sequences.
  --popsize=POPSIZE     Size of the population.
  --revcomp             Compile regex with reverse complement
  --random-seed=RANDOM_SEED
                        Random seed value to set for execution
  -n NGEN, --num-gen=NGEN
                        Generation where runtime stops (even in the case of
                        resumed checkpoints)
  --timelimit=TIMELIMIT
                        Time limit on the GP loop execution.
  --matcher=MATCHER     Use a different matcher. Options: 'grep', 'python'.
                        'grep' is faster on large datasets, while 'python' is
                        a pure python version in case the system doesn't
                        support grep.
  -o OUTPUT_PATH, --output=OUTPUT_PATH
                        Output directory. Default is ./OUT/
  -t TAG, --tag=TAG     A tag for the output subdirectory. Use to describes
                        the run and saves it in the tag's subdirectory in the
                        output directory. default is 'default'.
  -i, --inspector       Don't print any files. Can be useful with python -i
                        (interactive mode).
  --hardmask            Replace tandem repeats (lower-case typed nucleotides)
                        by N
  -g GRAMMAR, --grammar=GRAMMAR
                        Grammar for the STGP [min, iupac, full, ne]. Default
                        is iupac. 'min' only uses nucleotides. 'iupac' is a
                        network expression grammar. 'full' is a network
                        expression grammar with additional regular expression
                        tokens. 'ne' is like iupac, but built with string
                        primitives instead of booleans.
  -e ERASE, --erase=ERASE
                        Input .nef(t) file to delete from the dataset prior to
                        execution. Used for sequential coverage.
  --backpad             Pads background sequences with consecutive nucleotides
                        (ie. AAAAAAAA,CCCCCCCC,GGGGGGGG,TTTTTTTT) of length 8
                        every set of 4 sequences.
  --bg-algo=BG_ALGO     Shuffling algorithm for background. Default is
                        'dinuclShuffle', if no background dataset it provided.
                        Currently, dinuclShuffle is the only implemented
                        method.
  --ncpu=NCPU           Number of CPUs to use when mapping evaluation of
                        solutions. Use an integer, "auto" to automatically
                        dertmine the maximum number. Default is no
                        parallelism.
  --termination=TERMINATION
                        Use automatic termination algorithm. User 'auto' to
                        used the automatic termination algorithm for MOEAs.
  --hamming             [Experimental] Generates statistics on the hamming
                        distance from a template regex and hof candidates.
  --seeded-population   [Experimental] Use population seeds
  -c CHECKPOINT_PATH, --checkpoint=CHECKPOINT_PATH
                        [Temporarily disabled] Load a checkpoint at path.
  -q, --quiet           [Unimplemented] don't print status messages to stdout


Also consider looking at EXAMPLES.txt for basic examples of MotifGP usage.

About

Motif discovery for DNA sequences using multiobjective optimization and genetic programming.

mbelmadani.github.io/motifgp/

python bioinformatics transcription-factors motif dna genetic-programming chip-seq regular-expressions strongly-typed transcription-factor-binding pareto-front dna-sequences sequences nsga-ii multiobjective-optimization network-expressions motif-discovery deap jaspar

Readme

LGPL-3.0 license

Activity

6 stars

2 watching

0 forks

Report repository

Releases

No releases published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages