You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+54-46Lines changed: 54 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,16 +14,18 @@
14
14
15
15
---
16
16
-[About](#About)
17
-
-[Assets](#Assets)
18
-
-[Benchmark Results](#Benchmark-Results)
19
-
-[Benchmark Sequences](#Benchmark-Sequences)
20
-
-[ICOR Tool](#Tool)
21
-
-[Scripts](#Scripts)
22
-
-[Summaries](#Summaries)
23
-
-[Resources](#Resources)
17
+
-[Assets](#Assets)
18
+
-[Benchmark Results](#Benchmark-Results)
19
+
-[Benchmark Sequences](#Benchmark-Sequences)
20
+
-[ICOR Tool](#Tool)
21
+
-[Models](#Models)
22
+
-[Optimizers](#Optimizers)
23
+
-[Scripts](#Scripts)
24
+
-[Resources](#Resources)
25
+
-[Dependencies](#Dependencies)
24
26
---
25
27
26
-
## About
28
+
###About
27
29
In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.
28
30
29
31
### Assets
@@ -36,7 +38,7 @@ Assets including images and branding for the ICOR tool, hosted on the [biotools
36
38
-`naive_benchmarks` which consists of the benchmark results for the naively optimized sequences.
37
39
-`original_benchmarks` which consists of the benchmark results for the original, unoptimized sequences.
38
40
-`super_naive_benchmarks` which consists of the benchmark results for the super naively optimized sequences.
39
-
-`genscript_benchmarks` which consists of the
41
+
-`genscript_benchmarks` which consists of the benchmark results for the [Genscript Gensmart™](https://www.genscript.com/gensmart-free-gene-codon-optimization.html) optimized sequences.
40
42
41
43
### Benchmark Sequences
42
44
`benchmark_sequences` is a folder that contains sequences for benchmarking purposes, each in the FASTA format:
@@ -47,34 +49,48 @@ Assets including images and branding for the ICOR tool, hosted on the [biotools
47
49
-`icor` which consists of 40 DNA sequences optimized by the ICOR optimizer.
48
50
-`naive` which consists of 40 DNA sequences optimized by the naive optimizer.
49
51
-`super_naive` consists of 40 DNA sequences optimized by the super naive optimizer.
50
-
-`genscript` consists of 40 DNA sequences optimized by the Genscript Gensmart tool.
52
+
-`genscript` consists of 40 DNA sequences optimized by the [Genscript Gensmart™](https://www.genscript.com/gensmart-free-gene-codon-optimization.html) tool.
51
53
52
54
### Tool
53
-
The ICOR tool has been divided into four directories: models, optimizers, resources, and scripts. At the base of the directory sits the `run_icor.ipynb` file: an interactive notebook to optimize a sequence utilizing the trained ICOR model. Supporting files were used to train, evaluate, and test the ICOR model. Descriptions for these can be found below:
55
+
The ICOR tool has been divided into four directories: models, optimizers, resources, and scripts. In the `/tool/optimizers` directory sits the `icor_optimizer.py` file: an interactive script to optimize a sequence utilizing the trained ICOR model.
56
+
57
+
> Note as of 8/24/2021, this ICOR optimizer Python script has a bug, although it works, it does not output the correct sequence. The other script "run_icor_from_mat" does work and outputs the correct sequence given an input of a .mat file. However, a user would be inputting either a FASTA file or pasting in a sequence. This script currently accepts the pasted sequence, but the optimizer portion is not working as expected. It outputs a sequence but it is not correct. Since the same model was being inferenced in the run_icor_from_mat script, I have isolated that this issue is not because of the model file. It is because of the encoding done in this script. I have 1-2 things that I still need to try which I believe will solve this issue.
58
+
59
+
Supporting files were used to train, evaluate, and test the ICOR model. Descriptions for these can be found below:
54
60
55
61
#### Models
56
62
The models directory contains the trained ICOR model in the [ONNX](https://onnx.ai) (open-neural-network-exchange) format. Below is a preview of the model architecture:
57
63
58
-
<divstyle="text-align: right">
59
64
<imgsrc="/assets/icor-small-visualization.png">
60
65
The ICOR model was trained in the MATLAB environment. For more details on model architecture, please review our manuscript file in the base of the repository. Upon submission, this will be changed to a DOI/biorxiv link.
61
-
</div>
62
-
63
66
64
67
`benchmark_genes.pdf`
65
68
> A document that contains all of the benchmarking genes and descriptions of them.
66
69
67
-
## Scripts
68
-
The following is a description of the purpose for each script in the repository.
69
-
70
-
`reformat_seqs.py`
71
-
> Iterate through each file in a directory and reformat the sequence uniformly.
70
+
#### Optimizers
71
+
`brute_force_optimizer.py`
72
+
> Naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It generates 10,000 sequences and chooses the one with the highest CAI.
72
73
74
+
`icor_optimizer.py`
75
+
> ICOR optimizer outputs a text file given a sequence input of amino acids or DNA. It is an interactive Python command-line script. It runs an inference through the ICOR model.
76
+
77
+
`naive_optimizer.py`
78
+
> Naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It selects codons to match the natural frequency that occurs within E. coli. This is what many tools in the industry use as well. This tool/script is built upon the `ecoli_codon_frequencies.csv` file in the summaries directory.
79
+
73
80
`super_naive_optimizer.py`
74
81
> Super naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It randomly selects a codon given an amino acid, making it a very naive approach.
75
82
76
-
`naive_optimizer.py`
77
-
> Naive optimizer creates a directory containing amino acid sequences in the FASTA format and saves these "optimized" / "generated" DNA sequences in a directory. It selects codons to match the natural frequency that occurs within E. coli. This is what many tools in the industry use as well. This tool/script is built upon the `ecoli_codon_frequencies.csv` file in the summaries directory.
83
+
#### Scripts
84
+
The following is a description of the purpose for each script in the repository.
85
+
86
+
`convert_to_cds.py`
87
+
> Takes an input of DNA sequences and fetches their CDS only from the NCBI nuccore database. Rewrites files with CDS.
88
+
89
+
`csv_to_seqs.py`
90
+
> Takes an input of a CSV from the GenScript Gensmart tool and writes them into files containing the sequences in the FASTA format.
91
+
92
+
`reformat_seqs.py`
93
+
> Iterate through each file in a directory and reformat the sequence uniformly.
78
94
79
95
`run_benchmark.ipynb`
80
96
> An interactive notebook that helps benchmark a directory containing FASTA sequences across the following metrics:
@@ -83,35 +99,27 @@ The following is a description of the purpose for each script in the repository.
83
99
- CFD (known un-optimized gene that reduces efficiency)
84
100
- Negative CIS elements
85
101
- Negative repeat elements
102
+
103
+
`run_icor_from_mat.ipynb`
104
+
> A notebook that accepts a `.mat` file that contains one variable called "XTrain" of the cell array type. Cell array used in experiments was of value/shape 42266x1.
105
+
> Note: as of 8/24/2021 this script successfully outputs the ICOR optimized sequence and it does indeed match the correct ICOR optimization.
86
106
87
-
## Summaries
88
-
The following is a description of the purpose for each summary in the summaries folder.
89
-
90
-
`benchmark_genes.csv`
91
-
> Description of each benchmark gene used. Also is above, in README file.
92
-
93
-
`codon_map.xlsx`
94
-
> Contains the codon map used for the AA2Codons dictionary.
95
-
96
-
`super_naive_benchmarks.csv`
97
-
> Contains the benchmarks for super_naively-created sequences.
98
-
99
-
`naive_benchmarks.csv`
100
-
> Contains the benchmarks for naively-created sequences.
101
-
102
-
`original_benchmarks.csv`
103
-
> Contains the benchmarks for the original sequences.
104
-
105
-
`ICOR_benchmarks.csv`
106
-
> Contains the benchmarks for the ICOR-optimized sequences.
107
+
#### Resources
108
+
The following is a description of the purpose for each resource in the resources folder.
> Contains an overview of the benchmarks, comparing each of the "tools" for each of the benchmarks. This is the sheet to look at if you would like to be able to see the metrics differences between the tools.
112
+
113
+
`benchmark_genes.pdf`
114
+
> A table for all of the benchmark genes used for validation.
115
+
116
+
`codon_map.xlsx`
117
+
> Contains the codon map used for the AA2Codons dictionary.
118
+
119
+
`ecoli_codon_frequencies.csv`
120
+
> Contains the codon frequency weights for each codon/amino acid used in the E. coli genomes. The naive tool was built upon these frequencies.
110
121
111
-
`ecoli_codon_frequencies.xlsx`
112
-
> Contains the codon frequencies found in E. coli for each amino acid. The naive tool was built upon these frequencies.
113
-
114
-
## Dependencies
122
+
### Dependencies
115
123
- Python 3.9.4
116
124
- biopython
117
125
- numpy
@@ -120,4 +128,4 @@ The following is a description of the purpose for each summary in the summaries
120
128
- re
121
129
- selenium
122
130
- Chrome (chromedriver does not seem to work for chromium, needs to use an actual chrome installation)
0 commit comments