Skip to content

Commit 82d6cac

Browse files
author
RJain12
committed
Resolving suggestions from Nishant Code Review
1 parent 66a5acc commit 82d6cac

9 files changed

Lines changed: 107 additions & 151 deletions

README.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,11 @@
1010

1111
</p>
1212

13-
<h3 align="center"> ICOR: Improving Codon Optimization with Recurrent neural networks <h4>
13+
<h3 align="center"> ICOR: Improving Codon Optimization with Recurrent neural networks <h3>
1414

1515
---
1616
- [About](#About)
17+
- [Quickstart](#Quickstart)
1718
- [Assets](#Assets)
1819
- [Benchmark Results](#Benchmark-Results)
1920
- [Benchmark Sequences](#Benchmark-Sequences)
@@ -25,23 +26,20 @@
2526
- [Dependencies](#Dependencies)
2627
---
2728

29+
### About
30+
In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.
31+
2832
### Quickstart
2933
I really like having a quickstart section that gives me a single command to install prereqs, a single command to run all tests (if any), and a single command to run the application. Something like:
3034

3135
```bash
3236
# Install prereqs
33-
pip install -r requriements.txt # or an install_prereqs.sh script if you have more diverse dependencies
34-
35-
# run tests (if you decided to add tests in the future)
36-
pytest
37+
pip install -r requirements.txt
3738

38-
# run models
39-
python ./tool/optimizers/brute_force_optimizer.py
39+
# Run ICOR optimizer
40+
python ./tool/optimizers/icor_optimizer.py
4041
```
4142

42-
### About
43-
In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.
44-
4543
### Assets
4644
Assets including images and branding for the ICOR tool, hosted on the [biotools by Lattice Automation](https://tools.latticeautomation.com/) website.
4745

requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Bio==1.0.3
2+
numpy==1.21.2
3+
onnxruntime==1.9.0

tool/optimizers/brute_force_optimizer.py

Lines changed: 11 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,6 @@
33
Goal of this is to find a combination of codons to maximize CAI (achieve 1.0 CAI).
44
'''
55

6-
# Shouldn't hardcode profiling code, pass a flag to turn on profiling
7-
import timeit
8-
9-
start = timeit.default_timer()
10-
11-
126
# Import modules
137
import os
148
from Bio import SeqIO
@@ -19,11 +13,8 @@
1913
import re
2014

2115
# Set input AA sequence directory and output for writing brute sequences
22-
# Shouldn't hard code these - should be relative paths like "../../benchmark_sequences/aa"
23-
# Also, I think most scientists will be using UNIX-like OSes.
24-
# I'm not super familiar with it, but have you tried running this in WSL to check for compatibility?
25-
aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
26-
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\brute"
16+
aa_dir = r"..\..\benchmark_sequences\aa"
17+
out_dir = r"..\..\benchmark_sequences\brute"
2718

2819
# Define weights for each codon
2920
weights = [0,1,0.647058823500000,0.500000000000000,0.794117647100000,0.0789473684200000,0.131578947400000,0.263157894700000,0.184210526300000,0.973684210500000,1,0.851851851900000,1,1,0.587301587300000,0.818181818200000,1,0.483870967700000,0.129032258100000,1,1,0.515151515200000,0.470588235300000,1,0.384615384600000,0.307692307700000,0.871794871800000,1,1,0.754385964900000,0.180000000000000,1,0.820000000000000,0.265306122400000,0.265306122400000,1,0.0816326530600000,0.224489795900000,0.204081632700000,0.333333333300000,1,1,1,0.754385964900000,1,0.392156862700000,0.333333333300000,0.235294117600000,0.576923076900000,1,0.576923076900000,0.500000000000000,0.615384615400000,0.576923076900000,0.619047619000000,0.357142857100000,0.428571428600000,1,1,1,0.724137931000000,1,0.444444444400000,0.750000000000000,0.583333333300000]
@@ -133,18 +124,15 @@ def aa2codons(seq : str) -> list:
133124
# Converts an amino acid to a random corresponding codon:
134125
for entry in os.scandir(aa_dir):
135126
# Read in the amino acid sequence:
136-
137-
# I'm guessing this is to strip the _aa.fasta, perhaps replace it with something like
138-
# name = entry.replace("_aa.fasta", "_dna")
139-
# to be more explicit
140-
name = entry.name[0:-9] + "_dna"
127+
name = entry.replace("_aa.fasta", "_dna")
141128
record = SeqIO.read(entry,'fasta')
129+
142130
masterlist = []
143131
bestcai = 0
144132
curcai = 0
145-
z = 0
146-
# What's the significance of 100000? Could we give it a descriptive name?
147-
while z < 100000:
133+
TOTAL_ITERATIONS = 100000
134+
135+
for curr_iteration in range(0, TOTAL_ITERATIONS):
148136
codonarr = []
149137
# Convert amino acid to codons:
150138
for i in record.seq:
@@ -154,24 +142,13 @@ def aa2codons(seq : str) -> list:
154142
# With our new codon array, calculate the CAI:
155143
cai = seq2cai(codonarr)
156144
if (cai > curcai):
157-
bestcai = z
145+
bestcai = curr_iteration
158146
curcai = cai
159147
print('new best cai ' + str(cai))
160-
z += 1
161-
print(z)
162-
163-
# ⬆
164-
# Style nit, but it would be more pythonic to write
165-
# TOTAL_ITERATIONS = 100000
166-
# for curr_iteration in range(0, TOTAL_ITERATIONS):
167-
# ...
148+
curr_iteration += 1
149+
print(curr_iteration)
168150

169151
# Write the codon array to a file:
170152
record.seq = Seq(re.sub('[^GATC]',"",str("".join(masterlist[bestcai])).upper()))
171153
complete_name = os.path.join(out_dir, name)
172-
SeqIO.write(record, complete_name + ".fasta", "fasta")
173-
174-
stop = timeit.default_timer()
175-
176-
print('Time: ', stop - start)
177-
#1:00
154+
SeqIO.write(record, complete_name + ".fasta", "fasta")

tool/optimizers/icor_optimizer.py

Lines changed: 21 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
# Define variables (must change!)
2-
# Good idea! - you can use relative paths as described in ./brute_force_optimizer.py
3-
model_path = r"C:\Users\risha\Desktop\icor-codon-optimization\tool\models\icor.onnx"
1+
model_path = r"..\..\tool\models\icor.onnx"
42

53
# Import packages
64
from Bio.Seq import Seq
@@ -10,18 +8,17 @@
108
import numpy as np
119
from typing import List
1210

13-
type = input("Welcome to ICOR! Are you optimizing an amino acid sequence (enter in 'aa' below) or a dna/codon sequence (enter in 'dna' below)?\n\n").strip().upper()
11+
sequence_type = input("Welcome to ICOR! Are you optimizing an amino acid sequence (enter in 'aa' below) or a dna/codon sequence (enter in 'dna' below)?\n\n").strip().upper()
1412
input_seq = input(
1513
"Enter the coding sequence only.\nEnter in 'demo' to use demo sequence.\n\n").strip().upper()
16-
# 'type' is a builtin function in python - I'd recommend renaming the var to sequence_type to avoid reassigning it
1714

1815
# Load demo sequence (AKT1 amino acid seq)
19-
if type == 'AA':
16+
if sequence_type == 'AA':
2017
if input_seq == 'DEMO':
2118
input_seq = "MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKKEVIVAKDEVAHTLTENRVLQNSRHPFLTALKYSFQTHDRLCFVMEYANGGELFFHLSRERVFSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENLMLDKDGHIKITDFGLCKEGIKDGATMKTFCGTPEYLAPEVLEDNDYGRAVDWWGLGVVMYEMMCGRLPFYNQDHEKLFELILMEEIRFPRTLGPEAKSLLSGLLKKDPKQRLGGGSEDAKEIMQHRFFAGIVWQHVYEKKLSPPFKPQVTSETDTRYFDEEFTAQMITITPPDQDDSMECVDSERRPHFPQFSYSASGTA*"
2219
if not input_seq.startswith('M') or not input_seq.endswith('*'):
2320
sys.exit('Invalid amino acid sequence detected.\nThe sequence must start with M and end with * because ICOR only optimizes the codon-sequence region!\nPlease try again.\nRead more: http://www.hgvs.org/mutnomen/references.html#aalist')
24-
elif type == 'DNA':
21+
elif sequence_type == 'DNA':
2522
if input_seq == 'DEMO':
2623
input_seq = "ATGAGCGACGTGGCTATTGTGAAGGAGGGTTGGCTGCACAAACGAGGGGAGTACATCAAGACCTGGCGGCCACGCTACTTCCTCCTCAAGAATGATGGCACCTTCATTGGCTACAAGGAGCGGCCGCAGGATGTGGACCAACGTGAGGCTCCCCTCAACAACTTCTCTGTGGCGCAGTGCCAGCTGATGAAGACGGAGCGGCCCCGGCCCAACACCTTCATCATCCGCTGCCTGCAGTGGACCACTGTCATCGAACGCACCTTCCATGTGGAGACTCCTGAGGAGCGGGAGGAGTGGACAACCGCCATCCAGACTGTGGCTGACGGCCTCAAGAAGCAGGAGGAGGAGGAGATGGACTTCCGGTCGGGCTCACCCAGTGACAACTCAGGGGCTGAAGAGATGGAGGTGTCCCTGGCCAAGCCCAAGCACCGCGTGACCATGAACGAGTTTGAGTACCTGAAGCTGCTGGGCAAGGGCACTTTCGGCAAGGTGATCCTGGTGAAGGAGAAGGCCACAGGCCGCTACTACGCCATGAAGATCCTCAAGAAGGAAGTCATCGTGGCCAAGGACGAGGTGGCCCACACACTCACCGAGAACCGCGTCCTGCAGAACTCCAGGCACCCCTTCCTCACAGCCCTGAAGTACTCTTTCCAGACCCACGACCGCCTCTGCTTTGTCATGGAGTACGCCAACGGGGGCGAGCTGTTCTTCCACCTGTCCCGGGAGCGTGTGTTCTCCGAGGACCGGGCCCGCTTCTATGGCGCTGAGATTGTGTCAGCCCTGGACTACCTGCACTCGGAGAAGAACGTGGTGTACCGGGACCTCAAGCTGGAGAACCTCATGCTGGACAAGGACGGGCACATTAAGATCACAGACTTCGGGCTGTGCAAGGAGGGGATCAAGGACGGTGCCACCATGAAGACCTTTTGCGGCACACCTGAGTACCTGGCCCCCGAGGTGCTGGAGGACAATGACTACGGCCGTGCAGTGGACTGGTGGGGGCTGGGCGTGGTCATGTACGAGATGATGTGCGGTCGCCTGCCCTTCTACAACCAGGACCATGAGAAGCTTTTTGAGCTCATCCTCATGGAGGAGATCCGCTTCCCGCGCACGCTTGGTCCCGAGGCCAAGTCCTTGCTTTCAGGGCTGCTCAAGAAGGACCCCAAGCAGAGGCTTGGCGGGGGCTCCGAGGACGCCAAGGAGATCATGCAGCATCGCTTCTTTGCCGGTATCGTGTGGCAGCACGTGTACGAGAAGAAGCTCAGCCCACCCTTCAAGCCCCAGGTCACGTCGGAGACTGACACCAGGTATTTTGATGAGGAGTTCACGGCCCAGATGATCACCATCACACCACCTGACCAAGATGACAGCATGGAGTGTGTGGACAGCGAGCGCAGGCCCCACTTCCCCCAGTTCTCCTACTCGGCCAGCGGCACGGCCTGA"
2724
if 'U' in input_seq:
@@ -34,17 +31,15 @@
3431
# ICOR accepts the amino acid sequence, so we translate the DNA sequence to amino acid sequence:
3532
input_seq = Seq(input_seq)
3633
input_seq = input_seq.translate()
37-
# It's good to handle all cases of your if/elif. Something like
38-
# else:
39-
# sys.exit(f"Invalid sequence type {sequence_type}. Expected 'aa' or 'dna'")
34+
else:
35+
sys.exit(f"Invalid sequence type {sequence_type}. Expected 'aa' or 'dna'")
4036

4137

4238
print(input_seq)
4339
# Define categorical labels from when model was trained.
4440
labels = ['AAA', 'AAC','AAG','AAT','ACA','ACG','ACT','AGC','ATA','ATC','ATG','ATT','CAA','CAC','CAG','CCG','CCT','CTA','CTC','CTG','CTT','GAA','GAT','GCA','GCC','GCG','GCT','GGA','GGC','GTC','GTG','GTT','TAA','TAT','TCA','TCG','TCT','TGG','TGT','TTA','TTC','TTG','TTT','ACC','CAT','CCA','CGG','CGT','GAC','GAG','GGT','AGT','GGG','GTA','TGC','CCC','CGA','CGC','TAC','TAG','TCC','AGA','AGG','TGA']
4541

4642
# Define aa to integer table
47-
# Your 'seq: str' type definition is broken by your reassignment of 'str' below
4843
def aa2int(seq: str) -> List[int]:
4944
_aa2int = {
5045
'A': 1,
@@ -84,9 +79,9 @@ def aa2int(seq: str) -> List[int]:
8479
aa_placement = aa2int(input_seq)
8580

8681
# One-hot encode the amino acid sequence:
87-
i = 0
82+
8883
# style nit: more pythonic to write for i in range(0, len(aa_placement)):
89-
while i < len(aa_placement):
84+
for i in range(0, len(aa_placement)):
9085
oh_array[aa_placement[i], i] = 1
9186
i += 1
9287

@@ -109,32 +104,22 @@ def aa2int(seq: str) -> List[int]:
109104
for pred in pred_onx[0]:
110105
pred_indices.append(np.argmax(pred))
111106

112-
# Likewise, 'str' is a bultin type in python
113-
# I'd rename to 'output_str' or the like
114107
out_str = ""
115108
for index in pred_indices:
116-
str += labels[index]
117-
print('==== OUTPUT ====\n' + str)
109+
out_str += labels[index]
110+
print('==== OUTPUT ====\n' + out_str)
118111

119112
output = input(
120113
"Would you like to write this into a file? (Y or N)\n\n").strip().upper()
121114

122-
if (output == 'Y'):
123-
with open('output.txt', 'w') as f:
124-
f.write(str)
125-
print('\nOutput written to output.txt')
126-
else:
127-
print('\nNo output written. Done!')
128-
# should catch cases explicitly
129-
# like:
130-
# while True:
131-
# if output == "Y":
132-
# with open("output.txt", "w") as f:
133-
# f.write(out_str)
134-
# print("\nOutput written to output.txt")
135-
# break
136-
# elif output == "N":
137-
# print("\nNo output written. Done!")
138-
# break
139-
# else:
140-
# print("Error! Expected Y/N")
115+
while True:
116+
if output == "Y":
117+
with open("output.txt", "w") as f:
118+
f.write(out_str)
119+
print("\nOutput written to output.txt")
120+
break
121+
elif output == "N":
122+
print("\nNo output written. Done!")
123+
break
124+
else:
125+
print("Error! Expected Y/N")

tool/optimizers/naive_optimizer.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,11 @@
3434

3535
# Amino acid sequence dir to optimize:
3636
# hardcoded path
37-
aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
37+
aa_dir = r"..\..\benchmark_sequences\aa"
3838

3939
# Output dir to store optimized seqs:
4040
# hardcoded path
41-
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\naive"
41+
out_dir = r"..\..\benchmark_sequences\naive"
4242

4343

4444
# Normalize probabilities for frequency if sum is not exactly 1.
@@ -48,10 +48,9 @@ def fix_p( p):
4848
return p
4949

5050
for entry in os.scandir(aa_dir):
51-
name = entry.name[0:-9] + "_dna"
51+
name = entry.replace("_aa.fasta", "_dna")
5252

53-
# Replace ambiguities with amino acids from IUPAC guidelines.
54-
# Might be nice to have a link to the guidelines?
53+
# Replace ambiguities with amino acids from IUPAC guidelines: https://www.bioinformatics.org/sms/iupac.html
5554
record = SeqIO.read(entry,"fasta")
5655
seq = record.seq.replace("B", random.choice(["D","N"])).replace("Z", random.choice(["E", "Q"]))
5756
seq_arr = []

tool/optimizers/super_naive_optimizer.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,10 @@
88
from Bio.Seq import Seq
99

1010
# Amino acid sequence dir to optimize:
11-
# hardcoded path
12-
aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
11+
aa_dir = r"..\..\benchmark_sequences\aa"
1312

1413
# Output dir to store optimized seqs:
15-
# hardcoded path
16-
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\super_naive"
14+
out_dir = r"..\..\benchmark_sequences\super_naive"
1715

1816
# Amino acid to codon table, outputs arr of codons:
1917
def aa2codons(seq : str) -> list:

0 commit comments

Comments
 (0)