Skip to content

Commit 2db1ba5

Browse files
committed
Code Review
1 parent 48af3a6 commit 2db1ba5

10 files changed

Lines changed: 73 additions & 6 deletions

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,20 @@
2525
- [Dependencies](#Dependencies)
2626
---
2727

28+
### Quickstart
29+
I really like having a quickstart section that gives me a single command to install prereqs, a single command to run all tests (if any), and a single command to run the application. Something like:
30+
31+
```bash
32+
# Install prereqs
33+
pip install -r requriements.txt # or an install_prereqs.sh script if you have more diverse dependencies
34+
35+
# run tests (if you decided to add tests in the future)
36+
pytest
37+
38+
# run models
39+
python ./tool/optimizers/brute_force_optimizer.py
40+
```
41+
2842
### About
2943
In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.
3044

tool/optimizers/brute_force_optimizer.py

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
Goal of this is to find a combination of codons to maximize CAI (achieve 1.0 CAI).
44
'''
55

6+
# Shouldn't hardcode profiling code, pass a flag to turn on profiling
67
import timeit
78

89
start = timeit.default_timer()
@@ -18,6 +19,9 @@
1819
import re
1920

2021
# Set input AA sequence directory and output for writing brute sequences
22+
# Shouldn't hard code these - should be relative paths like "../../benchmark_sequences/aa"
23+
# Also, I think most scientists will be using UNIX-like OSes.
24+
# I'm not super familiar with it, but have you tried running this in WSL to check for compatibility?
2125
aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
2226
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\brute"
2327

@@ -129,13 +133,18 @@ def aa2codons(seq : str) -> list:
129133
# Converts an amino acid to a random corresponding codon:
130134
for entry in os.scandir(aa_dir):
131135
# Read in the amino acid sequence:
136+
137+
# I'm guessing this is to strip the _aa.fasta, perhaps replace it with something like
138+
# name = entry.replace("_aa.fasta", "_dna")
139+
# to be more explicit
132140
name = entry.name[0:-9] + "_dna"
133141
record = SeqIO.read(entry,'fasta')
134142
masterlist = []
135143
bestcai = 0
136144
curcai = 0
137145
z = 0
138-
while (z < 100000):
146+
# What's the significance of 100000? Could we give it a descriptive name?
147+
while z < 100000:
139148
codonarr = []
140149
# Convert amino acid to codons:
141150
for i in record.seq:
@@ -150,6 +159,13 @@ def aa2codons(seq : str) -> list:
150159
print('new best cai ' + str(cai))
151160
z += 1
152161
print(z)
162+
163+
# ⬆
164+
# Style nit, but it would be more pythonic to write
165+
# TOTAL_ITERATIONS = 100000
166+
# for curr_iteration in range(0, TOTAL_ITERATIONS):
167+
# ...
168+
153169
# Write the codon array to a file:
154170
record.seq = Seq(re.sub('[^GATC]',"",str("".join(masterlist[bestcai])).upper()))
155171
complete_name = os.path.join(out_dir, name)

tool/optimizers/icor_optimizer.py

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Define variables (must change!)
2+
# Good idea! - you can use relative paths as described in ./brute_force_optimizer.py
23
model_path = r"C:\Users\risha\Desktop\icor-codon-optimization\tool\models\icor.onnx"
34

45
# Import packages
@@ -7,10 +8,12 @@
78
import sys
89
import onnxruntime as rt
910
import numpy as np
11+
from typing import List
1012

1113
type = input("Welcome to ICOR! Are you optimizing an amino acid sequence (enter in 'aa' below) or a dna/codon sequence (enter in 'dna' below)?\n\n").strip().upper()
1214
input_seq = input(
1315
"Enter the coding sequence only.\nEnter in 'demo' to use demo sequence.\n\n").strip().upper()
16+
# 'type' is a builtin function in python - I'd recommend renaming the var to sequence_type to avoid reassigning it
1417

1518
# Load demo sequence (AKT1 amino acid seq)
1619
if type == 'AA':
@@ -31,13 +34,18 @@
3134
# ICOR accepts the amino acid sequence, so we translate the DNA sequence to amino acid sequence:
3235
input_seq = Seq(input_seq)
3336
input_seq = input_seq.translate()
37+
# It's good to handle all cases of your if/elif. Something like
38+
# else:
39+
# sys.exit(f"Invalid sequence type {sequence_type}. Expected 'aa' or 'dna'")
40+
3441

3542
print(input_seq)
3643
# Define categorical labels from when model was trained.
3744
labels = ['AAA', 'AAC','AAG','AAT','ACA','ACG','ACT','AGC','ATA','ATC','ATG','ATT','CAA','CAC','CAG','CCG','CCT','CTA','CTC','CTG','CTT','GAA','GAT','GCA','GCC','GCG','GCT','GGA','GGC','GTC','GTG','GTT','TAA','TAT','TCA','TCG','TCT','TGG','TGT','TTA','TTC','TTG','TTT','ACC','CAT','CCA','CGG','CGT','GAC','GAG','GGT','AGT','GGG','GTA','TGC','CCC','CGA','CGC','TAC','TAG','TCC','AGA','AGG','TGA']
3845

3946
# Define aa to integer table
40-
def aa2int(seq: str) -> list:
47+
# Your 'seq: str' type definition is broken by your reassignment of 'str' below
48+
def aa2int(seq: str) -> List[int]:
4149
_aa2int = {
4250
'A': 1,
4351
'R': 2,
@@ -77,6 +85,7 @@ def aa2int(seq: str) -> list:
7785

7886
# One-hot encode the amino acid sequence:
7987
i = 0
88+
# style nit: more pythonic to write for i in range(0, len(aa_placement)):
8089
while i < len(aa_placement):
8190
oh_array[aa_placement[i]-1, i] = 1
8291
i += 1
@@ -100,7 +109,9 @@ def aa2int(seq: str) -> list:
100109
for pred in pred_onx[0]:
101110
pred_indices.append(np.argmax(pred))
102111

103-
str = ""
112+
# Likewise, 'str' is a bultin type in python
113+
# I'd rename to 'output_str' or the like
114+
out_str = ""
104115
for index in pred_indices:
105116
str += labels[index]
106117
print('==== OUTPUT ====\n' + str)
@@ -114,3 +125,16 @@ def aa2int(seq: str) -> list:
114125
print('\nOutput written to output.txt')
115126
else:
116127
print('\nNo output written. Done!')
128+
# should catch cases explicitly
129+
# like:
130+
# while True:
131+
# if output == "Y":
132+
# with open("output.txt", "w") as f:
133+
# f.write(out_str)
134+
# print("\nOutput written to output.txt")
135+
# break
136+
# elif output == "N":
137+
# print("\nNo output written. Done!")
138+
# break
139+
# else:
140+
# print("Error! Expected Y/N")

tool/optimizers/naive_optimizer.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,11 @@
3333
}
3434

3535
# Amino acid sequence dir to optimize:
36+
# hardcoded path
3637
aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
3738

3839
# Output dir to store optimized seqs:
40+
# hardcoded path
3941
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\naive"
4042

4143

@@ -49,6 +51,7 @@ def fix_p( p):
4951
name = entry.name[0:-9] + "_dna"
5052

5153
# Replace ambiguities with amino acids from IUPAC guidelines.
54+
# Might be nice to have a link to the guidelines?
5255
record = SeqIO.read(entry,"fasta")
5356
seq = record.seq.replace("B", random.choice(["D","N"])).replace("Z", random.choice(["E", "Q"]))
5457
seq_arr = []

tool/optimizers/super_naive_optimizer.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,11 @@
88
from Bio.Seq import Seq
99

1010
# Amino acid sequence dir to optimize:
11+
# hardcoded path
1112
aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
1213

1314
# Output dir to store optimized seqs:
15+
# hardcoded path
1416
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\super_naive"
1517

1618
# Amino acid to codon table, outputs arr of codons:
@@ -58,4 +60,4 @@ def aa2codons(seq : str) -> list:
5860
complete_name = os.path.join(out_dir, name)
5961

6062
# Save the super naively optimized DNA sequence:
61-
SeqIO.write(record, complete_name + ".fasta", "fasta")
63+
SeqIO.write(record, complete_name + ".fasta", "fasta")

tool/scripts/convert_to_cds.ipynb

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,8 @@
7373
"for entry in os.scandir(dir):\r\n",
7474
" record = SeqIO.read(entry,'fasta')\r\n",
7575
" if record.name.startswith(\"NM\"):\r\n",
76+
" # this is fine, but recent python versions support f-strings\n",
77+
" # https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python\n",
7678
" browser.get(\"https://www.ncbi.nlm.nih.gov/nuccore/%s\" % record.name)\r\n",
7779
" time.sleep(1)\r\n",
7880
" cds = browser.find_elements_by_class_name('feature')\r\n",
@@ -141,6 +143,7 @@
141143
"from Bio.Seq import Seq\r\n",
142144
"import os\r\n",
143145
"\r\n",
146+
"# hardcoded paths\n",
144147
"dna_dir = r\"C:\\Users\\risha\\Desktop\\icor-codon-optimization\\benchmark_sequences\\dna\"\r\n",
145148
"aa_dir = r\"C:\\Users\\risha\\Desktop\\icor-codon-optimization\\benchmark_sequences\\aa\"\r\n",
146149
"\r\n",

tool/scripts/csv_to_seqs.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from Bio.Seq import Seq
1818

1919
#output directory to write sequences
20+
# hardcoded path
2021
out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\genscript"
2122

2223
#iterate through the csv file and write sequences to the output directory
@@ -31,4 +32,4 @@
3132
description="blank",
3233
)
3334
print(record)
34-
SeqIO.write(record,os.path.join(out_dir, file_name),'fasta')
35+
SeqIO.write(record,os.path.join(out_dir, file_name),'fasta')

tool/scripts/reformat_seqs.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
import random
1010

1111
# Change this to the directory where your files are stored.
12+
# hardcoded path
1213
aa_directory = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
1314
dna_directory = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\dna"
1415

@@ -32,5 +33,6 @@
3233
#if there are sequences that are not divisible by three, then truncate them:
3334
num = len(record.seq) % 3
3435
print("Warning: truncated" + entry.name + num)
36+
# I'd suggest checking for this explicitly in your code and showing the user this warning/error
3537
#warning: if sequences are being truncated, they are likely not formatted correctly.
36-
#all CDS should be divisible by three because they are all in frame.
38+
#all CDS should be divisible by three because they are all in frame.

tool/scripts/run_benchmark.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@
113113
"#benchmark_sequences\\naive contains the naive sequences so it was used for those.\r\n",
114114
"#benchmark_sequences\\ICOR contains the ICOR sequences so it was used for those.\r\n",
115115
"\r\n",
116+
"# hardcoded paths\n",
116117
"dir = r\"C:\\Users\\risha\\Desktop\\icor-codon-optimization\\benchmark_sequences\\brute\"\r\n",
117118
"summary_name = \"brute\"\r\n",
118119
"\r\n",

tool/scripts/run_icor_from_mat.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
"import numpy\r\n",
3434
"import onnxruntime as rt\r\n",
3535
"import mat73\r\n",
36+
"# hardcoded paths\n",
3637
"train_data = mat73.loadmat(r'C:\\Users\\risha\\Desktop\\ICOR MATLAB\\allseqsworkspace71921.mat')\r\n",
3738
"\r\n",
3839
"# Define categorical labels from when model was trained.\r\n",

0 commit comments

Comments
 (0)