Resolving suggestions from Nishant Code Review

RJain12 · RJain12 · commit 82d6cac92550 · 2021-09-27T01:14:14.000Z
diff --git a/README.md b/README.md
@@ -10,10 +10,11 @@
 
 </p>
 
-<h3 align="center"> ICOR: Improving Codon Optimization with Recurrent neural networks <h4>
+<h3 align="center"> ICOR: Improving Codon Optimization with Recurrent neural networks <h3>
 
 ---
 - [About](#About)
+- [Quickstart](#Quickstart)
 - [Assets](#Assets)
 - [Benchmark Results](#Benchmark-Results)
 - [Benchmark Sequences](#Benchmark-Sequences)
@@ -25,23 +26,20 @@
 - [Dependencies](#Dependencies)
 ---
 
+### About
+In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.
+
 ### Quickstart
 I really like having a quickstart section that gives me a single command to install prereqs, a single command to run all tests (if any), and a single command to run the application. Something like:
 
 ```bash
 # Install prereqs
-pip install -r requriements.txt # or an install_prereqs.sh script if you have more diverse dependencies
-
-# run tests (if you decided to add tests in the future)
-pytest
+pip install -r requirements.txt
 
-# run models
-python ./tool/optimizers/brute_force_optimizer.py
+# Run ICOR optimizer
+python ./tool/optimizers/icor_optimizer.py
 ```
 
-### About
-In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the production of the resulting protein. Codon optimization of synthetic DNA sequences for maximum expression is an important segment of heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network (RNN) based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 42,000 non-redundant, robust genes that are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential information of genes to be learnt. Our tool can predict synonymous codons for synthetic genes towards optimal expression in E. coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. On a benchmark set of over 40 select DNA sequences, ICOR tool improved the codon adaptation index by 41.69% compared to the original sequence. Our resulting algorithm is provided as an open-source software package along with the benchmark set of sequences.
-
 ### Assets
 Assets including images and branding for the ICOR tool, hosted on the [biotools by Lattice Automation](https://tools.latticeautomation.com/) website.
 
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+Bio==1.0.3
+numpy==1.21.2
+onnxruntime==1.9.0
diff --git a/tool/optimizers/brute_force_optimizer.py b/tool/optimizers/brute_force_optimizer.py
@@ -3,12 +3,6 @@
 Goal of this is to find a combination of codons to maximize CAI (achieve 1.0 CAI).
 '''
 
-# Shouldn't hardcode profiling code, pass a flag to turn on profiling
-import timeit
-
-start = timeit.default_timer()
-
-
 # Import modules
 import os
 from Bio import SeqIO
@@ -19,11 +13,8 @@
 import re
 
 # Set input AA sequence directory and output for writing brute sequences
-# Shouldn't hard code these - should be relative paths like "../../benchmark_sequences/aa"
-# Also, I think most scientists will be using UNIX-like OSes.
-# I'm not super familiar with it, but have you tried running this in WSL to check for compatibility?
-aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
-out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\brute"
+aa_dir = r"..\..\benchmark_sequences\aa"
+out_dir = r"..\..\benchmark_sequences\brute"
 
 # Define weights for each codon
 weights = [0,1,0.647058823500000,0.500000000000000,0.794117647100000,0.0789473684200000,0.131578947400000,0.263157894700000,0.184210526300000,0.973684210500000,1,0.851851851900000,1,1,0.587301587300000,0.818181818200000,1,0.483870967700000,0.129032258100000,1,1,0.515151515200000,0.470588235300000,1,0.384615384600000,0.307692307700000,0.871794871800000,1,1,0.754385964900000,0.180000000000000,1,0.820000000000000,0.265306122400000,0.265306122400000,1,0.0816326530600000,0.224489795900000,0.204081632700000,0.333333333300000,1,1,1,0.754385964900000,1,0.392156862700000,0.333333333300000,0.235294117600000,0.576923076900000,1,0.576923076900000,0.500000000000000,0.615384615400000,0.576923076900000,0.619047619000000,0.357142857100000,0.428571428600000,1,1,1,0.724137931000000,1,0.444444444400000,0.750000000000000,0.583333333300000]
@@ -133,18 +124,15 @@ def aa2codons(seq : str) -> list:
 # Converts an amino acid to a random corresponding codon:
 for entry in os.scandir(aa_dir):
     # Read in the amino acid sequence:
-
-    # I'm guessing this is to strip the _aa.fasta, perhaps replace it with something like
-    # name = entry.replace("_aa.fasta", "_dna")
-    # to be more explicit
-    name = entry.name[0:-9] + "_dna"
+    name = entry.replace("_aa.fasta", "_dna")
     record = SeqIO.read(entry,'fasta')
+    
     masterlist = []
     bestcai = 0
     curcai = 0
-    z = 0
-    # What's the significance of 100000? Could we give it a descriptive name?
-    while z < 100000:
+    TOTAL_ITERATIONS = 100000
+
+    for curr_iteration in range(0, TOTAL_ITERATIONS):
         codonarr = []
         # Convert amino acid to codons:
         for i in record.seq:
@@ -154,24 +142,13 @@ def aa2codons(seq : str) -> list:
         # With our new codon array, calculate the CAI:
         cai = seq2cai(codonarr)
         if (cai > curcai):
-            bestcai = z
+            bestcai = curr_iteration
             curcai = cai
             print('new best cai ' + str(cai))
-        z += 1
-        print(z)
-
-    # ⬆
-    # Style nit, but it would be more pythonic to write
-    # TOTAL_ITERATIONS = 100000
-    # for curr_iteration in range(0, TOTAL_ITERATIONS):
-    #    ...
+        curr_iteration += 1
+        print(curr_iteration)
 
     # Write the codon array to a file:
     record.seq = Seq(re.sub('[^GATC]',"",str("".join(masterlist[bestcai])).upper()))
     complete_name = os.path.join(out_dir, name)
-    SeqIO.write(record, complete_name + ".fasta", "fasta")
-
-stop = timeit.default_timer()
-
-print('Time: ', stop - start)  
-#1:00
+    SeqIO.write(record, complete_name + ".fasta", "fasta")
diff --git a/tool/optimizers/icor_optimizer.py b/tool/optimizers/icor_optimizer.py
@@ -1,6 +1,4 @@
-# Define variables (must change!)
-# Good idea! - you can use relative paths as described in ./brute_force_optimizer.py
-model_path = r"C:\Users\risha\Desktop\icor-codon-optimization\tool\models\icor.onnx"
+model_path = r"..\..\tool\models\icor.onnx"
 
 # Import packages
 from Bio.Seq import Seq
@@ -10,18 +8,17 @@
 import numpy as np
 from typing import List
 
-type = input("Welcome to ICOR! Are you optimizing an amino acid sequence (enter in 'aa' below) or a dna/codon sequence (enter in 'dna' below)?\n\n").strip().upper()
+sequence_type = input("Welcome to ICOR! Are you optimizing an amino acid sequence (enter in 'aa' below) or a dna/codon sequence (enter in 'dna' below)?\n\n").strip().upper()
 input_seq = input(
     "Enter the coding sequence only.\nEnter in 'demo' to use demo sequence.\n\n").strip().upper()
-# 'type' is a builtin function in python - I'd recommend renaming the var to sequence_type to avoid reassigning it
 
 # Load demo sequence (AKT1 amino acid seq)
-if type == 'AA':
+if sequence_type == 'AA':
     if input_seq == 'DEMO':
         input_seq = "MSDVAIVKEGWLHKRGEYIKTWRPRYFLLKNDGTFIGYKERPQDVDQREAPLNNFSVAQCQLMKTERPRPNTFIIRCLQWTTVIERTFHVETPEEREEWTTAIQTVADGLKKQEEEEMDFRSGSPSDNSGAEEMEVSLAKPKHRVTMNEFEYLKLLGKGTFGKVILVKEKATGRYYAMKILKKEVIVAKDEVAHTLTENRVLQNSRHPFLTALKYSFQTHDRLCFVMEYANGGELFFHLSRERVFSEDRARFYGAEIVSALDYLHSEKNVVYRDLKLENLMLDKDGHIKITDFGLCKEGIKDGATMKTFCGTPEYLAPEVLEDNDYGRAVDWWGLGVVMYEMMCGRLPFYNQDHEKLFELILMEEIRFPRTLGPEAKSLLSGLLKKDPKQRLGGGSEDAKEIMQHRFFAGIVWQHVYEKKLSPPFKPQVTSETDTRYFDEEFTAQMITITPPDQDDSMECVDSERRPHFPQFSYSASGTA*"
     if not input_seq.startswith('M') or not input_seq.endswith('*'):
         sys.exit('Invalid amino acid sequence detected.\nThe sequence must start with M and end with * because ICOR only optimizes the codon-sequence region!\nPlease try again.\nRead more: http://www.hgvs.org/mutnomen/references.html#aalist')
-elif type == 'DNA':
+elif sequence_type == 'DNA':
     if input_seq == 'DEMO':
         input_seq = "ATGAGCGACGTGGCTATTGTGAAGGAGGGTTGGCTGCACAAACGAGGGGAGTACATCAAGACCTGGCGGCCACGCTACTTCCTCCTCAAGAATGATGGCACCTTCATTGGCTACAAGGAGCGGCCGCAGGATGTGGACCAACGTGAGGCTCCCCTCAACAACTTCTCTGTGGCGCAGTGCCAGCTGATGAAGACGGAGCGGCCCCGGCCCAACACCTTCATCATCCGCTGCCTGCAGTGGACCACTGTCATCGAACGCACCTTCCATGTGGAGACTCCTGAGGAGCGGGAGGAGTGGACAACCGCCATCCAGACTGTGGCTGACGGCCTCAAGAAGCAGGAGGAGGAGGAGATGGACTTCCGGTCGGGCTCACCCAGTGACAACTCAGGGGCTGAAGAGATGGAGGTGTCCCTGGCCAAGCCCAAGCACCGCGTGACCATGAACGAGTTTGAGTACCTGAAGCTGCTGGGCAAGGGCACTTTCGGCAAGGTGATCCTGGTGAAGGAGAAGGCCACAGGCCGCTACTACGCCATGAAGATCCTCAAGAAGGAAGTCATCGTGGCCAAGGACGAGGTGGCCCACACACTCACCGAGAACCGCGTCCTGCAGAACTCCAGGCACCCCTTCCTCACAGCCCTGAAGTACTCTTTCCAGACCCACGACCGCCTCTGCTTTGTCATGGAGTACGCCAACGGGGGCGAGCTGTTCTTCCACCTGTCCCGGGAGCGTGTGTTCTCCGAGGACCGGGCCCGCTTCTATGGCGCTGAGATTGTGTCAGCCCTGGACTACCTGCACTCGGAGAAGAACGTGGTGTACCGGGACCTCAAGCTGGAGAACCTCATGCTGGACAAGGACGGGCACATTAAGATCACAGACTTCGGGCTGTGCAAGGAGGGGATCAAGGACGGTGCCACCATGAAGACCTTTTGCGGCACACCTGAGTACCTGGCCCCCGAGGTGCTGGAGGACAATGACTACGGCCGTGCAGTGGACTGGTGGGGGCTGGGCGTGGTCATGTACGAGATGATGTGCGGTCGCCTGCCCTTCTACAACCAGGACCATGAGAAGCTTTTTGAGCTCATCCTCATGGAGGAGATCCGCTTCCCGCGCACGCTTGGTCCCGAGGCCAAGTCCTTGCTTTCAGGGCTGCTCAAGAAGGACCCCAAGCAGAGGCTTGGCGGGGGCTCCGAGGACGCCAAGGAGATCATGCAGCATCGCTTCTTTGCCGGTATCGTGTGGCAGCACGTGTACGAGAAGAAGCTCAGCCCACCCTTCAAGCCCCAGGTCACGTCGGAGACTGACACCAGGTATTTTGATGAGGAGTTCACGGCCCAGATGATCACCATCACACCACCTGACCAAGATGACAGCATGGAGTGTGTGGACAGCGAGCGCAGGCCCCACTTCCCCCAGTTCTCCTACTCGGCCAGCGGCACGGCCTGA"
     if 'U' in input_seq:
@@ -34,17 +31,15 @@
     # ICOR accepts the amino acid sequence, so we translate the DNA sequence to amino acid sequence:
     input_seq = Seq(input_seq)
     input_seq = input_seq.translate()
-# It's good to handle all cases of your if/elif. Something like
-# else:
-#     sys.exit(f"Invalid sequence type {sequence_type}. Expected 'aa' or 'dna'")
+else:
+    sys.exit(f"Invalid sequence type {sequence_type}. Expected 'aa' or 'dna'")
 
 
 print(input_seq)
 # Define categorical labels from when model was trained.
 labels = ['AAA', 'AAC','AAG','AAT','ACA','ACG','ACT','AGC','ATA','ATC','ATG','ATT','CAA','CAC','CAG','CCG','CCT','CTA','CTC','CTG','CTT','GAA','GAT','GCA','GCC','GCG','GCT','GGA','GGC','GTC','GTG','GTT','TAA','TAT','TCA','TCG','TCT','TGG','TGT','TTA','TTC','TTG','TTT','ACC','CAT','CCA','CGG','CGT','GAC','GAG','GGT','AGT','GGG','GTA','TGC','CCC','CGA','CGC','TAC','TAG','TCC','AGA','AGG','TGA']
 
 # Define aa to integer table
-# Your 'seq: str' type definition is broken by your reassignment of 'str' below
 def aa2int(seq: str) -> List[int]:
     _aa2int = {
         'A': 1,
@@ -84,9 +79,9 @@ def aa2int(seq: str) -> List[int]:
 aa_placement = aa2int(input_seq)
 
 # One-hot encode the amino acid sequence:
-i = 0
+
 # style nit: more pythonic to write for i in range(0, len(aa_placement)):
-while i < len(aa_placement):
+for i in range(0, len(aa_placement)):
     oh_array[aa_placement[i], i] = 1
     i += 1
 
@@ -109,32 +104,22 @@ def aa2int(seq: str) -> List[int]:
 for pred in pred_onx[0]:
     pred_indices.append(np.argmax(pred))
 
-# Likewise, 'str' is a bultin type in python
-# I'd rename to 'output_str' or the like
 out_str = ""
 for index in pred_indices:
-    str += labels[index]
-print('==== OUTPUT ====\n' + str)
+    out_str += labels[index]
+print('==== OUTPUT ====\n' + out_str)
 
 output = input(
     "Would you like to write this into a file? (Y or N)\n\n").strip().upper()
 
-if (output == 'Y'):
-    with open('output.txt', 'w') as f:
-        f.write(str)
-    print('\nOutput written to output.txt')
-else:
-    print('\nNo output written. Done!')
-# should catch cases explicitly
-# like:
-# while True:
-#     if output == "Y":
-#         with open("output.txt", "w") as f:
-#         f.write(out_str)
-#         print("\nOutput written to output.txt")
-#         break
-#     elif output == "N":
-#         print("\nNo output written. Done!")
-#         break
-#     else:
-#         print("Error! Expected Y/N")
+while True:
+    if output == "Y":
+        with open("output.txt", "w") as f:
+        f.write(out_str)
+        print("\nOutput written to output.txt")
+        break
+    elif output == "N":
+        print("\nNo output written. Done!")
+        break
+    else:
+        print("Error! Expected Y/N")
diff --git a/tool/optimizers/naive_optimizer.py b/tool/optimizers/naive_optimizer.py
@@ -34,11 +34,11 @@
 
 # Amino acid sequence dir to optimize:
 # hardcoded path
-aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
+aa_dir = r"..\..\benchmark_sequences\aa"
 
 # Output dir to store optimized seqs:
 # hardcoded path
-out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\naive"
+out_dir = r"..\..\benchmark_sequences\naive"
 
 
 # Normalize probabilities for frequency if sum is not exactly 1.
@@ -48,10 +48,9 @@ def fix_p( p):
     return p
 
 for entry in os.scandir(aa_dir):
-    name = entry.name[0:-9] + "_dna"
+    name = entry.replace("_aa.fasta", "_dna")
 
-    # Replace ambiguities with amino acids from IUPAC guidelines.
-    # Might be nice to have a link to the guidelines?
+    # Replace ambiguities with amino acids from IUPAC guidelines: https://www.bioinformatics.org/sms/iupac.html
     record = SeqIO.read(entry,"fasta")
     seq = record.seq.replace("B", random.choice(["D","N"])).replace("Z", random.choice(["E", "Q"]))
     seq_arr = []
diff --git a/tool/optimizers/super_naive_optimizer.py b/tool/optimizers/super_naive_optimizer.py
@@ -8,12 +8,10 @@
 from Bio.Seq import Seq
 
 # Amino acid sequence dir to optimize:
-# hardcoded path
-aa_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\aa"
+aa_dir = r"..\..\benchmark_sequences\aa"
 
 # Output dir to store optimized seqs:
-# hardcoded path
-out_dir = r"C:\Users\risha\Desktop\icor-codon-optimization\benchmark_sequences\super_naive"
+out_dir = r"..\..\benchmark_sequences\super_naive"
 
 # Amino acid to codon table, outputs arr of codons:
 def aa2codons(seq : str) -> list:
diff --git a/tool/scripts/convert_to_cds.ipynb b/tool/scripts/convert_to_cds.ipynb
diff --git a/tool/scripts/csv_to_seqs.py b/tool/scripts/csv_to_seqs.py
diff --git a/tool/scripts/reformat_seqs.py b/tool/scripts/reformat_seqs.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+Bio==1.0.3`
	`2`	`+numpy==1.21.2`
	`3`	`+onnxruntime==1.9.0`