Skip to content

Commit 673c912

Browse files
committed
code untar in directory
to keep track of code changes
1 parent c1510f9 commit 673c912

12 files changed

Lines changed: 2842 additions & 0 deletions

File tree

RAILS_v1.1/RAILS

Lines changed: 1046 additions & 0 deletions
Large diffs are not rendered by default.

RAILS_v1.1/cobbler.pl

Lines changed: 530 additions & 0 deletions
Large diffs are not rendered by default.

RAILS_v1.1/readme.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
## RAILS v1.1 and Cobbler v0.2 Rene L. Warren, 2014-2016
2+
email: rwarren at bcgsc.ca
3+
4+
### Name
5+
6+
RAILS: Radial Assembly Improvement by Long Sequence Scaffolding
7+
Cobbler: Gap-filling with long sequences
8+
9+
10+
### Description
11+
12+
RAILS and Cobbler are genomics application for scaffolding and automated finishing of genome assemblies with long DNA sequences.
13+
They can be used to scaffold & finish high-quality draft genome assemblies with any long, preferably high-quality, sequences such as scaftigs/contigs from another genome draft.
14+
15+
They both rely on accurate, long DNA sequences to patch gaps in existing genome assembly drafts.
16+
17+
Cobbler is a utility to automatically patch gaps (ambiguous regions in a draft assembly, represented by N's)
18+
It does so by first aligning the long sequences to the assembly, tallying the alignments and replacing N's with the sequences from these long DNA sequences.
19+
20+
RAILS is an all-in-one scaffolder and gap-filler. Its process is similar to that of Cobbler. It scaffolds your genome draft with the help of long DNA sequences (contig sequences are ordered/oriented using alignment information). The newly created gaps are automatically filled with the DNA sequence of the provided long DNA sequence.
21+
22+
You can test the software by executing "runme.sh" in the test folder. A simulated SARS genome assembly is provided to test the software.
23+
24+
### Implementation and requirements
25+
26+
RAILS and Cobbler are implemented in PERL and run on any OS where PERL is installed.
27+
28+
29+
### Community guidelines:
30+
31+
I encourage the community to contribute to the development of this software, by providing suggestions for improving the code and/or directly contributing to the open source code for these tools. Users and developers may report software issues, bug fix requests, comments, etc, at <https://github.com/warrenlr/RAILS>
32+
33+
34+
### Install
35+
36+
Download the tar ball, gunzip and extract the files on your system using:
37+
38+
gunzip rails_v1-1.tar.gz
39+
tar -xvf rails_v1-1.tar
40+
41+
Alternatively, individual tools are available within the github repository
42+
43+
44+
### Dependencies
45+
46+
Make sure you have installed bwa (Version: 0.7.15-r1140) and that is is in your path.
47+
48+
49+
### Test data
50+
51+
Go to ./test
52+
(cd test)
53+
54+
1. SARS:
55+
execute runme.sh
56+
(./runme.sh)
57+
58+
2. Human:
59+
execute runmeHuman.sh (will take a while to run)
60+
(./runmeHuman.sh)
61+
62+
63+
### Usage
64+
65+
./runRAILS.sh
66+
Usage: runRAILS.sh <FASTA assembly .fa> <FASTA long sequences .fa> <anchoring sequence length eg. 250> <min sequence identity 0.95>
67+
68+
this pipeline will:
69+
1. reformat the assembly file $1
70+
2. rename the long sequence file $2
71+
3. Build a database index with bwa
72+
4. Align the reformatted long sequences to your re-formatted baseline assembly
73+
5. Run Cobbler to gap-fill regions of ambiguity
74+
6. Reformat Cobbler's .fa file
75+
7. Build a database index of it with bwa
76+
8. Align the reformatted long sequences to your re-formatted cobbler assembly
77+
9. Run RAILS to generate a newly scaffolded assembly draft
78+
79+
Usage: ./cobbler.pl [v0.2]
80+
-f Assembled Sequences to further scaffold (Multi-Fasta format, required)
81+
-q Long Sequences queried (Multi-Fasta format, required)
82+
-s SAM file
83+
-d Anchoring bases on contig edges (ie. minimum required alignment size on contigs, default -d 1000, optional)
84+
-i Minimum sequence identity, default -i 0.9, optional
85+
-t LIST of names/header, long sequences to avoid using for merging/gap-filling scaffolds (optional)
86+
-b Base name for your output files (optional)
87+
-v Runs in verbose mode (-v 1 = yes, default = no, optional)
88+
89+
Usage: ./RAILS [v1.1]
90+
-f Assembled Sequences to further scaffold (Multi-Fasta format, required)
91+
-q Long Sequences queried (Multi-Fasta format, required)
92+
-s SAM file
93+
-d Anchoring bases on contig edges (ie. minimum required alignment size on contigs, default -d 1000, optional)
94+
-i Minimum sequence identity, default -i 0.9, optional
95+
-t LIST of names/header, long sequences to avoid using for merging/gap-filling scaffolds (optional)
96+
-b Base name for your output files (optional)
97+
-v Runs in verbose mode (-v 1 = yes, default = no, optional)
98+
99+
100+
### How it works
101+
102+
The pipeline is detailed in the provided script runRAILS.sh
103+
104+
Cobbler's process:
105+
106+
The assembly draft sequence supplied to Cobbler is first broken up at the ambiguous regions of the assembly (Ns) to create scaftigs.
107+
In the runRAILS.sh, these scaftigs are renamed, tracking their scaffold of origin (renumbered incrementally) and their position within it (also numbered incrementally).
108+
A bwa index is created and the long sequence file, also re-numbered, is aligned to the scaftigs.
109+
Cobbler is supplied with the alignment file (-s sam file) and the long reads files (-q option), specifying the minimum length of anchoring bases (-d) aligning at the edge of scaftigs and the minimum sequence identity of the alignment (-i). When 1 or more long sequences align unambiguously to the 3'end of a scaftig and the 5'end of its neighbour, the gap is patched with the sequence of that long sequence. If no long sequences are suitable, or the -d and -i conditions are not met, the original Ns are placed back between those scaftigs.
110+
111+
RAILS process:
112+
113+
In RAILS, the process is similar as for Cobbler, except that the draft assembly is not broken up at Ns, since the goal is to merge distinct sequences into larger ones. Long sequences are aligned to the draft assembly sequences, orienting and ordering sequences and simulateneously filling the gaps between them, using DNA bases from the long sequences.
114+
115+
Scaffolding in RAILS is done using the LINKS scaffolder code (Warren et al. 2015), the unpublished scaffolding engine in the widely-used SSAKE assembler (Warren et al. 2007), and foundation of the SSPACE-LongRead scaffolder (Boetzer and Pirovano, 2014).
116+
117+
Output: For both Cobbler and RAILS, a summary of the gaps closed and their lengths is provided (.tsv) as a text file.
118+
A fasta file (.fa) of the finished and/or scaffolded draft is generated for both along with a log file reporting basic success statistics.
119+
120+
121+
Boetzer M, Pirovano W. 2014. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics.15:211. DOI: 10.1186/1471-2105-15-211
122+
123+
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJ, Birol I. 2015. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience 4:35. DOI: 10.1186/s13742-015-0076-3
124+
125+
Warren RL, Sutton GG, Jones SJM, Holt RA. 2007. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 23(4):500-501. DOI: 10.1093/bioinformatics/btl629
126+
127+
128+
### Runs on the human genome
129+
130+
On a human draft assembly, cobbler patched over 65% of the gaps using 1, 2.5, 5, 15 kb long DNA sequences simulated from the human genome reference. The Pearson correlation between the predicted gap sizes and the size of patched gaps is R=0.8150
131+
132+
133+
**Table 1.** Patching gaps with Cobbler using simulated 1, 2.5, 5, 15kbp simulated long sequences from human genome reference GRCh38.
134+
135+
Metric | Value
136+
---- | ----
137+
Total gaps | 148,091
138+
Number of gaps patched | 95,523
139+
Proportion of gaps patched | 65.1%
140+
Average length (bp) | 343.39
141+
Length st.dev +/- | 931.12
142+
Total bases added | 32,801,755
143+
Largest gap resolved (bp) | 13,662
144+
Shortest gap resolved (bp) | 1
145+
146+
RAILS was used to further contiguate the human baseline assembly draft and automatically close gaps within in:
147+
148+
**Table 2.** RAILS scaffolding and gap-filling summary on a human assembly baseline, using simulated 1, 2.5, 5, 15kbp simulated long sequences from human genome reference GRCh38.
149+
150+
Metric | Value
151+
---- | ----
152+
Number of merges induced | 6,029
153+
Average closed gap length (bp) | 1,136.71
154+
Closed gap length st.dev +/- | 2,511.69
155+
Total bases added | 6,853,222
156+
Largest gap resolved (bp) | 14,471
157+
Shortest gap resolved (bp) | 1
158+
159+
6,029 merges resulted from RAILS scaffolding of the baseline human assembly draft (1,695 >= 500bp)
160+
The scaffold N50 length increased from 5.6 to 7.3 Mbp, a 30% increase in N50 length.
161+
162+
**Table 3.** Assembly statistics on human genome scaffolding and finishing post cobbler and RAILS.
163+
164+
n:500 | n:N50 | n:NG50 | NG50 | N50 | E-size | max | sum | name
165+
------ | ----- | ----- | --------- | --------- | --------- | ------- | ------- | ---------
166+
65,905 | 145 | 164 | 5,144,025 | 5,597,244 | 7,101,538 | 26.41e6 | 2.794e9 | baseline
167+
65,905 | 145 | 161 | 5,312,196 | 5,658,133 | 7,175,808 | 26.66e6 | 2.827e9 | cobbler
168+
64,210 | 113 | 125 | 6,935,685 | 7,266,542 | 9,007,414 | 32.14e6 | 2.836e9 | RAILS
169+
170+
171+
### License Preamble
172+
173+
RAILS and Cobbler Copyright (c) 2014-2016 British Columbia Cancer Agency Branch. All rights reserved.
174+
175+
RAILS and Cobbler are released under the GNU General Public License v3
176+
177+
This program is free software: you can redistribute it and/or modify
178+
it under the terms of the GNU General Public License as published by
179+
the Free Software Foundation, version 3.
180+
181+
This program is distributed in the hope that it will be useful,
182+
but WITHOUT ANY WARRANTY; without even the implied warranty of
183+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
184+
GNU General Public License for more details.
185+
186+
You should have received a copy of the GNU General Public License
187+
along with this program. If not, see <http://www.gnu.org/licenses/>.
188+

RAILS_v1.1/runRAILS.sh

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#!/bin/bash
2+
#RLW 2016
3+
if [ $# -ne 4 ]; then
4+
echo "Usage: $(basename $0) <FASTA assembly .fa> <FASTA long sequences .fa> <anchoring sequence length eg. 250> <min sequence identity 0.95>"
5+
exit 1
6+
fi
7+
###Change line below to point to path of bwa executables
8+
export PATH=/gsc/btl/linuxbrew/bin:$PATH
9+
echo Resolving ambiguous bases -Ns- in $1 assembly using long sequences $2
10+
echo reformatting file $1
11+
cat $1 | perl -ne 'if(/^\>/){$scafnum++;}else{my $len=length($_);my @scaftigs=split(/N+/i,$_);my $scaftignum=0;foreach my $scaftig(@scaftigs){ my $len=length($scaftig);$scaftignum++; print ">wga$scafnum";print "."; print "$scaftignum,$len\n$scaftig\n";}}' > $1-formatted.fa
12+
echo reformatting file $2
13+
cat $2 | perl -ne 'if(/^\>/){$ct++;}else{my $len=length($_);print ">seq$ct,$len\n$_";}' > $2-formatted.fa
14+
echo Building sequence database index out of your $1-formatted.fa assembly contigs..
15+
bwa index $1-formatted.fa
16+
echo Aligning long sequences $2-formatted.fa to your contigs..
17+
bwa mem -a -t4 $1-formatted.fa $2-formatted.fa > $2_vs_$1_gapfilling.sam
18+
echo Scaffolding $1-formatted.fa using $2-formatted.fa and filling gaps with sequences in $2-formatted.fa
19+
./cobbler.pl -f $1 -s $2_vs_$1_gapfilling.sam -d $3 -i $4 -b $2_vs_$1_$3_gapsFill -q $2-formatted.fa
20+
echo Process terminated.
21+
echo RAILS scaffolding $1.gapsFill.fa sequences using long seqs $2 -- anchoring sequence threshold $3 bp
22+
echo reformatting file $1.gapsFill.fa
23+
cat $2_vs_$1_$3_gapsFill.fa | perl -ne 'if(/^\>/){$ct++;}else{my $len=length($_);print ">wga$ct,$len\n$_";}' > $2_vs_$1_$3_gapsFill-formatted.fa
24+
echo Building sequence database index out of your $2_vs_$1_$3_gapsFill-formatted.fa assembly contigs..
25+
bwa index $2_vs_$1_$3_gapsFill-formatted.fa
26+
echo Aligning long sequences $2-formatted.fa to your contigs..
27+
bwa mem -a -t4 $2_vs_$1_$3_gapsFill-formatted.fa $2-formatted.fa > $2_vs_$1_scaffolding.sam
28+
echo Scaffolding $2_vs_$1_$3_gapsFill-formatted.fa using $2-formatted.fa and filling new gaps with sequences in $2-formatted.fa
29+
./RAILS -f $2_vs_$1_$3_gapsFill-formatted.fa -s $2_vs_$1_scaffolding.sam -d $3 -i $4 -b $2_vs_$1_$3_rails -q $2-formatted.fa
30+
echo RAILS process terminated.

RAILS_v1.1/test/SARSassembly.fa

Lines changed: 6 additions & 0 deletions
Large diffs are not rendered by default.

RAILS_v1.1/test/SARSgenome.fa

Lines changed: 2 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)