Skip to content

Commit 6bb4dda

Browse files
committed
Adding the reproducibility scripts for RawHash2 and some more optimizations in default parameters
1 parent a09971c commit 6bb4dda

66 files changed

Lines changed: 1183 additions & 1088 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ test/evaluation/read_mapping/*/comparison/comparison.out
4343
test/evaluation/read_mapping/*/comparison/comparison.err
4444

4545
test/evaluation/*/*/slurm*
46-
test/evaluation/*/slurm*
46+
test/evaluation/*/*slurm*
4747
test/evaluation/*/*/test_*
4848
test/evaluation/*/*/comparison/*/
4949
test/scripts/run_rawhash_test.sh
@@ -52,7 +52,7 @@ test/scripts/run_rawhash_test*.sh
5252

5353
test/evaluation/read_mapping/*/parameters.txt
5454
test/evaluation/read_mapping/*/results.txt
55-
test/evaluation/read_mapping/parameters*.txt
55+
test/evaluation/read_mapping/*parameters*.txt
5656

5757
test/*.paf
5858
test/eval/*.paf

README.md

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
# Overview
66

7-
RawHash is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
7+
RawHash (and RawHash2) is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
88

99
Below figure shows the overview of the steps that RawHash takes to find matching regions between a reference genome and a raw nanopore signal.
1010

@@ -16,11 +16,11 @@ To efficiently identify similarities between a reference genome and reads, RawHa
1616

1717
RawHash can be used to map reads from **FAST5, POD5, SLOW5, or BLOW5** files to a reference genome in sequence format.
1818

19-
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads in FAST5 or POD5 file can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap) to report the mapping information.
19+
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap) to report the mapping information.
2020

2121
# Recent changes
2222

23-
* RawHash now supports **POD5** files. RawHash will automatically detect the POD5 files from the file prefix (i.e., ".pod5"). Note: This feature is tested only on the Linux systems.
23+
* We have released RawHash2, a more sensitive and faster raw signal mapping mechanism with substantial improvements over RawHash. RawHash2 is available within this repository. You can still use the earlier version, RawHash v1, from [this release](https://github.com/CMU-SAFARI/RawHash/releases/tag/v1.0).
2424

2525
* It is now possible to disable compiling HDF5, SLOW5, and POD5. Please check the `Compiling with HDF5, SLOW5, and POD5` section below for details.
2626

@@ -29,22 +29,22 @@ RawHash performs real-time mapping of nanopore raw signals. When the prefix of r
2929
* Clone the code from its GitHub repository (`--recursive` must be used):
3030

3131
```bash
32-
git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash
32+
git clone --recursive https://github.com/CMU-SAFARI/RawHash.git rawhash2
3333
```
3434

3535
* Compile (Make sure you have a C++ compiler and GNU make):
3636

3737
```bash
38-
cd rawhash && make
38+
cd rawhash2 && make
3939
```
4040

41-
If the compilation is successful, the binary will be in `bin/rawhash`.
41+
If the compilation is successful, the path to the binary will be `bin/rawhash2`.
4242

4343
## Compiling with HDF5, SLOW5, and POD5
4444

45-
We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash provides a flexible Makefile to enable custom compilation of these libraries.
45+
We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash2 provides a flexible Makefile to enable custom compilation of these libraries.
4646

47-
* It is possible to provide your own include and lib directories for *any* of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash. To use your own include and lib directories you should pass them to `make` when compiling as follows:
47+
* It is possible to provide your own include and lib directories for *any* of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash2. To use your own include and lib directories you should pass them to `make` when compiling as follows:
4848

4949
```bash
5050
#Provide the path to all of the HDF5/SLOW5/POD5 include and lib directories during compilation
@@ -70,10 +70,10 @@ make NOSLOW5=1 NOPOD5=1
7070

7171
## Getting help
7272

73-
You can print the help message to learn how to use `rawhash`:
73+
You can print the help message to learn how to use `rawhash2`:
7474

7575
```bash
76-
rawhash
76+
rawhash2
7777
```
7878

7979
## Indexing
@@ -82,65 +82,66 @@ Indexing is similar to minimap2's usage. We additionally include the pore models
8282
Below is an example that generates an index file `ref.ind` for the reference genome `ref.fasta` using a certain k-mer model located under `extern` and `32` threads.
8383

8484
```bash
85-
rawhash -d ref.ind -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 ref.fasta
85+
rawhash2 -d ref.ind -p extern/kmer_models/r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 ref.fasta
8686
```
8787

88-
Note that you can directly jump to mapping without creating the index because RawHash is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.
88+
Note that you can directly jump to mapping without creating the index because RawHash2 is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.
8989

9090
## Mapping
9191

9292
It is possible to provide inputs as FAST5 files from multiple directories. It is also possible to provide a list of files matching a certain pattern such as `test/data/contamination/fast5_files/Min*.fast5`
9393

94-
* Example usage where multiple files matching a certain the pattern `test/data/contamination/fast5_files/Min*.fast5` and fast5 files inside the `test/data/d1_sars-cov-2_r94/fast5_files` directory are inputted to rawhash using `32` threads and the previously generated `ref.ind` index:
94+
* Example usage where multiple files matching a certain the pattern `test/data/contamination/fast5_files/Min*.fast5` and fast5 files inside the `test/data/d1_sars-cov-2_r94/fast5_files` directory are inputted to rawhash2 using `32` threads and the previously generated `ref.ind` index:
9595

9696
```bash
97-
rawhash -t 32 ref.ind test/data/contamination/fast5_files/Min*.fast5 test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
97+
rawhash2 -t 32 ref.ind test/data/contamination/fast5_files/Min*.fast5 test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
9898
```
9999

100100
* Another example usage where 1) we only input a directory including FAST5 files as set of raw signals and 2) the output is directly saved in a file.
101101

102102
```bash
103-
rawhash -t 32 -o mapping.paf ref.ind test/data/d1_sars-cov-2_r94/fast5_files
103+
rawhash2 -t 32 -o mapping.paf ref.ind test/data/d1_sars-cov-2_r94/fast5_files
104104
```
105105

106-
**IMPORTANT** if there are many fast5 files that rawhash needs to process (e.g., thousands of them), we suggest that you specify **only** the directories that contain these fast5 files
106+
**IMPORTANT** if there are many fast5 files that rawhash2 needs to process (e.g., thousands of them), we suggest that you specify **only** the directories that contain these fast5 files
107107

108-
RawHash also provides a set of default parameters that can be preset automatically.
108+
RawHash2 also provides a set of default parameters that can be preset automatically.
109109

110-
* Mapping reads to a viral reference genome using its corresponding preset:
110+
* Mapping reads to a viral reference genome using its corresponding preset with the high precision goal (as set by --depletion):
111111

112112
```
113-
rawhash -t 32 -x viral ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
113+
rawhash2 -t 32 -x viral --depletion ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
114114
```
115115

116-
* Mapping reads to small reference genomes (<50M bases) using its corresponding preset:
116+
* Mapping reads to small reference genomes (<500M bases) using its corresponding preset:
117117

118118
```
119-
rawhash -t 32 -x sensitive ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
119+
rawhash2 -t 32 -x sensitive ref.ind test/data/d4_green_algae_r94/fast5_files > mapping.paf
120120
```
121121

122-
* Mapping reads to large reference genomes (>50M bases) using its corresponding preset:
122+
* Mapping reads to large reference genomes (>500M bases) using its corresponding preset:
123123

124124
```
125-
rawhash -t 32 -x fast ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
125+
rawhash2 -t 32 -x fast ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
126126
```
127127

128-
* Although we have not thoroguhly evaluated, RawHash also provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting. This setting is not evaluated in our manuscript.
128+
RawHash2 provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting, which is slightly less accurate than the non-minimizer mode but much faster (around 3X).
129129

130130
```
131-
rawhash -t 32 -x faster ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
131+
rawhash2 -t 32 -x faster ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
132132
```
133133

134134
The output will be saved to `mapping.paf` in a modified PAF format used by [Uncalled](https://github.com/skovaka/UNCALLED).
135135

136136
## Potential issues you may encounter during mapping
137137

138138
It is possible that your reads in fast5 files are compressed with the [VBZ compression](https://github.com/nanoporetech/vbz_compression) from Nanopore. Then you have to download the proper HDF5 plugin from [here](https://github.com/nanoporetech/vbz_compression/releases) and make sure it can be found by your HDF5 library:
139+
139140
```bash
140141
export HDF5_PLUGIN_PATH=/path/to/hdf5/plugins/lib
141142
```
142143

143-
If you have conda you can simply install the following package (`ont_vbz_hdf_plugin`) in your environment and use rawhash while the environment is active:
144+
If you have conda you can simply install the following package (`ont_vbz_hdf_plugin`) in your environment and use rawhash2 while the environment is active:
144145

145146
```bash
146147
conda install ont_vbz_hdf_plugin
@@ -153,11 +154,11 @@ Please follow the instructions in the [README](test/README.md) file in [test](./
153154

154155
* Direct integration with MinKNOW.
155156
* Ability to specify even/odd channels to eject the reads only from these specified channels.
156-
* Please create issues if you want to see more features that can make RawHash easily integratable with nanopore sequencers for any use case.
157+
* Please create issues if you want to see more features that can make RawHash2 easily integratable with nanopore sequencers for any use case.
157158

158-
# Citing RawHash
159+
# Citing RawHash and RawHash2
159160

160-
To cite RawHash, you can use the following BibTeX:
161+
To cite RawHash and RawHash2, you can use the following BibTeX:
161162

162163
```bibtex
163164
@article{firtina_rawhash_2023,
@@ -177,6 +178,6 @@ To cite RawHash, you can use the following BibTeX:
177178

178179
# Acknowledgement
179180

180-
RawHash uses [klib](https://github.com/attractivechaos/klib), some code snippets from [Minimap2](https://github.com/lh3/minimap2) (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and [Sigmap](https://github.com/haowenz/sigmap) (e.g., R9.4 segmentation parameters).
181+
RawHash2 uses [klib](https://github.com/attractivechaos/klib), some code snippets from [Minimap2](https://github.com/lh3/minimap2) (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and the R9.4 segmentation parameters from [Sigmap](https://github.com/haowenz/sigmap).
181182

182183
We thank [Melina Soysal](https://github.com/melina2200) and [Marie-Louise Dugua](https://github.com/MarieSarahLouise) for their feedback to improve the RawHash implementation and test scripts.

src/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ endif
103103

104104
LIBS+=-lm -lz -ldl
105105

106-
PROG=rawhash
106+
PROG=rawhash2
107107

108108
ifneq ($(aarch64),)
109109
arm_neon=1

src/hit.c

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,22 +10,22 @@
1010
static inline void mm_cal_fuzzy_len(mm_reg1_t *r,
1111
const mm128_t *a)
1212
{
13-
// uint32_t span_mask = (1U<<RI_HASH_SHIFT)-1;
13+
uint32_t span_mask = (1U<<RI_HASH_SHIFT)-1;
1414

1515
int i;
1616
r->mlen = r->blen = 0;
1717
if (r->cnt <= 0) return;
18-
// r->mlen = r->blen = (a[r->as].y>>RI_ID_SHIFT)&span_mask;
18+
r->mlen = r->blen = (a[r->as].y>>RI_ID_SHIFT)&span_mask;
1919
for (i = r->as + 1; i < r->as + r->cnt; ++i) {
20-
// int span = (a[i].y>>RI_ID_SHIFT)&span_mask;
20+
int span = (a[i].y>>RI_ID_SHIFT)&span_mask;
2121
int tl = (int32_t)a[i].x - (int32_t)a[i-1].x;
2222
int ql = (int32_t)a[i].y - (int32_t)a[i-1].y;
2323
r->blen += tl > ql? tl : ql;
24-
// r->mlen += tl > span && ql > span? span : tl < ql? tl : ql;
25-
// r->mlen += tl < ql? tl : ql;
24+
r->mlen += tl > span && ql > span? span : tl < ql? tl : ql;
25+
r->mlen += tl < ql? tl : ql;
2626
}
2727

28-
r->mlen = r->blen;
28+
// r->mlen = r->blen;
2929
}
3030

3131
/*

src/lchain.c

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ uint64_t *mg_chain_backtrack(void *km,
110110
// n_z = # of anchors with acceptable scores.
111111
for (i = 0, n_z = 0; i < n; ++i) // precompute n_z
112112
if(f[i] >= min_sc) ++n_z;
113+
113114
if(n_z == 0) return 0;
114115

115116
KMALLOC(km, z, n_z);
@@ -314,9 +315,9 @@ static inline int32_t compute_score(const mm128_t *ai,
314315
// TODO: currently the span is only determined by "e" (number of events concatanated in a seed)
315316
q_span = (aj->y>>RI_ID_SHIFT)&span_mask;
316317

317-
// Calculate the chaining score. Consider the the gap (dg) if it is larger than the span (q_span).
318+
// Matching bases. Consider the the gap (dg) if it is smaller than the span (q_span).
318319
sc = q_span < dg? q_span : dg;
319-
320+
320321
// Integrating penalties to the score
321322
if(dd || dg > q_span){
322323
float lin_pen, log_pen;
@@ -542,7 +543,7 @@ static inline int32_t comput_sc_simple(const mm128_t *ai,
542543
float lin_pen, log_pen;
543544
lin_pen = chn_pen_gap * (float)dd + chn_pen_skip * (float)dg;
544545
log_pen = dd >= 1? mg_log2(dd + 1) : 0.0f; // mg_log2() only works for dd>=2
545-
// sc -= (int)(lin_pen + .5f * log_pen);
546+
sc -= (int)(lin_pen + .5f * log_pen);
546547
}
547548
return sc;
548549
}
@@ -692,7 +693,7 @@ mm128_t *mg_lchain_rmq(int max_dist,
692693
}
693694
}
694695
// set max
695-
assert(max_j < 0 || (a[max_j].x < a[i].x && (int32_t)a[max_j].y < (int32_t)a[i].y));
696+
// assert(max_j < 0 || (a[max_j].x <= a[i].x && (int32_t)a[max_j].y <= (int32_t)a[i].y));
696697

697698
// Update the score and the predecessor of anchor i based on the found best predecessor max_j
698699
f[i] = max_f, p[i] = max_j;

0 commit comments

Comments
 (0)