You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+31-30Lines changed: 31 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
# Overview
6
6
7
-
RawHash is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
7
+
RawHash (and RawHash2) is a hash-based mechanism to map raw nanopore signals to a reference genome in real-time. To achieve this, it 1) generates an index from the reference genome and 2) efficiently and accurately maps the raw signals to the reference genome such that it can match the throughput of nanopore sequencing even when analyzing large genomes (e.g., human genome.
8
8
9
9
Below figure shows the overview of the steps that RawHash takes to find matching regions between a reference genome and a raw nanopore signal.
10
10
@@ -16,11 +16,11 @@ To efficiently identify similarities between a reference genome and reads, RawHa
16
16
17
17
RawHash can be used to map reads from **FAST5, POD5, SLOW5, or BLOW5** files to a reference genome in sequence format.
18
18
19
-
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads in FAST5 or POD5 file can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap) to report the mapping information.
19
+
RawHash performs real-time mapping of nanopore raw signals. When the prefix of reads can be mapped to a reference genome, RawHash will stop mapping and provide the mapping information in PAF format. We follow the similar PAF template used in [UNCALLED](https://github.com/skovaka/UNCALLED) and [Sigmap](https://github.com/haowenz/sigmap) to report the mapping information.
20
20
21
21
# Recent changes
22
22
23
-
*RawHash now supports **POD5** files. RawHash will automatically detect the POD5 files from the file prefix (i.e., ".pod5"). Note: This feature is tested only on the Linux systems.
23
+
*We have released RawHash2, a more sensitive and faster raw signal mapping mechanism with substantial improvements over RawHash. RawHash2 is available within this repository. You can still use the earlier version, RawHash v1, from [this release](https://github.com/CMU-SAFARI/RawHash/releases/tag/v1.0).
24
24
25
25
* It is now possible to disable compiling HDF5, SLOW5, and POD5. Please check the `Compiling with HDF5, SLOW5, and POD5` section below for details.
26
26
@@ -29,22 +29,22 @@ RawHash performs real-time mapping of nanopore raw signals. When the prefix of r
29
29
* Clone the code from its GitHub repository (`--recursive` must be used):
* Compile (Make sure you have a C++ compiler and GNU make):
36
36
37
37
```bash
38
-
cdrawhash&& make
38
+
cdrawhash2&& make
39
39
```
40
40
41
-
If the compilation is successful, the binary will be in `bin/rawhash`.
41
+
If the compilation is successful, the path to the binary will be `bin/rawhash2`.
42
42
43
43
## Compiling with HDF5, SLOW5, and POD5
44
44
45
-
We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash provides a flexible Makefile to enable custom compilation of these libraries.
45
+
We are aware that some of the pre-compiled libraries (e.g., POD5) may not work in your system and you may need to compile these libraries from scratch. Additionally, it may be possible that you may not want to compile any of the HDF5, SLOW5, or POD5 libraries if you are not going to use them. RawHash2 provides a flexible Makefile to enable custom compilation of these libraries.
46
46
47
-
* It is possible to provide your own include and lib directories for *any* of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash. To use your own include and lib directories you should pass them to `make` when compiling as follows:
47
+
* It is possible to provide your own include and lib directories for *any* of the HDF5, SLOW5, and POD5 libraries, if you do not want to use the source code or the pre-compiled binaries that come with RawHash2. To use your own include and lib directories you should pass them to `make` when compiling as follows:
48
48
49
49
```bash
50
50
#Provide the path to all of the HDF5/SLOW5/POD5 include and lib directories during compilation
@@ -70,10 +70,10 @@ make NOSLOW5=1 NOPOD5=1
70
70
71
71
## Getting help
72
72
73
-
You can print the help message to learn how to use `rawhash`:
73
+
You can print the help message to learn how to use `rawhash2`:
74
74
75
75
```bash
76
-
rawhash
76
+
rawhash2
77
77
```
78
78
79
79
## Indexing
@@ -82,65 +82,66 @@ Indexing is similar to minimap2's usage. We additionally include the pore models
82
82
Below is an example that generates an index file `ref.ind` for the reference genome `ref.fasta` using a certain k-mer model located under `extern` and `32` threads.
Note that you can directly jump to mapping without creating the index because RawHash is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.
88
+
Note that you can directly jump to mapping without creating the index because RawHash2 is able to generate the index relatively quickly on-the-fly within the mapping step. However, a real-time genome analysis application may still prefer generating the indexing before the mapping step. Thus, we suggest creating the index before the mapping step.
89
89
90
90
## Mapping
91
91
92
92
It is possible to provide inputs as FAST5 files from multiple directories. It is also possible to provide a list of files matching a certain pattern such as `test/data/contamination/fast5_files/Min*.fast5`
93
93
94
-
* Example usage where multiple files matching a certain the pattern `test/data/contamination/fast5_files/Min*.fast5` and fast5 files inside the `test/data/d1_sars-cov-2_r94/fast5_files` directory are inputted to rawhash using `32` threads and the previously generated `ref.ind` index:
94
+
* Example usage where multiple files matching a certain the pattern `test/data/contamination/fast5_files/Min*.fast5` and fast5 files inside the `test/data/d1_sars-cov-2_r94/fast5_files` directory are inputted to rawhash2 using `32` threads and the previously generated `ref.ind` index:
**IMPORTANT** if there are many fast5 files that rawhash needs to process (e.g., thousands of them), we suggest that you specify **only** the directories that contain these fast5 files
106
+
**IMPORTANT** if there are many fast5 files that rawhash2 needs to process (e.g., thousands of them), we suggest that you specify **only** the directories that contain these fast5 files
107
107
108
-
RawHash also provides a set of default parameters that can be preset automatically.
108
+
RawHash2 also provides a set of default parameters that can be preset automatically.
109
109
110
-
* Mapping reads to a viral reference genome using its corresponding preset:
110
+
* Mapping reads to a viral reference genome using its corresponding preset with the high precision goal (as set by --depletion):
* Mapping reads to large reference genomes (>50M bases) using its corresponding preset:
122
+
* Mapping reads to large reference genomes (>500M bases) using its corresponding preset:
123
123
124
124
```
125
-
rawhash -t 32 -x fast ref.ind test/data/d1_sars-cov-2_r94/fast5_files > mapping.paf
125
+
rawhash2 -t 32 -x fast ref.ind test/data/d5_human_na12878_r94/fast5_files > mapping.paf
126
126
```
127
127
128
-
* Although we have not thoroguhly evaluated, RawHash also provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting. This setting is not evaluated in our manuscript.
128
+
RawHash2 provides another set of default parameters that can be used for very large metagenomic samples (>10G). To achieve efficient search, it uses the minimizer seeding in this parameter setting, which is slightly less accurate than the non-minimizer mode but much faster (around 3X).
The output will be saved to `mapping.paf` in a modified PAF format used by [Uncalled](https://github.com/skovaka/UNCALLED).
135
135
136
136
## Potential issues you may encounter during mapping
137
137
138
138
It is possible that your reads in fast5 files are compressed with the [VBZ compression](https://github.com/nanoporetech/vbz_compression) from Nanopore. Then you have to download the proper HDF5 plugin from [here](https://github.com/nanoporetech/vbz_compression/releases) and make sure it can be found by your HDF5 library:
139
+
139
140
```bash
140
141
export HDF5_PLUGIN_PATH=/path/to/hdf5/plugins/lib
141
142
```
142
143
143
-
If you have conda you can simply install the following package (`ont_vbz_hdf_plugin`) in your environment and use rawhash while the environment is active:
144
+
If you have conda you can simply install the following package (`ont_vbz_hdf_plugin`) in your environment and use rawhash2 while the environment is active:
144
145
145
146
```bash
146
147
conda install ont_vbz_hdf_plugin
@@ -153,11 +154,11 @@ Please follow the instructions in the [README](test/README.md) file in [test](./
153
154
154
155
* Direct integration with MinKNOW.
155
156
* Ability to specify even/odd channels to eject the reads only from these specified channels.
156
-
* Please create issues if you want to see more features that can make RawHash easily integratable with nanopore sequencers for any use case.
157
+
* Please create issues if you want to see more features that can make RawHash2 easily integratable with nanopore sequencers for any use case.
157
158
158
-
# Citing RawHash
159
+
# Citing RawHash and RawHash2
159
160
160
-
To cite RawHash, you can use the following BibTeX:
161
+
To cite RawHash and RawHash2, you can use the following BibTeX:
161
162
162
163
```bibtex
163
164
@article{firtina_rawhash_2023,
@@ -177,6 +178,6 @@ To cite RawHash, you can use the following BibTeX:
177
178
178
179
# Acknowledgement
179
180
180
-
RawHash uses [klib](https://github.com/attractivechaos/klib), some code snippets from [Minimap2](https://github.com/lh3/minimap2) (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and [Sigmap](https://github.com/haowenz/sigmap) (e.g., R9.4 segmentation parameters).
181
+
RawHash2 uses [klib](https://github.com/attractivechaos/klib), some code snippets from [Minimap2](https://github.com/lh3/minimap2) (e.g., pipelining, hash table usage, DP and RMQ-based chaining) and the R9.4 segmentation parameters from [Sigmap](https://github.com/haowenz/sigmap).
181
182
182
183
We thank [Melina Soysal](https://github.com/melina2200) and [Marie-Louise Dugua](https://github.com/MarieSarahLouise) for their feedback to improve the RawHash implementation and test scripts.
0 commit comments