Skip to content

Commit 31f4e18

Browse files
committed
docs(batch): reframe combined zarrs as final products, per-pulse-mode as intermediates
1 parent d86ab09 commit 31f4e18

1 file changed

Lines changed: 55 additions & 43 deletions

File tree

scripts/batch_processing/PROCESSING_REPORT.md

Lines changed: 55 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -90,39 +90,39 @@ Stage 14: Combined daily products + per-day echograms (NEW)
9090

9191
### Stage 4: Compute Sv
9292

93-
- **277 raw Sv zarrs** (137 short_pulse + 140 long_pulse)
93+
- **277 per-pulse-mode Sv zarrs** (137 short_pulse + 140 long_pulse) — intermediates
9494
- Typical shape: `(channels=2, ping_time=~15000-32000, range_sample=~3600-7200)`
95-
- Stored as e.g. `2023-07-15/2023-07-15--short_pulse.zarr`
95+
- **Final per-day products**: 141 combined Sv zarrs (both pulse modes merged, channels: `38kHz`, `200kHz`)
9696

9797
### Stage 5–6: Calibrate + Enrich + Denoise
9898

99-
- **268 denoised zarrs** (132 short_pulse + 136 long_pulse)
99+
- **268 per-pulse-mode denoised zarrs** (132 short_pulse + 136 long_pulse) — intermediates
100100
- 9 raw zarrs had no matching GPS or failed calibration → skipped
101101
- 4-stage denoising: background noise removal → impulse noise → attenuation correction → transient removal
102102
- GPS (latitude/longitude) merged from `gpsdata` container into denoised datasets
103-
- Stored as e.g. `2023-07-15/2023-07-15--short_pulse--denoised.zarr`
103+
- **Final per-day products**: 140 combined denoised zarrs (both pulse modes merged)
104104

105105
**GPS coverage issue**: 34 denoised zarrs have all-NaN GPS coordinates. These are consistently one pulse mode per affected day — the GPS merge succeeded for one mode but not the other (likely timing mismatch between GPS timestamps and sonar ping times for the alternate pulse mode).
106106

107107
### Stage 7: Per-day MVBS
108108

109-
- **261 MVBS zarrs** (+ 261 NetCDF copies)
109+
- **261 per-pulse-mode MVBS zarrs** (+ 261 NetCDF copies) — intermediates
110110
- Bins: `range_bin=1m`, `ping_time_bin=10s`
111111
- Computed with `echopype.commongrid.compute_MVBS()`
112-
- Stored as e.g. `2023-07-15/2023-07-15--short_pulse--mvbs.zarr` and `.nc`
112+
- **Final per-day products**: 137 combined MVBS zarrs (both pulse modes merged)
113113

114114
### Stage 7 (NASC): Per-day NASC — Fast Vectorized
115115

116116
**Original approach** (echopype `compute_NASC`): ~90 GB RAM, 15–60 min per zarr. Only 5 zarrs completed before the pipeline was killed due to stalled computation.
117117

118118
**Replacement** (`run_nasc_parallel.py`): Pure numpy + haversine + `np.bincount`. ~7 GB per worker, 1–17 seconds per zarr. **~600× faster.**
119119

120-
- **229 NASC zarrs** (+ 229 NetCDF copies)
120+
- **229 per-pulse-mode NASC zarrs** (+ 229 NetCDF copies) — intermediates
121121
- 109 short_pulse + 120 long_pulse
122122
- Bins: `range_bin=10m`, `dist_bin=0.5nmi`
123123
- **222 computed in 2 minutes** (10 parallel workers)
124124
- 34 skipped (all-NaN GPS), 5 failed (see §4)
125-
- Stored as e.g. `2023-07-15/2023-07-15--short_pulse--nasc.zarr` and `.nc`
125+
- **Final per-day products**: 216 combined NASC zarrs (per-frequency: 38kHz + 200kHz)
126126

127127
### Stage 8: Per-day Echograms — SKIPPED
128128

@@ -172,24 +172,27 @@ Skipped with `--skip-perday-echograms` to prioritise campaign-level products. Ca
172172
- Grid: 0.5° resolution, scipy griddata interpolation, cKDTree search radius 0.5°
173173
- Stored in `/mnt/data/output/heatmaps/`
174174

175-
### Stage 14: Combined Daily Products + Per-day Echograms (NEW)
175+
### Stage 14: Pulse-Mode Merge + Per-day Echograms
176176

177-
Merges short_pulse + long_pulse into single per-day combined zarrs. Channels renamed from instrument IDs (`EKA 266972-07 ES38-18|200-18C`) to frequency labels (`38kHz`, `200kHz`). Each dataset includes a `pulse_mode` variable (0=long, 1=short) for provenance.
177+
The raw pipeline (stages 4–7) processes each pulse mode separately, producing per-pulse-mode intermediate zarrs. Stage 14 merges these into the **final per-day products** — one zarr per day per product level, with both pulse modes combined.
178178

179-
**Products combined:**
179+
Channels renamed from instrument IDs (`EKA 266972-07 ES38-18|200-18C`) to frequency labels (`38kHz`, `200kHz`). Each dataset includes a `pulse_mode` variable (0=long, 1=short) for provenance.
180180

181-
| Product | Count | Method |
182-
|---------|-------|--------|
183-
| Combined MVBS | 137 | Concat along `ping_time` (depth aligned at 1m) |
184-
| Combined denoised Sv | 140 | Interpolated to 0.5m common depth grid, concat along `ping_time` |
185-
| Combined raw Sv | 141 | Same interpolation as denoised |
186-
| Combined NASC | 216 | Per-frequency files, concat along `distance` (offset to avoid overlap) |
181+
**Final per-day products:**
187182

188-
Stored as e.g. `2023-07-15/2023-07-15--combined--mvbs.zarr` (NASC: `2023-07-15--combined--nasc--38kHz.zarr`).
183+
| Product | Count | Merge method | Example filename |
184+
|---------|-------|-------------|------------------|
185+
| Sv (raw) | 141 | Interpolated to 0.5m common depth grid, concat along `ping_time` | `2023-07-15--combined--sv.zarr` |
186+
| Denoised Sv | 140 | Same interpolation as raw Sv | `2023-07-15--combined--denoised.zarr` |
187+
| MVBS | 137 | Concat along `ping_time` (depth already aligned at 1m) | `2023-07-15--combined--mvbs.zarr` |
188+
| NASC (per-freq) | 216 | Concat along `distance` (offset to avoid overlap) | `2023-07-15--combined--nasc--38kHz.zarr` |
189+
190+
The per-pulse-mode zarrs (`*--short_pulse--*.zarr`, `*--long_pulse--*.zarr`) remain on disk as intermediates but are **not the deliverable products**.
189191

190192
**Per-day echograms:**
191193

192194
- **1,610 PNG files** (3.3 GB total)
195+
- Generated from the combined zarrs (not per-pulse-mode)
193196
- 3 products (MVBS, denoised, raw Sv) × 2 frequencies (38kHz, 200kHz) × 2 colormaps (`ocean_r`, `EK500`)
194197
- Each echogram has a **pulse-mode colour bar** at the bottom: orange = Short pulse, blue = Long pulse
195198
- Time axis labelled with hourly ticks (UTC)
@@ -279,22 +282,19 @@ Stored as e.g. `2023-07-15/2023-07-15--combined--mvbs.zarr` (NASC: `2023-07-15--
279282

280283
```
281284
/mnt/data/output/
282-
├── sd-tpos2023-full-v01/ # 300 GB — per-day products
285+
├── sd-tpos2023-full-v01/ # ~380 GB — per-day products
283286
│ ├── 2023-05-30/
284-
│ │ ├── 2023-05-30--short_pulse.zarr # raw Sv
285-
│ │ ├── 2023-05-30--short_pulse--denoised.zarr # denoised Sv
286-
│ │ ├── 2023-05-30--short_pulse--mvbs.zarr # MVBS
287-
│ │ ├── 2023-05-30--short_pulse--mvbs.nc # MVBS (NetCDF)
288-
│ │ ├── 2023-05-30--short_pulse--nasc.zarr # NASC
289-
│ │ ├── 2023-05-30--short_pulse--nasc.nc # NASC (NetCDF)
290-
│ │ ├── 2023-05-30--long_pulse.zarr
291-
│ │ ├── 2023-05-30--long_pulse--denoised.zarr
292-
│ │ ├── ... (same pattern for long_pulse)
293-
│ │ ├── 2023-05-30--combined--mvbs.zarr # ← NEW: combined daily
294-
│ │ ├── 2023-05-30--combined--denoised.zarr
295-
│ │ ├── 2023-05-30--combined--sv.zarr
296-
│ │ ├── 2023-05-30--combined--nasc--38kHz.zarr
297-
│ │ ├── 2023-05-30--combined--nasc--200kHz.zarr
287+
│ │ ├── 2023-05-30--combined--sv.zarr # ← FINAL: raw Sv (both pulse modes)
288+
│ │ ├── 2023-05-30--combined--denoised.zarr # ← FINAL: denoised Sv
289+
│ │ ├── 2023-05-30--combined--mvbs.zarr # ← FINAL: MVBS
290+
│ │ ├── 2023-05-30--combined--nasc--38kHz.zarr # ← FINAL: NASC 38 kHz
291+
│ │ ├── 2023-05-30--combined--nasc--200kHz.zarr # ← FINAL: NASC 200 kHz
292+
│ │ ├── 2023-05-30--short_pulse.zarr # intermediate
293+
│ │ ├── 2023-05-30--short_pulse--denoised.zarr # intermediate
294+
│ │ ├── 2023-05-30--short_pulse--mvbs.zarr # intermediate
295+
│ │ ├── 2023-05-30--long_pulse.zarr # intermediate
296+
│ │ ├── 2023-05-30--long_pulse--denoised.zarr # intermediate
297+
│ │ └── ... (+ .nc copies, long_pulse mvbs/nasc)
298298
│ ├── 2023-05-31/
299299
│ ├── ... (141 day directories)
300300
│ └── 2023-11-05/
@@ -321,7 +321,7 @@ ls /mnt/data/output/sd-tpos2023-full-v01/
321321
source ~/workspace/venv/bin/activate
322322
python3 -c "
323323
import xarray as xr
324-
ds = xr.open_zarr('/mnt/data/output/sd-tpos2023-full-v01/2023-07-15/2023-07-15--short_pulse--nasc.zarr')
324+
ds = xr.open_zarr('/mnt/data/output/sd-tpos2023-full-v01/2023-07-15/2023-07-15--combined--mvbs.zarr')
325325
print(ds)
326326
"
327327
```
@@ -349,22 +349,34 @@ azcopy sync "/mnt/data/output/sd-tpos2023-full-v01" \
349349

350350
## 6. Data Products Summary
351351

352+
**Final per-day products** (combined pulse modes — the deliverables):
353+
354+
| Product | Count | Size | Format | Filename pattern |
355+
|---------|-------|------|--------|------------------|
356+
| Sv (raw) | 141 | ~40 GB | zarr | `*--combined--sv.zarr` |
357+
| Denoised Sv | 140 | ~30 GB | zarr | `*--combined--denoised.zarr` |
358+
| MVBS | 137 | ~9 GB | zarr | `*--combined--mvbs.zarr` |
359+
| NASC (per-freq) | 216 | ~3 MB | zarr | `*--combined--nasc--{38kHz,200kHz}.zarr` |
360+
| Per-day echograms | 1,610 | 3.3 GB | PNG | `perday_echograms/` |
361+
362+
**Campaign-level products:**
363+
352364
| Product | Count | Size | Format | Location |
353365
|---------|-------|------|--------|----------|
354-
| Raw Sv (per-day) | 277 | 193 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
355-
| Denoised Sv | 268 | 91 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
356-
| MVBS (per-day) | 261 | 9 GB | zarr + nc | `sd-tpos2023-full-v01/{day}/` |
357-
| NASC (per-day) | 229 | 44 MB | zarr + nc | `sd-tpos2023-full-v01/{day}/` |
358366
| Campaign MVBS (38 kHz) | 1 | 8.9 GB | zarr | `campaign_mvbs_combined_38kHz.zarr` |
359367
| Campaign echograms | 12 | 593 MB | PNG | `campaign_echograms/` |
360368
| Echodata track tiles | 1 | 1.3 MB | PMTiles | `tiles/` |
361369
| NASC biomass points | 6,135 | 1.5 MB | GeoJSON | `nasc_biomass/` |
362370
| NASC heatmaps | 3+3 | 656 KB | COG + PNG | `heatmaps/` |
363-
| Combined MVBS (per-day) | 137 | ~9 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
364-
| Combined denoised (per-day) | 140 | ~30 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
365-
| Combined raw Sv (per-day) | 141 | ~40 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
366-
| Combined NASC (per-day) | 216 | ~3 MB | zarr | `sd-tpos2023-full-v01/{day}/` |
367-
| Per-day echograms | 1,610 | 3.3 GB | PNG | `perday_echograms/` |
371+
372+
**Intermediate per-pulse-mode products** (on disk but not deliverables):
373+
374+
| Product | Count | Size | Format |
375+
|---------|-------|------|--------|
376+
| Raw Sv | 277 | 193 GB | zarr |
377+
| Denoised Sv | 268 | 91 GB | zarr |
378+
| MVBS | 261 | 9 GB | zarr + nc |
379+
| NASC | 229 | 44 MB | zarr + nc |
368380

369381
---
370382

0 commit comments

Comments
 (0)