Skip to content

Commit fd193e7

Browse files
committed
docs(batch): update report with stage 14 — combined daily products + echograms
1 parent 213047b commit fd193e7

1 file changed

Lines changed: 47 additions & 3 deletions

File tree

scripts/batch_processing/PROCESSING_REPORT.md

Lines changed: 47 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,12 @@
3232
| MVBS zarrs | 9 GB |
3333
| Campaign MVBS combined | 8.9 GB |
3434
| NASC zarrs | 44 MB |
35+
| Combined daily zarrs (new) | ~82 GB |
36+
| Per-day echograms (new) | 3.3 GB |
3537
| Converted echodata | 5 GB |
3638
| Campaign echograms | 593 MB |
3739
| Tiles + GeoJSON + Heatmaps | 4 MB |
38-
| **Total used** | **314 GB / 1 TB** |
40+
| **Total used** | **396 GB / 1 TB** |
3941

4042
---
4143

@@ -57,6 +59,7 @@ Stage 10: Campaign echograms (4 segments × 3 colormaps)
5759
Stage 11: Echodata PMTiles (vector tiles for map viz)
5860
Stage 12: NASC Biomass GeoJSON (depth-frequency merged points)
5961
Stage 13: NASC Heatmap COGs (raster overlays + PNG previews)
62+
Stage 14: Combined daily products + per-day echograms (NEW)
6063
```
6164

6265
### Key Scripts
@@ -66,6 +69,7 @@ Stage 13: NASC Heatmap COGs (raster overlays + PNG previews)
6669
| `build_full_survey.py` | Main 13-stage pipeline (~2900 lines) |
6770
| `run_nasc_parallel.py` | Fast parallel NASC via numpy (replaced stage 7 NASC) |
6871
| `run_stages_9_to_13.py` | Standalone post-processing (stages 9–13 without re-running 1–8) |
72+
| `run_combine_daily.py` | Merge pulse modes per day + generate echograms with pulse markings |
6973
| `local_storage.py` | Monkey-patches Azure storage calls to local disk I/O |
7074

7175
---
@@ -168,6 +172,31 @@ Skipped with `--skip-perday-echograms` to prioritise campaign-level products. Ca
168172
- Grid: 0.5° resolution, scipy griddata interpolation, cKDTree search radius 0.5°
169173
- Stored in `/mnt/data/output/heatmaps/`
170174

175+
### Stage 14: Combined Daily Products + Per-day Echograms (NEW)
176+
177+
Merges short_pulse + long_pulse into single per-day combined zarrs. Channels renamed from instrument IDs (`EKA 266972-07 ES38-18|200-18C`) to frequency labels (`38kHz`, `200kHz`). Each dataset includes a `pulse_mode` variable (0=long, 1=short) for provenance.
178+
179+
**Products combined:**
180+
181+
| Product | Count | Method |
182+
|---------|-------|--------|
183+
| Combined MVBS | 137 | Concat along `ping_time` (depth aligned at 1m) |
184+
| Combined denoised Sv | 140 | Interpolated to 0.5m common depth grid, concat along `ping_time` |
185+
| Combined raw Sv | 141 | Same interpolation as denoised |
186+
| Combined NASC | 216 | Per-frequency files, concat along `distance` (offset to avoid overlap) |
187+
188+
Stored as `{day}/{day}--combined--{product}.zarr` (NASC: `{day}--combined--nasc--{freq}.zarr`).
189+
190+
**Per-day echograms:**
191+
192+
- **1,610 PNG files** (3.3 GB total)
193+
- 3 products (MVBS, denoised, raw Sv) × 2 frequencies (38kHz, 200kHz) × 2 colormaps (`ocean_r`, `EK500`)
194+
- Each echogram has a **pulse-mode colour bar** at the bottom: orange = Short pulse, blue = Long pulse
195+
- Time axis labelled with hourly ticks (UTC)
196+
- Stored in `/mnt/data/output/perday_echograms/`
197+
198+
**Processing**: 141 days × 4 workers = **~62 minutes** (`run_combine_daily.py`)
199+
171200
---
172201

173202
## 4. Issues Found and Fixed
@@ -260,7 +289,12 @@ Skipped with `--skip-perday-echograms` to prioritise campaign-level products. Ca
260289
│ │ ├── 2023-05-30--short_pulse--nasc.nc # NASC (NetCDF)
261290
│ │ ├── 2023-05-30--long_pulse.zarr
262291
│ │ ├── 2023-05-30--long_pulse--denoised.zarr
263-
│ │ ├── ... (same pattern)
292+
│ │ ├── ... (same pattern for long_pulse)
293+
│ │ ├── 2023-05-30--combined--mvbs.zarr # ← NEW: combined daily
294+
│ │ ├── 2023-05-30--combined--denoised.zarr
295+
│ │ ├── 2023-05-30--combined--sv.zarr
296+
│ │ ├── 2023-05-30--combined--nasc--38kHz.zarr
297+
│ │ ├── 2023-05-30--combined--nasc--200kHz.zarr
264298
│ ├── 2023-05-31/
265299
│ ├── ... (141 day directories)
266300
│ └── 2023-11-05/
@@ -269,6 +303,7 @@ Skipped with `--skip-perday-echograms` to prioritise campaign-level products. Ca
269303
├── campaign_echograms/ # 593 MB — 12 PNG echograms
270304
├── tiles/ # 1.9 MB — PMTiles + source GeoJSON
271305
├── nasc_biomass/ # 1.5 MB — NASC points GeoJSON
306+
├── perday_echograms/ # 3.3 GB — 1,610 daily echogram PNGs (NEW)
272307
├── heatmaps/ # 656 KB — COGs + PNGs + manifest
273308
├── raw_downloads/ # empty (cleaned up)
274309
└── *.log # pipeline logs
@@ -325,6 +360,11 @@ azcopy sync "/mnt/data/output/sd-tpos2023-full-v01" \
325360
| Echodata track tiles | 1 | 1.3 MB | PMTiles | `tiles/` |
326361
| NASC biomass points | 6,135 | 1.5 MB | GeoJSON | `nasc_biomass/` |
327362
| NASC heatmaps | 3+3 | 656 KB | COG + PNG | `heatmaps/` |
363+
| Combined MVBS (per-day) | 137 | ~9 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
364+
| Combined denoised (per-day) | 140 | ~30 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
365+
| Combined raw Sv (per-day) | 141 | ~40 GB | zarr | `sd-tpos2023-full-v01/{day}/` |
366+
| Combined NASC (per-day) | 216 | ~3 MB | zarr | `sd-tpos2023-full-v01/{day}/` |
367+
| Per-day echograms | 1,610 | 3.3 GB | PNG | `perday_echograms/` |
328368

329369
---
330370

@@ -354,7 +394,8 @@ azcopy sync "/mnt/data/output/sd-tpos2023-full-v01" \
354394
| Stage 11 PMTiles | ~10 sec | 141 tracks |
355395
| Stage 12 NASC GeoJSON | ~5 sec | 6,135 points |
356396
| Stage 13 NASC heatmaps | ~2 sec | 3 COGs + 3 PNGs |
357-
| **Total wall clock** | **~14 hours** | Including disk resize downtime |
397+
| **Stage 14 combined daily** | **~62 min** | 141 days, 4 workers, 661 zarrs + 1,610 PNGs |
398+
| **Total wall clock** | **~15 hours** | Including disk resize downtime |
358399

359400
---
360401

@@ -370,4 +411,7 @@ dbb588f fix(batch): stages 11-12 — look for lat/lon in data_vars, prefer denoi
370411
519c302 fix: normalize_string_dtypes — handle numpy 2.x StringDType
371412
0c18a89 feat(batch): add stages 11-13 — echodata PMTiles, NASC biomass GeoJSON, NASC heatmap COGs
372413
a1e577e fix: list_denoised_zarrs scans local disk when local_storage is patched
414+
213047b fix(batch): deduplicate ping_time in MVBS/NASC combine too
415+
550c68d fix(batch): deduplicate ping_time + error handling for resilient parallel processing
416+
c8613b5 feat(batch): per-day pulse-mode merge + daily echograms with pulse markings
373417
```

0 commit comments

Comments
 (0)