A complete system to keep uniprot_species.yaml synchronized with multiple authoritative sources, with special integration for GO (Gene Ontology) metadata.
- 159 organisms from manual curation and previous syncs
- No integration with GO goex.yaml
- Missing ~32 organisms used by GO annotations
- 294 organisms (159 → 294, +135 new)
- Full integration with GO goex.yaml (171 organisms)
- All GO organisms now included for annotation compatibility
- Smart merging preserves manual edits while adding new data
-
scripts/fetch_from_go_goex.py- Fetches organisms from GO's goex.yaml metadata
- Extracts 171 organisms with proteome IDs
- Caches to JSON for fast merging
-
scripts/merge_uniprot_sources.py- Merges multiple organism sources
- De-duplicates by UniProt code
- Tracks sources for each organism
- Preserves manual edits
scripts/sync_uniprot_species.py- Added
--json-outputfor caching - Fixed
/dev/nulloutput handling - Better integration with merge workflow
- Added
-
docs/how-to-guides/sync-uniprot-species.md- Complete user guide
- Workflow examples
- Troubleshooting
-
scripts/README.md- Technical reference
- Script details
- Development guide
just sync-uniprot # Recommended: GO + common organisms
just sync-uniprot-full # Full: includes 500+ extended organisms
just fetch-go-organisms # Fetch GO organisms only
just fetch-common-organisms # Fetch common organisms only
just uniprot-stats # Show cache statistics- GO organisms: 171 (from goex.yaml)
- Common organisms: 28 (curated model organisms)
- Existing entries: 120 (unique to previous version)
- Overlap: 25 organisms (in both GO and common)
- ✅ All 171 GO organisms present in final YAML
- ✅ All 28 common organisms present in final YAML
- ✅ All 120 existing-only organisms preserved
- ✅ Total: 294 unique organisms
294 = 171 (GO) + 28 (common) - 25 (overlap) + 120 (existing only)
- De-duplicates by UniProt mnemonic code
- Prefers later/more authoritative sources
- Fills missing fields from any source
- Preserves manual edits from existing YAML
Each organism now has an annotations.sources field:
SP_HUMAN:
description: 'Homo sapiens (Human) - Proteome: UP000005640'
annotations:
sources: common, GO- JSON caches in
cache/directory - Fast re-merging without API calls
- Regeneratable anytime
- Integrates with existing
just siteworkflow - LinkML schema validation passes
- No breaking changes
just sync-uniprot
just site # Validate and rebuildThis syncs with:
- GO goex.yaml organisms (171)
- Common model organisms (28)
- Existing manual entries (preserved)
just sync-uniprot-full # Includes 500+ extended organisms
just sitejust uniprot-statsOutput:
=== UniProt Species Cache Statistics ===
GO organisms: 171
Common organisms: 28
Extended organisms: [not cached]
Current YAML entries: 294
- Load existing YAML (baseline)
- Load common organisms (override empties)
- Load GO organisms (highest priority)
- De-duplicate by code
- Track all contributing sources
- Generate new YAML with backup
GO goex.yaml → fetch_from_go_goex.py → cache/go_organisms.json
UniProt API → sync_uniprot_species.py → cache/common_organisms.json
↓
merge_uniprot_sources.py
↓
uniprot_species.yaml (294 organisms)
- Automatic
.yaml.bakfiles - Never overwrites without backup
- Manual edits preserved through merge
✅ All scripts run without errors
✅ Schema validates with gen-yaml
✅ All GO organisms present (171/171)
✅ All common organisms present (28/28)
✅ Existing entries preserved (120)
✅ No duplicates
✅ Source tracking working
✅ Proteome IDs complete (294/294)
"remind me how we keep uniprot_species up to date - do we have everything in goex.yaml?"
- ❌ No integration with goex.yaml
- ❌ Missing ~32 GO organisms
- Manual sync only
- ✅ Full integration with GO goex.yaml
- ✅ All 171 GO organisms included
- ✅ Automated sync with
just sync-uniprot - ✅ Best union of GO + common + existing
Run just sync-uniprot periodically to:
- Pick up new GO organisms from goex.yaml
- Update proteome IDs
- Add new common organisms
- Preserve all manual edits
The system now maintains complete GO compatibility while preserving your curated additions!