Skip to content

Latest commit

 

History

History
196 lines (157 loc) · 5.04 KB

File metadata and controls

196 lines (157 loc) · 5.04 KB

UniProt Species Sync - Implementation Summary

What Was Built

A complete system to keep uniprot_species.yaml synchronized with multiple authoritative sources, with special integration for GO (Gene Ontology) metadata.

Before

  • 159 organisms from manual curation and previous syncs
  • No integration with GO goex.yaml
  • Missing ~32 organisms used by GO annotations

After

  • 294 organisms (159 → 294, +135 new)
  • Full integration with GO goex.yaml (171 organisms)
  • All GO organisms now included for annotation compatibility
  • Smart merging preserves manual edits while adding new data

Files Created

Scripts

  1. scripts/fetch_from_go_goex.py

    • Fetches organisms from GO's goex.yaml metadata
    • Extracts 171 organisms with proteome IDs
    • Caches to JSON for fast merging
  2. scripts/merge_uniprot_sources.py

    • Merges multiple organism sources
    • De-duplicates by UniProt code
    • Tracks sources for each organism
    • Preserves manual edits

Enhanced

  1. scripts/sync_uniprot_species.py
    • Added --json-output for caching
    • Fixed /dev/null output handling
    • Better integration with merge workflow

Documentation

  1. docs/how-to-guides/sync-uniprot-species.md

    • Complete user guide
    • Workflow examples
    • Troubleshooting
  2. scripts/README.md

    • Technical reference
    • Script details
    • Development guide

Justfile Commands

just sync-uniprot           # Recommended: GO + common organisms
just sync-uniprot-full      # Full: includes 500+ extended organisms
just fetch-go-organisms     # Fetch GO organisms only
just fetch-common-organisms # Fetch common organisms only
just uniprot-stats         # Show cache statistics

Coverage Verification

Sources

  • GO organisms: 171 (from goex.yaml)
  • Common organisms: 28 (curated model organisms)
  • Existing entries: 120 (unique to previous version)
  • Overlap: 25 organisms (in both GO and common)

Results

  • All 171 GO organisms present in final YAML
  • All 28 common organisms present in final YAML
  • All 120 existing-only organisms preserved
  • Total: 294 unique organisms

Math

294 = 171 (GO) + 28 (common) - 25 (overlap) + 120 (existing only)

Key Features

Smart Merging

  • De-duplicates by UniProt mnemonic code
  • Prefers later/more authoritative sources
  • Fills missing fields from any source
  • Preserves manual edits from existing YAML

Source Tracking

Each organism now has an annotations.sources field:

SP_HUMAN:
  description: 'Homo sapiens (Human) - Proteome: UP000005640'
  annotations:
    sources: common, GO

Caching

  • JSON caches in cache/ directory
  • Fast re-merging without API calls
  • Regeneratable anytime

Validation

  • Integrates with existing just site workflow
  • LinkML schema validation passes
  • No breaking changes

Usage

Quick Sync (Recommended)

just sync-uniprot
just site  # Validate and rebuild

This syncs with:

  • GO goex.yaml organisms (171)
  • Common model organisms (28)
  • Existing manual entries (preserved)

Full Sync (Optional)

just sync-uniprot-full  # Includes 500+ extended organisms
just site

Check Status

just uniprot-stats

Output:

=== UniProt Species Cache Statistics ===

GO organisms: 171
Common organisms: 28
Extended organisms: [not cached]
Current YAML entries: 294

Technical Details

Merge Strategy

  1. Load existing YAML (baseline)
  2. Load common organisms (override empties)
  3. Load GO organisms (highest priority)
  4. De-duplicate by code
  5. Track all contributing sources
  6. Generate new YAML with backup

Data Flow

GO goex.yaml → fetch_from_go_goex.py → cache/go_organisms.json
UniProt API → sync_uniprot_species.py → cache/common_organisms.json
                                     ↓
                              merge_uniprot_sources.py
                                     ↓
                        uniprot_species.yaml (294 organisms)

Backup Safety

  • Automatic .yaml.bak files
  • Never overwrites without backup
  • Manual edits preserved through merge

Testing Results

✅ All scripts run without errors ✅ Schema validates with gen-yaml ✅ All GO organisms present (171/171) ✅ All common organisms present (28/28) ✅ Existing entries preserved (120) ✅ No duplicates ✅ Source tracking working ✅ Proteome IDs complete (294/294)

Answer to Original Question

"remind me how we keep uniprot_species up to date - do we have everything in goex.yaml?"

Before

  • ❌ No integration with goex.yaml
  • ❌ Missing ~32 GO organisms
  • Manual sync only

Now

  • ✅ Full integration with GO goex.yaml
  • ✅ All 171 GO organisms included
  • ✅ Automated sync with just sync-uniprot
  • ✅ Best union of GO + common + existing

Going Forward

Run just sync-uniprot periodically to:

  • Pick up new GO organisms from goex.yaml
  • Update proteome IDs
  • Add new common organisms
  • Preserve all manual edits

The system now maintains complete GO compatibility while preserving your curated additions!