Skip to content

Use MusicBrainz release aliases in Spotify + Apple Music matchers #176

@dprodger

Description

@dprodger

Background

The MB importer doesn't fetch release aliases (backend/integrations/musicbrainz/client.py:954 requests inc=artist-credits+recordings+artist-rels — no aliases), and the releases table has no analog to songs.alt_titles. So MB-side alias edits don't feed our Spotify or Apple Music matchers.

Concrete failure case: MB release 705bd412-637a-47a5-91c5-05d519f4ef03 is titled "Sweet Georgia Brown" / Johnny Mercer With Paul Weston and His Orchestra (1995, single track). The same content lives on Apple Music + Spotify under "By the River Sainte Marie" / Johnny Mercer (2014). Our matcher's search ladder never produces "By the River Sainte Marie" as a query, and even if Spotify's fuzzy returned it, validation against "Sweet Georgia Brown" would reject it. An alias has been added on MB; once we read it, both the search-side and validation-side gates would clear.

Internal release ID: 1ec447b1-b8ab-4514-95d6-046f618cf582.

Plan

Add MB release aliases to our import pipeline as releases.alt_titles, then teach both matchers to search and validate against them, mirroring the existing songs.alt_titles pattern.

Phase 1 — Schema migration

  • File: sql/migrations/016_add_release_alt_titles.sql (next sequential number; 015_research_jobs.sql is the most recent numbered one).
  • Change: ALTER TABLE releases ADD COLUMN IF NOT EXISTS alt_titles TEXT[] NOT NULL DEFAULT '{}';. Skip the GIN index for now — alt-title search isn't a current need.
  • Risks: NOT NULL DEFAULT '{}' is a metadata-only rewrite on modern Postgres, but verify on the target version. Picking flat text[] instead of jsonb gives up locale/type metadata — see Open Questions.
  • Done when: column exists with default {}, migration is idempotent.

Phase 2 — Importer: fetch + parse aliases

  • Files: backend/integrations/musicbrainz/client.py (get_release_details, line ~954), backend/integrations/musicbrainz/parsing.py (parse_release_data, line 195), backend/integrations/musicbrainz/release_importer.py (_create_release, line 605).
  • Changes:
    • client.py: change inc to 'aliases+artist-credits+recordings+artist-rels'.
    • parsing.py: extract mb_release.get('aliases') or [], take each entry's name, drop entries where type == 'Search hint', dedupe case-insensitively against title and against each other (preserve first-seen casing), return as alt_titles.
    • release_importer.py: add alt_titles to the INSERT column list, and — critically — change the ON CONFLICT (musicbrainz_release_id) DO UPDATE SET musicbrainz_release_id = EXCLUDED.musicbrainz_release_id no-op into DO UPDATE SET alt_titles = EXCLUDED.alt_titles so re-imports refresh aliases on existing rows.
  • Risks: existing 30-day on-disk MB JSON cache won't have an aliases key — newly imported releases get aliases, but rows whose details came from cache won't until cache expires or force_refresh=True. The Phase 3 backfill covers existing rows. Also: flipping the ON CONFLICT branch to actually UPDATE means any legitimate re-import path overwrites alt_titles — fine if MB is the source of truth.
  • Done when: a fresh get_release_details(force_refresh=True) of 705bd412-… returns an aliases array, parse_release_data produces an alt_titles list including "By the River Sainte Marie", and re-importing populates the column.

Phase 3 — One-time backfill

  • File: backend/scripts/backfill_release_alt_titles.py (model: backend/scripts/backfill_mb_track_titles.py — same ScriptBase / MusicBrainzSearcher pattern).
  • Logic: SELECT id, musicbrainz_release_id FROM releases WHERE musicbrainz_release_id IS NOT NULL AND alt_titles = '{}' for natural resumability. For each row, client.get_release_details(mbid, force_refresh=True) to bypass stale cache, parse, UPDATE releases SET alt_titles = %s WHERE id = %s. Standard --limit, --dry-run, --debug. Commit in batches of ~500.
  • Risks: MB rate limit is 1 req/sec; runtime ≈ N seconds where N is the count of releases with MBIDs. force_refresh=True rewrites every cache file — bigger disk churn but correct.
  • Done when: script runs to completion with stats (processed, updated, errors); the only remaining alt_titles = '{}' rows are ones MB genuinely has no aliases for.

Phase 4 — Spotify matcher consumption

  • Files: backend/integrations/spotify/db.py (release SELECTs around lines 80, 189, 255, 329 — pull rel.alt_titles), backend/integrations/spotify/matcher.py (line 410 call site), backend/integrations/spotify/search.py (search_spotify_album, line 235), backend/integrations/spotify/matching.py (validate_album_match, line 827).
  • Changes:
    • Hydrate release['alt_titles'] in every SELECT that feeds the matcher.
    • matcher.py: pass release.get('alt_titles') or [] into search_spotify_album as a new album_alt_titles kwarg.
    • search.py: after the existing strip-strategy block but before the album-only fallback (line ~357), iterate album_alt_titles and append album:"<alias>" artist:"<artist>" and "<alias>" "<artist>" for each. Skip aliases equal to search_album after normalize_for_search.
    • matching.py: validate_album_match takes a new expected_album_alt_titles: List[str] = None; compute album_similarity against [expected_album, *alt_titles] and take the max for both fuzzy score AND substring check. Add a which_album_title_matched field to scores for log clarity.
  • Risks: with N aliases the search ladder grows by 2N strategies per release — cost is one extra Spotify API call per strategy that doesn't short-circuit. Existing ladder already has 8+ strategies. Validation-side: a vandalized alias like "Greatest Hits" could let unrelated Spotify candidates pass the album check — but the artist similarity gate still applies, so the blast radius is limited to same-artist false positives. Keep a per-strategy log line attributing matches to aliases.
  • Done when: a release with alt_titles=['By the River Sainte Marie'] produces alias-bearing search strategies in debug logs, and a candidate whose Spotify name is "By the River Sainte Marie" passes validate_album_match even when expected_album is "Sweet Georgia Brown".

Phase 5 — Apple Music matcher consumption

  • Files: backend/integrations/apple_music/db.py (release SELECTs around lines 108, 171), backend/integrations/apple_music/matcher.py (lines 233, 277 call sites), backend/integrations/apple_music/search.py (search_local_catalog line 49 and search_api line 129), backend/integrations/apple_music/matching.py (validate_album_match line 22).
  • Changes: same shape as Phase 4. In both search ladders, after the strip strategies and before the album-only / punctuation-stripped fallbacks, append (artist_name, alias) and (None, alias) strategies per alias. In validate_album_match accept expected_album_alt_titles, compute album_similarity = max(calculate_similarity(t, norm_am_album) for t in [expected_album, *alts]), same for the substring check.
  • Risks: search_local_catalog runs a SQLite FTS query per strategy, so cost scales linearly with alias count — local is cheap, fine. iTunes API calls are not — consider capping at the first ~3 aliases if MB releases ever sprout pathological alias counts.
  • Done when: same verification as Phase 4 against Apple's catalog and API paths.

Phase 6 — Tests

  • Files: backend/tests/test_spotify_matcher.py (model on existing test_alt_titles_fall_back style cases), backend/tests/test_apple_music_matcher.py, plus a small parser test for parse_release_data alias extraction (verify dedup, "Search hint" filter, name-only output).
  • Cases: validator passes when alias matches but primary doesn't; validator still fails when neither matches; search ladder includes alias-derived strategies in the right slot order; importer's ON CONFLICT branch updates alt_titles on a re-import.
  • Done when: all new tests pass; existing matcher tests remain green.

Phase 7 — End-to-end verification

  • Run the backfill against MBID 705bd412-637a-47a5-91c5-05d519f4ef03 so alt_titles is populated, then re-run the matcher via the admin "Run matcher & update DB" buttons on /admin/releases/1ec447b1-b8ab-4514-95d6-046f618cf582.
  • Done when: Spotify and Apple links resolve to the "By the River Sainte Marie" album for the "Sweet Georgia Brown" release, with debug logs showing the alias was the title that cleared validation.

Open questions

  1. Flat text[] vs jsonb with locale/type/primary? Flat is enough for the matcher and matches songs.alt_titles. Jsonb would let future i18n / UI differentiate "the official translation" from "search hint" but adds complexity now. Recommendation: flat; revisit if i18n lands.
  2. Include MB disambiguation as an alias? It's prose ("with bonus tracks", "live"), not a title. Recommendation: no — folding it into similarity would mostly add noise.
  3. Filter alias type? Recommendation: drop Search hint (low signal, often gibberish), keep the rest.
  4. Cache invalidation strategy? Backfill uses force_refresh=True, which rewrites cache entries — preferred, since it both fixes the data and warms cache for next time.
  5. Vandalism risk? Alias-validation widens the album gate, but artist gate is unchanged — so a malicious alias like "Greatest Hits" can only cause same-artist mismatches (wrong compilation by the right artist). A log field tagging which title matched makes post-hoc detection easy.

Verification target

After all phases land, this should pass:

  1. Run backfill on release 1ec447b1-b8ab-4514-95d6-046f618cf582; confirm alt_titles includes "By the River Sainte Marie".
  2. On /admin/releases/1ec447b1-…, click "Run matcher & update DB" for both Spotify and Apple Music.
  3. Confirm both produce a match diff — Spotify links the album, Apple links the album + adds artwork.
  4. Diagnose log shows the alias was the title that cleared validate_album_match.

Resolves the issue surfaced in #175.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions