This repo owns the curated YAML config set for html2rss.
Primary goal: add or repair configs that are stable, shippable, and easy to verify. Prefer a narrow, clean surface over a broad noisy one.
- Source of truth here:
lib/html2rss/configs/. - Do not hand-edit generated schema output.
- Keep config work separate from downstream docs, web, or example changes unless the task explicitly includes them.
- Use the registrable domain folder, not a subdomain folder, unless there is a strong existing reason.
- Start from the cleanest article list the site offers, not the marketing homepage by default.
- Prefer stable list/detail extraction over extracting every possible field.
- If the site only becomes reliable on a narrower path, use that narrower path.
- Omit brittle fields. If dates or descriptions are low quality, leave them out.
- Set
enhance: falsewhen enhancement pulls in page chrome, duplicate cards, or unrelated links.
Prefer these surfaces first:
- dedicated newsroom or blog archive pages
- category pages with one repeated card structure
- stable subpaths like
/blog/latestor/blog/everything/
Avoid these unless they are the only workable option:
- homepages with hero content mixed with promos
- pages that combine multiple unrelated card systems
- infinite-scroll surfaces unless Browserless is already clearly required
- localized or geo-redirecting entry pages when a stable non-localized path exists
Start with the smallest useful selector set:
itemstitleurl
Add fields only when they are clean:
descriptionpublished_atauthorcategories
Useful patterns:
- Prefer the repeated article card itself as
items, especially when it is a single anchor. - Anchor on article URLs or stable path fragments instead of generic headings.
- Keep selectors item-local when possible.
- Do not add complexity to recover weak optional fields.
Use Chrome MCP when the static HTML is unclear, the page is hydrated, or Faraday fetch returns zero items while the browser shows a valid list.
Recommended sequence:
- Open the target URL.
- Take an accessibility snapshot.
- Identify the exact repeated item boundary.
- Confirm the title and URL live inside that boundary.
- Record the final URL if the page redirects by locale or renders a different surface than expected.
If Chrome MCP is unavailable (Transport closed or page-lock errors), do this recovery sequence:
- Kill stale Chrome MCP processes (
pkill -9 -f 'chrome-devtools-mcp|Chrome for Testing'). - Retry Chrome MCP once before continuing.
- If still unavailable, continue with
curl -I -L, runtimefeed, and HTML inspection in a temporary file. - Explicitly report Chrome MCP outage in the final handoff.
Use Browserless when:
- the page is JS-rendered
- Faraday fetch returns zero items but Chrome shows a valid repeated list
- the site is bot-sensitive enough that static fetch is unreliable
Local Browserless notes:
html2rss-webexposes a local endpoint atws://127.0.0.1:4002- Browserless fetch tests require
BROWSERLESS_IO_WEBSOCKET_URL - custom websocket endpoints also require
BROWSERLESS_IO_API_TOKEN
Do not default the whole repo to Browserless. Use it only for configs that need it.
Assume the html2rss CLI is available on PATH when working against the sibling core repo.
- Use
html2rss ...in examples and one-off validation commands. - If the CLI is not installed globally in the current environment, run the equivalent command from the sibling
html2rss/checkout, typicallybundle exec exe/html2rss .... - In this repo, keep using
make ...andbundle exec rspec ...because those are the implemented entrypoints.
- Find the cleanest stable candidate URL.
- Inspect the DOM in Chrome MCP before writing selectors.
- Create the YAML with the schema modeline and minimal selectors.
- Validate the single file with the core CLI.
- Generate a live feed with the core CLI.
- Tighten selectors until the feed output is clean.
- Run repo validation and non-fetch tests.
- Run the appropriate fetch lane:
- plain fetch for static or Faraday-backed configs
- Browserless fetch for JS-heavy or Browserless-backed configs
For every new or changed config, verify in this order.
- Single-file runtime validation in the core repo:
cd ../html2rss
html2rss validate /abs/path/to/config.yml- Single-file live feed generation in the core repo:
cd ../html2rss
html2rss feed /abs/path/to/config.yml- Repo-wide validation in this repo:
make validate- Repo non-fetch tests in this repo:
make test- Focused fetch verification:
- Faraday-backed candidate:
bundle exec rspec --tag fetch --example 'example.com/feed.yml' spec/html2rss/configs_dynamic_spec.rb- Browserless-backed candidate:
BROWSERLESS_IO_WEBSOCKET_URL=ws://127.0.0.1:4002 \
BROWSERLESS_IO_API_TOKEN=... \
bundle exec rspec --tag fetch --example 'example.com/feed.yml' spec/html2rss/configs_dynamic_spec.rb- If fetch still fails, decide explicitly whether:
- selectors are wrong
- the page needs Browserless
- the chosen surface is too noisy or too dynamic
- the candidate should be downgraded or dropped
- Cross-runtime mismatch check (required when core feed works but fetch specs fail):
- confirm canonical URL with redirect tracing:
curl -I -L -s https://example.com | sed -n '1,20p'- compare behavior in both runtimes:
- core repo (
../html2rss) viahtml2rss feed - configs repo fetch lane (
bundle exec rspec --tag fetch --example ...)
- core repo (
- if selectors are valid in core but fetch lane still returns zero items, treat this as request-strategy/runtime mismatch, not selector success.
- in that case: prefer Browserless-backed verification if available; otherwise mark as downgraded/deferred with evidence.
Use the core CLI as the authority for single-config debugging. The quickest loop is:
validatefeed- inspect the RSS for zero items, nav/footer leakage, duplicates, relative URLs, or noisy descriptions
- adjust selectors
- rerun
If Browserless works but Faraday does not, keep the config narrow and classify it as Browserless-backed instead of trying to rescue it with brittle tweaks.
Additional high-value checks:
- Always normalize
channel.urlto the final canonical host/path (wwwvs non-www, retired legacy paths). - Prefer selectors anchored to content links (
h3 a,a[href*='/article/']) over container-only selectors. - Remove optional fields first when quality drops (
categories, synthetic IDs, weak descriptions) before adding selector complexity. - Set
enhance: falseearly if enhancement starts pulling nav/hero/market widgets.
Use auto for reconnaissance, not as proof that a config is ready.
cd ../html2rss
html2rss auto 'https://example.com'Use it to:
- discover likely repeated item selectors
- compare Faraday and Browserless behavior quickly
- decide whether a site belongs in the curated set at all
Do not ship raw auto-sourced output without manual tightening.
Drop or defer when:
- the page stays noisy after reasonable selector tightening
- the site already offers first-party RSS and this config adds little curated value
- the page depends on unstable interaction flows that are not worth encoding
Downgrade when:
- a narrower subpath is much cleaner than the flagship page
- the config is acceptable without descriptions or dates
- month-level dates are the best the source offers
When finishing config work, report:
- files changed
- accepted configs
- downgraded configs and why
- dropped or deferred candidates and why
- commands actually run
- residual risks, especially selector drift, localization dependence, or Browserless dependence
- whether Chrome MCP was available during validation
- whether focused fetch specs matched core runtime behavior