ProductNormaliser.Worker is the write-side runtime host for the solution. It continuously runs both deterministic discovery and product crawling: it expands source entry points and sitemaps into bounded discovery queues, promotes confirmed product URLs into crawl targets, fetches source pages, extracts and normalises product evidence, merges that evidence into canonical products, records logs and conflicts, and reschedules future work.
If ProductNormaliser.AdminApi is the observability surface, ProductNormaliser.Worker is the engine that keeps the database alive and current.
- host the background crawl loop
- host the background discovery loop
- compose the crawl pipeline through dependency injection
- process queued discovery and crawl targets one at a time per worker loop
- mark items completed, skipped, or failed
- ensure future attempts are rescheduled rather than treating crawl as a one-off event
At startup the worker registers:
- MongoDB stores and infrastructure services
- structured-data extraction
- attribute normalisation
- identity resolution
- merge weighting and canonical merge
- conflict detection
- HTTP fetcher and robots policy services
- discovery services and application-layer discovery coordination
- the hosted discovery and crawl worker services
The main entry point is intentionally thin. Most behavior lives in the orchestrator and infrastructure services.
Program: DI composition rootDiscoveryWorker: background loop that dequeues and processes discovery workDiscoveryOrchestrator: orchestration logic for sitemap traversal, listing traversal, URL classification, and promotion into crawl targetsCrawlWorker: background loop that dequeues and processes workCrawlOrchestrator: orchestration logic for fetch, extract, normalise, merge, persistence, intelligence updates, and related-link expansion after successful product fetchesCrawlProcessResult: result contract used to classify queue outcomes
The worker reads configuration from appsettings.json and the environment.
Current default settings include:
- MongoDB connection string:
mongodb://127.0.0.1:27017 - MongoDB database:
ProductNormaliser - crawl user agent:
ProductNormaliserBot/1.0 - default host delay: 1000 ms
- transient retry count: 2
- idle delay when the queue is empty: 1500 ms
The optional local classification-layer settings currently matter more to source discovery and source-probe workflows than to the steady-state crawl loop. They are still shared through the same infrastructure registration so the runtime can evaluate them consistently where needed.
From the repository root:
dotnet run --project ProductNormaliser.WorkerOr from this project folder:
dotnet runFor each discovery lease, the worker:
- dequeues the next eligible discovery item
- evaluates robots and throttling rules for the source
- fetches sitemaps or listing pages within source depth and budget limits
- classifies and persists discovered URLs
- promotes confirmed product URLs into the crawl queue
For each crawl lease, the worker:
- dequeues the next eligible crawl target
- calls the crawl orchestrator
- marks the item as completed, skipped, or failed
- allows the queue service to determine future scheduling
If the queue is empty, the worker sleeps for the configured idle delay and tries again.
If processing throws an unexpected exception, the worker logs the failure and marks the item failed so queue-state history is preserved.
- This service can bootstrap category crawls from managed source discovery profiles, so the crawl queue no longer needs to be fully pre-seeded with product URLs.
- It shares its MongoDB database with the admin API.
- It is designed as a long-running background process.
- Its value increases over time because trust history, stability, and disagreement data accumulate across runs.
- Start MongoDB.
- Ensure the
Mongoconfiguration points at the correct instance. - Register or enable crawl sources with category coverage and discovery profiles through the Admin API or Web UI.
- Run the worker.
- Optionally run the admin API and Web UI to inspect discovery and crawl progress.
dotnet build ProductNormaliser.Worker/ProductNormaliser.Worker.csproj- it is not the place for business rules that should live in Domain or Application
- it is not the read-side API surface
- it is not a scheduler UI or queue-management console
It is intentionally a thin runtime host over the reusable platform services.