Skip to content

R2 raw/prod two-stage model

Document Version: 1.0 Purpose: Change model to reset the existing 3-stage (datasets / processing / prod) to 2-stage (raw / prod). Finalized and implemented.


CategoryExisting (3-stage)After Change (2-stage)
Stage 1datasets/ (Bronze, immutable)raw/ (immutable source)
Stage 2processing/ (Silver) + prod/ (Gold)prod/ (all pipeline outputs)
  • raw: Original data collected/uploaded externally. Do not overwrite.
  • prod: All results generated/updated by the pipeline (including intermediate and final outputs). Merges the existing processing + prod into a single prefix.

ItemContent
MeaningUnprocessed raw data. As collected/uploaded.
RuleImmutable. Overwriting existing files is prohibited.
Path Formatraw/country={cc}/category={cat}/date=YYYY-MM-DD/
File Exampleraw_0001.json , raw_0002.json , raw_metadata.json
CorrespondenceSame role as the existing datasets/ directory.
ItemDescription
MeaningAll outputs (intermediate and final) generated by the pipeline.
RuleCan be created/updated. Does not overwrite raw files.
Path Formatprod/country={cc}/category={cat}/date=YYYY-MM-DD/...
IncludesDomain metadata, robots, sitemap, checkpoints (.success), final aggregation/distribution files, etc.
MappingMerge existing processing/ + prod/ into a single prod/.
  • If detailed distinctions are needed within prod, subpaths can be used (e.g., prod/.../domain_metadata.json, prod/.../final/, etc.).
  • Only use the top-level prefixes raw / prod.

Original Path (3 levels)Changed Path (2 levels)
datasets/country=us/category=news/date=2026-01-28/raw_0001.jsonraw/country=us/category=news/date=2026-01-28/raw_0001.json
datasets/.../raw_0001.json.successraw/.../raw_0001.json.success or prod/.../ (depending on policy)
processing/country=us/.../example.com/domain_metadata.jsonprod/country=us/.../example.com/domain_metadata.json
processing/.../robots.txt , sitemap.xmlprod/.../robots.txt , prod/.../sitemap.xml
prod/ Sub-final outputprod/ Sub- (same as existing)

4. Scope of Impact (Modifications Required During Implementation)

Section titled “4. Scope of Impact (Modifications Required During Implementation)”
AreaChange Points
Path Builderdatasets/raw/ , processing/prod/. Modify prefix constants and function return values.
Seed OrchestratorRaw file list prefix: raw/country=.../category=.../date=.../.
Seed Queue ConsumerRaw file get path, .success path based on raw/ prefix.
Domain Queue Consumerdomain_metadata·robots·sitemap·.success storage path based on prod/ prefix.
Upload ScriptR2 key prefix: raw/ + existing relative path.
Existing R2 DataMigrate to raw/ using migration script (dev/migrate-r2-to-raw-prefix.ts). Use S3 CopyObject (server-side, parallel) + DeleteObjects (batch).

  • Raw Immutability: Files under raw/ are not overwritten.
  • No Stage Mixing: Do not write derived results to raw; derived results must only be written to prod/.
  • raw_metadata.json: Recommended per partition. Can be placed in the raw/ partition root.

ItemDescription
prefixraw (Stage 1), prod (Stage 2). Remove existing 3-stage structure (datasets/processing/prod).
rawRaw collection/upload data. Immutable. raw/country=.../category=.../date=.../.
prodFull pipeline output. prod/country=.../....
migrationVerify destination at pnpm migrate:r2:raw-prefix:dry-run then use pnpm migrate:r2:raw-prefix (copy+delete) or pnpm migrate:r2:raw-prefix:no-delete (copy only). Use S3 CopyObject + DeleteObjects.