R2 raw/prod two-stage model
R2 prefix 2-stage model: raw / prod
Section titled “R2 prefix 2-stage model: raw / prod”Document Version: 1.0 Purpose: Change model to reset the existing 3-stage (datasets / processing / prod) to 2-stage (raw / prod). Finalized and implemented.
1. Change Summary
Section titled “1. Change Summary”| Category | Existing (3-stage) | After Change (2-stage) |
|---|---|---|
| Stage 1 | datasets/ (Bronze, immutable) | raw/ (immutable source) |
| Stage 2 | processing/ (Silver) + prod/ (Gold) | prod/ (all pipeline outputs) |
- raw: Original data collected/uploaded externally. Do not overwrite.
- prod: All results generated/updated by the pipeline (including intermediate and final outputs). Merges the existing processing + prod into a single prefix.
2. Stage 2 Definition
Section titled “2. Stage 2 Definition”2.1 raw (Stage 1)
Section titled “2.1 raw (Stage 1)”| Item | Content |
|---|---|
| Meaning | Unprocessed raw data. As collected/uploaded. |
| Rule | Immutable. Overwriting existing files is prohibited. |
| Path Format | raw/country={cc}/category={cat}/date=YYYY-MM-DD/ |
| File Example | raw_0001.json , raw_0002.json , raw_metadata.json |
| Correspondence | Same role as the existing datasets/ directory. |
2.2 prod (Stage 2)
Section titled “2.2 prod (Stage 2)”| Item | Description |
|---|---|
| Meaning | All outputs (intermediate and final) generated by the pipeline. |
| Rule | Can be created/updated. Does not overwrite raw files. |
| Path Format | prod/country={cc}/category={cat}/date=YYYY-MM-DD/... |
| Includes | Domain metadata, robots, sitemap, checkpoints (.success), final aggregation/distribution files, etc. |
| Mapping | Merge existing processing/ + prod/ into a single prod/. |
- If detailed distinctions are needed within prod, subpaths can be used (e.g.,
prod/.../domain_metadata.json,prod/.../final/, etc.). - Only use the top-level prefixes raw / prod.
3. Path Mapping (Before → After)
Section titled “3. Path Mapping (Before → After)”| Original Path (3 levels) | Changed Path (2 levels) |
|---|---|
datasets/country=us/category=news/date=2026-01-28/raw_0001.json | raw/country=us/category=news/date=2026-01-28/raw_0001.json |
datasets/.../raw_0001.json.success | raw/.../raw_0001.json.success or prod/.../ (depending on policy) |
processing/country=us/.../example.com/domain_metadata.json | prod/country=us/.../example.com/domain_metadata.json |
processing/.../robots.txt , sitemap.xml | prod/.../robots.txt , prod/.../sitemap.xml |
prod/ Sub-final output | prod/ Sub- (same as existing) |
4. Scope of Impact (Modifications Required During Implementation)
Section titled “4. Scope of Impact (Modifications Required During Implementation)”| Area | Change Points |
|---|---|
| Path Builder | datasets/ → raw/ , processing/ → prod/. Modify prefix constants and function return values. |
| Seed Orchestrator | Raw file list prefix: raw/country=.../category=.../date=.../. |
| Seed Queue Consumer | Raw file get path, .success path based on raw/ prefix. |
| Domain Queue Consumer | domain_metadata·robots·sitemap·.success storage path based on prod/ prefix. |
| Upload Script | R2 key prefix: raw/ + existing relative path. |
| Existing R2 Data | Migrate to raw/ using migration script (dev/migrate-r2-to-raw-prefix.ts). Use S3 CopyObject (server-side, parallel) + DeleteObjects (batch). |
5. Lifecycle Consistency
Section titled “5. Lifecycle Consistency”- Raw Immutability: Files under raw/ are not overwritten.
- No Stage Mixing: Do not write derived results to
raw; derived results must only be written toprod/. - raw_metadata.json: Recommended per partition. Can be placed in the
raw/partition root.
6. Summary
Section titled “6. Summary”| Item | Description |
|---|---|
| prefix | raw (Stage 1), prod (Stage 2). Remove existing 3-stage structure (datasets/processing/prod). |
| raw | Raw collection/upload data. Immutable. raw/country=.../category=.../date=.../. |
| prod | Full pipeline output. prod/country=.../.... |
| migration | Verify destination at pnpm migrate:r2:raw-prefix:dry-run then use pnpm migrate:r2:raw-prefix (copy+delete) or pnpm migrate:r2:raw-prefix:no-delete (copy only). Use S3 CopyObject + DeleteObjects. |