Skip to content

raw/prod step-by-step input/output dataset

Raw / Prod Stage-by-Stage Input·Output Datasets

Section titled “Raw / Prod Stage-by-Stage Input·Output Datasets”

Document Version: 1.0
Purpose: Define the input·output datasets·deliverables, storage locations, and data formats for each stage of raw and prod. Finalization·Implementation Criteria.


This document outlines what comes in (Input), what goes out (Output), and where and in what format it is stored for each stage, based on the 2-stage prefix model (raw / prod).


ItemDescription
SourceOriginal research data generated locally or from external pipelines. Uploaded to R2 via dev/upload-to-r2.
Storage LocationUnder the raw/ prefix. Example: raw/country={cc}/category={cat}/date=YYYY-MM-DD/
File Nameraw_0001.json , raw_0002.json , … (sequential numbering), recommended raw_metadata.json per partition
Data FormatJSON. Schema: EnhancedResearchDataset (Zod standard)
RolePipeline’s sole source input. Immutable (overwriting prohibited).

raw Does not modify the file itself. The pipeline only creates and records the following in the raw stage.

ItemContent
CheckpointFile: raw/.../raw_NNNN.json.success (same partition, within raw prefix). Format: Empty object or empty text. Meaning: Processing of this raw file completed.
Downstream InputReads one raw file to extract a domain list, then sends messages to DOMAIN_QUEUE (one message per domain). This message becomes Input for the prod stage.

ItemContent
SourceDOMAIN_QUEUE message. One message per domain extracted from raw files by the Seed Queue Consumer.
Delivery FormatQueue message body (JSON). Schema: DomainQueueMessage.
Field Summarydomain_id , domain_url , registrable_domain , authority , partition_info (country, category, date), source_file_path (optional).

3.2 prod — Output (Output)·Storage Location

Section titled “3.2 prod — Output (Output)·Storage Location”

| Output | Download/Generation Method | Storage Location (prod/) | Data Format | |--------|--------------------|--------------------------------|-------------| | robots.txt | HTTP GET https://{registrable_domain}/robots.txt | prod/country={cc}/category={cat}/date=YYYY-MM-DD/{sanitized_domain}/robots.txt | Plain text (as-is) | | sitemap.xml | Download after extracting Sitemap URL from robots.txt or attempting default URL | prod/.../{sanitized_domain}/sitemap.xml | XML (As is) | | domain_metadata.json | Summary metadata of robots·sitemap fetch results | prod/.../{sanitized_domain}/domain_metadata.json | JSON. Schema: DomainMetadata (Zod). | | domain_metadata.json.success | Indicates processing of the domain is complete | prod/.../{domain}/domain_metadata.json.success | Empty text. |


StepInputOutput (Stored Data)Notes
rawUploaded raw_NNNN.json (EnhancedResearchDataset). Location: raw/.../.raw/.../raw_NNNN.json.success (Checkpoint). + Domain list passed to queue (prod Input).Raw file is immutable.
prodDOMAIN_QUEUE message (DomainQueueMessage).prod/.../{domain}/robots.txt , sitemap.xml , domain_metadata.json , .success.robots·sitemap downloaded from external URLs and stored in prod.

DataSchema/FormatReference (Codebase)
raw fileEnhancedResearchDataset (JSON)src/schemas/research.ts
Seed Queue messageSeedQueueMessagesrc/schemas/seed-engine.ts
Domain Queue messageDomainQueueMessagesrc/schemas/seed-engine.ts
domain_metadata.jsonDomainMetadatasrc/schemas/seed-engine.ts
robots.txtPlain text (original)
sitemap.xmlXML (Original)