Skip to content

Project Overview

This document was automatically generated from the README in the project root folder.

Research discovery and seed contract management for news sources.

NewsFork Seeds is a comprehensive system for discovering, validating, and managing news source contracts. It follows a two-phase architecture:

  1. Research Phase: Discover WHERE to look (URL discovery)
  2. Seed Phase: Define HOW to fetch (content contracts)

The system is built as a Distributed Event Processing Platform using Cloudflare Workers, Queues, R2, D1, and KV, with GitHub as the audit trail.

┌─────────────────────────────────────────────────────────────────┐
│ Cloudflare Workers │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ HTTP API │ │ Scheduled │ │ Queue │ │
│ │ Handler │ │ (Cron) │ │ Consumers │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Business Services │ │
│ │ Research │ Seed │ Dataset │ Metadata │ Queue │ │
│ └───────────────────────┬───────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Storage Services │ │
│ │ HybridStorage │ R2Storage │ GitHubStorage │ │
│ └───────────────────────┬───────────────────────┘ │
└──────────────────────────┼──────────────────────────────────────┘
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ R2 │ │ D1 │ │ GitHub │
│ (Raw Data) │ │ (Metadata) │ │ (Audit Trail) │
└───────────────┘ └───────────────┘ └───────────────┘
HTTP Request / Queue Message / Cron Event
┌─────────────┐
│ Apps │ ← Entry Points (Workers) - 얇은 진입점만
└─────────────┘
├──→ Domain (순수 비즈니스 로직, Cloudflare-free)
└──→ Infra (Cloudflare 어댑터)
├── R2 (Raw Data: raw, prod)
├── D1 (Metadata: task state, domain cache)
├── KV (Fast Lookups: domain registry)
├── Queue (Task Processing)
└── GitHub (Audit Trail)

Newsfork v6.0 introduces an engine-based architecture: each capability is a separate engine with consistent naming (nf-{engine}-{resource}-{qualifier}) and interfaces. Engines communicate only via Queue or API; external exposure is through Zuplo API Gateway.

8 Core Engines: E1 Collection, E2 Diaspora, E3 RAG, E4 Knowledge Agent, E5 Journalist, E6 Advertising, E7 Publishing, E8 Distribution.

The current system (Research → Seed → Dataset, single Worker, R2/D1/KV/Queues) corresponds to E1 (Collection Engine) and shared infra; existing architecture docs above remain valid. For the v6.0 engine model and phased rollout, see Engine-Based Architecture (v6.0).

  • Runtime: Cloudflare Workers (Edge Computing)
  • Framework: Hono (Fast HTTP framework)
  • Validation: Zod (Schema validation)
  • Database: D1 (SQLite-compatible, metadata storage)
  • Storage: R2 (Raw datasets), GitHub (Audit trail)
  • Cache: KV (Domain registry, fast lookups)
  • Queue: Cloudflare Queues (Reliable task processing)
  • CI/CD: GitHub Actions
  • Language: TypeScript
Terminal window
# Install dependencies
pnpm install
# Start development server (local)
pnpm dev:local
# Start development server (remote)
pnpm dev:remote
# Type check
pnpm typecheck
# Run tests
pnpm test
# Complete local validation (recommended before pushing)
pnpm run validate:local
# Deploy to staging
pnpm deploy:staging
# Deploy to production
pnpm deploy:production

Before pushing to CI, run complete validation locally to save CI time:

Terminal window
# Complete validation (TypeScript + Tests + Build)
pnpm run validate:local
# Or run individually:
pnpm typecheck # TypeScript validation
pnpm test # Run all tests
pnpm test:local # Run tests with Cloudflare Workers environment

Why validate locally?

  • ✅ Faster feedback (immediate results)
  • ✅ CI time savings (50-60% reduction)
  • ✅ Better developer experience
  • ✅ CI focuses on deployment only

Local Development Server Testing:

Terminal window
# Start local server
pnpm dev:local
# In another terminal, test API endpoints
curl http://localhost:8787/health
curl http://localhost:8787/api/v1/research
curl http://localhost:8787/api/v1/seeds
# Test Orchestrator
curl -X POST http://localhost:8787/api/v1/seeds/orchestrate \
-H "Content-Type: application/json" \
-d '{"country": "sg", "category": "news", "date": "2026-01-28"}'

Build Validation (Dry-run):

Terminal window
# Validate build before deployment
pnpm exec wrangler deploy --dry-run --env staging
# With Cloudflare credentials
CLOUDFLARE_API_TOKEN=your-token \
CLOUDFLARE_ACCOUNT_ID=your-id \
pnpm exec wrangler deploy --dry-run --env staging

Set these secrets in Cloudflare Workers:

Terminal window
# Development
wrangler secret put GH_TOKEN
wrangler secret put GH_OWNER
wrangler secret put GH_REPO
# Staging
wrangler secret put GH_TOKEN --env staging
wrangler secret put GH_OWNER --env staging
wrangler secret put GH_REPO --env staging
# Production
wrangler secret put GH_TOKEN --env production
wrangler secret put GH_OWNER --env production
wrangler secret put GH_REPO --env production

Logpush (production only): Worker logs are pushed to R2 via Cloudflare Logpush. Create a single R2 bucket logpush-r2 in Cloudflare, then set GitHub Secrets: LOGPUSH_R2_ACCESS_KEY , LOGPUSH_R2_SECRET_KEY. Provision and verify run only when deploying to production. The Cloudflare API token must have Logs Write permission. See the Logpush R2 Secrets Guide.

newsfork-seeds/
├── src/
│ ├── apps/ # Entry Points (Workers)
│ │ └── api/
│ │ ├── index.ts # HTTP Worker entry point
│ │ ├── queue-handler.ts # Queue consumer handler
│ │ └── scheduled-handler.ts # Cron handler
│ │
│ ├── domain/ # 순수 비즈니스 로직 (Cloudflare-free)
│ │ ├── research/ # Research 도메인
│ │ │ ├── discoverUrlsFromSource.ts
│ │ │ ├── createResearchOutput.ts
│ │ │ ├── generateDatasetId.ts
│ │ │ └── updateDatasetWithLiveness.ts
│ │ └── seed/ # Seed 도메인
│ │ ├── createSeedContract.ts
│ │ ├── validateSeedContract.ts
│ │ └── promoteSeedToActive.ts
│ │
│ ├── infra/ # Cloudflare 어댑터
│ │ └── cloudflare/
│ │ ├── r2/ # R2 Storage 어댑터
│ │ ├── github/ # GitHub Storage 어댑터
│ │ └── hybrid/ # Hybrid Storage (R2 + GitHub)
│ │
│ ├── services/ # 서비스 레이어 (Domain + Infra 조합)
│ │ ├── research.service.ts
│ │ ├── seed.service.ts
│ │ ├── dataset.service.ts
│ │ ├── metadata.service.ts
│ │ ├── queue.service.ts
│ │ └── storage.service.ts
│ │
│ ├── routes/ # API 라우트 핸들러
│ │ ├── health.ts
│ │ ├── research.ts
│ │ ├── seeds.ts
│ │ ├── datasets.ts
│ │ ├── metadata.ts
│ │ └── queues.ts
│ │
│ ├── schemas/ # Zod 스키마 (계약 정의)
│ │ ├── research.ts
│ │ ├── seed/
│ │ ├── queue.ts
│ │ └── common.ts
│ │
│ └── lib/ # Cross-domain 유틸리티
│ ├── d1/ # D1 유틸리티
│ ├── kv/ # KV 유틸리티
│ ├── queue/ # Queue 유틸리티
│ ├── path/ # 경로 빌더/파서
│ └── errors.ts
├── research/ # Research 데이터 (R2)
│ ├── datasets/ # Research datasets
│ ├── liveness/ # Liveness check results
│ ├── blocked/ # Blocked domains
│ └── dead/ # Dead domains
├── seeds/ # Seed contracts (GitHub)
│ ├── drafts/ # Pending review
│ ├── active/ # Production contracts
│ └── archived/ # Historical
├── migrations/ # D1 Database migrations
│ └── 001_init.sql
├── .github/
│ ├── workflows/ # GitHub Actions
│ │ ├── deploy.yml # Deployment workflow
│ │ └── metadata-sync.yml # Metadata sync workflow
│ └── scripts/ # CI/CD scripts
│ ├── ci.sh
│ ├── setup.sh
│ └── steps/
├── docs/ # Documentation
│ ├── README.md
│ ├── ENVIRONMENT_GUIDE.md
│ └── CLOUDFLARE_MIGRATION_PLAN.md
└── wrangler.jsonc # Cloudflare Workers config
MethodEndpointDescription
GET/healthHealth check
GET/health/readyReadiness probe
GET/health/liveLiveness probe

| Method | Endpoint | Description | |--------|----------|-------------|EN| GET | /api/v1/research | List research outputs |EN| GET | /api/v1/research/index | Get research index |EN| GET | /api/v1/research/:country/:category/:date | Get specific research |EN| GET | /api/v1/research/:country/:category/today | Get today’s research |EN| POST | /api/v1/research | Create research output |EN______EN### Seeds

MethodEndpointDescription
GET/api/v1/seedsList seeds with filters
GET/api/v1/seeds/:idGet seed by ID
POST/api/v1/seedsCreate draft seed
PATCH/api/v1/seeds/:idUpdate seed
POST/api/v1/seeds/:id/promotePromote to active
POST/api/v1/seeds/:id/archiveArchive seed

| Method | Endpoint | Description | |--------|----------|-------------|EN| GET | /api/v1/datasets | List datasets (R2) |EN| GET | /api/v1/datasets/:country/:category/:date/:chunk | Get specific dataset |EN| POST | /api/v1/datasets | Save dataset to R2 |EN______EN### Metadata___EN______EN___| Method | Endpoint | Description |EN|--------|----------|-------------|EN| GET | /api/v1/metadata/snapshot | Get metadata snapshot |EN| POST | /api/v1/metadata/sync | Sync metadata to GitHub |EN______EN### Queues___EN______EN___| Method | Endpoint | Description |EN|--------|----------|-------------|EN| POST | /api/v1/queues/research | Create research batch |EN| POST | /api/v1/queues/contract | Create contract batch |EN| POST | /api/v1/queues/liveness | Create liveness batch |EN| GET | /api/v1/queues/batch/:batchId | Get batch status |EN______EN## Data Flow___EN______EN___### Research Pipeline

1. Research Request
2. Queue Batch Creation (POST /api/v1/queues/research)
3. Queue Consumer Processing
4. URL Discovery (Domain functions)
5. Dataset Creation & Storage (R2)
6. Metadata Update (D1)
7. GitHub Sync (Audit trail)
1. Research Dataset (R2)
2. Seed Candidate Analysis
3. Draft Seed Creation (POST /api/v1/seeds)
4. Human Review
5. Promotion (POST /api/v1/seeds/:id/promote)
6. Active Seed (GitHub)

The system uses Cloudflare Queues for reliable task processing:

  • Research Queue: Processes URL discovery batches
  • Contract Queue: Processes seed contract generation
  • Liveness Queue: Processes domain health checks

Each queue has:

  • Batch processing (10-100 items per batch)
  • Automatic retries (max 3 attempts)
  • Dead Letter Queue (DLQ) for failed tasks
research/datasets/country=sg/category=news/2026-01-23_0001.json
seeds/drafts/country=sg/domain=mom.gov.sg/content=news/v1.json

Compatible with BigQuery, Delta Lake, AWS Athena, Cloudflare R2.

{
"meta": {
"dataset_id": "sg-news-2026-01-25-0001",
"country": "SG",
"category": "news",
"discovered_at": "2026-01-25T03:12:00Z",
"research_methods": ["google_search", "crtsh"],
"record_count": 8
},
"records": [
{
"raw_url": "https://www.mom.gov.sg/newsroom",
"normalized_domain": "mom.gov.sg",
"domain_id": "gov:sg:mom.gov.sg",
"source_type": "gov",
"confidence": 0.95
}
]
}
{
"seed_id": "sg-mom-001",
"source": {
"domain": "mom.gov.sg",
"type": "government",
"name": "Ministry of Manpower",
"country": "SG"
},
"contents": [{
"nature": "news",
"source_url": "https://www.mom.gov.sg/newsroom",
"fetch_type": "html",
"confidence": 0.92
}],
"status": "active",
"version": 1
}
ConceptDescriptionExamples
Source TypeWHO produces contentgovernment, media, company
Content CategoryWHAT the content isnews, policy, guide
MediumHOW it’s deliveredweb, social, video
draft → active → archived
suspended
  • R2: Raw data (datasets, liveness checks) - Large files
  • D1: Metadata (task state, domain cache) - Queryable, small data
  • KV: Fast lookups (domain registry) - Cache layer
  • GitHub: Audit trail (seed contracts, metadata snapshots) - Version control

The system supports three environments with completely isolated resources:

  • Development (dev): Safe experimentation
  • Staging (staging): Pre-production testing
  • Production (production): Legal compliance records

See Environment Guide for details.

WorkflowScheduleDescription
deploy.ymlOn push to mainDeploy to Cloudflare Workers
metadata-sync.ymlEvery 6 hoursSync metadata to GitHub
Terminal window
# Run all tests
pnpm test
# Run tests with Cloudflare Workers environment
pnpm test:local
# Type check
pnpm exec tsc --noEmit
Terminal window
# Apply migrations to dev
pnpm db:migrate
# Apply migrations to staging
pnpm db:migrate:staging
# Apply migrations to production
pnpm db:migrate:production

Documentation - Architecture Guidelines - Detailed architecture and coding standards - Environment Guide - Environment setup and configuration - ](/ko/v1/guides/research/) - Research data structure - ](/ko/v1/guides/seeds/) - Seed contract structure - ](/ko/v1/guides/docs/) - All documentation ## License MIT

Section titled “Documentation - Architecture Guidelines - Detailed architecture and coding standards - Environment Guide - Environment setup and configuration - ](/ko/v1/guides/research/) - Research data structure - ](/ko/v1/guides/seeds/) - Seed contract structure - ](/ko/v1/guides/docs/) - All documentation ## License MIT”