Cloudflare Migration
Overview
Section titled “Overview”The migration to transition the execution environment from GitHub Actions to Cloudflare Workers and migrate the data storage from GitHub to R2 has been completed.
Completed Changes
Section titled “Completed Changes”1. Execution Environment: GitHub Actions → Cloudflare Workers
Section titled “1. Execution Environment: GitHub Actions → Cloudflare Workers”- All data processing logic now runs on Workers
- Stable task distribution via Cloudflare Queues
- Scheduled task execution using Cron triggers
2. Data Storage: GitHub → R2
Section titled “2. Data Storage: GitHub → R2”- Raw JSON data stored in R2
- Only metadata synchronized to GitHub (audit trail)
- State and metadata managed via D1 database
3. Adding Cloudflare Queues
Section titled “3. Adding Cloudflare Queues”- Task splitting and stable processing
- Supports automatic retries and DLQ (Dead Letter Queue)
- Protects external services via rate limiting
New Architecture
Section titled “New Architecture”┌─────────────────────────────────────────────────────────────────┐│ Cloudflare Workers ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ HTTP API │ │ Scheduled │ │ Queue │ ││ │ Handler │ │ (Cron) │ │ Consumers │ ││ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││ │ │ │ ││ └────────────────┼────────────────┘ ││ │ ││ ┌───────────────────────┴───────────────────────┐ ││ │ Business Services │ ││ │ Research │ Dataset │ Metadata │ Queue │ ││ └───────────────────────┬───────────────────────┘ ││ │ ││ ┌───────────────────────┴───────────────────────┐ ││ │ Storage Services │ ││ │ HybridStorage │ R2Storage │ GitHubStorage │ ││ └───────────────────────┬───────────────────────┘ │└──────────────────────────┼──────────────────────────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼┌───────────────┐ ┌───────────────┐ ┌───────────────┐│ R2 │ │ D1 │ │ GitHub ││ (Raw Data) │ │ (Metadata) │ │ (Audit Trail) │└───────────────┘ └───────────────┘ └───────────────┘File Structure
Section titled “File Structure”src/├── index.ts # 앱 진입점 (HTTP, Cron, Queue handlers)├── routes/│ ├── index.ts # 라우트 export│ ├── health.ts # 헬스체크│ ├── research.ts # 리서치 API│ ├── seeds.ts # 시드 API│ ├── datasets.ts # 데이터셋 API (NEW)│ ├── metadata.ts # 메타데이터 API (NEW)│ └── queues.ts # 큐 관리 API (NEW)├── services/│ ├── index.ts # 서비스 export│ ├── storage.service.ts # GitHub Storage│ ├── r2-storage.service.ts # R2 Storage (NEW)│ ├── hybrid-storage.service.ts # Hybrid Storage (NEW)│ ├── queue.service.ts # Queue Service (NEW)│ ├── dataset.service.ts # Dataset Service (NEW)│ ├── metadata.service.ts # Metadata Service (NEW)│ ├── research.service.ts│ └── seed.service.ts├── schemas/│ ├── index.ts│ ├── common.ts│ ├── research.ts│ ├── seed.ts│ ├── checkpoint.ts│ ├── metadata.ts # Metadata Schemas (NEW)│ └── queue.ts # Queue Schemas (NEW)└── lib/ ├── index.ts ├── path.ts ├── errors.ts ├── domain.ts ├── kv.ts ├── liveness.ts ├── dataset-utils.ts ├── checkpoint.ts ├── r2.ts # R2 Utilities (NEW) ├── d1.ts # D1 Utilities (NEW) └── queue.ts # Queue Processors (NEW)
migrations/└── 001_init.sql # D1 Database Schema (NEW)
.github/workflows/├── deploy.yml # Workers 배포└── metadata-sync.yml # 메타데이터 GitHub 동기화 (NEW)API Endpoints
Section titled “API Endpoints”Existing Endpoints (Maintained)
Section titled “Existing Endpoints (Maintained)”GET /health- Health CheckGET /api/v1/research/:country/:category/:date- Research QueryPOST /api/v1/research- Research CreationGET /api/v1/seeds- Seed ListPOST /api/v1/seeds- Seed Creation
New Endpoint
Section titled “New Endpoint”-
GET /api/v1/datasets- Dataset List -
GET /api/v1/datasets/stats- Dataset Statistics -
GET /api/v1/datasets/:country/:category/:date- List Chunks for Specific Date -
GET /api/v1/datasets/:country/:category/:date/:chunk- Query Specific Chunk -
POST /api/v1/datasets/:country/:category/:date/:chunk- Save Chunk -
DELETE /api/v1/datasets/:country/:category/:date/:chunk- Delete Chunk -
GET /api/v1/metadata/snapshot- Snapshot Metadata -
POST /api/v1/metadata/sync- GitHub Sync Trigger -
GET /api/v1/metadata/by-country/:country- Country-Specific Metadata -
GET /api/v1/metadata/by-category/:category- Category-Specific Metadata -
GET /api/v1/metadata/by-date-range- Date Range Metadata -
GET /api/v1/metadata/tasks- Task Statistics -
POST /api/v1/queues/research- Research Batch Creation -
POST /api/v1/queues/liveness- Create Liveness Batch -
POST /api/v1/queues/contract- Create Contract Batch -
GET /api/v1/queues/batch/:batchId- Query Batch Status -
POST /api/v1/queues/trigger/metadata-sync- Trigger Metadata Synchronization -
POST /api/v1/queues/trigger/research-discovery- Trigger Research Discovery
Cloudflare Queues Configuration
Section titled “Cloudflare Queues Configuration”Queue List
Section titled “Queue List”| Queue Name | Purpose | Batch Size | Timeout | Retries |
|---|---|---|---|---|
| newsfork-research-{env} | Research tasks | 10 | 30s | 3 |
| newsfork-contract-{env} | Contract Creation | 10 | 30s | 3 |
| newsfork-liveness-{env} | Liveness Check | 50-100 | 10s | 3 |
| newsfork-dlq-{env} | Dead Letter Queue | - | - | - |
Message Type
Section titled “Message Type”research_discovery: Research Discovery Taskresearch_batch: Research Batch Processingmetadata_sync: Metadata Synchronizationcontract_generate: Contract Creationcontract_validate: Contract Validationcontract_batch: Contract Batch Processingliveness_check: Single Liveness Checkliveness_batch: Liveness Batch Check
Configuration and Deployment
Section titled “Configuration and Deployment”1. Create Cloudflare Resource
Section titled “1. Create Cloudflare Resource”# R2 버킷 생성wrangler r2 bucket create newsfork-datasets-devwrangler r2 bucket create newsfork-metadata-dev
# D1 데이터베이스 생성wrangler d1 create newsfork-metadata-dev
# 큐 생성wrangler queues create newsfork-research-devwrangler queues create newsfork-contract-devwrangler queues create newsfork-liveness-devwrangler queues create newsfork-dlq-dev
# KV 네임스페이스 생성wrangler kv:namespace create DOMAIN_KV2. Update wrangler.jsonc
Section titled “2. Update wrangler.jsonc”Update the generated resource IDs in wrangler.jsonc:
- R2 bucket_name
- D1 database_id
- KV namespace id
- Queue names
3. Apply D1 Schema
Section titled “3. Apply D1 Schema”wrangler d1 execute METADATA_DB --file=./migrations/001_init.sql4. Configure Secrets
Section titled “4. Configure Secrets”wrangler secret put GH_TOKENwrangler secret put GH_OWNERwrangler secret put GH_REPO5. Deploy
Section titled “5. Deploy”# 개발 환경pnpm run dev
# 스테이징 배포pnpm run deploy:staging
# 프로덕션 배포pnpm run deploy:productionData Flow
Section titled “Data Flow”Research Batch Processing
Section titled “Research Batch Processing”1. API 호출: POST /api/v1/queues/research └─> QueueService.createResearchBatch() └─> URL 목록을 청크로 분할 (예: 100개씩) └─> 각 청크를 RESEARCH_QUEUE에 전송 └─> 배치 메타데이터를 D1에 저장
2. Queue Consumer └─> processResearchQueue() 호출 └─> 청크 처리 (URL 정규화, 도메인 추출) └─> 결과를 R2에 저장 └─> D1에 배치 진행 상태 업데이트
3. 완료 후 └─> metadata_sync 작업 자동 트리거 └─> R2 메타데이터를 GitHub에 동기화Metadata Synchronization
Section titled “Metadata Synchronization”1. Cron 트리거 (매 6시간) 또는 API 호출 └─> RESEARCH_QUEUE에 metadata_sync 메시지 전송
2. Queue Consumer └─> handleMetadataSync() 호출 └─> R2에서 모든 데이터셋 메타데이터 집계 └─> GitHub에 metadata/snapshot.json 커밋
3. GitHub Actions (선택적) └─> metadata-sync.yml 워크플로우 └─> Worker API에서 스냅샷 조회 └─> GitHub 저장소에 커밋Deleted Files
Section titled “Deleted Files”CLI Tools (Replaced by Workers)
Section titled “CLI Tools (Replaced by Workers)”cli/research-engine.tscli/contract-engine.ts
GitHub Actions Workflows (Replaced by Workers)
Section titled “GitHub Actions Workflows (Replaced by Workers)”.github/workflows/research-dev.yml.github/workflows/research-staging.yml.github/workflows/research-prod.yml.github/workflows/research-pipeline.yml.github/workflows/contract-pipeline.yml.github/workflows/seed-dev.yml.github/workflows/seed-staging.yml.github/workflows/seed-prod.yml.github/workflows/seed-promotion.yml
Local Test
Section titled “Local Test”# 로컬 Workers 실행 (Miniflare)pnpm run dev:local
# 테스트 실행pnpm test
# Workers 전용 테스트pnpm run test:localAPI Test
Section titled “API Test”# 헬스체크curl http://localhost:8787/health
# 데이터셋 목록curl http://localhost:8787/api/v1/datasets
# 메타데이터 스냅샷curl http://localhost:8787/api/v1/metadata/snapshot
# 리서치 배치 생성curl -X POST http://localhost:8787/api/v1/queues/research \ -H "Content-Type: application/json" \ -d '{"country":"SG","category":"news","urls":["https://example.com"]}'Monitoring
Section titled “Monitoring”Cloudflare Dashboard
Section titled “Cloudflare Dashboard”- Workers Analytics: Request Count, Latency, Error Rate
- Queues: Message Throughput, Retry Count, DLQ Status
- R2: Storage Usage, Request Count
- D1: Query Performance, Storage Usage
Metadata Synchronization
Section titled “Metadata Synchronization”- Verify
metadata/snapshot.jsonin GitHub repository - Last sync time, total dataset count, record count
- Worker Execution Time Limit: Each task is designed to complete within 30 seconds
- Queue Retries: Failed tasks retried up to 3 times before moving to DLQ
- Rate Limiting: Applies 100ms delay per domain during liveness checks
- Memory Limit: Consider Workers’ 128MB memory limit
- R2 Costs: Requires monitoring of storage and request costs