Research 엔진 상세 설명

🔬 Research 엔진 상세 설명

🎯 핵심 철학

“Research Engine discovers WHERE to look.”

Research 엔진은 뉴스 소스 발견에 특화된 시스템으로, URL 발견에만 집중합니다. 콘텐츠의 품질이나 유효성을 판단하지 않고, 단순히 “관련성이 있을 수 있는” URL들을 발견하여 불변 데이터셋으로 저장합니다.

📁 데이터 구조 및 스토리지

🗂️ 디렉토리 구조

research/
├── datasets/           # 발견된 URL 데이터셋 (불변 스냅샷)
│   └── country=sg/
│       └── category=news/
│           ├── 2026-01-24_0001.json
│           ├── 2026-01-25_0001.json
│           └── 2026-01-25_summary.json
├── liveness/           # 도메인 생존 확인 결과
│   └── country=sg/
│       ├── 2026-01-24.json
│       └── 2026-01-25.json
├── blocked/            # 차단된 도메인 (403, captcha, rate limit)
│   └── country=sg/
│       └── 2026-01-24.json
├── dead/               # 죽은 도메인 (DNS 실패, 연결 불가)
│   └── country=sg/
│       └── 2026-01-24.json
└── processing/         # 체크포인트 및 중간 처리 데이터
    └── checkpoints/
        └── country=sg/category=news/
            └── research/checkpoint.json

🏪 스토리지 계층

Primary Storage: Cloudflare R2 (DATASETS_BUCKET)
Metadata Storage: Cloudflare D1 (배치 상태, 통계)
Audit Trail: GitHub (메타데이터 동기화)

📊 경로 규칙 (Hive-style Partitioning)

research/datasets/country={cc}/category={cat}/{date}_{chunk}.json
research/liveness/country={cc}/{date}.json
research/blocked/country={cc}/{date}.json
research/dead/country={cc}/{date}.json

이 형식은 다음 도구들과 호환됩니다:

BigQuery
Delta Lake
AWS Athena
Cloudflare R2

📋 데이터 스키마

🎯 Research Dataset Schema

{
  "meta": {
    "dataset_id": "sg-news-2026-01-25-0001",
    "country": "SG",
    "category": "news",
    "discovered_at": "2026-01-25T03:12:00Z",
    "research_methods": ["google_search", "crtsh", "wayback_machine"],
    "queries": [
      "Singapore government news site:.gov.sg",
      "Singapore ministry news 2026"
    ],
    "engine": {
      "name": "research-engine",
      "version": "1.0.0"
    },
    "record_count": 8,
    "chunk_info": {
      "chunk_index": 1,
      "total_chunks": 3,
      "chunk_size": 100
    }
  },
  "records": [
    {
      "raw_url": "https://www.mom.gov.sg/newsroom",
      "normalized_domain": "mom.gov.sg",
      "domain_id": "gov:sg:mom.gov.sg",
      "registrable_domain": "mom.gov.sg",
      "subdomain": "www",
      "source_type": "gov",
      "discovery_method": "google_search",
      "discovery_query": "Singapore government news site:.gov.sg",
      "confidence": 0.95,
      "content_hints": ["news", "government_content"],
      "discovered_at": "2026-01-25T03:12:15Z",
      "metadata": {
        "title": "Newsroom - Ministry of Manpower",
        "description": "Latest news and updates from MOM",
        "language": "en"
      }
    }
  ]
}

🔍 Liveness Check Schema

{
  "meta": {
    "country": "SG",
    "check_date": "2026-01-25",
    "total_domains": 150,
    "alive_count": 142,
    "dead_count": 5,
    "blocked_count": 3
  },
  "results": [
    {
      "domain_id": "gov:sg:mom.gov.sg",
      "domain": "mom.gov.sg",
      "status": "alive",
      "http_status": 200,
      "response_time_ms": 245,
      "last_check": "2026-01-25T10:30:00Z",
      "ssl_valid": true,
      "redirect_chain": [
        "https://mom.gov.sg",
        "https://www.mom.gov.sg"
      ]
    }
  ]
}

🔧 주요 기능 및 처리 로직

✅ Research 엔진의 책임

1. URL 발견 (Discovery)

Google Search API: 구조화된 검색 쿼리 실행
Certificate Transparency (crt.sh): SSL 인증서 기반 도메인 발견
Wayback Machine: 과거 스냅샷에서 URL 추출
DNS 열거: 서브도메인 브루트포스 (선택적)

2. 도메인 정규화 (Normalization)

// 예시 정규화 과정
"https://www.mom.gov.sg/newsroom/press-releases"
  ↓
{
  raw_url: "https://www.mom.gov.sg/newsroom/press-releases",
  normalized_domain: "mom.gov.sg",
  registrable_domain: "mom.gov.sg",
  subdomain: "www",
  domain_id: "gov:sg:mom.gov.sg"
}

3. Domain ID 생성 규칙

// 정부 도메인: gov:{country}:{domain}
"mom.gov.sg" → "gov:sg:mom.gov.sg"

// 일반 조직: org:{country}:{domain}
"redcross.org.sg" → "org:sg:redcross.org.sg"

// 기업: com:{country}:{domain}
"dbs.com.sg" → "com:sg:dbs.com.sg"

4. 생존성 확인 (Liveness Check)

Phase 1-A: 기본 HTTP 응답 확인
SSL 인증서 검증: 유효한 HTTPS 설정 확인
리다이렉트 체인 추적: 최종 도착 URL 기록
응답 시간 측정: 성능 메트릭 수집

5. 불변 데이터셋 생성

청킹: 대용량 결과를 100-1000개 단위로 분할
메타데이터 생성: 각 청크별 메타데이터 첨부
체크섬: 데이터 무결성 보장
타임스탬프: 정확한 발견 시점 기록

❌ Research 엔진이 하지 않는 일

콘텐츠 타입 분석: RSS/HTML/API 구분하지 않음
콘텐츠 품질 평가: 뉴스 품질이나 신뢰도 판단 안함
메타데이터 추출: 상세한 페이지 메타데이터 수집 안함
Seed 계약 생성: 수집 방법 정의하지 않음
실제 콘텐츠 수집: 페이지 내용 다운로드 안함

🚀 워크플로우 및 처리 과정

📊 전체 처리 흐름

[API Request] POST /api/v1/queues/research
     │
     ├── Request 검증 (country, category, urls)
     ├── 배치 ID 생성 (batch_uuid)
     └── D1에 배치 메타데이터 저장
     │
     ▼
[Queue Message Creation]
     │
     ├── URL 그룹을 청크 단위로 분할
     ├── 각 청크별 Queue 메시지 생성
     └── RESEARCH_QUEUE로 메시지 전송
     │
     ▼
[Queue Consumer Processing]
     │
     ├── 메시지 배치 처리 (max_batch_size: 10)
     ├── 각 URL에 대해 발견 로직 실행
     └── 병렬 처리 (동시성 제어)
     │
     ▼
[Domain Functions Execution]
     │
     ├── discoverUrlsFromSource(input)
     ├── normalizeDiscoveredUrls(urls)
     ├── generateDomainIds(domains)
     └── createResearchOutput(results)
     │
     ▼
[Storage Operations]
     │
     ├── R2에 데이터셋 저장
     ├── D1에 메타데이터 업데이트
     ├── 배치 상태 갱신
     └── GitHub 동기화 (선택적)

🔄 배치 처리 상세

1. 배치 생성 (Batch Creation)

// POST /api/v1/queues/research
{
  "country": "SG",
  "category": "news",
  "urls": ["https://example.com", "https://test.com"],
  "chunk_size": 100,
  "research_methods": ["google_search", "crtsh"]
}

// 생성되는 배치 메타데이터
{
  batch_id: "batch_2026-01-25_sg-news_001",
  country: "SG",
  category: "news",
  total_urls: 250,
  chunk_size: 100,
  total_chunks: 3,
  status: "queued",
  created_at: "2026-01-25T10:00:00Z"
}

2. Queue 메시지 구조

// RESEARCH_QUEUE 메시지
{
  batch_id: "batch_2026-01-25_sg-news_001",
  chunk_index: 1,
  urls: ["url1", "url2", ...], // 최대 100개
  research_config: {
    methods: ["google_search", "crtsh"],
    country: "SG",
    category: "news"
  }
}

3. Consumer 처리 로직

export async function handleResearchQueue(
  batch: MessageBatch,
  env: Env
): Promise<void> {
  for (const message of batch.messages) {
    const { batch_id, chunk_index, urls, research_config } = message.body;

    try {
      // 1. URL 발견 실행
      const discoveredUrls = await discoverUrlsFromSource({
        urls,
        methods: research_config.methods,
        country: research_config.country
      });

      // 2. 도메인 정규화
      const normalizedResults = await normalizeDiscoveredUrls(discoveredUrls);

      // 3. Research 출력 생성
      const researchOutput = await createResearchOutput({
        batch_id,
        chunk_index,
        results: normalizedResults,
        config: research_config
      });

      // 4. R2에 저장
      const datasetPath = buildR2DatasetPath(
        research_config.country,
        research_config.category,
        getCurrentDate(),
        chunk_index
      );

      await env.DATASETS_BUCKET.put(
        datasetPath,
        JSON.stringify(researchOutput)
      );

      // 5. D1 메타데이터 업데이트
      await updateBatchProgress(env.METADATA_DB, batch_id, chunk_index);

    } catch (error) {
      // 에러 처리 및 DLQ 전송
      console.error(`Research processing failed: ${error.message}`);
      throw error; // Queue가 자동으로 재시도/DLQ 처리
    }
  }
}

🔌 API 엔드포인트

📋 Research 운영 API

Method	Endpoint	Description	Parameters
GET	`/api/v1/research`	Research 출력 목록 조회	`country`, `category`, `limit`, `offset`
GET	`/api/v1/research/index`	Research 인덱스 조회	-
GET	`/api/v1/research/:country/:category/:date`	특정 Research 조회	`country`, `category`, `date`
GET	`/api/v1/research/:country/:category/today`	오늘의 Research 조회	`country`, `category`
POST	`/api/v1/research`	Research 출력 생성	Request Body

🔄 Queue 운영 API

Method	Endpoint	Description	Parameters
POST	`/api/v1/queues/research`	Research 배치 생성	Request Body
GET	`/api/v1/queues/batch/:batchId`	배치 상태 조회	`batchId`
POST	`/api/v1/queues/liveness`	생존성 확인 배치 생성	Request Body

📊 API 사용 예시

Research 배치 생성

curl -X POST https://api.newsfork.com/api/v1/queues/research \
  -H "Content-Type: application/json" \
  -d '{
    "country": "SG",
    "category": "news",
    "urls": [
      "https://www.gov.sg",
      "https://www.moh.gov.sg"
    ],
    "chunk_size": 100,
    "research_methods": ["google_search", "crtsh"]
  }'

# Response
{
  "success": true,
  "data": {
    "batch_id": "batch_2026-01-25_sg-news_001",
    "total_chunks": 1,
    "estimated_completion": "2026-01-25T10:15:00Z"
  }
}

Research 결과 조회

curl "https://api.newsfork.com/api/v1/research/SG/news/2026-01-25"

# Response
{
  "success": true,
  "data": {
    "datasets": [
      {
        "dataset_id": "sg-news-2026-01-25-0001",
        "path": "research/datasets/country=sg/category=news/2026-01-25_0001.json",
        "record_count": 85,
        "size_bytes": 45120,
        "created_at": "2026-01-25T10:12:00Z"
      }
    ],
    "summary": {
      "total_records": 85,
      "unique_domains": 23,
      "discovery_methods": ["google_search", "crtsh"]
    }
  }
}

⚙️ 설정 및 환경

🔧 Queue 설정 (wrangler.jsonc)

{
  "queues": {
    "consumers": [
      {
        "queue": "newsfork-research-staging",
        "max_batch_size": 10,
        "max_batch_timeout": 30,
        "max_retries": 3,
        "dead_letter_queue": "newsfork-dlq-staging"
      }
    ]
  }
}

🌍 환경별 경로 분리

Development:   dev/research/datasets/...
Staging:       staging/research/datasets/...
Production:    prod/research/datasets/...

📊 모니터링 메트릭

배치 성공률: 완료된 배치 / 전체 배치
발견 URL 수: 시간당 발견된 고유 URL 수
도메인 생존율: 살아있는 도메인 / 전체 도메인
처리 지연시간: Queue 메시지 처리 평균 시간
에러율: 실패한 메시지 / 전체 메시지

🔍 생존성 확인 (Liveness Check)

🎯 Phase 1-A 생존성 확인

Research 엔진은 기본적인 생존성 확인만 수행합니다:

async function checkDomainLiveness(domain: string): Promise<LivenessResult> {
  try {
    const response = await fetch(`https://${domain}`, {
      method: 'HEAD',
      timeout: 10000,
      redirect: 'follow'
    });

    return {
      domain,
      status: response.ok ? 'alive' : 'error',
      http_status: response.status,
      response_time_ms: Date.now() - startTime,
      ssl_valid: response.url.startsWith('https://'),
      redirect_chain: getRedirectChain(response),
      last_check: new Date().toISOString()
    };
  } catch (error) {
    return {
      domain,
      status: 'dead',
      error: error.message,
      last_check: new Date().toISOString()
    };
  }
}

📊 생존성 결과 분류

alive: HTTP 200-299 응답
dead: DNS 실패, 연결 불가, 타임아웃
blocked: 403, 429, captcha 감지
redirect: 영구적 리다이렉트 (301, 308)

🔗 서비스 레이어 구조

🏗️ Research Service Architecture

// Domain Layer (순수 비즈니스 로직)
export function discoverUrlsFromSource(input: DiscoverUrlsInput): DiscoverUrlsOutput
export function createResearchOutput(...): ResearchOutput
export function generateDatasetId(...): string
export function createDatasetPath(...): string

// Service Layer (도메인 + 인프라 오케스트레이션)
export class ResearchService {
  async list(params: ResearchListParams): Promise<ResearchListResult>
  async get(country: string, category: string, date: string): Promise<ResearchOutput>
  async create(request: CreateResearchRequest): Promise<ResearchOutput>
  async createBatch(request: CreateBatchRequest): Promise<BatchResult>
}

// Infrastructure Layer (Cloudflare 어댑터)
export class R2StorageAdapter {
  async storeDataset(path: string, data: ResearchOutput): Promise<void>
  async getDataset(path: string): Promise<ResearchOutput>
  async listDatasets(prefix: string): Promise<DatasetInfo[]>
}

🔄 의존성 주입

// 서비스 생성 시 인프라 어댑터 주입
const researchService = new ResearchService({
  r2Storage: new R2StorageAdapter(env.DATASETS_BUCKET),
  d1Database: new D1Adapter(env.METADATA_DB),
  githubStorage: new GitHubStorageAdapter(env.GITHUB_TOKEN)
});

📈 성능 및 확장성

⚡ 병렬 처리 전략

배치 레벨: 여러 배치 동시 처리
청크 레벨: 배치 내 청크 병렬 처리
URL 레벨: 청크 내 URL 동시 발견

📊 처리량 최적화

// 적응형 동시성 제어
let concurrency = 10;
const errorRate = errors / totalRequests;

if (errorRate > 0.05) {
  concurrency = Math.max(5, concurrency * 0.8);
} else if (errorRate < 0.01) {
  concurrency = Math.min(50, concurrency * 1.2);
}

🔄 재시도 및 에러 처리

지수 백오프: 1s → 2s → 4s → 8s
Circuit Breaker: 연속 실패 시 일시 중단
DLQ 처리: 최대 재시도 후 수동 검토 큐로 이동

🎉 결론

Research 엔진은 Newsfork 파이프라인의 첫 번째 단계로서, URL 발견이라는 명확한 책임을 가집니다.

🎯 핵심 가치

단순성: 발견에만 집중, 판단하지 않음
확장성: 국가/카테고리별 독립적 확장
신뢰성: 불변 데이터셋과 체크포인트 시스템
추적성: 완전한 audit trail과 메타데이터

이를 통해 Seed 엔진이 안정적인 입력 데이터를 받아 수집 계약을 생성할 수 있는 기반을 제공합니다.