지능형 URL 발견 시스템 기획서

🧠 지능형 URL 발견 시스템 기획서

🎯 목적 및 배경

현재 상황 분석

기본 메타데이터만 수집: robots.txt, sitemap.xml만 수집하고 있음
제한적 URL 분류: 기본적인 패턴 매칭만 수행 (/news, /press 등)
수동적 발견: sitemap.xml에 의존한 URL 발견
콘텐츠 힌트 부족: URL 경로 기반의 단순한 힌트만 제공

현재 구현된 기능

const pathHints = [
  { pattern: '/news', hint: 'news' },
  { pattern: '/newsroom', hint: 'news' },
  { pattern: '/press', hint: 'press_release' },
  { pattern: '/media', hint: 'press_release' },
  { pattern: '/faq', hint: 'faq' },
  { pattern: '/questions', hint: 'faq' },
  { pattern: '/policy', hint: 'policy' },
  { pattern: '/regulation', hint: 'policy' },
  { pattern: '/guide', hint: 'guide' },
  { pattern: '/help', hint: 'guide' },
  { pattern: '/announcement', hint: 'announcements' },
  { pattern: '/notice', hint: 'announcements' }
];

목표

지능형 URL 발견: 도메인 구조 분석을 통한 의미있는 URL 자동 발견
고도화된 분류: 콘텐츠 타입 및 중요도 기반 URL 분류
동적 패턴 학습: 도메인별 특성을 학습한 맞춤형 URL 발견
품질 평가: 발견된 URL의 뉴스 가치 및 신뢰도 평가

🏗️ 시스템 아키텍처

📊 전체 워크플로우

[Domain Input] mom.gov.sg
    │
    ▼
[Stage 1] 기본 메타데이터 수집
    ├── robots.txt 분석
    ├── sitemap.xml 파싱
    └── 서버 정보 수집
    │
    ▼
[Stage 2] 지능형 URL 발견
    ├── Sitemap URL 분석 및 분류
    ├── 도메인 구조 탐색 (Breadth-First)
    ├── 패턴 기반 URL 생성
    └── 동적 크롤링 (제한적)
    │
    ▼
[Stage 3] URL 품질 평가 및 분류
    ├── 콘텐츠 타입 분류 (ML 기반)
    ├── 뉴스 가치 평가
    ├── 업데이트 빈도 추정
    └── 우선순위 점수 계산
    │
    ▼
[Output] 구조화된 URL 메타데이터

🔍 Stage 1: 기본 메타데이터 수집 (현재 구현됨)

현재 기능

robots.txt: 크롤링 정책 및 sitemap URL 추출
sitemap.xml: URL 목록 및 메타데이터 수집
서버 정보: 응답 헤더, 성능 메트릭

개선 계획

// 확장된 메타데이터 수집
interface EnhancedDomainMetadata {
  // 기존 필드
  robots_txt: RobotsMetadata;
  sitemap_xml: SitemapMetadata;
  server_info: ServerInfo;

  // 새로운 필드
  discovered_urls: DiscoveredUrl[];
  url_patterns: UrlPattern[];
  content_categories: ContentCategory[];
  crawl_strategy: CrawlStrategy;
}

🕷️ Stage 2: 지능형 URL 발견

🎯 2.1 Sitemap URL 심층 분석

현재 구현

// 단순 URL 개수만 파싱
function parseSitemapUrlCount(content: string): number {
  return (content.match(/<url>/g) || []).length;
}

개선 계획

interface SitemapUrl {
  loc: string;                    // URL
  lastmod?: string;              // 최종 수정일
  changefreq?: string;           // 변경 빈도
  priority?: number;             // 우선순위

  // 추가 분석 결과
  path_segments: string[];       // URL 경로 분할
  inferred_type: ContentType;    // 추론된 콘텐츠 타입
  depth_level: number;           // 도메인 루트로부터 깊이
  parent_category?: string;      // 상위 카테고리
}

async function analyzeSitemapUrls(content: string): Promise<SitemapUrl[]> {
  const urls: SitemapUrl[] = [];

  // 1. XML 파싱
  const parser = new DOMParser();
  const doc = parser.parseFromString(content, 'text/xml');
  const urlElements = doc.querySelectorAll('url');

  // 2. 각 URL 분석
  for (const urlElement of urlElements) {
    const loc = urlElement.querySelector('loc')?.textContent;
    if (!loc) continue;

    const url = new URL(loc);
    const pathSegments = url.pathname.split('/').filter(Boolean);

    urls.push({
      loc,
      lastmod: urlElement.querySelector('lastmod')?.textContent || undefined,
      changefreq: urlElement.querySelector('changefreq')?.textContent || undefined,
      priority: parseFloat(urlElement.querySelector('priority')?.textContent || '0.5'),

      path_segments: pathSegments,
      inferred_type: inferContentTypeFromPath(url.pathname),
      depth_level: pathSegments.length,
      parent_category: pathSegments[0] || 'root'
    });
  }

  return urls;
}

🗂️ 2.2 도메인 구조 탐색

패턴 기반 URL 생성

interface UrlPattern {
  pattern: string;               // '/news/{year}/{month}'
  content_type: ContentType;     // 'news'
  confidence: number;            // 0.0 - 1.0
  examples: string[];            // 실제 발견된 URL 예시
}

async function discoverUrlPatterns(
  domain: string,
  sitemapUrls: SitemapUrl[]
): Promise<UrlPattern[]> {
  const patterns: Map<string, UrlPattern> = new Map();

  // 1. 기존 URL에서 패턴 추출
  for (const url of sitemapUrls) {
    const pattern = extractPattern(url.path_segments);
    if (pattern) {
      if (!patterns.has(pattern.pattern)) {
        patterns.set(pattern.pattern, {
          pattern: pattern.pattern,
          content_type: url.inferred_type,
          confidence: 0.1,
          examples: []
        });
      }

      const existing = patterns.get(pattern.pattern)!;
      existing.examples.push(url.loc);
      existing.confidence = Math.min(0.9, existing.confidence + 0.1);
    }
  }

  // 2. 공통 패턴 식별
  const commonPatterns = identifyCommonPatterns(sitemapUrls);

  // 3. 정부/기관 특화 패턴 추가
  if (domain.includes('.gov.')) {
    patterns.set('/press-releases/{year}', {
      pattern: '/press-releases/{year}',
      content_type: 'press_release',
      confidence: 0.8,
      examples: []
    });

    patterns.set('/policies/{category}', {
      pattern: '/policies/{category}',
      content_type: 'policy',
      confidence: 0.7,
      examples: []
    });
  }

  return Array.from(patterns.values());
}

제한적 동적 크롤링

interface CrawlStrategy {
  max_depth: number;             // 최대 크롤링 깊이
  max_urls_per_category: number; // 카테고리당 최대 URL 수
  allowed_patterns: string[];    // 허용된 URL 패턴
  respect_robots: boolean;       // robots.txt 준수 여부
}

async function performLimitedCrawl(
  domain: string,
  patterns: UrlPattern[],
  strategy: CrawlStrategy
): Promise<DiscoveredUrl[]> {
  const discovered: DiscoveredUrl[] = [];
  const visited = new Set<string>();
  const queue: string[] = [`https://${domain}`];

  while (queue.length > 0 && discovered.length < 100) {
    const currentUrl = queue.shift()!;

    if (visited.has(currentUrl)) continue;
    visited.add(currentUrl);

    try {
      // 1. robots.txt 확인
      if (strategy.respect_robots && !isAllowedByRobots(currentUrl)) {
        continue;
      }

      // 2. 페이지 요청 (HEAD만)
      const response = await fetch(currentUrl, {
        method: 'HEAD',
        timeout: 5000
      });

      if (!response.ok) continue;

      // 3. 콘텐츠 타입 확인
      const contentType = response.headers.get('content-type');
      if (!contentType?.includes('text/html')) continue;

      // 4. URL 분석 및 분류
      const discoveredUrl = analyzeDiscoveredUrl(currentUrl, response);
      if (discoveredUrl.relevance_score > 0.3) {
        discovered.push(discoveredUrl);
      }

      // 5. 패턴 기반 추가 URL 생성 (제한적)
      const generatedUrls = generateUrlsFromPatterns(currentUrl, patterns);
      queue.push(...generatedUrls.slice(0, 5)); // 최대 5개만

    } catch (error) {
      console.warn(`Crawl failed for ${currentUrl}:`, error);
    }
  }

  return discovered;
}

🤖 Stage 3: URL 품질 평가 및 분류

🏷️ 3.1 고도화된 콘텐츠 분류

현재 분류 (단순 패턴 매칭)

// 기존: 경로 기반 단순 매칭
if (urlLower.includes('/news')) {
  hints.push('news');
}

개선된 분류 시스템

interface ContentClassification {
  primary_type: ContentType;      // 주 콘텐츠 타입
  secondary_types: ContentType[]; // 부 콘텐츠 타입
  confidence: number;             // 분류 신뢰도
  evidence: ClassificationEvidence; // 분류 근거
}

interface ClassificationEvidence {
  url_patterns: string[];         // 매칭된 URL 패턴
  keyword_matches: string[];      // 키워드 매칭
  structural_hints: string[];     // 구조적 힌트
  domain_context: string[];       // 도메인 컨텍스트
}

async function classifyUrlContent(url: string, context: DomainContext): Promise<ContentClassification> {
  const evidence: ClassificationEvidence = {
    url_patterns: [],
    keyword_matches: [],
    structural_hints: [],
    domain_context: []
  };

  // 1. URL 패턴 분석 (기존 + 확장)
  const urlPatterns = analyzeUrlPatterns(url);
  evidence.url_patterns = urlPatterns;

  // 2. 키워드 분석 (다국어 지원)
  const keywords = extractKeywords(url);
  evidence.keyword_matches = keywords;

  // 3. 구조적 힌트 (경로 깊이, 파일 확장자 등)
  const structuralHints = analyzeUrlStructure(url);
  evidence.structural_hints = structuralHints;

  // 4. 도메인 컨텍스트 (정부기관, 교육기관 등)
  const domainHints = analyzeDomainContext(url, context);
  evidence.domain_context = domainHints;

  // 5. ML 기반 분류 (향후 구현)
  const mlClassification = await classifyWithML(url, evidence);

  return {
    primary_type: mlClassification.primary_type,
    secondary_types: mlClassification.secondary_types,
    confidence: mlClassification.confidence,
    evidence
  };
}

확장된 콘텐츠 타입

enum ContentType {
  // 기존 타입
  NEWS = 'news',
  PRESS_RELEASE = 'press_release',
  FAQ = 'faq',
  POLICY = 'policy',
  GUIDE = 'guide',
  ANNOUNCEMENT = 'announcement',

  // 새로운 타입
  LEGISLATION = 'legislation',        // 법령
  REGULATION = 'regulation',          // 규정
  STATISTICS = 'statistics',          // 통계
  REPORT = 'report',                 // 보고서
  PUBLICATION = 'publication',        // 간행물
  EVENT = 'event',                   // 행사/이벤트
  SERVICE = 'service',               // 서비스 안내
  CONTACT = 'contact',               // 연락처
  ABOUT = 'about',                   // 소개
  ARCHIVE = 'archive',               // 아카이브
  SEARCH = 'search',                 // 검색
  MULTIMEDIA = 'multimedia',         // 멀티미디어
  DOWNLOAD = 'download',             // 다운로드
  FORM = 'form',                     // 양식
  CALENDAR = 'calendar',             // 일정
  DIRECTORY = 'directory',           // 디렉토리
  UNKNOWN = 'unknown'                // 미분류
}

📊 3.2 뉴스 가치 평가

interface NewsValueAssessment {
  news_relevance: number;        // 0.0 - 1.0 뉴스 관련성
  timeliness: number;           // 0.0 - 1.0 시의성
  authority: number;            // 0.0 - 1.0 권위성
  accessibility: number;        // 0.0 - 1.0 접근성
  update_frequency: UpdateFrequency; // 업데이트 빈도
  overall_score: number;        // 종합 점수
}

enum UpdateFrequency {
  REAL_TIME = 'real_time',      // 실시간
  DAILY = 'daily',              // 일간
  WEEKLY = 'weekly',            // 주간
  MONTHLY = 'monthly',          // 월간
  QUARTERLY = 'quarterly',      // 분기
  YEARLY = 'yearly',            // 연간
  IRREGULAR = 'irregular',      // 불규칙
  STATIC = 'static'             // 정적
}

function assessNewsValue(
  url: string,
  classification: ContentClassification,
  sitemapData?: SitemapUrl
): NewsValueAssessment {
  let newsRelevance = 0;
  let timeliness = 0;
  let authority = 0;
  let accessibility = 0;

  // 1. 뉴스 관련성 평가
  if (classification.primary_type === ContentType.NEWS) {
    newsRelevance = 0.9;
  } else if (classification.primary_type === ContentType.PRESS_RELEASE) {
    newsRelevance = 0.8;
  } else if (classification.primary_type === ContentType.ANNOUNCEMENT) {
    newsRelevance = 0.6;
  } else if (classification.primary_type === ContentType.POLICY) {
    newsRelevance = 0.5;
  }

  // 2. 시의성 평가 (sitemap lastmod 기반)
  if (sitemapData?.lastmod) {
    const lastModified = new Date(sitemapData.lastmod);
    const daysSinceUpdate = (Date.now() - lastModified.getTime()) / (1000 * 60 * 60 * 24);

    if (daysSinceUpdate < 1) timeliness = 1.0;
    else if (daysSinceUpdate < 7) timeliness = 0.8;
    else if (daysSinceUpdate < 30) timeliness = 0.6;
    else if (daysSinceUpdate < 90) timeliness = 0.4;
    else timeliness = 0.2;
  }

  // 3. 권위성 평가 (도메인 기반)
  const domain = new URL(url).hostname;
  if (domain.includes('.gov.')) authority = 1.0;
  else if (domain.includes('.edu.')) authority = 0.8;
  else if (domain.includes('.org.')) authority = 0.6;
  else authority = 0.4;

  // 4. 접근성 평가 (URL 구조 기반)
  const pathDepth = url.split('/').length - 3; // 도메인 제외
  if (pathDepth <= 2) accessibility = 1.0;
  else if (pathDepth <= 4) accessibility = 0.8;
  else if (pathDepth <= 6) accessibility = 0.6;
  else accessibility = 0.4;

  // 5. 업데이트 빈도 추정
  const updateFreq = estimateUpdateFrequency(classification, sitemapData);

  // 6. 종합 점수 계산
  const overallScore = (
    newsRelevance * 0.4 +
    timeliness * 0.3 +
    authority * 0.2 +
    accessibility * 0.1
  );

  return {
    news_relevance: newsRelevance,
    timeliness,
    authority,
    accessibility,
    update_frequency: updateFreq,
    overall_score: overallScore
  };
}

🎯 3.3 우선순위 기반 URL 선별

interface PrioritizedUrl {
  url: string;
  classification: ContentClassification;
  news_value: NewsValueAssessment;
  priority_score: number;        // 최종 우선순위 점수
  recommended_action: RecommendedAction;
}

enum RecommendedAction {
  IMMEDIATE_CRAWL = 'immediate_crawl',    // 즉시 크롤링
  SCHEDULED_CRAWL = 'scheduled_crawl',    // 스케줄 크롤링
  PERIODIC_CHECK = 'periodic_check',      // 주기적 확인
  LOW_PRIORITY = 'low_priority',          // 낮은 우선순위
  IGNORE = 'ignore'                       // 무시
}

function prioritizeUrls(discoveredUrls: DiscoveredUrl[]): PrioritizedUrl[] {
  return discoveredUrls
    .map(url => {
      // 우선순위 점수 계산
      const priorityScore = calculatePriorityScore(url);

      // 권장 액션 결정
      const recommendedAction = determineRecommendedAction(priorityScore, url.classification);

      return {
        url: url.url,
        classification: url.classification,
        news_value: url.news_value,
        priority_score: priorityScore,
        recommended_action: recommendedAction
      };
    })
    .sort((a, b) => b.priority_score - a.priority_score) // 높은 점수 우선
    .slice(0, 50); // 상위 50개만 선별
}

function calculatePriorityScore(url: DiscoveredUrl): number {
  const weights = {
    news_value: 0.5,
    classification_confidence: 0.2,
    structural_quality: 0.2,
    discovery_method: 0.1
  };

  return (
    url.news_value.overall_score * weights.news_value +
    url.classification.confidence * weights.classification_confidence +
    assessStructuralQuality(url.url) * weights.structural_quality +
    getDiscoveryMethodScore(url.discovery_method) * weights.discovery_method
  );
}

📊 출력 데이터 구조

🎯 확장된 도메인 메타데이터

interface EnhancedDomainMetadata {
  // 기존 필드
  domain: string;
  domain_id: string;
  registrable_domain: string;

  // 기존 기술 메타데이터
  technical_metadata: {
    robots_txt: RobotsMetadata;
    sitemap_xml: SitemapMetadata;
    server_info: ServerInfo;
  };

  // 새로운 지능형 발견 결과
  intelligent_discovery: {
    discovered_urls: PrioritizedUrl[];
    url_patterns: UrlPattern[];
    content_categories: {
      [key in ContentType]: {
        count: number;
        examples: string[];
        avg_priority: number;
      }
    };
    crawl_strategy: CrawlStrategy;
    discovery_stats: {
      total_urls_analyzed: number;
      high_priority_urls: number;
      medium_priority_urls: number;
      low_priority_urls: number;
      processing_time_ms: number;
    };
  };

  // 처리 메타데이터
  processing_metadata: {
    processed_at: string;
    processing_duration_ms: number;
    total_urls: number;
    partition_info: PartitionInfo;
    version: string; // 알고리즘 버전
  };
}

📋 출력 예시

{
  "domain": "mom.gov.sg",
  "domain_id": "gov:sg:mom.gov.sg",
  "registrable_domain": "mom.gov.sg",
  "intelligent_discovery": {
    "discovered_urls": [
      {
        "url": "https://mom.gov.sg/newsroom/press-releases/2026",
        "classification": {
          "primary_type": "press_release",
          "secondary_types": ["news"],
          "confidence": 0.92,
          "evidence": {
            "url_patterns": ["/newsroom", "/press-releases", "/{year}"],
            "keyword_matches": ["press", "releases", "newsroom"],
            "structural_hints": ["depth_2", "year_pattern"],
            "domain_context": ["government", "ministry"]
          }
        },
        "news_value": {
          "news_relevance": 0.8,
          "timeliness": 0.9,
          "authority": 1.0,
          "accessibility": 1.0,
          "update_frequency": "daily",
          "overall_score": 0.86
        },
        "priority_score": 0.89,
        "recommended_action": "immediate_crawl"
      }
    ],
    "url_patterns": [
      {
        "pattern": "/newsroom/press-releases/{year}",
        "content_type": "press_release",
        "confidence": 0.9,
        "examples": [
          "https://mom.gov.sg/newsroom/press-releases/2026",
          "https://mom.gov.sg/newsroom/press-releases/2025"
        ]
      }
    ],
    "content_categories": {
      "press_release": {
        "count": 15,
        "examples": ["https://mom.gov.sg/newsroom/press-releases/2026"],
        "avg_priority": 0.85
      },
      "policy": {
        "count": 8,
        "examples": ["https://mom.gov.sg/employment-practices/employment-act"],
        "avg_priority": 0.72
      }
    },
    "discovery_stats": {
      "total_urls_analyzed": 156,
      "high_priority_urls": 23,
      "medium_priority_urls": 45,
      "low_priority_urls": 88,
      "processing_time_ms": 12500
    }
  }
}

🚀 구현 계획

📅 Phase 1: 기본 URL 분석 강화 (2주)

Sitemap URL 심층 분석 구현
확장된 콘텐츠 분류 시스템
기본 뉴스 가치 평가 로직
우선순위 기반 URL 선별

📅 Phase 2: 패턴 발견 및 생성 (2주)

URL 패턴 추출 알고리즘
패턴 기반 URL 생성
도메인별 특화 패턴 라이브러리
제한적 동적 크롤링

📅 Phase 3: 지능형 분류 시스템 (3주)

ML 기반 콘텐츠 분류 (선택적)
다국어 키워드 분석
도메인 컨텍스트 분석
업데이트 빈도 추정 알고리즘

📅 Phase 4: 통합 및 최적화 (1주)

기존 시스템과 통합
성능 최적화
에러 처리 강화
문서화 완료

📊 성능 및 제약사항

⚡ 성능 목표

처리 시간: 도메인당 평균 30초 이내
메모리 사용량: Worker당 최대 64MB
API 호출: 도메인당 최대 20회 HTTP 요청
정확도: 콘텐츠 분류 정확도 85% 이상

🚫 제약사항

Cloudflare Workers 제한: CPU 시간 30초, 메모리 128MB
외부 요청 제한: 도메인당 최대 50개 URL 분석
robots.txt 준수: 크롤링 정책 엄격 준수
속도 제한: 동일 도메인에 대한 요청 간격 1초

🔄 에러 처리

interface ProcessingError {
  type: 'network' | 'parsing' | 'classification' | 'timeout';
  url: string;
  message: string;
  retry_count: number;
}

async function handleDiscoveryError(
  error: ProcessingError,
  context: DiscoveryContext
): Promise<void> {
  // 1. 에러 로깅
  logger.error('url_discovery_error', error);

  // 2. 재시도 가능한 에러인지 확인
  if (isRetryableError(error) && error.retry_count < 3) {
    await retryDiscovery(error.url, context);
  }

  // 3. 부분적 결과라도 저장
  await savePartialResults(context);
}

지능형 URL 발견 시스템 기획서

🧠 지능형 URL 발견 시스템 기획서

🎯 목적 및 배경

현재 상황 분석

현재 구현된 기능

목표

🏗️ 시스템 아키텍처

📊 전체 워크플로우

🔍 Stage 1: 기본 메타데이터 수집 (현재 구현됨)

현재 기능

개선 계획

🕷️ Stage 2: 지능형 URL 발견

🎯 2.1 Sitemap URL 심층 분석

현재 구현

개선 계획

🗂️ 2.2 도메인 구조 탐색

패턴 기반 URL 생성

제한적 동적 크롤링

🤖 Stage 3: URL 품질 평가 및 분류

🏷️ 3.1 고도화된 콘텐츠 분류

현재 분류 (단순 패턴 매칭)

개선된 분류 시스템

확장된 콘텐츠 타입

📊 3.2 뉴스 가치 평가

🎯 3.3 우선순위 기반 URL 선별

📊 출력 데이터 구조

🎯 확장된 도메인 메타데이터

📋 출력 예시

🚀 구현 계획

📅 Phase 1: 기본 URL 분석 강화 (2주)

📅 Phase 2: 패턴 발견 및 생성 (2주)

📅 Phase 3: 지능형 분류 시스템 (3주)

📅 Phase 4: 통합 및 최적화 (1주)

📊 성능 및 제약사항

⚡ 성능 목표

🚫 제약사항

🔄 에러 처리

🎯 기대 효과

📈 품질 향상

⚡ 효율성 개선

🎯 사용자 가치

📚 참조 문서