Skip to content

Source Domain to Source URL Discovery Strategy

There is no single method. “Stepwise exploration + confidence-based prioritization” is the correct approach.

Design it as a multi-stage pipeline with fallback.

source_domain
[1] Well-known Feeds
↓ (없으면)
[2] Sitemap 분석
↓ (없으면)
[3] Robots.txt 힌트
↓ (없으면)
[4] HTML Pattern 탐색
↓ (없으면)
[5] Search Engine 보조 탐색

The first successful point is the source_url

1. Well-known Feed Path (Highest Priority, Resolves 80%)

Section titled “1. Well-known Feed Path (Highest Priority, Resolves 80%)”
https://{domain}/rss
https://{domain}/rss.xml
https://{domain}/feed
https://{domain}/feeds/news.xml
https://{domain}/news/rss
https://{domain}/press/rss
/rss/news
/newsroom/rss
/press-releases/rss
/media-centre/rss
  • HTTP 200
  • Content-Type: application/xml or rss+xml
  • <item> or <entry> Existence
  • https://{domain}/sitemap.xml
  • URLs containing /news , /press , /media , /announcements within Sitemap are prioritized
  • Sitemap: entries
  • Allow: paths such as /news , /press are considered

4. HTML Pattern Exploration (Final Automated Step)

Section titled “4. HTML Pattern Exploration (Final Automated Step)”
  • https://{domain}/ to <nav> , <footer> Link Collection
  • Filtering by Anchor Text (News, Press, Media, etc.)

5. Search Engine Assistance (Semi-Automated)

Section titled “5. Search Engine Assistance (Semi-Automated)”
  • site:{domain} (news OR press OR announcement)
  • For Seed Candidate Suggestions, Automatic Registration ❌
{
"source_domain": "mom.gov.sg",
"source_url": "https://www.mom.gov.sg/rss/news",
"fetch_type": "rss",
"discovery_method": "well_known",
"discovered_at": "2026-01-21T02:00:00Z",
"confidence": 0.95
}
  • Crawling entire domains
  • Automatic registration of raw Google results
  • Setting HTML parsing as default
  • Creating seeds without source_url