Source Domain to Source URL Discovery Strategy
Conclusion
Section titled “Conclusion”There is no single method. “Stepwise exploration + confidence-based prioritization” is the correct approach.
Design it as a multi-stage pipeline with fallback.
Discovery Pipeline (Recommended Order)
Section titled “Discovery Pipeline (Recommended Order)”source_domain ↓[1] Well-known Feeds ↓ (없으면)[2] Sitemap 분석 ↓ (없으면)[3] Robots.txt 힌트 ↓ (없으면)[4] HTML Pattern 탐색 ↓ (없으면)[5] Search Engine 보조 탐색→ The first successful point is the source_url
1. Well-known Feed Path (Highest Priority, Resolves 80%)
Section titled “1. Well-known Feed Path (Highest Priority, Resolves 80%)”Standard Path
Section titled “Standard Path”https://{domain}/rsshttps://{domain}/rss.xmlhttps://{domain}/feedhttps://{domain}/feeds/news.xmlhttps://{domain}/news/rsshttps://{domain}/press/rssGovernment/Public Institution Specialized
Section titled “Government/Public Institution Specialized”/rss/news/newsroom/rss/press-releases/rss/media-centre/rssJudgment Criteria
Section titled “Judgment Criteria”- HTTP 200
Content-Type: application/xmlorrss+xml<item>or<entry>Existence
2. Sitemap.xml Analysis
Section titled “2. Sitemap.xml Analysis”https://{domain}/sitemap.xml- URLs containing
/news,/press,/media,/announcementswithin Sitemap are prioritized
3. Robots.txt Hints
Section titled “3. Robots.txt Hints”Sitemap:entriesAllow:paths such as/news,/pressare considered
4. HTML Pattern Exploration (Final Automated Step)
Section titled “4. HTML Pattern Exploration (Final Automated Step)”https://{domain}/to<nav>,<footer>Link Collection- Filtering by Anchor Text (News, Press, Media, etc.)
5. Search Engine Assistance (Semi-Automated)
Section titled “5. Search Engine Assistance (Semi-Automated)”site:{domain} (news OR press OR announcement)- For Seed Candidate Suggestions, Automatic Registration ❌
Seed storage result example
Section titled “Seed storage result example”{ "source_domain": "mom.gov.sg", "source_url": "https://www.mom.gov.sg/rss/news", "fetch_type": "rss", "discovery_method": "well_known", "discovered_at": "2026-01-21T02:00:00Z", "confidence": 0.95}Methods to absolutely avoid ❌
Section titled “Methods to absolutely avoid ❌”- Crawling entire domains
- Automatic registration of raw Google results
- Setting HTML parsing as default
- Creating seeds without source_url