Skip to content

JSON Structure Design

  • Research Dataset: Contains only discovered facts (Evidence). Not a Seed.
  • Seed: A contract defining “where · how · under what conditions · how often to collect.”
  • meta: country, category, date, generated_at, generator
  • stats: total_domains, alive_domains, avg_response_time_ms
  • domains: domain_id, source_domain, discovered, content_nature_inference, policy_hints, semantic_profile, relationships
  • domain_id: Stable ID (e.g., gov:sg:mom.gov.sg). Used as GraphDB/VectorDB node key.
  • content_nature_inference: Specify that it is an inference result (nature, confidence, method, evidence).
  • policy_hints: robots_allow_crawling, sitemap_present, suspected_license, etc.
  • relationships: Edge hints for GraphDB (GOVERNED_BY, affiliation, etc.).
  • required: meta, source, fetch, policy, lifecycle
  • meta: seed_id, version, country, category, created_at, status
  • source: domain_id, source_domain, source_urls (Map: news, faq, guide…), language, trust_tier
  • fetch: type (rss | html | api), Options per RSS/HTML/API
  • policy: crawl_allowed (must be human-approved), robots_url, sitemap_url, license
  • lifecycle: source_dataset, approved_by, approved_at
  • domain_id: {authority}:{country}:{registrable_domain} (e.g., gov:sg:mom.gov.sg)
  • seed_id: {domain_id}::{content_type} (e.g., gov:sg:mom.gov.sg::news)
  • version: Only reflects contract changes. Increment integer. Increment by +1 when URL/selector/schedule changes.
  • Research: Facts + Inference. Seed: Decisions + Contracts.
  • crawl_allowed requires human approval.
  • Seeds prohibit embedding. Perform at article stage.