API ReferenceWeb ScrapingScrape Sitemap

Scrape Sitemap

Discover URLs declared in a domain’s sitemap. Reads /robots.txt for non-standard sitemap locations, then falls through to /sitemap.xml, /sitemap_index.xml, and /sitemap-index.xml. Sitemap-index files are walked recursively. Up to 1,000 URLs are returned, grouped by first path segment so you can quickly find what you need.

Endpoint: GET /v1/web/scrape/sitemap Credits: 1 per request

Parameters

ParameterTypeRequiredDescription
domainstringYesDomain to parse sitemap for (e.g., stripe.com)

Response Schema

{
  "data": {
    "domain": "stripe.com",
    "sitemap": "https://stripe.com/sitemap.xml",
    "urls": [
      "https://stripe.com",
      "https://stripe.com/pricing",
      "https://stripe.com/payments",
      "https://stripe.com/billing"
    ],
    "count": 847,
    "groups": {
      "docs": { "count": 240, "samples": ["https://stripe.com/docs/api", "https://stripe.com/docs/payments"] },
      "blog": { "count": 156, "samples": ["https://stripe.com/blog/announcing-stripe-link"] },
      "pricing": { "count": 4, "samples": ["https://stripe.com/pricing"] }
    }
  },
  "_meta": { "timing": { "total_ms": 1240 }, "cache": { "hit": false } }
}

Code Examples

cURL

curl -X GET "https://api.orsa.dev/v1/web/scrape/sitemap?domain=stripe.com" \
  -H "Authorization: Bearer YOUR_API_KEY"

TypeScript

const { data } = await client.web.scrapeSitemap({
  domain: 'stripe.com',
});
 
console.log(data.count);                  // 847
console.log(data.sitemap);                // which sitemap URL we resolved
console.log(data.groups.docs.count);      // 240
console.log(data.urls.slice(0, 5));

Python

res = client.web.scrape_sitemap(domain="stripe.com")
data = res["data"]
 
print(data["count"])               # 847
print(data["sitemap"])             # resolved sitemap URL
print(data["groups"]["docs"]["count"])

Error Codes

CodeStatusDescription
INPUT_VALIDATION_ERROR400Invalid or missing domain
UNAUTHORIZED401Missing or invalid API key
RATE_LIMITED429Rate limit exceeded
INTERNAL_ERROR500Server error during parsing

Notes

  • sitemap is the URL we actually resolved (e.g. one from robots.txt, or /sitemap.xml). null if no sitemap was found.
  • groups buckets URLs by first non-empty path segment with up to 5 sample URLs each — useful for quickly answering “where are the docs?” or “is there a blog?” without iterating the full list.
  • Returns up to 1,000 URLs per request. URLs are not deduplicated across sub-sitemaps beyond what the source declares.
  • No credits are deducted if zero URLs were found.