Crawl Website

Start an asynchronous full-site crawl. Returns a job ID that you poll for results. Crawls follow links within the domain, respecting depth and page limits.

Endpoint: POST /v1/web/crawl Credits: 1 per page crawled (1 charged upfront, remainder on completion)

Request Body

Parameter	Type	Required	Description
`url`	string	Yes	Starting URL for the crawl
`maxPages`	number	No	Maximum pages to crawl (default: 100, max: 10,000)
`maxDepth`	number	No	Maximum link depth (default: 3, max: 10)
`urlRegex`	string	No	Regex pattern to filter which URLs to crawl
`includeLinks`	boolean	No	Include outgoing links per page (default: false)
`includeImages`	boolean	No	Include images per page (default: false)
`webhookUrl`	string	No	URL to receive a POST when the crawl completes

Request Example

{
  "url": "https://stripe.com",
  "maxPages": 50,
  "maxDepth": 2,
  "includeLinks": true,
  "webhookUrl": "https://yourapp.com/webhooks/crawl"
}

Response Schema

Start Crawl (202 Accepted)

{
  "success": true,
  "jobId": "c3d4e5f6-a7b8-9012-cdef-123456789012",
  "status": "queued",
  "url": "https://stripe.com",
  "maxPages": 50,
  "credits_used": 1,
  "request_id": "d4e5f6a7-b8c9-0123-def0-234567890123"
}

Poll Status (GET /v1/web/crawl/{jobId})

{
  "success": true,
  "data": {
    "job_id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
    "status": "completed",
    "pages_found": 48,
    "pages_completed": 48,
    "pages": [
      {
        "url": "https://stripe.com",
        "title": "Stripe | Payment Processing Platform",
        "status_code": 200,
        "content_type": "text/html",
        "word_count": 1250,
        "links": ["https://stripe.com/pricing", "https://stripe.com/docs"]
      }
    ],
    "started_at": "2024-12-15T10:30:00Z",
    "completed_at": "2024-12-15T10:32:45Z"
  },
  "request_id": "e5f6a7b8-c9d0-1234-ef01-345678901234"
}

Code Examples

cURL

# Start the crawl
curl -X POST "https://api.orsa.dev/v1/web/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://stripe.com",
    "maxPages": 50,
    "maxDepth": 2
  }'
 
# Poll for results
curl -X GET "https://api.orsa.dev/v1/web/crawl/c3d4e5f6-a7b8-9012-cdef-123456789012" \
  -H "Authorization: Bearer YOUR_API_KEY"

TypeScript

// Start the crawl
const job = await client.web.crawl({
  url: 'https://stripe.com',
  maxPages: 50,
  maxDepth: 2,
  includeLinks: true,
});
 
console.log(job.jobId);   // "c3d4e5f6-..."
console.log(job.status);  // "queued"
 
// Poll for results
const result = await client.web.getCrawlStatus(job.jobId);
 
if (result.status === 'completed') {
  console.log(result.pages.length);       // 48
  console.log(result.pages[0].title);     // "Stripe | ..."
}

Python

# Start the crawl
job = client.web.crawl(
    url="https://stripe.com",
    max_pages=50,
    max_depth=2,
    include_links=True,
)
 
print(job.job_id)   # "c3d4e5f6-..."
print(job.status)   # "queued"
 
# Poll for results
result = client.web.get_crawl_status(job.job_id)
 
if result.status == "completed":
    print(len(result.pages))          # 48
    print(result.pages[0].title)      # "Stripe | ..."

Error Codes

Code	Status	Description
`INPUT_VALIDATION_ERROR`	400	Invalid URL or parameters
`UNAUTHORIZED`	401	Missing or invalid API key
`USAGE_EXCEEDED`	402	Insufficient credits
`RATE_LIMITED`	429	Rate limit exceeded
`INTERNAL_ERROR`	500	Failed to create crawl job

Crawl Status Values

Status	Description
`queued`	Job created, waiting to start
`running`	Actively crawling pages
`completed`	All pages crawled successfully
`failed`	Crawl encountered a fatal error

Notes

Credits are charged as pages are crawled: 1 credit upfront, the rest on completion.
Use urlRegex to limit crawling to specific sections (e.g., /blog/.* for blog posts only).
The webhookUrl receives a POST with the full crawl result when the job completes.
Poll GET /v1/web/crawl/[jobId] to check status. Completed jobs include all page data.
Crawl jobs expire after 24 hours. Download results before then.
The crawl respects robots.txt by default.

Scrape Sitemap Retrieve by Domain