Crawl Website
Start an asynchronous full-site crawl. Returns a job ID that you poll for results. Crawls follow links within the domain, respecting depth and page limits.
Endpoint: POST /v1/web/crawl
Credits: 1 per page crawled (1 charged upfront, remainder on completion)
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | Starting URL for the crawl |
maxPages | number | No | Maximum pages to crawl (default: 100, max: 10,000) |
maxDepth | number | No | Maximum link depth (default: 3, max: 10) |
urlRegex | string | No | Regex pattern to filter which URLs to crawl |
includeLinks | boolean | No | Include outgoing links per page (default: false) |
includeImages | boolean | No | Include images per page (default: false) |
webhookUrl | string | No | URL to receive a POST when the crawl completes |
Request Example
{
"url": "https://stripe.com",
"maxPages": 50,
"maxDepth": 2,
"includeLinks": true,
"webhookUrl": "https://yourapp.com/webhooks/crawl"
}Response Schema
Start Crawl (202 Accepted)
{
"success": true,
"jobId": "c3d4e5f6-a7b8-9012-cdef-123456789012",
"status": "queued",
"url": "https://stripe.com",
"maxPages": 50,
"credits_used": 1,
"request_id": "d4e5f6a7-b8c9-0123-def0-234567890123"
}Poll Status (GET /v1/web/crawl/{jobId})
{
"success": true,
"data": {
"job_id": "c3d4e5f6-a7b8-9012-cdef-123456789012",
"status": "completed",
"pages_found": 48,
"pages_completed": 48,
"pages": [
{
"url": "https://stripe.com",
"title": "Stripe | Payment Processing Platform",
"status_code": 200,
"content_type": "text/html",
"word_count": 1250,
"links": ["https://stripe.com/pricing", "https://stripe.com/docs"]
}
],
"started_at": "2024-12-15T10:30:00Z",
"completed_at": "2024-12-15T10:32:45Z"
},
"request_id": "e5f6a7b8-c9d0-1234-ef01-345678901234"
}Code Examples
cURL
# Start the crawl
curl -X POST "https://api.orsa.dev/v1/web/crawl" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://stripe.com",
"maxPages": 50,
"maxDepth": 2
}'
# Poll for results
curl -X GET "https://api.orsa.dev/v1/web/crawl/c3d4e5f6-a7b8-9012-cdef-123456789012" \
-H "Authorization: Bearer YOUR_API_KEY"TypeScript
// Start the crawl
const job = await client.web.crawl({
url: 'https://stripe.com',
maxPages: 50,
maxDepth: 2,
includeLinks: true,
});
console.log(job.jobId); // "c3d4e5f6-..."
console.log(job.status); // "queued"
// Poll for results
const result = await client.web.getCrawlStatus(job.jobId);
if (result.status === 'completed') {
console.log(result.pages.length); // 48
console.log(result.pages[0].title); // "Stripe | ..."
}Python
# Start the crawl
job = client.web.crawl(
url="https://stripe.com",
max_pages=50,
max_depth=2,
include_links=True,
)
print(job.job_id) # "c3d4e5f6-..."
print(job.status) # "queued"
# Poll for results
result = client.web.get_crawl_status(job.job_id)
if result.status == "completed":
print(len(result.pages)) # 48
print(result.pages[0].title) # "Stripe | ..."Error Codes
| Code | Status | Description |
|---|---|---|
INPUT_VALIDATION_ERROR | 400 | Invalid URL or parameters |
UNAUTHORIZED | 401 | Missing or invalid API key |
USAGE_EXCEEDED | 402 | Insufficient credits |
RATE_LIMITED | 429 | Rate limit exceeded |
INTERNAL_ERROR | 500 | Failed to create crawl job |
Crawl Status Values
| Status | Description |
|---|---|
queued | Job created, waiting to start |
running | Actively crawling pages |
completed | All pages crawled successfully |
failed | Crawl encountered a fatal error |
Notes
- Credits are charged as pages are crawled: 1 credit upfront, the rest on completion.
- Use
urlRegexto limit crawling to specific sections (e.g.,/blog/.*for blog posts only). - The
webhookUrlreceives a POST with the full crawl result when the job completes. - Poll
GET /v1/web/crawl/[jobId]to check status. Completed jobs include all page data. - Crawl jobs expire after 24 hours. Download results before then.
- The crawl respects
robots.txtby default.