Scrape URLs

cURL

curl --request POST \
  --url https://api.olyptik.io/scrape \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2"
  ],
  "includeLinks": true,
  "excludeNonMainTags": true,
  "timeout": 60,
  "engineType": "auto",
  "useStaticIps": false,
  "deduplicateContent": true,
  "extraction": "Extract pricing information"
}
'

{
  "id": "67890abc123def456789",
  "teamId": "team_123",
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2"
  ],
  "results": [
    {
      "url": "https://example.com/page1",
      "isSuccess": true,
      "title": "Page Title",
      "markdown": "# Page Title\n\nPage content here...",
      "links": [
        "<string>"
      ],
      "duplicatesRemovedCount": 0,
      "errorCode": null,
      "errorMessage": null
    }
  ],
  "timeout": 60,
  "origin": "api",
  "projectId": "project_123",
  "createdAt": "2025-01-15T10:30:00Z",
  "updatedAt": "2025-01-15T10:31:00Z"
}

POST

scrape

cURL

curl --request POST \
  --url https://api.olyptik.io/scrape \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2"
  ],
  "includeLinks": true,
  "excludeNonMainTags": true,
  "timeout": 60,
  "engineType": "auto",
  "useStaticIps": false,
  "deduplicateContent": true,
  "extraction": "Extract pricing information"
}
'

{
  "id": "67890abc123def456789",
  "teamId": "team_123",
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2"
  ],
  "results": [
    {
      "url": "https://example.com/page1",
      "isSuccess": true,
      "title": "Page Title",
      "markdown": "# Page Title\n\nPage content here...",
      "links": [
        "<string>"
      ],
      "duplicatesRemovedCount": 0,
      "errorCode": null,
      "errorMessage": null
    }
  ],
  "timeout": 60,
  "origin": "api",
  "projectId": "project_123",
  "createdAt": "2025-01-15T10:30:00Z",
  "updatedAt": "2025-01-15T10:31:00Z"
}

The scrape endpoint allows you to scrape multiple URLs at once (up to 30 URLs). This is perfect for when you need to extract content from specific pages without crawling.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Scrape request payload

urls

string<uri>[]

required

Array of URLs to scrape (max 30)

Required array length: 1 - 30 elements

Example:

[
  "https://example.com/page1",
  "https://example.com/page2"
]

includeLinks

boolean

default:true

Whether to include links in the markdown output

excludeNonMainTags

boolean

default:true

Whether to exclude non-main tags from the markdown

timeout

integer

default:60

Timeout in seconds for the scrape operation

Required range: x >= 1

engineType

enum<string>

default:auto

The engine to use for scraping

Available options:

auto,

cheerio,

playwright

useStaticIps

boolean

default:false

Whether to use static IPs for scraping

deduplicateContent

boolean

default:true

Whether to remove duplicate content

extraction

string

default:""

AI instructions for extracting specific content

Example:

"Extract pricing information"

Response

Scrape response with results for all URLs

string

Unique identifier for the scrape operation

Example:

"67890abc123def456789"

teamId

string

ID of the team that initiated the scrape

Example:

"team_123"

urls

string<uri>[]

Array of URLs that were scraped

Example:

[
  "https://example.com/page1",
  "https://example.com/page2"
]

results

object[]

Results for each URL

Show child attributes

timeout

integer

Timeout used for the scrape operation in seconds

Example:

60

origin

string

Origin of the scrape request

Example:

"api"

projectId

string

Project ID associated with the scrape

Example:

"project_123"

createdAt

string<date-time>

Timestamp when the scrape was created

Example:

"2025-01-15T10:30:00Z"

updatedAt

string<date-time>

Timestamp when the scrape was last updated

Example:

"2025-01-15T10:31:00Z"

API Introduction Get Crawl

⌘I

API Documentation

Scrape

Crawls

Crawl Results

Crawl Logs

Authorizations

Body

Response