Create Crawl

cURL

curl --request POST \
  --url https://api.olyptik.io/crawls \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "startUrl": "https://example.com",
  "maxResults": 5000,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info about the product",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 60
}
'

{
  "startUrl": "https://example.com",
  "maxResults": 55,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 1800,
  "id": "6870e36787c81925622df818",
  "createdAt": "2023-11-07T05:31:56Z",
  "status": "timed_out",
  "completedAt": "2023-11-07T05:31:56Z",
  "durationInSeconds": 1800,
  "brandId": "<string>",
  "startUrls": [
    "https://example.com"
  ],
  "totalPages": 100,
  "origin": "web"
}

POST

crawls

cURL

curl --request POST \
  --url https://api.olyptik.io/crawls \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "startUrl": "https://example.com",
  "maxResults": 5000,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info about the product",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 60
}
'

{
  "startUrl": "https://example.com",
  "maxResults": 55,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 1800,
  "id": "6870e36787c81925622df818",
  "createdAt": "2023-11-07T05:31:56Z",
  "status": "timed_out",
  "completedAt": "2023-11-07T05:31:56Z",
  "durationInSeconds": 1800,
  "brandId": "<string>",
  "startUrls": [
    "https://example.com"
  ],
  "totalPages": 100,
  "origin": "web"
}

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

maxResults and maxDepth will be ignored if useSitemap or entireWebsite is true

startUrl

string<uri>

required

URL to start crawling from

Example:

"https://example.com"

maxResults

integer

Maximum number of results to collect

Required range: 1 <= x <= 10000

maxDepth

integer

default:10

Maximum depth of pages to crawl

Required range: 1 <= x <= 100

useSitemap

boolean

default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite

boolean

default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags

boolean

default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

includeLinks

boolean

default:true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent

boolean

default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction

string

default:""

Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info about the product"

engineType

enum<string>

default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:

auto,

cheerio,

playwright

Example:

"auto"

useStaticIps

boolean

default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout

integer

Timeout duration in minutes

Required range: x >= 60

Example:

60

Response

Crawl object

startUrl

string<uri>

required

URL to start crawling from

Example:

"https://example.com"

maxResults

integer

Maximum number of results to collect

Required range: 1 <= x <= 110

maxDepth

integer

default:10

Maximum depth of pages to crawl

Required range: 1 <= x <= 99

Example:

10

useSitemap

boolean

default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite

boolean

default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags

boolean

default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

includeLinks

boolean

default:true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent

boolean

default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction

string

default:""

Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info"

engineType

enum<string>

default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:

auto,

cheerio,

playwright

Example:

"auto"

useStaticIps

boolean

default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout

integer

default:1800

Timeout duration in seconds

Required range: x >= 60

Example:

1800

string

Identification number of the crawl

Example:

"6870e36787c81925622df818"

createdAt

string<date-time>

Timestamp when the crawl was created

status

enum<string>

Current status of the crawl

Available options:

running,

succeeded,

failed,

aborted,

timed_out,

error

Example:

"timed_out"

completedAt

string<date-time>

Timestamp when the crawl was completed

durationInSeconds

integer

Duration of the crawl in seconds

Required range: x >= 0

Example:

1800

brandId

string

ID of the brand associated with the crawl

startUrls

string<uri>[]

Array of URLs to start crawling from

Example:

["https://example.com"]

totalPages

integer

Count of pages extracted

Required range: x >= 0

Example:

100

origin

enum<string>

Origin of the crawl request

Available options:

api,

web

Example:

"web"

Get Crawl Abort Crawl

⌘I

API Documentation

Scrape

Crawls

Crawl Results

Crawl Logs

Authorizations

Body

Response