Skip to main content
POST
/
crawls
cURL
curl --request POST \
  --url https://api.olyptik.io/crawls \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "startUrl": "https://example.com",
  "maxResults": 5000,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info about the product",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 60
}
'
{
  "startUrl": "https://example.com",
  "maxResults": 55,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 1800,
  "id": "6870e36787c81925622df818",
  "createdAt": "2023-11-07T05:31:56Z",
  "status": "timed_out",
  "completedAt": "2023-11-07T05:31:56Z",
  "durationInSeconds": 1800,
  "brandId": "<string>",
  "startUrls": [
    "https://example.com"
  ],
  "totalPages": 100,
  "origin": "web"
}

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

maxResults and maxDepth will be ignored if useSitemap or entireWebsite is true

startUrl
string<uri>
required

URL to start crawling from

Example:

"https://example.com"

maxResults
integer

Maximum number of results to collect

Required range: 1 <= x <= 10000
maxDepth
integer
default:10

Maximum depth of pages to crawl

Required range: 1 <= x <= 100
useSitemap
boolean
default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite
boolean
default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags
boolean
default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent
boolean
default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction
string
default:""

Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info about the product"

engineType
enum<string>
default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:
auto,
cheerio,
playwright
Example:

"auto"

useStaticIps
boolean
default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout
integer

Timeout duration in minutes

Required range: x >= 60
Example:

60

Response

Crawl object

startUrl
string<uri>
required

URL to start crawling from

Example:

"https://example.com"

maxResults
integer

Maximum number of results to collect Maximum number of results to collect

Required range: 1 <= x <= 110
maxDepth
integer
default:10

Maximum depth of pages to crawl Maximum depth of pages to crawl

Required range: 1 <= x <= 99
Example:

10

useSitemap
boolean
default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite
boolean
default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags
boolean
default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent
boolean
default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction
string
default:""

Instructions defining how the AI should extract specific content from the crawl results Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info"

engineType
enum<string>
default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:
auto,
cheerio,
playwright
Example:

"auto"

useStaticIps
boolean
default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0 Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout
integer
default:1800

Timeout duration in minutes Timeout duration in seconds

Required range: x >= 60
Example:

1800

id
string

Identification number of the crawl

Example:

"6870e36787c81925622df818"

createdAt
string<date-time>

Timestamp when the crawl was created

status
enum<string>

Current status of the crawl

Available options:
running,
succeeded,
failed,
aborted,
timed_out,
error
Example:

"timed_out"

completedAt
string<date-time>

Timestamp when the crawl was completed

durationInSeconds
integer

Duration of the crawl in seconds

Required range: x >= 0
Example:

1800

brandId
string

ID of the brand associated with the crawl

startUrls
string<uri>[]

Array of URLs to start crawling from

Example:
["https://example.com"]
totalPages
integer

Count of pages extracted

Required range: x >= 0
Example:

100

origin
enum<string>

Origin of the crawl request

Available options:
api,
web
Example:

"web"