Skip to main content
POST
/
crawls
cURL
curl --request POST \
  --url https://api.olyptik.io/crawls \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "startUrl": "https://example.com",
  "maxResults": 5000,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info about the product",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 60
}'
{
  "startUrl": "https://example.com",
  "maxResults": 55,
  "maxDepth": 10,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": true,
  "deduplicateContent": true,
  "extraction": "Extract only pricing info",
  "engineType": "auto",
  "useStaticIps": false,
  "timeout": 1800,
  "id": "6870e36787c81925622df818",
  "createdAt": "2023-11-07T05:31:56Z",
  "status": "timed_out",
  "completedAt": "2023-11-07T05:31:56Z",
  "durationInSeconds": 1800,
  "brandId": "<string>",
  "startUrls": [
    "https://example.com"
  ],
  "totalPages": 100,
  "origin": "web"
}
Creates a new crawl job using the API key authentication.

Request

Headers

  • Authorization: API key for authentication

Body

{
  "startUrls": ["https://example.com"],
  "engine": "playwright",
  "maxDepth": 5,
  "maxResults": 50,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": false
}

Schema

export class StartCrawlPayload {
    @IsString({ each: true })
    @IsNotEmpty()
    @IsUrl({}, { each: true })
    startUrls: string[];

    @IsOptional()
    @IsEnum(EngineType)
    engine: EngineType;

    @IsNumber()
    @IsNotEmpty()
    @Max(50)
    @Min(1)
    maxDepth: number;

    @IsNumber()
    @IsNotEmpty()
    @Max(999)
    @Min(1)
    maxResults: number;

    @IsBoolean()
    @IsOptional()
    includeExternalLinks: boolean;

    @IsOptional()
    @IsEnum(StackType)
    stack?: StackType;

    includeOnlyPaths: string[];
    excludePaths: string[];
    includeOnlyTags: string[];
    excludeTags: string[];
    origin: Origin;

    maxConcurrentPages: number;
}

export enum EngineType {
    PLAYWRIGHT = 'playwright',
    CHEERIO = 'cheerio'
}
FieldTypeRequiredDescriptionValidation
startUrlsstring[]YesArray of URLs to start crawling fromMust be valid URLs
enginestringNoCrawling engine to useMust be either ‘playwright’ or ‘cheerio’
maxDepthnumberYesMaximum depth of pages to crawlBetween 1-50
maxResultsnumberYesMaximum number of results to collectBetween 1-999
entireWebsitebooleanNoWhether to crawl the entire websiteDefault: false
excludeNonMainTagsbooleanNoWhether to exclude non-main tags from the crawl results’ markdownDefault: true
stackstringNoTechnology stack to target (e.g. react)Must be a valid StackType
includeOnlyPathsstring[]NoArray of paths to include in the crawl-
excludePathsstring[]NoArray of paths to exclude from the crawl-
includeOnlyTagsstring[]NoArray of HTML tags to include-
excludeTagsstring[]NoArray of HTML tags to exclude-
maxConcurrentPagesnumberNoMaximum number of concurrent pages to crawl-

Response

{
  "id": "crawl-id",
  "status": "running",
  "startUrls": ["https://example.com"],
  "createdAt": "2024-03-20T12:00:00Z",
  "updatedAt": "2024-03-20T12:00:00Z"
}

Errors

  • 401 Unauthorized: Invalid or missing API key
  • 400 Bad Request: Invalid request payload
  • 500 Internal Server Error: Error starting the crawl

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

maxResults and maxDepth will be ignored if useSitemap or entireWebsite is true

startUrl
string<uri>
required

URL to start crawling from

Example:

"https://example.com"

maxResults
integer

Maximum number of results to collect

Required range: 1 <= x <= 10000
maxDepth
integer
default:10

Maximum depth of pages to crawl

Required range: 1 <= x <= 100
useSitemap
boolean
default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite
boolean
default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags
boolean
default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent
boolean
default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction
string
default:""

Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info about the product"

engineType
enum<string>
default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:
auto,
cheerio,
playwright
Example:

"auto"

useStaticIps
boolean
default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout
integer

Timeout duration in minutes

Required range: x >= 60
Example:

60

Response

Crawl object

startUrl
string<uri>
required

URL to start crawling from

Example:

"https://example.com"

maxResults
integer

Maximum number of results to collect

Required range: 1 <= x <= 110
maxDepth
integer
default:10

Maximum depth of pages to crawl

Required range: 1 <= x <= 99
Example:

10

useSitemap
boolean
default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite
boolean
default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags
boolean
default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent
boolean
default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction
string
default:""

Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info"

engineType
enum<string>
default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:
auto,
cheerio,
playwright
Example:

"auto"

useStaticIps
boolean
default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout
integer
default:1800

Timeout duration in minutes Timeout duration in seconds

Required range: x >= 60
Example:

1800

id
string

Identification number of the crawl

Example:

"6870e36787c81925622df818"

createdAt
string<date-time>

Timestamp when the crawl was created

status
enum<string>

Current status of the crawl

Available options:
running,
succeeded,
failed,
aborted,
timed_out,
error
Example:

"timed_out"

completedAt
string<date-time>

Timestamp when the crawl was completed

durationInSeconds
integer

Duration of the crawl in seconds

Required range: x >= 0
Example:

1800

brandId
string

ID of the brand associated with the crawl

startUrls
string<uri>[]

Array of URLs to start crawling from

Example:
["https://example.com"]
totalPages
integer

Count of pages extracted

Required range: x >= 0
Example:

100

origin
enum<string>

Origin of the crawl request

Available options:
api,
web
Example:

"web"

I