Create Crawl

Creates a new crawl job using the API key authentication.

Request

Headers

Authorization: API key for authentication

Body

{
  "startUrls": ["https://example.com"],
  "engine": "playwright",
  "maxDepth": 5,
  "maxResults": 50,
  "useSitemap": false,
  "entireWebsite": false,
  "excludeNonMainTags": true,
  "includeLinks": false
}

Schema

export class StartCrawlPayload {
    @IsString({ each: true })
    @IsNotEmpty()
    @IsUrl({}, { each: true })
    startUrls: string[];

    @IsOptional()
    @IsEnum(EngineType)
    engine: EngineType;

    @IsNumber()
    @IsNotEmpty()
    @Max(50)
    @Min(1)
    maxDepth: number;

    @IsNumber()
    @IsNotEmpty()
    @Max(999)
    @Min(1)
    maxResults: number;

    @IsBoolean()
    @IsOptional()
    includeExternalLinks: boolean;

    @IsOptional()
    @IsEnum(StackType)
    stack?: StackType;

    includeOnlyPaths: string[];
    excludePaths: string[];
    includeOnlyTags: string[];
    excludeTags: string[];
    origin: Origin;

    maxConcurrentPages: number;
}

export enum EngineType {
    PLAYWRIGHT = 'playwright',
    CHEERIO = 'cheerio'
}

Field	Type	Required	Description	Validation
startUrls	string[]	Yes	Array of URLs to start crawling from	Must be valid URLs
engine	string	No	Crawling engine to use	Must be either ‘playwright’ or ‘cheerio’
maxDepth	number	Yes	Maximum depth of pages to crawl	Between 1-50
maxResults	number	Yes	Maximum number of results to collect	Between 1-999
entireWebsite	boolean	No	Whether to crawl the entire website	Default: false
excludeNonMainTags	boolean	No	Whether to exclude non-main tags from the crawl results’ markdown	Default: true
stack	string	No	Technology stack to target (e.g. react)	Must be a valid StackType
includeOnlyPaths	string[]	No	Array of paths to include in the crawl	-
excludePaths	string[]	No	Array of paths to exclude from the crawl	-
includeOnlyTags	string[]	No	Array of HTML tags to include	-
excludeTags	string[]	No	Array of HTML tags to exclude	-
maxConcurrentPages	number	No	Maximum number of concurrent pages to crawl	-

Response

{
  "id": "crawl-id",
  "status": "running",
  "startUrls": ["https://example.com"],
  "createdAt": "2024-03-20T12:00:00Z",
  "updatedAt": "2024-03-20T12:00:00Z"
}

Errors

401 Unauthorized: Invalid or missing API key
400 Bad Request: Invalid request payload
500 Internal Server Error: Error starting the crawl

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

maxResults and maxDepth will be ignored if useSitemap or entireWebsite is true

startUrl

string<uri>

required

URL to start crawling from

Example:

"https://example.com"

maxResults

integer

Maximum number of results to collect

Required range: 1 <= x <= 10000

maxDepth

integer

default:10

Maximum depth of pages to crawl

Required range: 1 <= x <= 100

useSitemap

boolean

default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite

boolean

default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags

boolean

default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

includeLinks

boolean

default:true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent

boolean

default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction

string

default:""

Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info about the product"

engineType

enum<string>

default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:

auto,

cheerio,

playwright

Example:

"auto"

useStaticIps

boolean

default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout

integer

Timeout duration in minutes

Required range: x >= 60

Example:

60

Response

Crawl object

startUrl

string<uri>

required

URL to start crawling from

Example:

"https://example.com"

maxResults

integer

Maximum number of results to collect Maximum number of results to collect

Required range: 1 <= x <= 110

maxDepth

integer

default:10

Maximum depth of pages to crawl Maximum depth of pages to crawl

Required range: 1 <= x <= 99

Example:

10

useSitemap

boolean

default:false

Whether to use sitemap.xml to crawl the website. If true - maxResults and maxDepth will be ignored.

Example:

false

entireWebsite

boolean

default:false

Whether to crawl the entire website. If true - maxResults and maxDepth will be ignored.

Example:

false

excludeNonMainTags

boolean

default:true

Whether to exclude non-main tags from the crawl results' markdown

Example:

true

includeLinks

boolean

default:true

Whether to include links in the crawl results' markdown

Example:

true

deduplicateContent

boolean

default:true

Whether to remove duplicate text fragments that appeared on other pages.

Example:

true

extraction

string

default:""

Instructions defining how the AI should extract specific content from the crawl results Instructions defining how the AI should extract specific content from the crawl results

Example:

"Extract only pricing info"

engineType

enum<string>

default:auto

The engine to use for the crawl. Auto: auto detect the best engine (default). Cheerio: fast, great for static websites. Playwright: great for dynamic websites that use JavaScript frameworks.

Available options:

auto,

cheerio,

playwright

Example:

"auto"

useStaticIps

boolean

default:false

Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0 Whether to use static IPs for the crawl. This target website can whitelist the IPs to use for the crawl. The static IP will be 154.17.150.0

Example:

false

timeout

integer

default:1800

Timeout duration in minutes Timeout duration in seconds

Required range: x >= 60

Example:

1800

string

Identification number of the crawl

Example:

"6870e36787c81925622df818"

createdAt

string<date-time>

Timestamp when the crawl was created

status

enum<string>

Current status of the crawl

Available options:

running,

succeeded,

failed,

aborted,

timed_out,

error

Example:

"timed_out"

completedAt

string<date-time>

Timestamp when the crawl was completed

durationInSeconds

integer

Duration of the crawl in seconds

Required range: x >= 0

Example:

1800

brandId

string

ID of the brand associated with the crawl

startUrls

string<uri>[]

Array of URLs to start crawling from

Example:

["https://example.com"]

totalPages

integer

Count of pages extracted

Required range: x >= 0

Example:

100

origin

enum<string>

Origin of the crawl request

Available options:

api,

web

Example:

"web"

API Documentation

Scrape

Crawls

Crawl Results

Crawl Logs

Request

Headers

Body

Schema

Response

Errors

Authorizations

Body

Response

API Documentation

Scrape

Crawls

Crawl Results

Crawl Logs

​Request

​Headers

​Body

​Schema

​Response

​Errors

Authorizations

Body

Response

Request

Headers

Body

Schema

Response

Errors