n8n Firecrawl Node: Web Scraping, Crawling, and AI Extraction Guide

n8n Firecrawl Node: Web Scraping, Crawling, and AI Extraction Guide

What is Firecrawl?

Firecrawl is a next-generation web scraping engine that handles JavaScript rendering, anti-bot bypass, and structured data extraction out of the box. The n8n Firecrawl node (n8n-nodes-firecrawl-v2) brings all 10 Firecrawl v2 API operations into n8n, working with both Firecrawl Cloud and self-hosted instances.

This guide walks through every operation, parameter, and deployment consideration. It is written for automation engineers and integrators who want to build production scraping workflows on n8n.

Cloud vs Self-Hosted

Firecrawl CloudSelf-Hosted
SetupSign up at firecrawl.dev, get API keyDeploy via Docker on your own server
Base URLhttps://api.firecrawl.dev/v2http://your-server:3002/v2
Best forQuick tests, low volumeProduction, sensitive data, unlimited requests
CostUsage-based pricingInfrastructure cost only

At THE NEXOVA, we run Firecrawl self-hosted alongside n8n on the same server. This gives us zero-latency API calls and full control over data residency. Our competitive intelligence workflows process hundreds of pages daily through this setup.

Installation

Install the Node

Settings > Community Nodes > Install > n8n-nodes-firecrawl-v2

Configure Credentials

Create a new credential of type Firecrawl API:

FieldDefaultDescription
Base URLhttps://api.firecrawl.dev/v2Change this for self-hosted instances. Must include /v2.
API KeyYour Firecrawl API key

Authentication uses Authorization: Bearer {apiKey}. On save, n8n tests the connection by scraping https://example.com.

Operations Reference

1. Scrape

The most commonly used operation. Scrape extracts content from a single URL with full JavaScript rendering support.

Endpoint: POST /scrape

ParameterTypeDefaultDescription
urlStringTarget URL (required)

Scrape Options (all optional):

ParameterDefaultDescription
formatsmarkdownOutput formats: markdown, html, rawHtml, links, screenshot, json, summary, images, audio, changeTracking
onlyMainContenttrueStrip headers, navigation, and footers
includeTagsCSS selectors to keep (e.g., article, .content)
excludeTagsCSS selectors to remove (e.g., nav, .sidebar)
waitFor0Wait for JS rendering (ms). Increase for SPA/React pages.
timeout30000Request timeout (ms), max 300,000
mobilefalseEmulate mobile device viewport
blockAdstrueBlock ads and cookie consent popups
proxyautoProxy mode: auto, basic, enhanced
locationCountryISO country code (e.g., VN, US)
locationLanguagesLocale codes (e.g., vi-VN, en-US)

Sample output:

{
  "markdown": "# Page Title\n\nMain content extracted...",
  "metadata": {
    "title": "Page Title",
    "description": "Meta description",
    "sourceURL": "https://example.com",
    "statusCode": 200
  }
}

2. Crawl

Crawl processes an entire website by following links from a starting URL. This is an asynchronous job that can take minutes to hours depending on site size.

Endpoint: POST /crawl

ParameterDefaultDescription
crawlUrlStarting URL (required)
waitForCompletionfalseHold execution until crawl finishes
maxPollTime300Max wait time in seconds

Crawl Options:

ParameterDefaultDescription
limit100Maximum pages to crawl
maxDiscoveryDepth2Maximum link depth
includePathsRegex patterns to include (e.g., /blog/*, /docs/*)
excludePathsRegex patterns to exclude (e.g., /admin/*, /login)
sitemapincludeSitemap handling: include, skip, or only
crawlEntireDomainfalseFollow sibling and parent links across the domain
allowExternalLinksfalseFollow links to external domains
allowSubdomainsfalseCrawl subdomains
delay0Seconds between requests (forces concurrency to 1)
formatsmarkdownOutput format per page
onlyMainContenttrueStrip boilerplate from each page

When waitForCompletion is off: the output only contains the job id. Use the Get Crawl Status operation to retrieve results later. Internal polling interval is 2 seconds.

3. Get Crawl Status / 4. Cancel Crawl

OperationEndpointParameter
Get Crawl StatusGET /crawl/{id}crawlId (job ID from Crawl)
Cancel CrawlDELETE /crawl/{id}cancelCrawlId (job ID)

5. Map

Map discovers all URLs on a website without scraping their content. It is significantly faster than Crawl and works well as a first step before targeted scraping.

Endpoint: POST /map

ParameterDefaultDescription
mapUrlStarting URL (required)
searchSearch query to rank results by relevance
includeSubdomainstrueInclude subdomain URLs
limit5000Max URLs to return (max: 100,000)
ignoreQueryParameterstrueDeduplicate URLs by stripping query strings
ignoreCachefalseBypass sitemap cache

6. Search

Search performs a web search and optionally scrapes each result page. This combines search discovery and content extraction in a single step.

Endpoint: POST /search

ParameterDefaultDescription
searchQuerySearch keywords, max 500 chars (required)
limit5Number of results (1-100)
countryUSISO country code for geo-targeting
tbsAny TimeTime filter: past hour, day, week, month, or year
formatsmarkdownContent format for scraped results
onlyMainContenttrueStrip boilerplate

7. Extract

Extract is the most powerful operation. It uses AI to pull structured data from any web page using natural language prompts. You describe what you want, optionally provide a JSON Schema, and Firecrawl returns clean structured data.

Endpoint: POST /extract

ParameterDefaultDescription
extractUrlsComma-separated URLs (supports glob patterns like https://example.com/*)
extractPromptNatural language instruction for what to extract
extractSchemaOptional JSON Schema to enforce output structure
extractWaitForCompletiontrueWait for results (defaults ON, unlike Crawl/Batch)
extractMaxPollTime300Max wait time in seconds

Extract Options:

ParameterDefaultDescription
enableWebSearchfalseSupplement extraction with web search
showSourcesfalseInclude source URLs in the output

Example prompt:

Extract company name, address, phone number, email, and industry from this page.

Example schema:

{
  "type": "object",
  "properties": {
    "company_name": { "type": "string" },
    "address": { "type": "string" },
    "phone": { "type": "string" },
    "email": { "type": "string" },
    "industry": { "type": "string" }
  }
}

8. Get Extract Status

Endpoint: GET /extract/{extractId}

9. Batch Scrape

Batch Scrape processes multiple URLs asynchronously. Feed it a list of URLs from a Map operation or an external source, and it scrapes them all in parallel.

Endpoint: POST /batch/scrape

ParameterDefaultDescription
batchUrlsComma-separated list of URLs
batchWaitForCompletionfalseWait for all URLs to finish
batchMaxPollTime300Max wait time in seconds

Batch Options include formats, onlyMainContent, and maxConcurrency.

10. Get Batch Scrape Status

Endpoint: GET /batch/scrape/{batchScrapeId}

Workflow Examples

Competitive Intelligence Pipeline

Schedule Trigger (every Monday, 8 AM)
  → Firecrawl: Map (https://competitor.com)
  → Firecrawl: Batch Scrape (URLs from Map, formats: markdown)
  → Code Node (diff against last week's data)
  → Google Sheets (log changes)
  → Slack (notify team of updates)

We use this exact pattern at THE NEXOVA for our competitive intelligence system that monitors 6 competitor websites weekly.

AI-Powered Lead Extraction

Manual Trigger
  → Firecrawl: Extract
      URLs: https://directory.example.com/category/*
      Prompt: "Extract company name, phone, address, and email"
      Schema: { type: object, properties: { name, phone, address, email } }
  → Google Sheets: Append extracted data

Content Change Monitoring

Schedule Trigger (daily, 7 AM)
  → Firecrawl: Scrape (formats: changeTracking)
      URL: https://competitor.com/pricing
  → IF (changes detected)
    → Email: Alert the team

Technical Notes

  • Async operations (Crawl, Extract, Batch Scrape) return a job ID by default. Enable waitForCompletion to get results directly. Internal polling interval is 2 seconds.
  • Extract defaults to waitForCompletion: true, while Crawl and Batch Scrape default to false. This is by design since Extract jobs typically complete faster.
  • Format availability varies by operation. Scrape supports 10 formats (including json, summary, audio). Crawl, Search, and Batch Scrape support 5 basic formats.
  • Comma-separated inputs apply to includeTags, excludeTags, includePaths, excludePaths, extractUrls, and batchUrls. Whitespace around commas is trimmed automatically.
  • Self-hosted Base URL must include /v2 (e.g., http://firecrawl:3002/v2). A common mistake is omitting the version prefix.
  • Error handling: The node supports continueOnFail. On error, the output is { "error": "message" } instead of stopping the workflow.

Premium and Custom Solutions

The community node covers all 10 Firecrawl v2 operations. For organizations that need deeper capabilities, THE NEXOVA offers:

  • Firecrawl self-hosted deployment on your own infrastructure (full data sovereignty)
  • Custom scraping workflows tailored to your specific data sources
  • Integration with your existing CRM, ERP, or BI systems
  • Agent-based operations (in development): AI that navigates and extracts from complex multi-page flows
  • Technical support and long-term maintenance

If you need a production-grade n8n Firecrawl setup or custom web scraping infrastructure, get in touch with our team.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *