ACTION_ID: scrape_web_page_using_firecrawl
NAME: Scrape Web Page Content
CATEGORY: scrape
CREDITS: 0.1

Scrape web pages for information using a text prompt.

INDEX:
  1. Inputs
  2. Outputs
  3. How to configure
  4. Key notes
  5. Where it fits in a workflow
  6. When to use
  7. When not to use

================================================================================
1. INPUTS
================================================================================

url (type: url, required) — URL
  URL of the page to scrape.

extract_only_main_content (type: string, required) — Extract Only Main Content
  Whether to extract only the main body of the page (stripping
  navigation, footers, sidebars, etc). Allowed values: "true" /
  "false". Standard config: "true".

extract_only_text (type: string, required) — Extract Only Text
  Whether to return plain text only (no HTML markup, no markdown).
  Allowed values: "true" / "false". Standard config: "true".

include_html_content (type: string, required) — Include HTML Content
  Whether to include the raw HTML in the output. Allowed values:
  "true" / "false". Standard config: "false".

include_page_links (type: string, required) — Include Page Links
  Whether to return all links found on the page in the output.
  Allowed values: "true" / "false". Standard config: "false".

proxy_type (type: string, optional) — Proxy Type
  Which proxy strategy to use. Default: "auto". Allowed values:
    - "basic"   — Proxies for sites with none-to-basic anti-bot.
                  Fast and usually works.
    - "stealth" — Proxies for sites with advanced anti-bot
                  solutions. Slower but more reliable.
    - "auto"    — Automatically retries with stealth if basic
                  fails. Recommended default.

================================================================================
2. OUTPUTS
================================================================================

page_title (type: string) — Page Title
  Title of the scraped page.

page_url (type: url) — Page URL
  Resolved URL of the scraped page (after redirects).

content (type: string) — Content
  Firecrawl's page-scraping result — the extracted body text of
  the page (plain text when `extract_only_text` is `"true"`,
  otherwise markdown).

html (type: string) — HTML
  Raw HTML of the page as a single string. Only populated when
  `include_html_content` is `"true"`.

links (type: raw_array) — Links
  All links found on the page. Only populated when
  `include_page_links` is `"true"`.

metadata (type: raw_array) — Metadata
  Page metadata such as meta title, meta description,
  og:title, og:description, etc.

================================================================================
3. HOW TO CONFIGURE
================================================================================

Configure Action body:

{
  "inputs": {
    "url": "{{input.url}}",
    "extract_only_main_content": "true",
    "extract_only_text": "true",
    "include_html_content": "false",
    "include_page_links": "false",
    "proxy_type": "auto"
  }
}

Standard config (recommended starting point):
  - extract_only_main_content = "true"   (extract only main body
                                            content)
  - extract_only_text         = "true"   (plain text, no HTML or
                                            markdown)
  - include_html_content      = "false"  (skip raw HTML)
  - include_page_links        = "false"  (skip page links)
  - proxy_type                = "auto"   (retry with stealth if
                                            basic fails)

Override the flags only when you specifically need the raw HTML
(e.g. you're going to parse a structured table) or the page-link
list (e.g. crawling outbound links from the page).

================================================================================
4. KEY NOTES
================================================================================

- Pair with `format_data_using_js_expression` to pull structured
  data out of a scraped page. The `content` / `metadata` / `links`
  outputs are nested raw_array blobs — use
  `format_data_using_js_expression` to write a JS expression that
  extracts specific fields (e.g. pricing tiers from a pricing page,
  the list of customer logos from a homepage, all email addresses
  on a contact page) into a raw array, then push that raw array
  into `raw_to_structured_array` if you want one row per extracted
  item on a downstream sheet.
  See https://floqer.com/docs/action-detail/format_data_using_js_expression.txt and
  https://floqer.com/docs/action-detail/raw_to_structured_array.txt.

================================================================================
5. WHERE IT FITS IN A WORKFLOW
================================================================================

Pattern (page scrape -> per-prospect insight): scrape a page tied to
the row (e.g. the prospect's company website, a job posting, a
news article), then run an LLM step over the scraped content to
produce a per-row insight or score for outreach.

  input (row data — typically a URL on the row)
    -> scrape_web_page_using_firecrawl (returns content as a
       string, plus metadata / links as nested raw_arrays)
    -> llm_models (prompt: extract per-prospect insights or score
       company fit from the scraped content)
    -> outreach.

================================================================================
6. WHEN TO USE
================================================================================

Use scrape_web_page_using_firecrawl to grab the content of a single
known URL — typically tied to the row (the prospect's company
website, a specific landing page, a news article, a job posting) —
for downstream LLM processing or structured-data extraction.

================================================================================
7. WHEN NOT TO USE
================================================================================

Need step-by-step browser navigation
  -> ai_web_navigator
     (https://floqer.com/docs/action-detail/ai_web_navigator.txt)

Need open-ended web research
  -> llm_web_agents
     (https://floqer.com/docs/action-detail/llm_web_agents.txt)

================================================================================

This file is maintained manually. Last updated: 2026-04-30.
Full interactive reference: https://floqer.com/docs/reference
Action catalog: https://floqer.com/docs/action-catalog.txt