SOURCE_ID: extract_from_website
NAME: Extract from Website
CATEGORY: Signals

Scrape a public web page and extract structured rows from it using a natural-
language prompt. Floqer reads the page content, runs your extraction prompt
over it, and turns the result into table rows your workflows can consume.
This is a one-time import — it pulls the page once and stores what it found.

INDEX:
  1. Endpoints
  2. Extraction model
  3. Lifecycle (one-time only)
  4. Body shape (preview + create)
  5. Field catalogue
  6. Dynamic options
  7. How to configure end-to-end
  8. Key notes
  9. When to use

================================================================================
1. ENDPOINTS
================================================================================

Source identifier (used in every endpoint path): `extract_from_website`.

  POST /api/v1/sources/extract_from_website/preview
    Scope: sources:read
    Scrapes the page and runs your prompt, returning the extracted rows
    without creating anything. Use this to refine `url` and `prompt` and
    inspect the row shape your prompt produces. Does not consume Floqer
    credits.

  POST /api/v1/sources/extract_from_website
    Scope: sources:write
    Creates the source and starts a one-time extraction. Returns `source_instance_id`
    (the new source's UUID). The extraction consumes Floqer credits.

  GET /api/v1/sources/<source_instance_id>/data
    Scope: sources:read
    Paginated rows imported into the created source. `<source_instance_id>` is the
    UUID returned by Create. Query: `page_no` (default 1), `page_size`
    (default 20, max 200). Source-agnostic; see concepts.txt §10.

  POST /api/v1/sources/<source_instance_id>/sync
    Scope: sources:write
    Connects the created source to a workflow and (by default) backfills it
    with the extracted rows. `<source_instance_id>` = UUID from Create. Body:
    { workflow_id, field_mapping, push_existing?, run? }. Source-agnostic;
    see concepts.txt §10.

No connection is required. There is no dynamic-options endpoint for this
source today (see §6). Because this is a one-time source, there is no
ongoing-status PATCH to pause/resume.

================================================================================
2. EXTRACTION MODEL
================================================================================

Every preview/create call does two things:

  1. Reads the page at `url` and converts its main content to text.
  2. Runs your natural-language `prompt` over that text to produce rows.

Core behaviour:

  - `url` is the public web page to read. It should be directly reachable
    (no login wall).

  - `prompt` describes what to pull out and how to shape it — e.g. "Extract
    every product as a row with name, price, and category." The shape of the
    output rows is driven entirely by your prompt, so the columns are not
    fixed (see §5).

  - `paginated` (optional, default false) tells the extractor to follow and
    read additional pages of a multi-page listing before extracting, so a
    paginated list can be captured in one source.

The columns in the resulting rows depend on what your prompt asks for. Use
preview to lock in the exact keys before you create.

================================================================================
3. LIFECYCLE (ONE-TIME ONLY)
================================================================================

Extract from Website is a one-time (static) source. On create it reads the
page once and stores the extracted rows — there is no schedule, no
`expiration_date`, and no recurring re-check. To capture the page again
later (e.g. after it changes), create a new source.

================================================================================
4. BODY SHAPE (PREVIEW + CREATE)
================================================================================

Field names are snake_case. Unknown top-level keys are rejected with 400 —
the body is strict.

Preview accepts:
  url                  required: non-empty string (the page URL)
  prompt               required: non-empty string (what to extract)
  paginated            optional: boolean (default false)

Create accepts:
  url                  required: non-empty string
  prompt               required: non-empty string
  paginated            optional: boolean (default false)
  name                 required: display name for the new source

Example preview body:

  {
    "url": "https://example.com/pricing",
    "prompt": "Extract each plan as a row with plan_name, monthly_price, and included_seats.",
    "paginated": false
  }

Example create body:

  {
    "name": "Competitor pricing tiers",
    "url": "https://example.com/pricing",
    "prompt": "Extract each plan as a row with plan_name, monthly_price, and included_seats.",
    "paginated": false
  }

Preview response envelope:

  {
    "status": 200,
    "data": {
      "data": [
        { "plan_name": "Starter", "monthly_price": "$29", "included_seats": "3" },
        { "plan_name": "Growth",  "monthly_price": "$99", "included_seats": "10" }
      ],
      "metadata": { "total_results": 2 }
    }
  }

  Rows are plain key/value objects whose keys come from your prompt.

Create response envelope:

  {
    "status": 201,
    "data": {
      "source_instance_id": "<uuid>",
      "name": "Competitor pricing tiers",
      "created_at": "2026-05-27T12:00:00.000Z"
    }
  }

  The extraction runs asynchronously — Create returns as soon as the source
  is created and the extraction is started, not when rows have finished
  arriving.

================================================================================
5. FIELD CATALOGUE
================================================================================

  url (string) — required
    The public web page to read. Should be directly reachable without login.
    Example: "https://example.com/pricing"

  prompt (string) — required
    Natural-language instruction describing what to extract and how to shape
    each row. The keys on the output rows are whatever your prompt asks for.
    Example: "Extract every job opening as a row with title, department, and
    location."

  paginated (boolean) — optional
    When true, follow and read additional pages of a multi-page listing
    before extracting. Defaults to false (single page).

  name (string) — required on create only
    Human-readable display name for the source.

Imported row fields (read-only):
  Row columns are PROMPT-DRIVEN — there is no fixed schema. Each row is a
  key/value object whose keys match the fields your `prompt` requested. Run
  preview first to discover and confirm the exact keys, then use those keys
  when building `field_mapping` for sync (§7).

================================================================================
6. DYNAMIC OPTIONS
================================================================================

None. `url` and `prompt` are free-form strings. There is no
`POST /api/v1/sources/extract_from_website/options/<field_name>` endpoint.

================================================================================
7. HOW TO CONFIGURE END-TO-END
================================================================================

  Step 1 — Preview to lock in the row shape
    POST /api/v1/sources/extract_from_website/preview
    Body: { "url": "...", "prompt": "...", "paginated": false }
    Inspect the keys on the returned `data[]` rows — these are the columns
    your prompt produces. Refine the prompt until the shape is right. Preview
    does not consume credits.

  Step 2 — Create the source
    POST /api/v1/sources/extract_from_website
    Body: <same url + prompt + paginated> + `"name"`.
    Response: { "status": 201, "data": { "source_instance_id": "<uuid>", ... } }

  Step 2b — (Optional) Poll extracted rows after create
    GET /api/v1/sources/<source_instance_id>/data?page_no=1&page_size=20
      (<source_instance_id> = UUID from Step 2)

  Step 3 — Sync the source into a workflow
    POST /api/v1/sources/<source_instance_id>/sync   ( <source_instance_id> = UUID from Step 2 )
    Map workflow inputs to the prompt-driven row keys you saw in preview.
    See concepts.txt §10 for the full source-agnostic sync semantics.

This is a one-time extraction — once it finishes there are no further runs.

================================================================================
8. KEY NOTES
================================================================================

- No connection is required.
- One-time import: no schedule, no `expiration_date`, no pause/resume.
- Row columns are determined entirely by your `prompt` — there is no fixed
  schema. Always preview first to confirm the keys before syncing.
- Preview runs the full scrape + extraction and does not consume credits;
  the create extraction consumes credits.
- `paginated: true` reads additional listing pages before extracting; leave
  it false for a single page.
- The target page must be publicly reachable; pages behind a login or
  blocking bots may return no rows.

================================================================================
9. WHEN TO USE
================================================================================

- Competitor / market research: pull pricing tiers, feature tables, or
  product catalogues into rows for analysis.
- List building: extract names, roles, or companies from a public directory
  or listing page.
- Content monitoring snapshot: capture a changelog, careers page, or news
  index as structured rows at a point in time.

When you instead want to monitor new posts on a platform over time, use
`track_x_posts` or `track_linkedin_posts`. When you want to enrich a single
known URL inside a workflow, use a web-navigation workflow action.

================================================================================

Last updated: 2026-06-02. Reference: https://floqer.com/docs/reference