SOURCE_ID: extract_from_website NAME: Extract from Website CATEGORY: Signals Scrape a public web page and extract structured rows from it using a natural- language prompt. Floqer reads the page content, runs your extraction prompt over it, and turns the result into table rows your workflows can consume. This is a one-time import — it pulls the page once and stores what it found. INDEX: 1. Endpoints 2. Extraction model 3. Lifecycle (one-time only) 4. Body shape (preview + create) 5. Field catalogue 6. Dynamic options 7. How to configure end-to-end 8. Key notes 9. When to use ================================================================================ 1. ENDPOINTS ================================================================================ Source identifier (used in every endpoint path): `extract_from_website`. POST /api/v1/sources/extract_from_website/preview Scope: sources:read Scrapes the page and runs your prompt, returning the extracted rows without creating anything. Use this to refine `url` and `prompt` and inspect the row shape your prompt produces. Does not consume Floqer credits. POST /api/v1/sources/extract_from_website Scope: sources:write Creates the source and starts a one-time extraction. Returns `source_instance_id` (the new source's UUID). The extraction consumes Floqer credits. GET /api/v1/sources//data Scope: sources:read Paginated rows imported into the created source. `` is the UUID returned by Create. Query: `page_no` (default 1), `page_size` (default 20, max 200). Source-agnostic; see concepts.txt §10. POST /api/v1/sources//sync Scope: sources:write Connects the created source to a workflow and (by default) backfills it with the extracted rows. `` = UUID from Create. Body: { workflow_id, field_mapping, push_existing?, run? }. Source-agnostic; see concepts.txt §10. No connection is required. There is no dynamic-options endpoint for this source today (see §6). Because this is a one-time source, there is no ongoing-status PATCH to pause/resume. ================================================================================ 2. EXTRACTION MODEL ================================================================================ Every preview/create call does two things: 1. Reads the page at `url` and converts its main content to text. 2. Runs your natural-language `prompt` over that text to produce rows. Core behaviour: - `url` is the public web page to read. It should be directly reachable (no login wall). - `prompt` describes what to pull out and how to shape it — e.g. "Extract every product as a row with name, price, and category." The shape of the output rows is driven entirely by your prompt, so the columns are not fixed (see §5). - `paginated` (optional, default false) tells the extractor to follow and read additional pages of a multi-page listing before extracting, so a paginated list can be captured in one source. The columns in the resulting rows depend on what your prompt asks for. Use preview to lock in the exact keys before you create. ================================================================================ 3. LIFECYCLE (ONE-TIME ONLY) ================================================================================ Extract from Website is a one-time (static) source. On create it reads the page once and stores the extracted rows — there is no schedule, no `expiration_date`, and no recurring re-check. To capture the page again later (e.g. after it changes), create a new source. ================================================================================ 4. BODY SHAPE (PREVIEW + CREATE) ================================================================================ Field names are snake_case. Unknown top-level keys are rejected with 400 — the body is strict. Preview accepts: url required: non-empty string (the page URL) prompt required: non-empty string (what to extract) paginated optional: boolean (default false) Create accepts: url required: non-empty string prompt required: non-empty string paginated optional: boolean (default false) name required: display name for the new source Example preview body: { "url": "https://example.com/pricing", "prompt": "Extract each plan as a row with plan_name, monthly_price, and included_seats.", "paginated": false } Example create body: { "name": "Competitor pricing tiers", "url": "https://example.com/pricing", "prompt": "Extract each plan as a row with plan_name, monthly_price, and included_seats.", "paginated": false } Preview response envelope: { "status": 200, "data": { "data": [ { "plan_name": "Starter", "monthly_price": "$29", "included_seats": "3" }, { "plan_name": "Growth", "monthly_price": "$99", "included_seats": "10" } ], "metadata": { "total_results": 2 } } } Rows are plain key/value objects whose keys come from your prompt. Create response envelope: { "status": 201, "data": { "source_instance_id": "", "name": "Competitor pricing tiers", "created_at": "2026-05-27T12:00:00.000Z" } } The extraction runs asynchronously — Create returns as soon as the source is created and the extraction is started, not when rows have finished arriving. ================================================================================ 5. FIELD CATALOGUE ================================================================================ url (string) — required The public web page to read. Should be directly reachable without login. Example: "https://example.com/pricing" prompt (string) — required Natural-language instruction describing what to extract and how to shape each row. The keys on the output rows are whatever your prompt asks for. Example: "Extract every job opening as a row with title, department, and location." paginated (boolean) — optional When true, follow and read additional pages of a multi-page listing before extracting. Defaults to false (single page). name (string) — required on create only Human-readable display name for the source. Imported row fields (read-only): Row columns are PROMPT-DRIVEN — there is no fixed schema. Each row is a key/value object whose keys match the fields your `prompt` requested. Run preview first to discover and confirm the exact keys, then use those keys when building `field_mapping` for sync (§7). ================================================================================ 6. DYNAMIC OPTIONS ================================================================================ None. `url` and `prompt` are free-form strings. There is no `POST /api/v1/sources/extract_from_website/options/` endpoint. ================================================================================ 7. HOW TO CONFIGURE END-TO-END ================================================================================ Step 1 — Preview to lock in the row shape POST /api/v1/sources/extract_from_website/preview Body: { "url": "...", "prompt": "...", "paginated": false } Inspect the keys on the returned `data[]` rows — these are the columns your prompt produces. Refine the prompt until the shape is right. Preview does not consume credits. Step 2 — Create the source POST /api/v1/sources/extract_from_website Body: + `"name"`. Response: { "status": 201, "data": { "source_instance_id": "", ... } } Step 2b — (Optional) Poll extracted rows after create GET /api/v1/sources//data?page_no=1&page_size=20 ( = UUID from Step 2) Step 3 — Sync the source into a workflow POST /api/v1/sources//sync ( = UUID from Step 2 ) Map workflow inputs to the prompt-driven row keys you saw in preview. See concepts.txt §10 for the full source-agnostic sync semantics. This is a one-time extraction — once it finishes there are no further runs. ================================================================================ 8. KEY NOTES ================================================================================ - No connection is required. - One-time import: no schedule, no `expiration_date`, no pause/resume. - Row columns are determined entirely by your `prompt` — there is no fixed schema. Always preview first to confirm the keys before syncing. - Preview runs the full scrape + extraction and does not consume credits; the create extraction consumes credits. - `paginated: true` reads additional listing pages before extracting; leave it false for a single page. - The target page must be publicly reachable; pages behind a login or blocking bots may return no rows. ================================================================================ 9. WHEN TO USE ================================================================================ - Competitor / market research: pull pricing tiers, feature tables, or product catalogues into rows for analysis. - List building: extract names, roles, or companies from a public directory or listing page. - Content monitoring snapshot: capture a changelog, careers page, or news index as structured rows at a point in time. When you instead want to monitor new posts on a platform over time, use `track_x_posts` or `track_linkedin_posts`. When you want to enrich a single known URL inside a workflow, use a web-navigation workflow action. ================================================================================ Last updated: 2026-06-02. Reference: https://floqer.com/docs/reference