The word "pipeline" makes it sound more complicated than it is

When I tell a client I will build them a data pipeline, their eyes sometimes glaze over. The word sounds like enterprise infrastructure. Kafka clusters and data lakes and ETL orchestration layers.

In practice, most of the pipelines I build are simpler than that. A pipeline is: get data from here, clean it, put it there, do it again on schedule. That is the whole concept. The complexity is in the details of each step, not in the architecture.

Here is what a real pipeline looks like from start to finish.

Step 1: Extract

Something pulls data from a source. The source might be a website (scraping), an API (structured requests), a database (queries), a file (CSV uploads, S3 objects), or a combination.

For a county records pipeline, the source is 10 to 50 county assessor websites, each with a different layout and search interface. For a price monitoring pipeline, the source is product pages on Amazon, Walmart, and Target. For a financial research pipeline, the source is the SEC EDGAR filing system.

The extraction layer handles the source-specific complexity: authentication, pagination, rate limiting, error recovery, and data format conversion. This is usually the most code-intensive part of the pipeline and the part that breaks most often because sources change.

Step 2: Transform

Raw extracted data is messy. Different sources use different field names, different formats, different conventions. One county calls it "Assessed Value." Another calls it "Total Appraised." A third requires you to add "Land Value" and "Improvement Value" together.

The transform step normalizes everything into a consistent schema. Same column names, same data types, same formats. It also handles: deduplication, null detection, validation against expected ranges, and flagging of anomalies.

For most pipelines, the transform step is a Python script or a SQL query. For complex transformations (normalizing 42 fields across 932 county schemas), it is a dedicated normalization engine with per-source mapping rules.

Step 3: Load

Clean, normalized data lands in its destination. Common targets:

  • Google Sheets updated in place (the most common for small teams)
  • PostgreSQL or BigQuery for analytics and querying
  • S3 or Google Cloud Storage as a file archive (CSV, JSON, Parquet)
  • CRM write-back via API (HubSpot, Salesforce, Pipedrive)
  • Webhook push to a downstream system (Slack, n8n, custom)

The load step handles connection management, authentication, schema validation (does the target table match the data shape?), and conflict resolution (what happens when a row already exists?).

Step 4: Schedule

A pipeline that runs once is a script. A pipeline that runs on schedule is a service.

The scheduler triggers the extract-transform-load cycle at a defined interval: daily, twice daily, weekly, hourly. It also handles: retry logic (if a run fails, try again in 30 minutes), alerting (if a run fails repeatedly, notify someone), and logging (what ran, when, how many records, any errors).

For most client pipelines, the schedule is a cron job or an n8n workflow. For pipelines that need more sophisticated orchestration (dependencies between steps, conditional branches, parallel extraction from multiple sources), the scheduler might be a custom script with state management.

Step 5: Monitor

This is the step most DIY pipelines skip, and it is the step that matters most for reliability.

Monitoring answers: did the pipeline run? Did it extract the expected number of records? Did any records fail validation? Is the data actually fresh, or did the source return cached results?

For recurring pipelines I build, monitoring is included. I get alerted on the first failed run, not after a week of missing data. For clients, monitoring means they never have to think about whether the data is flowing. If something breaks, I fix it and backfill before the next delivery.

A real example

A price monitoring pipeline for a consumer electronics brand:

  • Extract: Twice daily, scrape 12,000 SKUs across Amazon, Walmart, Target. Residential proxy rotation, per-retailer pacing, anti-bot handling.
  • Transform: Normalize pricing (Amazon "list price" = Walmart "current price" = Target "regular price"). Detect price changes > 5%. Flag out-of-stock transitions.
  • Load: Write to a Google Sheet dashboard with conditional formatting. Push alerts to a Slack channel when price drops or stock changes occur.
  • Schedule: Runs at 6 AM and 6 PM daily via cron. Retry on failure with 30-minute backoff.
  • Monitor: Row count validation (if SKU count drops below 11,000, something broke). Staleness check (if any retailer's data is > 24 hours old, alert).

Total pipeline: about 2,000 lines of code. Setup time: 2 weeks. Ongoing maintenance: 4 to 6 hours per month.

Where automation tools fit

For pipelines where the logic is "if this, then that" and the client wants visual control, n8n is a good fit. I use it for:

  • Webhook-triggered workflows (new CRM record arrives, trigger enrichment)
  • Conditional routing (if price drop > 10%, alert Slack AND email the merchandising team)
  • Scheduled report delivery (every Monday at 9 AM, compile the Sheet data and email a PDF)
  • Multi-step integrations (scraper outputs JSON, n8n transforms it, writes to Sheet, sends Slack)

The scraping itself still runs as code (Python with Playwright or requests). n8n handles the orchestration and delivery layer. This split gives you the power of custom extraction with the visibility of a visual workflow builder.

What it costs

The cost of a pipeline depends on three things:

  1. How many sources and how complex they are (a single API call vs 50 county websites)
  2. How often it runs (weekly is simpler than hourly)
  3. Where the data goes (Google Sheet is simpler than CRM write-back with dedup)

A simple pipeline (one source, weekly, Sheet delivery) starts at $199 for the setup and does not need a recurring fee if it runs on your own infrastructure. A complex pipeline (multiple sources, daily, database + CRM + alerting) starts at $499/mo for the managed service including maintenance and monitoring.

The managed service means I own the reliability. The pipeline runs, the data flows, and when something breaks I fix it. You consume the output without managing the infrastructure.