From spreadsheet chaos to a Python serverless pipeline

A staged path from manual CSV to validated transforms, idempotent writes, and AWS Lambda + SQS—with examples and AWS docs.

Spreadsheets are a great exploration surface; they are a fragile system of record when multiple teams depend on weekly exports. This post outlines a staged migration to Python + serverless components, with pointers to official AWS documentation.

1. Document the workflow before code

Capture:

  • Trigger: schedule (cron), file upload (S3 event), or message (queue).
  • Inputs: CSV columns, encodings, header row, allowed nulls.
  • Outputs: database tables, files, or API calls.
  • SLA: max runtime, who gets paged on failure.

If you cannot describe it in one page, split into phase 1 (read + validate + email report) and phase 2 (write + notify).

2. A thin Python core with testable transforms

Keep I/O at the edges; pure functions in the middle.

# shapes only — adapt types to your stack
from dataclasses import dataclass
from typing import Iterator

@dataclass
class Row:
    customer_id: str
    amount: str  # parse/validate downstream

def parse_rows(raw_lines: Iterator[str]) -> Iterator[Row]:
    ...

def validate_business_rules(row: Row) -> None:
    """Raise ValueError with a stable error code for DLQ messaging."""
    ...

Unit-test validate_business_rules with table-driven tests (pytest @pytest.mark.parametrize). For messy CSVs, Python’s csv module plus explicit encoding (utf-8-sig for Excel exports) prevents silent mojibake.

3. Idempotency: design for reruns

Jobs will retry after timeouts. Strategies:

  • Natural keys: upsert on (source_system, external_id).
  • Job id: store job_id on every row written; reject duplicates.
  • S3 + DynamoDB: conditional writes or transactions for small metadata.

This mirrors webhook design in idempotency and retries.

4. When Lambda + SQS fit

AWS Lambda works well when:

  • Duration stays within your account’s limit (configure timeout; watch memory—it scales CPU).
  • Concurrency is bursty, not 24/7 pinned at high utilization.

Pattern:

  1. S3 ObjectCreatedLambda validates and enqueues a message (payload = S3 key + checksum).
  2. SQSLambda worker processes batches; failures go to a DLQ after N attempts—see SQS dead-letter queues.
[S3 upload] --> [Lambda: validate + enqueue] --> [SQS] --> [Lambda: transform + write]
                                                      |
                                                   [DLQ on poison messages]

For long-running ETL, consider AWS Batch or containers instead of stretching Lambda duration.

5. Cold starts and packaging

Cold start matters if the trigger is user-facing. Mitigations:

  • Provisioned concurrency (cost tradeoff)—Lambda scaling.
  • Slim dependencies (avoid shipping unused scientific stacks).
  • Initialize heavy clients outside the handler where possible.

6. Observability and alerts

  • Structured JSON logs with job_id, record_count, duration_ms, outcome.
  • CloudWatch alarms on DLQ depth and error rate—Lambda monitoring.
  • One on-call channel (PagerDuty, Slack) beats silent failure.

References

Related posts: webhook reliability, LLM cost controls.