From spreadsheet chaos to a Python serverless pipeline
A staged path from manual CSV to validated transforms, idempotent writes, and AWS Lambda + SQS—with examples and AWS docs.
Spreadsheets are a great exploration surface; they are a fragile system of record when multiple teams depend on weekly exports. This post outlines a staged migration to Python + serverless components, with pointers to official AWS documentation.
1. Document the workflow before code
Capture:
- Trigger: schedule (cron), file upload (S3 event), or message (queue).
- Inputs: CSV columns, encodings, header row, allowed nulls.
- Outputs: database tables, files, or API calls.
- SLA: max runtime, who gets paged on failure.
If you cannot describe it in one page, split into phase 1 (read + validate + email report) and phase 2 (write + notify).
2. A thin Python core with testable transforms
Keep I/O at the edges; pure functions in the middle.
# shapes only — adapt types to your stack
from dataclasses import dataclass
from typing import Iterator
@dataclass
class Row:
customer_id: str
amount: str # parse/validate downstream
def parse_rows(raw_lines: Iterator[str]) -> Iterator[Row]:
...
def validate_business_rules(row: Row) -> None:
"""Raise ValueError with a stable error code for DLQ messaging."""
...
Unit-test validate_business_rules with table-driven tests (pytest @pytest.mark.parametrize). For messy CSVs, Python’s csv module plus explicit encoding (utf-8-sig for Excel exports) prevents silent mojibake.
3. Idempotency: design for reruns
Jobs will retry after timeouts. Strategies:
- Natural keys: upsert on
(source_system, external_id). - Job id: store
job_idon every row written; reject duplicates. - S3 + DynamoDB: conditional writes or transactions for small metadata.
This mirrors webhook design in idempotency and retries.
4. When Lambda + SQS fit
AWS Lambda works well when:
- Duration stays within your account’s limit (configure timeout; watch memory—it scales CPU).
- Concurrency is bursty, not 24/7 pinned at high utilization.
Pattern:
- S3
ObjectCreated→ Lambda validates and enqueues a message (payload = S3 key + checksum). - SQS → Lambda worker processes batches; failures go to a DLQ after N attempts—see SQS dead-letter queues.
[S3 upload] --> [Lambda: validate + enqueue] --> [SQS] --> [Lambda: transform + write]
|
[DLQ on poison messages]
For long-running ETL, consider AWS Batch or containers instead of stretching Lambda duration.
5. Cold starts and packaging
Cold start matters if the trigger is user-facing. Mitigations:
- Provisioned concurrency (cost tradeoff)—Lambda scaling.
- Slim dependencies (avoid shipping unused scientific stacks).
- Initialize heavy clients outside the handler where possible.
6. Observability and alerts
- Structured JSON logs with
job_id,record_count,duration_ms,outcome. - CloudWatch alarms on DLQ depth and error rate—Lambda monitoring.
- One on-call channel (PagerDuty, Slack) beats silent failure.
References
- Lambda developer guide
- SQS developer guide
- S3 event notifications
- Well-Architected Serverless lens — operational excellence checklist
Related posts: webhook reliability, LLM cost controls.