Al Bunch

AWS Infrastructure — Architecture Patterns for Real-World Workloads

Two deliberate architecture patterns — production SaaS on ECS/Aurora and composable EC2/Docker for SMBs — plus a 1k/min analytics pipeline. Infrastructure matched to business needs, not résumé padding.

AWSECSTerraformDockerAurora ServerlessCloudFrontSQSGitHub Actions

Every project in this portfolio runs on AWS infrastructure I designed and operate. Rather than one-size-fits-all, I maintain two distinct architecture patterns and select between them based on the actual business requirements — uptime expectations, throughput, budget, and team size.

Pattern 1: Production SaaS (TC Track)

TC Track is a revenue-generating multi-tenant SaaS platform. The infrastructure reflects that:

  • Request path — CloudFront → ALB → WAF → ECS (Fargate) → Aurora Serverless / S3 / EFS
  • React SPA hosted in S3 behind CloudFront, separate from the API containers
  • VPC with NAT gateway and VPC endpoints for S3 and ECR to reduce NAT data transfer costs
  • EventBridge fires every minute to poll for pending emails, queues them into SQS, and worker containers send asynchronously — no synchronous email sending in the request cycle
  • SQS also handles inbound Stripe webhooks and outbound integration webhooks
  • Autoscaling — web containers scale on CPU/load, worker containers scale on visible queue depth
  • Aurora Serverless — scales to zero during quiet periods, no capacity planning required
  • WAF with managed rule sets in front of the ALB
  • Secrets Manager for credentials — easier to rotate and audit than burying values in ECS task definitions

Running cost at current scale: $250–300/month, over-provisioned for early revenue. The plan is to consolidate onto the EC2 pattern (below) until traffic justifies the full ECS stack — infrastructure should earn its complexity.

Pattern 2: Composable EC2 Platform (SMB Workloads)

For smaller clients and early-stage projects where ECS/Aurora is overkill:

  • EC2 instance running Docker — the instance is cattle, not a pet. Rebuild, don’t repair.
  • ALB for SSL termination — ACM certificates are free, always renew, and look more “enterprise” than Let’s Encrypt to clients who care about that. HTTP behind the ALB inside the VPC.
  • Traefik or Nginx Proxy Manager on Docker — Traefik for stable, well-defined routing; NPM for multi-service hosts with frequent changes.
  • Public and private Docker networks — apps that need internet exposure get routed through the proxy; internal services (queues, caches, admin tools) stay on the private network.
  • RDS sized to fit — small provisioned instance, Aurora Serverless, or multi-AZ depending on the workload’s actual durability and performance requirements.
  • EFS for shared storage, S3 for assets. CloudFront in front when caching makes sense to reduce EFS load.
  • Queue layer is pluggable — SQS for durability, RabbitMQ for routing features, Redis for speed. Pick based on the problem.
  • Lambda for fan-out — wide, shallow tasks that don’t justify a persistent container.
  • ECS tasks for heavy one-offs — on-demand containers for jobs like FFmpeg rendering that need compute but not a running service.
  • Tailscale + Portainer for management — SSH and container management without exposing ports to the internet.
  • Stack files and environment configs checked into GitHub — infrastructure state lives in version control, not in someone’s head.

For truly small workloads where downtime isn’t a concern, we pull back further: containerized database on the same EC2, backups to S3 on a schedule. No RDS bill at all. It’s the right call when the client’s recovery time tolerance is measured in minutes, not seconds.

Data Pipeline: Event Analytics at Scale

Built at a previous role — a real-time analytics pipeline processing sustained 1,000 messages/minute:

  • Custom JavaScript event handlers on the frontend replaced an existing OpenTelemetry integration that was the wrong tool for the job — OTel is for distributed tracing, not business event capture.
  • API Gateway (HTTP, not REST) receives events and writes directly to SQS. No Lambda in the hot path — API Gateway’s native SQS integration is cheaper and faster.
  • SQS with DLQ for failed messages — nothing gets silently dropped.
  • Three Nifi containers read from SQS, perform data enrichment (geolocation, browser/device classification), and write to OpenSearch.
  • Migration from OpenSearch to PostgreSQL was in progress — OpenSearch is excellent for full-text search and log aggregation, but the team was stronger in SQL and the analytical queries involved joins that OpenSearch handles poorly. Postgres with good indexing handled the volume fine and at a lower cost.
  • Custom analytics dashboards with Chart.js and Apache ECharts on the read side, querying the enriched data.

CI/CD and Deployment

  • Container images built by GitHub Actions, pushed to ECR (migrated from Docker Hub — cheaper and faster within the AWS network).
  • OIDC federation — GitHub Actions assumes an IAM role via trust policy. No AWS access keys stored as GitHub secrets. Same pattern for EC2 instances pulling from ECR: instance profiles with ECR read-only policies. Zero stored credentials.
  • Self-hosted runners for builds that need more power, private network access, or custom toolchains. Currently spread across a colocation facility and an EC2 instance for ARM builds.
  • Terraform manages the full AWS stack across projects — some projects share a repo where infrastructure overlaps, others have dedicated repos. Claude Code writes the HCL, I review and commit. The Terraform repo is the source of truth for infrastructure state, not anyone’s memory.
  • Ansible for repeatable provisioning when machine setup needs to be reproduced.

Operations and Cost Discipline

  • Uptime Kuma for external monitoring of all services
  • ECS health checks for container-level self-healing
  • Pushover for immediate alert notifications — same integration used across TC Track’s application alerts
  • AWS budget alerts for cost anomalies
  • QA environments auto-shutdown — TC Track’s QA spins down when idle, starts on demand with a 4-hour TTL that resets on access. No reason to pay for an environment nobody’s using.
  • VPC endpoints where data transfer costs justify them — S3 and ECR traffic stays off the NAT gateway
  • ECR over Docker Hub — lower cost and faster pulls within the AWS network
  • Forward-compatible database migrations — schema changes are designed to work across 3–4 application versions so rolling ECS deployments, rollbacks, and concurrent container versions don’t break. Not the most sophisticated migration strategy, but deliberate and appropriate for the current scale.

Security Posture

  • OIDC federation everywhere — no long-lived AWS credentials in CI or on compute
  • Narrowly scoped IAM policies — no “god keys,” services get the minimum permissions they need
  • WAF in front of public-facing ALBs
  • VPC isolation — RDS and internal services are private-subnet only, not accessible from the internet
  • Secrets Manager for sensitive configuration rather than environment variables in task definitions
  • Dedicated VPC strategy — projects share infrastructure until they outgrow it, then get isolated into their own VPC when the project is earning revenue and team access needs to be scoped

The Philosophy

Most infrastructure decisions aren’t technical — they’re economic. ECS with Aurora Serverless is the right answer for a multi-tenant SaaS platform that needs to scale and stay resilient. An EC2 instance with Docker and a right-sized RDS is the right answer for an SMB client who needs reliability but not five-nines uptime. A containerized database with S3 backups is the right answer when the budget is tight and recovery time is flexible.

The engineering skill isn’t knowing how to set up ECS. It’s knowing when not to.