Skip to main content

Command Palette

Search for a command to run...

From Docker Compose to Production AWS: How 11 Engineers Shipped 8 Microservices in 3 Weeks - DMI Cohort 2 Final Capstone

Updated
19 min read

Picture this: you have 3 weeks, 11 engineers spread across different time zones, 8 Spring Boot microservices, and a single goal — deploy a production-grade AWS platform from scratch with no console clicking, no shortcuts, and no "it works on my machine" excuses.

That was our brief for the DevOps Micro Internship (DMI) Cohort 2 Final Capstone, mentored by Pravin Mishra. And three weeks later, our team presented a fully working dual-environment AWS platform — live demos included — in a 30-minute presentation slot that every single team member contributed to, and we finished exactly on time.

This blog covers everything: what we built, how we built it, the bugs that nearly broke us, what we learned, and what we would do differently. If you are thinking about building something similar, this is the guide I wish I had before starting.

🔗 Full source code: https://github.com/petclinic-project/PETCLINIC-GROUP7-PROJECT

The Team

We were Group 7, a squad of 11 engineers with clearly defined roles:

Name Role
Pratyush Pahari Co- Team Lead — CI/CD & Automation
Joy Ukpabi Team Lead
Anumba Chiamaka Maryann Scrum Master
Aarti Jadhav Infrastructure Engineer
Paul Lulami Wamenya Infrastructure Engineer
Ezeh Lilian Ezichi Deployment Engineer
Afunogu Stephanie Deployment Engineer
Issac Osei Owusu-Ansah Documentation Engineer
Nwogu Nice Ihuoma Documentation Engineer
Ntando Mvubu Monitoring Engineer
Ibeh Victory Chimaobi Monitoring Engineer

Clear ownership was one of the best decisions we made early. When something breaks, you need to know immediately whose responsibility it is - not spend 20 minutes in a group chat asking "whose code is this?"

What We Built and Why

Spring Petclinic Microservices is a well-known sample application — a veterinary clinic management system built with 8 Spring Boot services. It normally runs locally with docker-compose up. Our job was to take it from that Docker Compose file to a production-grade AWS deployment with:

  • Two isolated environments (dev and prod)

  • Full GitOps — Git is the single source of truth for every deployment

  • Automated CI/CD pipeline

  • Complete observability (metrics, logs, traces, alerts)

  • Zero secrets committed to Git

  • An AI chatbot powered by OpenAI GPT-4o-mini

The reason this application makes such a good capstone project is that it has real complexity: services depend on each other in a specific startup order, three services share a single MySQL database with cross-service foreign key constraints, and the frontend is served through a Spring Cloud Gateway that routes to all backend services.

Prerequisites

Before starting a project like this, you need:

  • AWS account with an IAM user (Administrator access for initial setup)

  • Domain name (we used Cloudflare — free DNS management)

  • GitHub account with a fork of spring-petclinic-microservices

  • Local tools: AWS CLI v2, Terraform ≥ 1.10, kubectl, Helm v3, Docker Desktop, yq v4, gh CLI

The Architecture

Here is the full picture of what we deployed:

The full tech stack:

Layer Tool Why we chose it
IaC Terraform ≥ 1.10 Modular, reproducible, S3 state backend
Cluster Amazon EKS 1.30 ARM64 Industry standard + Graviton free trial
Autoscaling Karpenter v1.1.1 Faster than Cluster Autoscaler, better Spot support
Registry Amazon ECR Private repos, scan-on-push, lifecycle policies
Database RDS MySQL 8.0 db.t4g.micro Free tier, shared DB for all 3 domain services
DNS Cloudflare Existing domain, free, Terraform provider available
Secrets AWS Secrets Manager + ESO No secrets in Git, industry-standard pattern
Packaging Helm Single generic chart for all 8 services
GitOps ArgoCD Auto-sync dev, manual prod — full auditability
CI GitHub Actions OIDC federation, ARM64 builds, Trivy scanning
Observability Prometheus, Grafana, Loki, Alertmanager, Zipkin Full stack — metrics, logs, traces, alerts

The Journey — Stage by Stage

Stage 1: Understanding the Application Locally

Before touching any AWS resource, we ran the application locally with Docker Compose and understood every service:

Service Port Role MySQL?
config-server 8888 Git-backed config for all services No
discovery-server 8761 Eureka service registry No
api-gateway 8080 Routes all traffic, serves frontend No
customers-service 8081 Owners and pets data Yes
visits-service 8082 Visit records Yes
vets-service 8083 Vet data + Caffeine cache Yes
genai-service 8084 AI chatbot via Spring AI No
admin-server 9090 Spring Boot Admin dashboard No

The critical insight about startup order: config-server must be healthy before any other service starts — every service fetches its configuration from it on boot. discovery-server must be healthy before api-gateway can route requests. visits-service must start after customers-service because visits.pet_id has a foreign key on pets.id — a table created by customers-service.

We enforced this startup order in Kubernetes using init containers that poll health endpoints before the main container starts:

initContainers:
  - name: wait-for-config-server
    image: busybox:1.36
    command:
      - sh
      - -c
      - until wget -qO- http://config-server:8888/actuator/health; do sleep 5; done

This means Kubernetes declaratively enforces startup order — no timing hacks, no sleep commands in application code.

Stage 2: Infrastructure as Code with Terraform

We organised Terraform into reusable modules with a clear separation between environments:

terraform/
├── environments/
│   ├── dev/          # Dev root module
│   └── prod/         # Prod root module
└── modules/
    ├── vpc/          # VPC, subnets, security groups
    ├── eks/          # EKS cluster, node groups, add-ons, IRSA
    ├── ecr/          # ECR repos, lifecycle policies
    ├── rds/          # RDS MySQL, credentials
    ├── dns/          # ACM cert, Cloudflare DNS records
    ├── secrets/      # Secrets Manager, ESO IRSA role
    ├── karpenter/    # Karpenter IAM, SQS, EventBridge
    └── github-oidc/  # GitHub Actions OIDC federation

Key design decisions and why:

All-public subnets (no NAT Gateway): NAT Gateways cost ~$35/month per environment. We used public subnets with strict security groups as the access control boundary. The RDS security group only allows port 3306 from the EKS node security group — nothing else reaches the database. Security groups replace the private subnet boundary at zero additional cost.

Separate VPCs for dev and prod: Dev uses 10.0.0.0/16, prod uses 10.1.0.0/16. Non-overlapping CIDRs mean these environments can never interfere with each other.

S3 native state locking: Terraform ≥ 1.10 supports state locking directly in S3 without DynamoDB. We wrote bootstrap-state.sh — a one-time script that creates the bucket before any terraform init runs.

Stage 3: Kubernetes and Helm

Rather than writing 8 separate sets of Kubernetes manifests, we created a single generic Helm chart that serves all 8 services. Per-service differences live in values files:

helm-values/
├── dev/
│   ├── api-gateway.yaml        # ECR dev URL, ports, env vars, image tag
│   ├── customers-service.yaml  # RDS connection, MySQL profile
│   └── ... (one per service)
├── prod/
│   └── ... (one per service)
├── dev.yaml    # replicaCount=1, no HPA, no PDB
└── prod.yaml   # replicaCount=2, HPA enabled, PDB enabled

Why separate dev/ and prod/ subdirectories? Originally we had a flat helm-values/ directory. Running generate-config.sh prod would overwrite dev values with prod ECR URLs and prod RDS endpoints. The next dev deploy would pull images from prod ECR — which didn't exist. Separating into subdirectories eliminated this entire class of cross-environment contamination bug.

Stage 4: GitOps with ArgoCD

ArgoCD runs inside the cluster and continuously reconciles cluster state to match Git. Every deployment is traceable to a specific Git commit. Rolling back means reverting a file in Git and waiting 3 minutes.

Dev:  automated: { selfHeal: true, prune: true }
      → auto-deploys within 3 minutes of any Git change

Prod: automated: disabled
      → shows OutOfSync, waits for human to click Sync in ArgoCD UI

This means no accidental prod deploys. A developer pushing bad code triggers an automatic dev deployment where it can be caught — but prod stays on the last approved version until an operator explicitly approves.

Stage 5: The CI/CD Pipeline

The pipeline is split across two repositories intentionally:

App Repo → build-push.yml:
├─ paths-filter: detect ONLY changed service directories
├─ QEMU + Buildx: build linux/arm64 Docker image
├─ Trivy: security scan (CRITICAL fails the build)
├─ ECR push: petclinic-dev/{service}:{7-char-sha}
└─ repository_dispatch → infra repo

Infra Repo → update-image-tags.yml:
├─ yq: update helm-values/dev/{service}.yaml image.tag
└─ git commit + push → ArgoCD detects change → deploys

GitHub Actions never touches the cluster. It only updates a YAML file in Git. ArgoCD handles all actual deployments. No AWS cluster credentials stored in GitHub — only an OIDC-federated role scoped to ECR push.

Why paths-filter? With 8 services, rebuilding all 8 on every push is wasteful. dorny/paths-filter detects which service directories changed and only builds those.

Stage 6: Observability Stack

Full observability running in the monitoring namespace:

  • Prometheus — scrapes /actuator/prometheus every 15 seconds from 5 of 8 services

  • Grafana — dashboards for request rate, error rate, latency, JVM metrics

  • Loki + FluentBit — FluentBit runs as a DaemonSet on every node, forwarding all container logs to Loki

  • Alertmanager — routes ServiceDown alerts to Gmail via SMTP when up == 0 for more than 1 minute

  • Zipkin — distributed tracing with 100% sampling

Stage 7: The AI Chatbot

We integrated Spring AI + OpenAI GPT-4o-mini into genai-service. The chatbot supports 4 operations via function calling:

  • List all owners and their pets

  • Add a new owner

  • List all vets

  • Add a pet to an existing owner

The OpenAI API key lives in AWS Secrets Manager and is synced into Kubernetes by External Secrets Operator — never committed to Git.

My Role — 13 Scripts That Run the Whole Thing

As Team Lead for CI/CD and automation, I wrote every operational script. The philosophy was consistent: idempotent, self-documenting, and reproducible. Any team member can run any script, get a clear output, and re-run it safely if something goes wrong.

Script What it does
bootstrap-state.sh Creates S3 state bucket — run once per account
tf.sh Terraform wrapper — handles init, plan, apply without hanging
setup-cluster.sh Full cluster setup from blank EKS to production-ready
generate-config.sh Injects dynamic values after terraform apply
build-push-images.sh Builds ARM64 images and pushes to ECR
smoke-test.sh Validates all 8 services are healthy (16 checks)
promote-to-prod.sh Copies image from dev ECR to prod ECR
seed-prod-data.sh Seeds prod RDS with test data — run once
update-dns-and-ingress.sh Wires Cloudflare DNS to ALBs
setup-github-secrets.sh Configures CI/CD secrets via gh CLI
pre-apply-check.sh Imports shared resources before apply
pre-destroy.sh Cleans up before terraform destroy
full-cleanup.sh Destroys everything — dev and prod

The most complex was setup-cluster.sh — a single script that installs and configures 10 components in the correct dependency order, with health checks and retry logic at every step. Any team member could run it on a fresh cluster and get a fully operational environment.

Beyond the scripts, I also designed the two-repo GitOps architecture — the deliberate separation between the app repo (which builds images) and the infra repo (which declares cluster state). This ensures CI never has direct cluster access and every deployment is traceable to a Git commit.

Demo Day — 30 Minutes, 11 Engineers, Zero Wasted Time

After 3 weeks of building and debugging, we had exactly 30 minutes to present to Pravin Mishra. What made our presentation stand out was not just the technical work — it was the preparation and discipline around time management.

Here is the order we followed:

Slot Person Topic Time
1 Joy Ukpabi Introduction and project overview 5 min
2 Anumba Chiamaka Scrum process and team coordination ~2 min
3 Issac + Nice Documentation — ADRs, runbooks, onboarding guide ~2 min
4 Paul + Aarti Infrastructure — Terraform modules, VPC, EKS, RDS ~2 min
5 Lilian + Stephanie Deployment — Helm chart, ArgoCD, GitOps flow ~2 min
6 Ntando Monitoring — Prometheus, Grafana, Loki, Zipkin ~2 min
7 Pratyush Alerting Demo + CI/CD Overview 10 min
8 All Q&A 5 min

Every team member had already done a CI/CD end-to-end demo in an earlier session with the mentor. So on final demo day, I focused the 10-minute slot on the Chaos Engineering and Alerting demo — the part that had the most dramatic visual impact.

I disabled ArgoCD selfHeal on vets-service, verified it was confirmed false, then scaled the deployment to zero replicas. The vets page immediately started failing. Within 3 minutes, a [CRITICAL] Petclinic Alert: ServiceDown email arrived in Gmail live on screen. I restored the service, re-enabled selfHeal, and a [Resolved] email followed automatically. Smoke test: 16/16.

The Q&A session covered questions about our architectural decisions — particularly why we chose public subnets over private, and why we split CI and CD across two repositories.

The proudest moment of the entire project: Every single team member presented their own work. Nobody was a spectator. And we finished in exactly 30 minutes — not a second over. For an 11-person distributed team presenting a system of this complexity, that is something to be genuinely proud of.

The Bugs That Nearly Broke Us

Bug 1 — Alertmanager Empty Config (The Chaos Demo Breaker)

What happened: Alertmanager crashed on every startup with no route provided in config. Alert emails never sent. The chaos demo was completely broken.

The root cause chain — three nested bugs:

First, setup-cluster.sh injected Alertmanager credentials by parsing a alertmanager.yml: | block inside a Kubernetes Secret YAML file. Weeks earlier, we had removed that Secret from alertmanager.yaml to prevent ArgoCD from overwriting real credentials on every sync. The injection function now parsed a file with no Secret section — producing an empty output. Alertmanager started with a blank config.

Second, even when the config was non-empty, shell variable substitution stripped spaces from the Gmail app password. kyxc auvf mqvy dmvs became kyxcauvfmqvydmvs — an invalid password. SMTP authentication failed silently.

Third, Alertmanager accepted the bad config without crashing. It only failed when actually trying to send an email. The bug was completely invisible until the exact moment of the live demo.

The fix: Created a dedicated monitoring/alertmanager-config.yaml as a pure Alertmanager config template with placeholders. Switched credential injection from shell to Python:

content = content.replace("ALERTMANAGER_EMAIL_PLACEHOLDER", email)
content = content.replace("ALERTMANAGER_PASSWORD_PLACEHOLDER", password)
# Python string replace preserves spaces exactly. Shell tr -d ' ' does not.

The lesson: Never manipulate credentials with shell string operations. The most dangerous infrastructure bugs are the ones that succeed silently and only fail at the worst possible moment.

Bug 2 — Terraform Output Hangs Forever

What happened: Any script calling terraform output would hang indefinitely. Cluster setup was completely blocked.

Root cause: A bug in TLS Terraform provider v4.3.0 causes the provider shutdown process to deadlock on certain systems.

The fix: Read outputs directly from S3 state using Python — bypasses the Terraform CLI entirely:

import json, subprocess
state_json = subprocess.check_output([
    "aws", "s3", "cp", f"s3://{bucket}/{key}", "-"
]).decode()
state = json.loads(state_json)
vpc_id = state["outputs"]["vpc_id"]["value"]

The lesson: Pin all Terraform provider versions from day one. Add version constraints to every provider in versions.tf. Never leave provider versions unpinned.

Bug 3 — Circuit Breaker Opens on api-gateway Restart

What happened: After restarting api-gateway, the genai chatbot returned "Chat is currently unavailable" for every request, even minutes later.

Root cause: api-gateway has a Resilience4j circuit breaker with a 1-second default timeout. genai-service takes 40-65 seconds to fully start. When api-gateway restarts and immediately tries to route requests, every request times out. After enough failures the circuit breaker opens, and all subsequent requests go to the fallback without even trying the service.

The fix: Added Resilience4j configuration to api-gateway Helm values:

- name: RESILIENCE4J_TIMELIMITER_INSTANCES_GENAICIRCUITBREAKER_TIMEOUTDURATION
  value: "30s"
- name: RESILIENCE4J_CIRCUITBREAKER_INSTANCES_GENAICIRCUITBREAKER_MINIMUMNUMBEROFCALLS
  value: "10"
- name: RESILIENCE4J_CIRCUITBREAKER_INSTANCES_GENAICIRCUITBREAKER_FAILURERATETHRESHOLD
  value: "80"

The lesson: Always restart services in dependency order. Restart dependent services first, wait for them to register in Eureka, then restart the gateway.

Bug 4 — ESO ClusterSecretStore Invalid After Rebuild

What happened: After destroying and rebuilding infrastructure, External Secrets Operator showed ValidationFailed and no secrets synced to Kubernetes.

Root cause: terraform destroy + terraform apply creates a new ESO IAM role with a new ARN. The Kubernetes ServiceAccount annotation still pointed to the deleted role. AWS rejected all sts:AssumeRoleWithWebIdentity requests with 403.

The fix:

kubectl annotate serviceaccount external-secrets-sa \
  -n external-secrets \
  "eks.amazonaws.com/role-arn=${NEW_ESO_ROLE}" \
  --overwrite
kubectl rollout restart deployment/external-secrets -n external-secrets

The lesson: Always verify all IRSA role ARN annotations after a destroy/rebuild cycle.


What Went Well

The presentation exceeded our expectations. We had 30 minutes with Pravin Mishra and used every one of them effectively. Every team member presented their own work — nobody was a spectator. The live alerting demo produced a real Gmail notification on screen, exactly on cue. Finishing a distributed 11-person presentation of a system this complex in precisely 30 minutes, with zero over-run, is a direct reflection of how much preparation and rehearsal went into demo day.

Reproducibility was fully achieved. This was the hardest requirement and the most satisfying to meet. A complete stranger can clone the repository, fill in terraform.tfvars with their own values, run 6 commands, and have a fully operational dual-environment AWS platform. Every dynamic value is injected automatically. No hidden manual steps anywhere.

Team coordination was excellent. 11 engineers, clear ownership boundaries, zero finger-pointing. When something broke, the person who owned it fixed it. The Scrum Master kept us on track without micromanaging.

The automation layer genuinely helped the team. Once the scripts were solid, team members who had never touched Terraform or kubectl could run ./scripts/setup-cluster.sh dev and get a working cluster. That is the real measure of good automation — when it removes you as a dependency.


What I Would Do Differently

Pin all Terraform provider versions from day one. One version constraint per provider in versions.tf. This would have saved hours of debugging the hanging terraform output issue.

Create per-environment Prometheus values files from the start. A single shared prometheus-values.yaml means running generate-config.sh prod overwrites dev monitoring config. Five minutes of upfront structure prevents hours of debugging later.

Add selfHeal verification to the chaos demo script from the beginning. The race condition between disabling selfHeal and scaling to zero is subtle but consistent. A five-line verification block with an explicit exit prevents failed demo runs.

Add a --dry-run mode to every script. Scripts that can show what they would do without doing it are invaluable for debugging and onboarding new team members safely.


Cost Breakdown

Resource Dev Prod Notes
EKS Control Plane ~$73 ~$73 $0.10/hr — unavoidable
EC2 t4g.medium nodes $0 $0 Graviton free trial until Dec 2026
RDS db.t4g.micro $0 $0 12-month free tier
ECR + Secrets Manager ~$3 ~$3 Minimal
S3, data transfer ~$1 ~$1 State bucket
Total per env ~$77 ~$77
Total both envs ~$154/month

Cost tip: EKS costs \(0.10/hour. Destroy after each session with ./scripts/full-cleanup.sh. Our total AWS spend for the entire 3-week project was under \)15.


Final Thoughts

Three weeks. Eleven engineers. Eight microservices on two EKS clusters with full GitOps, observability, security, and an AI chatbot. A 30-minute presentation where every team member presented their own work and we finished exactly on time.

The biggest thing this project taught me is that automation is a communication tool, not just a technical convenience. When I was the only person who understood the scripts, every deployment routed through me. The moment I made every script self-documenting, idempotent, and reproducible, my teammates could work independently.

Solo work optimises for getting something working. Team work forces you to optimise for someone else being able to understand, run, and fix what you built — without you in the room. That constraint produces better engineering. Every single time.


This project is part of the DevOps Micro Internship (DMI) by Pravin Mishra — a hands-on program that takes engineers from DevOps theory to real production infrastructure. No tutorials. No shortcuts.

DMI Cohort 3 starts 27 June 2025. If you want to build real infrastructure instead of just reading about it, apply here 👇

🔗 https://docs.google.com/forms/d/e/1FAIpQLSel7ai7nyb0P1qLW4vEyfB_nEsD4lUF1XG88vmAaFGBOb6hPA/viewform

Join the DMI community 👉 https://lnkd.in/gHYe7Mdg


9 views