From Localhost to Production: The Complete Guide to Deploying Spring Petclinic Microservices on AWS EKS with Terraform, Helm, ArgoCD, Observability, and AI-Assisted Development

A beginner's honest account of building a production-grade AWS platform from scratch - every architecture decision, every real error, and every fix, documented in full.

UpdatedJune 13, 2026

•97 min read

PRATYUSH PAHARI

Chapter 1: Introduction — The Real Story

Knowing Kubernetes is not the same as running Kubernetes in production.

I knew that difference existed before I started this project. During DevOps Micro Internship (DMI), I learned Docker properly, deployed monolith apps on the cloud using Docker Compose, and studied Kubernetes well beyond the basics - Pods, Deployments, Services, Ingress, Horizontal Pod Autoscalers, the whole thing. I felt reasonably confident.

Then came the actual task:

Take 8 Spring Boot microservices. Provision the entire AWS infrastructure using Terraform - VPC, EKS cluster, RDS database, container registry, load balancer, DNS, TLS certificates. Package every service with Helm. Set up GitOps continuous deployment using ArgoCD. Build a CI/CD pipeline on GitHub Actions with zero hardcoded AWS credentials. Add a full observability stack - Prometheus, Grafana, Loki, FluentBit, Zipkin, Alertmanager. Make it production-grade.

That's when I realized there's a canyon between "I know what these tools are" and "I can make all of them work together."

Knowing what a Deployment is doesn't tell you how a GitHub Actions workflow triggers an image push to ECR, which fires an event to update a Git file, which ArgoCD detects to roll out new pods, all on a Kubernetes cluster whose nodes were provisioned by Terraform and whose secrets are pulled from AWS Secrets Manager by an operator running inside the cluster itself. That chain understanding how everything connects is what this project actually teaches.

I hit real walls. A UTF-8 BOM character hiding inside a YAML file silently broke my entire ArgoCD deployment for hours. My Terraform DNS module switched between Cloudflare and Route 53 twice before I understood why Cloudflare was the right call. My CI pipeline failed in six different places before it ran clean. My destroy script left orphaned AWS resources the first three times I ran it.

Every single one of those moments is in this guide - what broke, why it broke, and exactly how to fix it.

📸 Screenshot: The fully deployed Spring Petclinic web UI running on your real domain with HTTPS

Who This Guide Is For

This guide is written for someone who already understands Docker and the fundamentals of Kubernetes. You don't need to know Terraform, Helm, ArgoCD, or AWS in depth. You don't need prior experience with GitOps or CI/CD pipelines.

What you do need is the willingness to read carefully, run commands, and not skip the explanations — because the explanations are where the actual learning is.

By the end of this guide, you will have built and deployed a real, production-grade platform on AWS from scratch. Not a toy project. Not a localhost demo. A working system with real infrastructure, real automation, and real monitoring — the kind you'd find at an actual company.

The Application: Spring Petclinic Microservices

The app you're deploying is called Spring Petclinic Microservices — an official sample application from the Spring team. It's a veterinary clinic management system: owners, pets, vets, visits. Simple enough to understand in five minutes, but split into 8 independent services that each run separately and communicate with each other over the network.

This split is what makes it a real microservices application — and what makes deploying it genuinely hard. In a monolith, you deploy one thing. Here, you have 8 services with inter-dependencies, a specific startup order, shared database constraints, and individual configuration requirements. This is exactly the pattern you'll encounter in real companies.

Here are all 8 services:

config-server (port 8888) — Every other service pulls its settings from here on startup. It must be the first thing running. If it's not ready, everything else fails to start.
discovery-server (port 8761) — Acts as a directory. Services register themselves here and find each other by name instead of hardcoded addresses. Must start right after config-server.
api-gateway (port 8080) — The only service reachable from the internet. All external traffic hits here first and gets routed to the right service behind the scenes.
customers-service (port 8081) — Handles pet owners and pets. Uses a MySQL database and creates the owners, pets, and types tables.
visits-service (port 8082) — Handles visit records. Also uses MySQL, and has a foreign key that points to the pets table created by customers-service. This service will crash if customers-service hasn't run its database setup first.
vets-service (port 8083) — Handles vet profiles and specialties. Uses MySQL.
genai-service (port 8084) — An AI-powered assistant using OpenAI. Needs an API key to work.
admin-server (port 9090) — A dashboard showing the live health and status of all other services.

The startup order is not a preference - it's a hard constraint enforced at the infrastructure level. We handle it using Kubernetes init containers, which are small containers that run before your main app container and block it from starting until a condition is met. This is covered in Chapter 8.

What You Will Have Built by the End

Cloud Infrastructure (managed by Terraform):

A private AWS network (VPC) with security rules controlling exactly what can talk to what
A Kubernetes cluster on AWS (EKS) running on ARM64 Graviton instances — free tier eligible
A MySQL database (RDS) — encrypted, with credentials auto-generated and stored safely
A private Docker image registry (ECR) with lifecycle policies and vulnerability scanning
An Application Load Balancer that handles all incoming traffic
A real domain with HTTPS, certificate provisioned and validated automatically

Kubernetes (inside the cluster):

All 8 services running in the correct order with proper health checks
Automatic scaling when load increases
Network isolation so services can only talk to what they're allowed to
Secrets synced automatically from AWS Secrets Manager — never stored in code or YAML

Automation:

Push code → Docker images build and push → cluster updates automatically — zero manual steps
A single script (up.sh) that deploys the entire platform from scratch in one run
A single script (destroy.sh) that tears everything down safely without leaving orphaned resources

Observability:

Prometheus collecting metrics from all 8 services every 15 seconds
Grafana dashboards showing request rates, error rates, and JVM memory
Loki collecting logs from every pod
Zipkin showing distributed traces across service calls
Alertmanager sending alerts when something goes wrong

A Straight Answer on Cost

AWS charges for everything. Here's exactly what this project costs:

EKS control plane: $73/month — Unavoidable. AWS charges this for every Kubernetes cluster regardless of how many nodes run on it.
EC2 compute (4 × t4g.small): $0 — Covered by the AWS Graviton free trial (ARM64 instances).
RDS MySQL: $0 — Covered by AWS free tier.
Load Balancer: $0 — Covered by AWS free tier.
ECR, Secrets Manager, EBS storage: ~$5–7/month combined.
Total: roughly $80/month per environment when running.

The guide includes scripts to scale your cluster down to zero nodes (without destroying it) when you're not actively using it, which cuts the cost to the $73 control plane charge only.

How This Guide Is Organized

Every chapter is focused and self-contained. Skip what you already know, read deeply on what you don't, and use the errors section at the end as a reference whenever something breaks.

Let's build something real.

Chapter 2: What You're Building - The Full Architecture Explained Simply

Before writing a single line of code or running a single command, you need a clear picture of what the final system looks like. Most guides skip this and jump straight into tools. That's a mistake — because when something breaks, and it will, the only way to debug it is to understand how all the pieces connect.

So let's walk through the entire architecture in plain English first.

The Big Picture

Think of what you're building as four layers stacked on top of each other:

The cloud infrastructure layer - the actual AWS resources: networks, servers, databases, storage
The Kubernetes layer - the system that runs your 8 services on those servers
The deployment layer - the automation that gets your code from GitHub into the cluster
The observability layer - the tools that let you see what's happening inside the running system

Every chapter in this guide builds one or more of these layers. Let's go through each one.

Layer 1: The AWS Infrastructure

Everything lives in eu-central-1; AWS's Frankfurt region. You can use any region you prefer, but all the code in this project defaults to Frankfurt.

The VPC - Your Private Corner of the Cloud

A VPC (Virtual Private Cloud) is simply your own isolated section of AWS's network. Think of it as a building with rooms inside — you decide who gets in and what each room can talk to.

In this project, the VPC has:

A main network block of 10.0.0.0/16 (about 65,000 private IP addresses to use)
2 public subnets spread across 2 different AWS data centres (called Availability Zones) for resilience
An Internet Gateway — the front door that connects your VPC to the internet
4 Security Groups — the rules controlling exactly what traffic is allowed in and out

Why only public subnets? Isn't that insecure?

The standard advice is to put your servers in private subnets with no direct internet access, and use a NAT Gateway to let them reach the internet. The problem: NAT Gateways cost between $35–65 per month per availability zone. For a learning project, that's an unnecessary expense.

The alternative is to keep everything in public subnets but use Security Groups as your security perimeter instead. A Security Group is a firewall rule — if the rule doesn't explicitly allow the traffic, it's blocked, regardless of whether the resource is in a public or private subnet. This approach is just as secure for this project and costs nothing extra.

EKS - Managed Kubernetes on AWS

EKS stands for Elastic Kubernetes Service. It's AWS's managed Kubernetes offering — meaning AWS runs and maintains the Kubernetes control plane (the brain of the cluster) for you, and you just manage the worker nodes (the servers your apps actually run on).

The cluster in this project runs:

Kubernetes version 1.34
4 worker nodes of type t4g.small — small ARM64 Graviton servers
Minimum 2 nodes, maximum 4, with autoscaling handled by Karpenter

Why ARM64 Graviton?

AWS offers a free trial for t4g Graviton (ARM64) instances — 750 free hours per month per instance type. During the trial period, your compute cost is effectively zero. The only catch is that your Docker images must be built for the linux/arm64 platform, not the standard linux/amd64. The CI pipeline handles this automatically.

RDS - Your Managed MySQL Database

RDS (Relational Database Service) is AWS's managed database offering. You don't manage the database server yourself — AWS handles backups, patching, and availability. You just connect to it.

This project runs MySQL 8.0 on a db.t4g.micro instance (free tier). Three of the 8 services use this database: customers-service, visits-service, and vets-service. All three share a single database called petclinic but use separate tables — which is why they can share one RDS instance instead of needing three separate databases.

The database credentials (username and password) are auto-generated by Terraform and stored immediately in AWS Secrets Manager. They are never written into any file or committed to Git.

ECR - Your Private Docker Registry

ECR (Elastic Container Registry) is where your Docker images live. Instead of pushing images to Docker Hub (public), they go to a private registry in your own AWS account. This project creates 8 ECR repositories — one per service — in each environment.

Images are tagged with the short commit SHA from Git (for example, a3f4bc1). This means you can always trace exactly which version of the code is running in your cluster.

Application Load Balancer - Your Traffic Router

When someone types your domain name into a browser, the request hits an Application Load Balancer (ALB) first. The ALB holds your TLS certificate (the thing that makes HTTPS work), terminates the encryption, and forwards the plain HTTP request to the right Kubernetes service inside the cluster.

The ALB is not created directly by Terraform. Instead, it's created automatically by the AWS Load Balancer Controller - a piece of software running inside Kubernetes that watches for Ingress objects and provisions the ALB in AWS on your behalf. You write a Kubernetes Ingress YAML file, and the controller handles the rest.

▎ 📸 [Screenshot: AWS Console showing the Application Load Balancer with its DNS name and a healthy target group]

Layer 2: Kubernetes - What Runs Inside the Cluster

Inside the EKS cluster, there are three categories of things running:

Your 8 Application Services

These run in a Kubernetes namespace called petclinic-dev (or petclinic-prod for production). Each service is packaged as a Helm chart and deployed by ArgoCD. Every service has:

A Deployment that keeps the right number of pods running
A Service that gives the pods a stable internal network address
Health checks (readiness and liveness probes) that Kubernetes uses to know if a pod is healthy
Resource limits - each pod is capped at 512 MB of memory and 500m CPU to prevent one service from starving the others
Init containers that enforce the startup ordering described in Chapter 1

The Supporting Add-ons

Three additional components are installed into the cluster to make everything work:

AWS Load Balancer Controller - Watches for Ingress objects and provisions the ALB in AWS. Covered in Chapter 6.
External Secrets Operator (ESO) - Continuously syncs secrets from AWS Secrets Manager into Kubernetes Secrets, refreshed every hour. This means your pods always have the latest credentials without a redeployment. Covered in Chapter 7.
Karpenter - A node autoscaler. When pods can't be scheduled because there aren't enough servers, Karpenter provisions a new EC2 instance in seconds and joins it to the cluster. When the load drops, it drains and terminates the extra nodes. Covered in Chapter 10.

The Observability Stack

Running in a separate monitoring namespace:

Prometheus - Scrapes metrics from all 8 services every 15 seconds
Grafana - Dashboards built on top of those metrics
Loki - Collects and stores logs from every pod
FluentBit - A tiny agent running on every node that reads container logs and forwards them to Loki
Zipkin - Collects distributed traces and records of how a single request flows through multiple services
Alertmanager — Receives alerts from Prometheus and routes them to your email or Slack

Layer 3: The Deployment Pipeline - How Code Gets to Production

This is where most people get confused, so let's walk through it step by step.

The project uses two separate GitHub repositories:

The application repo (spring-petclinic-microservices) — Contains the Java code for all 8 services. You fork this repository. It has a CI pipeline (build-push.yml) that builds Docker images and pushes them to ECR.
The platform repo (petclinic-platform) — Contains all the infrastructure code: Terraform, Helm charts, Kubernetes manifests, and ArgoCD configuration. This is the repository this guide is based on.

Here's what happens every time code is pushed to the application repo:

Developer pushes code to the application repo ↓ GitHub Actions builds 8 Docker images for linux/arm64 ↓ Trivy scans images for security vulnerabilities ↓ Images pushed to ECR with a 7-character commit SHA tag ↓ GitHub Actions sends a signal (repository_dispatch) to the platform repo ↓ Platform repo's CI updates the image tag in a YAML file and commits it to Git ↓ ArgoCD detects the Git change and rolls out the new pods in the cluster

Notice that GitHub Actions never touches the cluster directly. There's no kubectl apply in CI. No cluster credentials stored in GitHub. ArgoCD watches Git and pulls changes - the cluster initiates its own updates. This pattern is called GitOps, and it's the right way to manage Kubernetes deployments because Git becomes the single source of truth for what's running in production at any moment.

📸 Screenshot: GitHub Actions showing a completed update-image-tags workflow run with the commit message showing the service name and SHA

Layer 4: Observability - Seeing Inside the Running System

Once everything is running, you need to be able to answer three questions at any time:

Is my app healthy? → Prometheus metrics + Grafana dashboards
What did my app do? → Loki logs + FluentBit
Why did this request fail? → Zipkin distributed traces

📸 Screenshot: Grafana dashboard showing the service overview — request rates, error rates, and response times for all 8 services

📸 Screenshot: Zipkin trace view showing a single API request travelling through api-gateway → customers-service, with timing for each hop

Two Environments: Dev and Prod

The entire platform is built to run in two isolated environments: dev and prod. Both use the same Terraform modules and the same Kubernetes configuration, but with different settings:

Dev - 1 replica per service, no autoscaling, ArgoCD deploys automatically on every Git change
Prod - 2+ replicas per service, autoscaling enabled, ArgoCD requires a manual approval before deploying

This separation means you can break dev without touching prod, test infrastructure changes safely, and practice real deployment workflows.

The Repository Structure

Before you start working, it helps to know where everything lives in the platform repository:

Every chapter in this guide maps directly to one or more of these folders.

Chapter 3: Prerequisites - Every Tool You Need and How to Set It Up

Nothing is more frustrating than getting halfway through a guide and discovering you're missing something you needed from the start. This chapter is your complete checklist. Set everything up here before touching any infrastructure code.

Accounts You Need to Create

1. AWS Account

Everything in this project runs on AWS. If you don't have an account, go to https://aws.amazon.com and create one. Use a personal email address, not a work one.

When creating the account, AWS will ask for a credit card. You won't be charged immediately — the resources in this project are designed to stay within free tier limits — but the card is required to activate the account.

Important: After creating your account, enable MFA (Multi-Factor Authentication) on your root account immediately. The root account has unlimited access to everything in your AWS account. Go to IAM → Security credentials → Assign MFA device. Use Google Authenticator or Authy on your phone.

Once your account is active, create an IAM user for day-to-day use instead of using the root account. In AWS Console → IAM → Users → Create user. Give it AdministratorAccess for this project (you can tighten permissions later). Generate access keys for this user - you'll need them in a moment.

2. GitHub Account

You need a GitHub account to host the repositories and run the CI/CD pipelines. If you don't have one, create it at https://github.com.

3. Cloudflare Account

DNS for this project is managed through Cloudflare. You need a domain name — buy one from any registrar (Namecheap, GoDaddy, Cloudflare itself) and then either transfer it to Cloudflare or point its nameservers to Cloudflare. Cloudflare's free plan is sufficient for this project.

Once your domain is on Cloudflare, you need to generate an API token with two permissions: Zone:Read and DNS: Edit. Go to Cloudflare → My Profile → API Tokens → Create Token. Save this token — you'll need it when running Terraform.

4. OpenAI Account (Optional)

The genai-service needs an OpenAI API key to function. If you skip this, the service will still deploy, but the AI assistant feature won't work. You can get an API key at https://platform.openai.com.

Tools to Install on Your Machine

Git

Check if Git is already installed: git --version

If not, download it from https://git-scm.com. On Windows, use Git Bash as your terminal for all the commands in this guide — it behaves like a Linux terminal and avoids path conversion issues that cause problems with some scripts.

AWS CLI v2

The AWS CLI is a command-line tool that lets you interact with your AWS account directly from the terminal. Terraform and the project scripts both use it under the hood.

Download and install from https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html.

After installing, configure it with the access keys you generated for your IAM user:

aws configure

It will prompt you for four things: AWS Access Key ID: [paste your access key] AWS Secret Access Key: [paste your secret key] AWS Default region name: eu-central-1 AWS Default output format: json

Verify it works: aws sts get-caller-identity

If you see your account ID, user ARN, and user ID printed back, you're connected.

📸 Screenshot: Terminal showing the output of aws sts get-caller-identity with account ID and user ARN visible

Terraform

Terraform is the tool that reads your infrastructure code and creates real AWS resources from it. This project requires Terraform version 1.6 or higher.

Install from https://developer.hashicorp.com/terraform/install. On Windows, the easiest way is to download the binary directly, unzip it, and add it to your PATH.

Verify: terraform --version

kubectl

kubectl is the command-line tool for interacting with Kubernetes clusters. You'll use it to check pod status, read logs, and debug issues.

Install from kubernetes.io/docs/tasks/tools (https://kubernetes.io/docs/tasks/tools/).

Verify: kubectl version --client

Helm

Helm is the package manager for Kubernetes. It takes a chart (a template for Kubernetes resources) and fills in values to produce ready-to-deploy manifests. This project requires Helm version 3.x.

Install from helm.sh/docs/intro/install (https://helm.sh/docs/intro/install/).

Verify: helm version

Docker Desktop

Docker Desktop runs Docker on your machine and is required to build container images locally (though in practice the CI pipeline builds images — Docker Desktop is mainly needed for local testing and troubleshooting).

Download from docker.com/products/docker-desktop (https://www.docker.com/products/docker-desktop/).

After installing, open Docker Desktop and make sure the engine is running before you proceed. Verify: docker --version

Claude Code

Claude Code is the AI-assisted development tool used throughout this project. It's a CLI that runs in your terminal and lets you interact with Claude, an AI model from Anthropic, to write, review, and debug infrastructure code.

Install it with: npm install -g @anthropic-ai/claude-code

Then start it: claude

The first time you run it, you'll be prompted to log in with your Anthropic account. Chapter 4 covers exactly how to use Claude Code to build infrastructure — it's one of the biggest levers in this entire project.

📸 Screenshot: Claude Code running in the terminal, showing the interactive prompt ready for input

Optional but Recommended: k9s

k9s is a terminal-based UI for Kubernetes. Instead of running kubectl commands one by one, k9s gives you a live view of your cluster - pods, logs, resource usage - all in one screen. It makes debugging significantly faster.

Install from k9scli.io (https://k9scli.io/).

A Note for Windows Users

All scripts in this project are written in Bash. On Windows, use Git Bash (which comes with Git for Windows) or WSL2 (Windows Subsystem for Linux) as your terminal throughout this guide. Do not use PowerShell or Command Prompt - the scripts will not work correctly.

There is one specific Windows issue worth knowing about upfront: Git Bash on Windows sometimes automatically converts Unix-style paths (like /Zone:Edit) into Windows paths, which breaks certain AWS CLI commands. The fix is to prefix the problematic command with MSYS_NO_PATHCONV=1:

MSYS_NO_PATHCONV=1 aws ssm put-parameter ...

You'll see this pattern in the project scripts wherever this issue can occur. Chapter 11 covers this in more detail in the errors section.

Fork the Repositories

This project uses two GitHub repositories. You need your own copies (forks) of both.

Fork 1 — The application repo:

Go to https://github.com/spring-attic/spring-petclinic-microservices and click Fork. This is where the Java source code for all 8 services lives. You'll add the CI pipeline (build-push.yml) to this fork.

Fork 2 — The platform repo:

Go to https://github.com/paharipratyush/petclinic-platform and click Fork. This is the infrastructure repository — everything else in this guide lives here.

Clone your platform repo fork to your machine:

git clone https://github.com/YOUR_USERNAME/petclinic-platform.git cd petclinic-platform

📸 Screenshot: GitHub showing both forked repositories in your account

Quick Verification Checklist

Before moving to the next chapter, confirm all of this work:

git --version # Should show 2.x or higher

aws --version # Should show aws-cli/2.x

aws sts get-caller-identity # Should show your AWS account ID

terraform --version # Should show 1.6.x or higher

kubectl version --client # Should show a version number

helm version # Should show v3.x

docker --version # Should show Docker version

claude --version # Should show Claude Code version

If any of these fail, fix them before continuing. Every tool in this list is used somewhere in the chapters ahead.

Chapter 4: Working with Claude Code - Your AI DevOps Partner

This chapter is different from every other chapter in this guide. Every other chapter covers infrastructure. This one covers how to build infrastructure at a speed and quality level that would be impossible alone.

Let me explain what I mean.

What Claude Code Actually Is

Claude Code is an AI assistant that lives inside your terminal. But calling it just an "assistant" undersells what it actually does.

Most AI tools work like a smarter search engine - you ask a question, and they answer it. Claude Code works differently. It can read every file in your project, write and edit code, run terminal commands, search your codebase, and connect to external tools - all autonomously, step by step, to complete a task you describe.

In DevOps terms: instead of telling Claude "how do I write a Terraform module for EKS," you tell it "write me a production-ready EKS Terraform module following the conventions in this project, using the latest AWS provider version, with OIDC enabled and ARM64 Graviton nodes" and it reads your existing code, checks the Terraform registry for the latest provider, writes the module, validates it, and tells you what to review.

This is called an agentic workflow - Claude takes a series of steps on its own to complete a goal, rather than just answering one question.

📸 Screenshot: Claude Code running in the terminal mid-task, showing it reading files, making edits, and running terraform validate autonomously

Setting Up Claude Code for This Project

You installed Claude Code in Chapter 3. Now, let's configure it properly for this project.

Navigate to your platform repository:

cd petclinic-platform

claude

Claude Code automatically reads a file called CLAUDE.md at the root of the repository when it starts. This file is the most important configuration in the whole project from an AI workflow perspective.

CLAUDE.md - Teaching Claude About Your Project

CLAUDE.md is a plain Markdown file that you write once, and Claude reads every single time it opens in this project. Think of it as the briefing document you'd give a new team member on their first day, except this team member reads it perfectly, every time, without forgetting anything.

The CLAUDE.md in this project tells Claude:

The full directory structure and what each folder contains
Every naming convention: resource names, tag names, file naming
Which tools are used and what versions
Security rules that must never be broken (no secrets in code, no terraform destroy without approval)
How Terraform modules are structured
Kubernetes conventions: labels, probes, resource limits
The CI/CD pipeline design and the two-repo split

Here's a small excerpt from the actual CLAUDE.md in this project:

Terraform Conventions

Provider: AWS provider ~> 5.0, region eu-central-1
Naming: petclinic-{env}-{resource}
Tagging: Every resource MUST have tags: Project=petclinic, Environment={dev|prod}, ManagedBy=terraform
Never hardcode secrets. Use sensitive = true for secret outputs.

When Claude reads this, every piece of Terraform that it writes automatically follows these conventions - correct naming, correct tags, correct region. You don't have to repeat yourself in every prompt.

Practical tip: The quality of what Claude produces is directly tied to the quality of your CLAUDE.md. The more precise your conventions and rules are in that file, the more consistent and correct the output will be.

MCP Servers - Giving Claude Access to External Tools

MCP stands for Model Context Protocol. Think of MCP servers as plugins for Claude - they give it the ability to reach out to external services and get real, live information rather than relying only on what it was trained on.

This project has five MCP servers configured in .mcp.json:

awslabs.terraform-mcp-server Connects Claude directly to the Terraform registry. When writing a Terraform module, Claude can check the exact latest version of the AWS provider, look up the correct resource arguments, and even run Checkov security scans on the code it just wrote — all without leaving the terminal.
aws-knowledge-mcp Connects Claude to the official AWS documentation. Instead of Claude guessing how an AWS service works based on training data that might be outdated, it searches the live AWS docs and reads the current behaviour. This matters a lot with services like EKS, where configuration changes between Kubernetes versions.
awslabs.aws-pricing-mcp-server Let's, Claude, look up real AWS pricing during design decisions. When choosing between instance types or storage options, Claude can fetch current costs and give you an accurate trade-off rather than an approximate one.
context7 Connects Claude to up-to-date documentation for libraries and frameworks. Used heavily when writing Helm chart templates, Kubernetes manifests, and GitHub Actions workflows where exact API field names and syntax matter.
atlassian Connects Claude to Jira. In this project, every piece of work was tracked as a Jira story (PETPLAT-xxx). Claude could read the acceptance criteria of a story and implement it directly, then update the ticket status when done.

📸 Screenshot: .mcp.json file open in the editor showing all five MCP server configurations

Safety Hooks - Preventing Expensive Mistakes

One of the most valuable things about Claude Code is that it can run commands. One of the most dangerous things about Claude Code is that it can run commands.

On cloud infrastructure, a mistyped command or an overeager AI can destroy weeks of work and generate unexpected AWS charges in seconds. The project solves this with hooks - shell scripts that run automatically before certain actions and either warn you or block the action entirely.

These hooks are configured in .claude/settings.json and run without any manual trigger:

block-destroy.sh - Completely blocks terraform destroy and any kubectl delete on production namespaces. If Claude or you tries to run these, the command is intercepted and rejected.
block-dangerous-rm.sh - Blocks rm -rf on critical directories: terraform/, k8s/, helm/, .github/. Prevents accidental deletion of infrastructure code.
warn-apply-without-plan.sh - If you try to run terraform apply without a saved plan file, it warns you first. You should always review a plan before applying it.
block-secret-commit.sh - Blocks git add or git commit if any .env, .tfvars, .pem, or .key files are staged. Stops secrets from ever reaching GitHub.
suggest-validate.sh - After Claude edits any .tf, Kubernetes YAML, or Helm file, it automatically reminds you to run terraform validate or helm lint.

📸 Screenshot: Terminal showing the block-destroy hook intercepting a terraform destroy command and printing a blocked message

These hooks are not optional niceties - they are guardrails that saved real work during this project. Claude is highly capable, but it is not infallible, and cloud infrastructure has no "undo" button.

How to Actually Use Claude Code: Writing Good Prompts

The quality of Claude's output depends almost entirely on how clearly you describe the task. Here are the patterns that worked consistently throughout this project.

Be specific about what already exists: "Write a Terraform module for RDS MySQL 8.0. Follow the same structure as the existing VPC module in terraform/modules/vpc/. The module should output the endpoint, port, and secret ARN."

Reference the spec: "Implement the EKS module according to section 5 in docs/technical-spec.md. Use the exact Kubernetes version, node instance types, and add-on versions listed there."

Ask for review, not just writing: "Review the Helm chart in helm/petclinic-service/ against the conventions in CLAUDE.md. List anything that doesn't match."

Use multi-agent reviews for big changes: "Run a full security audit across the Terraform modules, Kubernetes manifests, and CI/CD pipeline. Use multiple agents in parallel and consolidate the findings."

That last type of prompt - using multiple agents working in parallel is what Claude Code calls an agentic workflow. Claude spawns several sub-agents, each focusing on a different part of the codebase simultaneously, then synthesises their findings. A review that would take a human hours gets done in minutes.

The Workflow That Built This Project

Here's the actual pattern used to build every piece of infrastructure in this project:

Step 1 - Start with the Jira story. Each feature had a Jira ticket with acceptance criteria. Claude read the ticket and the relevant section of docs/technical-spec.md to understand exactly what needed to be built.

Step 2 - Plan before writing. Before writing code, Claude was asked to describe its approach. This caught misunderstandings early, before any files were changed.

Step 3 - Write + validate together. Claude wrote the Terraform module or Kubernetes manifest, then immediately ran terraform validate or helm lint to catch syntax errors before any human review.

Step 4 - Audit after completing each epic. After each major feature was done, a multi-agent security and correctness audit was run across the new code. This caught issues that only become visible when you look at the whole picture — like an IAM role that was too permissive, or a secret that was logged by mistake.

Step 5 - Commit with intention. Claude never committed automatically. Every change was reviewed by the human first, then committed with a clear, descriptive message.

What Claude Code Is Not

Claude Code is not a replacement for understanding. If you use it as a magic box where you paste instructions and copy output without reading it, you will not learn anything, and when something breaks in production at 2 AM, you will have no idea where to start.

The right way to use it is as an extremely knowledgeable pair programmer who does the mechanical work fast so you can spend your time on understanding, reviewing, and making decisions. Every piece of code Claude generates in this guide is explained in the chapter it appears in. Read the explanations.

Chapter 5: Building the AWS Infrastructure with Terraform

This is where the actual building begins. By the end of this chapter, you will have a real VPC, a running EKS cluster, a private container registry, and a MySQL database - all created from code, in minutes, repeatably.

What Terraform Actually Is

Terraform is a tool that lets you describe your cloud infrastructure in code and then creates it for you.

Instead of clicking through the AWS Console to create a VPC, then manually creating subnets, then configuring security groups — you write a file that says "I want a VPC with these subnets and these security groups" and Terraform creates all of it in the right order. More importantly, if you delete that file and run Terraform again, it destroys those resources. If you change a value in the file, Terraform figures out the minimum set of changes needed to reach the new state.

This is why it's called Infrastructure as Code - your infrastructure is version-controlled, reviewable, and repeatable.

How This Project Organises Terraform

The Terraform code in this project is split into two layers:

Modules live in terraform/modules/. Each module is a reusable blueprint for one piece of infrastructure — the VPC module knows how to create a VPC, the EKS module knows how to create a cluster, and so on. Modules don't create anything by themselves.

Environments live in terraform/environments/dev/ and terraform/environments/prod/. Each environment is a root configuration that calls the modules and wires them together. This is what you actually run.

This structure means you never duplicate code. When you improve a module, both environments benefit.

Step 1: Bootstrap the State Backend

Before running any Terraform, you need somewhere to store Terraform state.

Terraform state is a file that records what real AWS resources exist and maps them to your code. Without it, Terraform can't know what it has already created. The problem: if you store this file only on your laptop, you lose it if your laptop breaks, and you can't collaborate with others.

The solution is to store the state file in AWS S3 (a file storage service) with a DynamoDB table to prevent two people from running Terraform at the same time and corrupting the state.

The project includes a script that creates this backend automatically:

bash scripts/bootstrap-state.sh

This script:

Detects your AWS account ID automatically
Creates an S3 bucket named petclinic-terraform-state-{your-account-id}
Enables versioning on the bucket (so you can recover from accidental state corruption)
Creates a DynamoDB table for state locking
Updates backend.tf in your dev and prod environments with the correct bucket name

Error you might hit: If the AWS CLI output pauses and waits for you to press q (like it's showing a paginated result), run this first:

aws configure set cli_pager ""

This disables the AWS CLI pager globally. The bootstrap script won't hang again.

Step 2: Configure Your Variables

Each environment has a terraform.tfvars file where you set your personal values. This file is gitignored - it never gets committed to Git because it contains values specific to your setup.

Copy the example file:

cp terraform/environments/dev/terraform.tfvars.example
terraform/environments/dev/terraform.tfvars

Open it and fill in your values:

domain_name = "yourdomain.com"

github_repo = "YOUR_USERNAME/spring-petclinic-microservices"

budget_email = "your@email.com"

Set your Cloudflare API token as an environment variable (never put it in the tfvars file): export CLOUDFLARE_API_TOKEN="your-token-here"

Step 3: Initialise Terraform

From the dev environment directory:

cd terraform/environments/dev terraform init

This downloads all the provider plugins Terraform needs (AWS, Cloudflare, etc.) and connects to your S3 state backend. You'll see it print Terraform has been successfully initialized at the end.

Step 4: The Four Core Modules

Here's what each module creates and why it's built the way it is.

The VPC Module

The VPC module creates your private AWS network. When you call it from the dev environment, it receives the CIDR range, availability zones, and project name as inputs and builds:

A VPC with CIDR 10.0.0.0/16
2 public subnets (10.0.1.0/24 in eu-central-1a, 10.0.2.0/24 in eu-central-1b)
An Internet Gateway
A route table directing all traffic through the Internet Gateway
4 Security Groups: one for the EKS control plane, one for EKS nodes, one for RDS, one for the ALB

The subnets are tagged with special Kubernetes-specific labels: kubernetes.io/cluster/petclinic-dev = shared kubernetes.io/role/elb = 1

These tags tell EKS which subnets to place nodes in, and tell the AWS Load Balancer Controller which subnets to use when creating an ALB. Without these tags, both systems fail silently — they simply don't find the subnets they need.

The EKS Module

This is the most complex module and the one that takes the longest to create (about 15 minutes for a new cluster).

It creates:

The EKS control plane running Kubernetes 1.34
A managed node group with t4g.small ARM64 instances (min 2, desired 4, max 4)
An OIDC provider — a trust relationship that allows pods inside the cluster to authenticate to AWS services using their Kubernetes identity instead of static credentials
4 managed add-ons: CoreDNS (internal DNS), kube-proxy (network routing), VPC CNI (pod networking), EBS CSI Driver (persistent volumes)

Why 4 nodes as the desired count?

You might think 2 nodes is enough to start. It isn't, once the full observability stack runs. Prometheus, Grafana, Loki, FluentBit, Alertmanager, Zipkin, and 8 application services all have resource requests. With only 2 t4g.small nodes (each with 2 vCPUs and 2 GB RAM), pods get stuck in Pending because there's not enough capacity. 4 nodes gives enough headroom for the complete platform.

Error you might hit - AMI type mismatch: The node group requires AL2023_ARM_64_STANDARD as the AMI type for ARM64 nodes on Kubernetes 1.34. If you see an error like InvalidParameterException: The following AMI types are not supported, check that your node_ami_type variable is set to AL2023_ARM_64_STANDARD and not the older AL2_ARM_64. Amazon Linux 2 is no longer supported for new node groups on recent Kubernetes versions.

Error you might hit - Access entry: After the cluster is created, Terraform itself needs permission to make further changes to it. If you see Error: Unauthorized when Terraform tries to install add-ons, it means the IAM principal running Terraform hasn't been added to the cluster's access entries. The EKS module handles this automatically by adding the caller's ARN as a cluster admin but only if you run Terraform with the same IAM identity each time.

The ECR Module

This module creates 8 private Docker image repositories - one per service:

petclinic-dev/config-server

petclinic-dev/discovery-server

petclinic-dev/api-gateway

petclinic-dev/customers-service

petclinic-dev/visits-service

petclinic-dev/vets-service

petclinic-dev/genai-service

petclinic-dev/admin-server

Each repository has:

A lifecycle policy that keeps only the last 10 images and automatically deletes older ones — prevents storage costs from accumulating
Scan-on-push enabled — every image is scanned for known vulnerabilities when pushed
Mutable tags in dev (you can overwrite a tag), immutable tags in prod (once pushed, a tag is permanent — prevents accidentally overwriting what's running in production)

Error you might hit - Lifecycle policy syntax: ECR lifecycle policies have a specific JSON format that changed in a subtle way. If your lifecycle rule targets tagged images, you must use tagPatternList (not tagPrefixList) when matching by pattern. Using the wrong key silently creates an invalid policy that ECR accepts but never enforces.

Error you might hit - Force delete on destroy: If you run terraform destroy while images exist in ECR, Terraform will fail with RepositoryNotEmptyException. The ECR module has force_delete = true set for exactly this reason - it tells Terraform to empty the repository before deleting it. Make sure this is present in the module.

The RDS Module

This module creates your MySQL 8.0 database. It:

Creates a DB subnet group (tells RDS which subnets it can use)
Sets a custom parameter group to enforce utf8mb4 character encoding (required for emoji and non-ASCII characters in the database)
Deploys the RDS instance in a single AZ with 20 GB of storage
Generates a random secure password using Terraform's random_password resource
Immediately stores the username and password in AWS Secrets Manager as a JSON object

The database is placed in the RDS Security Group, which only accepts connections on port 3306 from the EKS node Security Group. Nothing else can reach it - not the internet, not other AWS services, not even other resources in the same VPC.

Error you might hit - Free tier backup restriction: If you set backup_retention_period to anything greater than 0 on a free-tier AWS account, RDS returns a FreeTierRestrictionError. The dev environment must have backup_retention_period = 0. The prod environment can use higher values once you're off the free tier.

Step 5: Plan and Apply

Always run a plan before applying. A plan shows you exactly what Terraform is going to create, change, or destroy — without actually doing it. Review the plan output carefully before proceeding.

terraform plan -out plan.out

Read through the output. Every resource listed with a + will be created. Once you're satisfied:

terraform apply plan.out

The first apply takes roughly 15–20 minutes. EKS is the slow part — the control plane takes about 12 minutes to fully provision. Everything else is faster.

📸 Screenshot: Terminal showing terraform apply in progress, with the EKS cluster creation step running and a timer showing elapsed time

When it finishes, Terraform prints the output values:

terraform output

Save these - you'll need the cluster name, ECR registry URL, and RDS endpoint in the chapters ahead.

The One Rule You Must Never Break

# NEVER run this without understanding what it will destroy

terraform destroy

terraform destroy deletes every resource Terraform knows about your cluster, your database, your network, everything. The safety hooks in .claude/settings.json block this command when run through Claude Code, but they don't block it when you run it directly in your terminal.

The correct teardown procedure is to run the project's destroy.sh script, which handles dependent resources in the correct order before calling Terraform. Jumping straight to terraform destroy will leave orphaned resources in your AWS account that you'll have to clean up manually. Chapter 11 covers this in full.

Chapter 6: DNS, Load Balancer, and TLS Certificates

This chapter connects your running Kubernetes cluster to the real internet — with a proper domain name and HTTPS. It also contains one of the more interesting mistakes in this project: switching DNS providers halfway through and having to switch back.

Three Concepts You Need to Understand First

DNS - The Phone Book of the Internet

Every website has an IP address (like 18.184.92.31). DNS is the system that translates a human-readable domain name (yourdomain.com) into that IP address. When you type a domain into a browser, your computer asks a DNS server "what IP address does this domain map to?" and then connects to that IP.

In this project, Cloudflare manages your DNS records. When your Application Load Balancer gets created, you'll add a DNS record in Cloudflare pointing your domain to the ALB's address.

TLS Certificates - The Padlock in Your Browser

When a browser shows a padlock icon and https:// in the address bar, it means the connection is encrypted. TLS certificates are what make this possible — they prove that the server you're talking to is genuinely your domain, and they enable the encryption.

AWS Certificate Manager (ACM) issues free TLS certificates for use with AWS services like load balancers. The catch is that you have to prove you own the domain before ACM will issue a certificate. The easiest way is DNS validation — you add a specific CNAME record to your DNS, ACM checks it, and if it's there, it issues the certificate.

The Load Balancer - Your Cluster's Front Door

The Application Load Balancer (ALB) is a managed AWS service that receives all incoming internet traffic and forwards it to the right place inside your cluster. It holds the TLS certificate, so it handles the HTTPS encryption/decryption — your internal services only ever see plain HTTP.

The ALB is not created manually. It's created automatically by the AWS Load Balancer Controller — a piece of software running inside your cluster that watches for Kubernetes Ingress objects. When you apply an Ingress YAML file, the controller reads it and provisions a real ALB in your AWS account.

The Route 53 Detour - A Real Mistake

Before getting into how the DNS setup works, it's worth telling you what went wrong first.

The original plan was to use AWS Route 53 for DNS. Route 53 is AWS's native DNS service, and every tutorial uses it because it integrates cleanly with ACM - you can validate certificates automatically without any extra work.

The problem: the domain for this project was already registered and managed on Cloudflare. Moving DNS management to Route 53 would mean either transferring the domain (complex, risky) or delegating a subdomain to Route 53 (adds complexity and latency). And Route 53 charges $0.50/month per hosted zone.

After implementing the Route 53 version and realising it was adding unnecessary complexity and cost for a domain already sitting on Cloudflare, the DNS module was rebuilt entirely with the Cloudflare Terraform provider. Cloudflare is free, the domain was already there, and Terraform's Cloudflare provider handles everything Route 53 would — including automatic certificate validation.

The lesson: use whatever manages your domain already. Don't migrate DNS providers just because a tutorial tells you to.

Step 1: The DNS Terraform Module

The DNS module does two things: it requests a TLS certificate from ACM and it validates that certificate by adding records to Cloudflare.

Here's what happens when you run terraform apply with the DNS module:

ACM Certificate Request

Terraform requests a wildcard certificate from ACM:

*.yourdomain.com - covers all subdomains (app.yourdomain.com, grafana.yourdomain.com, etc.)
yourdomain.com - covers the apex domain itself

Wildcard certificates are powerful - one certificate covers unlimited subdomains. This means you never need to request a new certificate when you add a new service.

DNS Validation Records

ACM responds with a validation challenge: "add this specific CNAME record to your DNS to prove you own this domain."

Terraform automatically reads this challenge and creates the CNAME record in Cloudflare using the Cloudflare provider. Within a few minutes, ACM detects the record and issues the certificate.

Error you might hit - for_each plan-time error:

ACM returns the validation records as a map, but during the first terraform plan, the exact keys of that map are not yet known (they come from ACM's response). If you use the wrong key in your for_each loop, Terraform throws:

The "for_each" map includes keys derived from resource attributes ▎ that cannot be determined until apply

The fix is to use dvo.domain_name as the for_each key instead of the index. The module in this project already handles this correctly.

Error you might hit - duplicate validation records:

A wildcard certificate (*.yourdomain.com) and its apex domain (yourdomain.com) share the same validation CNAME record. If you try to create two Cloudflare records for both, you'll get a "record already exists" error. The fix is to deduplicate using Terraform's toset() function on the validation records before creating them.

Step 2: Install the AWS Load Balancer Controller

The AWS Load Balancer Controller is a Kubernetes operator that runs inside your cluster and watches for Ingress objects. When it sees one, it reads the annotations and creates a real ALB in your AWS account.

To do this, the controller needs permission to create and manage AWS resources; it needs to call EC2 and ELB APIs. This permission is granted through IRSA (IAM Roles for Service Accounts) — a mechanism that gives a specific Kubernetes service account an AWS IAM role without using any static credentials.

The project scripts handle this automatically:

bash scripts/install-lb-controller.sh --env dev

This script:

Creates an IAM policy with the exact permissions the controller needs
Creates an IRSA role tied to the controller's Kubernetes service account
Installs the controller via Helm into the kube-system namespace
Waits for the controller pod to become ready

Error you might hit — missing IAM permissions:

The AWS Load Balancer Controller's required IAM policy is long and specific. If it's missing even one permission, the controller will create the ALB but fail silently on certain operations — like updating listener rules or describing security groups. Two permissions that were discovered missing only after testing:

- ec2:GetSecurityGroupsForVpc — needed when the controller reconciles security group rules

- elasticloadbalancing:DescribeListenerAttributes — needed when the controller checks existing listener configuration

The IAM policy in this project includes both. If you're writing your own policy from scratch, use the official policy JSON from the AWS Load Balancer Controller GitHub repository — don't try to write a minimal one from memory.

Step 3: Apply the Ingress Resource

An Ingress is a Kubernetes object that describes how external traffic should reach your services. It's essentially a routing rule: "traffic arriving at this domain should go to this internal service on this port."

The project's Ingress file is in k8s/base/ingress/ingress.yaml. The key parts look like this:

A few important things in those annotations:

kubernetes.io/ingress.class: alb — tells the Load Balancer Controller this Ingress should be handled by an ALB (not nginx or another controller)
alb.ingress.kubernetes.io/scheme: internet-facing — creates a public ALB, not a private one
alb.ingress.kubernetes.io/certificate-arn — attaches your ACM certificate so the ALB can serve HTTPS
The SSL redirect action - automatically redirects HTTP traffic to HTTPS

When you apply this file: kubectl apply -f k8s/base/ingress/ingress.yaml

The Load Balancer Controller detects it within seconds and begins provisioning an ALB in your AWS account. This takes about 2–3 minutes.

Step 4: Point Your Domain at the ALB

Once the ALB is provisioned, you have an AWS-generated DNS name that looks like:

k8s-petclinic-abc123def456.eu-central-1.elb.amazonaws.com

You need to add a CNAME record in Cloudflare, pointing your subdomain to this address. The project's Terraform DNS module handles this automatically - it reads the ALB DNS name from a variable and creates the record:

# Get the ALB DNS name from kubectl

kubectl get ingress -n petclinic-dev

# Add it to your terraform.tfvars

alb_dns_name = "k8s-petclinic-abc123def456.eu-central-1.elb.amazonaws.com"

# Apply the Cloudflare CNAME record

terraform apply plan.out

After a minute or two of DNS propagation, visiting https://app.yourdomain.com should load the Petclinic frontend — with a valid certificate and HTTPS.

📸 Screenshot: Browser showing the Spring Petclinic web UI loading at your real domain with the HTTPS padlock visible in the address bar

The Cloudflare 81044 Error on Destroy

This error appeared specifically when running terraform destroy after the infrastructure had been running for a while.

When you destroy and recreate infrastructure, Terraform tries to delete the Cloudflare DNS records it created. But if you've run the ACM validation step and the certificate is already issued, the validation CNAME record has already done its job. Cloudflare sometimes returns error 81044 — "record already exists" — when Terraform tries to recreate the record during a re-apply after a partial destroy.

The fix is in the DNS module: filter the validation records using distinct() before creating them, so duplicate CNAME entries are never attempted. The destroy.sh script also handles this edge case by cleaning up orphaned ACM certificates before Terraform destroy runs.

The Safe Order for Teardown

When you eventually want to destroy the infrastructure, the load balancer and DNS must be torn down in a specific order:

Delete the Ingress first - this triggers the Load Balancer Controller to delete the ALB from AWS. If you skip this and run terraform destroy, Terraform tries to delete the VPC, but the ALB is still inside it and the VPC deletion fails.
Wait for the ALB to disappear from the AWS Console
Then run destroy.sh, which handles everything else in the correct order

The project's destroy.sh script handles this sequence automatically. Never skip straight to terraform destroy.

Chapter 7: Secrets Management - Never Hardcode Anything

Every application needs sensitive values to run: database passwords, API keys, and encryption keys. How you handle these values is one of the most important decisions in any infrastructure project and one of the easiest to get wrong.

This chapter covers the right way to do it.

Why You Can Never Put Secrets in Your Code

It seems obvious once you hear it, but it's worth stating clearly: any value committed to a Git repository should be treated as public, even if the repository is private.

Private repositories get made public by accident. Employees leave organisations. Access tokens get leaked. Git history is permanent even if you delete a secret from a file, it lives in the commit history forever.

The consequences of leaked cloud credentials are immediate and expensive. Attackers scan GitHub continuously for exposed AWS access keys. Within minutes of a key being pushed to a public repository, automated bots find it and begin spinning up GPU instances for crypto mining on your bill. AWS has sent invoices in the tens of thousands of dollars for accounts where credentials were leaked this way.

The rule in this project is absolute: no secret ever touches a file that could be committed to Git. Not in .yaml files, not in .env files, not in Terraform variables, not anywhere.

Instead, every secret lives in AWS Secrets Manager and gets pulled into the cluster automatically by a component called the External Secrets Operator.

AWS Secrets Manager - Your Secure Vault

AWS Secrets Manager is a managed service that stores sensitive values securely. Think of it as a locked safe inside your AWS account. Every secret has:

A name (the path you use to reference it, like petclinic/dev/rds-credentials)
A value (the actual secret, stored encrypted)
An audit trail (every time a secret is accessed, AWS logs it)
Fine-grained IAM permissions (you control exactly which services and roles can read which secrets)

This project stores three secrets in Secrets Manager:

RDS Credentials Stored at petclinic/dev/rds-credentials as a JSON object:

This secret is created automatically by the RDS Terraform module when the database is provisioned. The password is generated by Terraform's random_password resource — it's never typed by a human.
OpenAI API Key Stored at petclinic/dev/openai-api-key as a plain string. This one you create manually by running: aws secretsmanager create-secret
--name "petclinic/dev/openai-api-key"
--secret-string "sk-your-openai-key-here"
--region eu-central-1
Alertmanager SMTP Credentials Stored at petclinic/alertmanager-smtp. Used by Alertmanager to send email alerts. This is optional - if it doesn't exist, Alertmanager runs in a mode where alerts are logged but not emailed. The project handles this gracefully.

📸 Screenshot: AWS Secrets Manager console showing the three secrets with their names and last modified dates — but not their values

The Problem: Kubernetes Doesn't Know About Secrets Manager

Kubernetes has its own concept of secrets — a resource type called Secret that stores base64-encoded values and mounts them into pods as environment variables or files. But Kubernetes Secrets have a fundamental limitation: they exist only inside the cluster. They don't automatically sync from AWS Secrets Manager.

The naive solution — manually creating Kubernetes Secrets and applying them with kubectl — has two problems:

You have to type the secret value somewhere to create the kubectl command, which risks exposure
If the secret rotates in Secrets Manager, your pods still have the old value until you manually update the Kubernetes Secret

The right solution is the External Secrets Operator.

External Secrets Operator — The Bridge

External Secrets Operator (ESO) is a Kubernetes operator — a piece of software that runs inside your cluster and continuously watches for a custom resource type called ExternalSecret. When it finds one, it reaches out to AWS Secrets Manager, fetches the value, and creates a real Kubernetes Secret from it. Every hour, it syncs again to pick up any rotations.

The flow looks like this:

AWS Secrets Manager

↑ (ESO fetches value every hour)

External Secrets Operator (running in cluster)

↓ (creates and updates)

Kubernetes Secret (petclinic-dev namespace)

↓ (mounted as env var)

Pod (customers-service, visits-service, etc.)

Your pods never talk to Secrets Manager directly. They just read a Kubernetes Secret that ESO keeps up to date. If the password rotates in Secrets Manager, ESO updates the Kubernetes Secret within an hour, and the next pod restart picks up the new value.

Step 1: Install ESO

ESO is installed via Helm into its own namespace. The project script handles this:

bash scripts/install-eso.sh --env dev

But before ESO can fetch secrets from AWS, it needs permission to do so. This is where IRSA comes in again.

An IRSA role is created for ESO with a policy that allows it to call secretsmanager:GetSecretValue on specifically the secrets this project uses — nothing else. The Terraform karpenter module (which handles all IRSA roles) creates this role. The ESO installation script passes the role ARN to the Helm chart, which annotates ESO's service account with it.

Error you might hit — ESO API version change:

ESO changed its Kubernetes custom resource API version between major releases. In ESO v1.x, the ClusterSecretStore and ExternalSecret resources used apiVersion: external-secrets.io/v1beta1. In ESO v2.x, this changed to apiVersion: external-secrets.io/v1.

If you apply old manifests against a new ESO installation, Kubernetes accepts them (it doesn't immediately reject unknown API versions) but ESO silently ignores them. Your pods start with no secrets injected and you get vague CreateContainerConfigError errors on the pods.

The fix is simple: check your ESO version with helm list -n external-secrets and make sure your manifests use v1 for ESO v2.x and above. All manifests in this project use v1.

Step 2: Apply the ClusterSecretStore

A ClusterSecretStore is a cluster-wide resource that tells ESO how to connect to the external secret store — in this case, AWS Secrets Manager.

The jwt auth method tells ESO to use IRSA - it will use the service account's JWT token to authenticate to AWS without any static credentials. This is applied once at the cluster level and used by all namespaces.

Step 3: Apply ExternalSecret Resources

An ExternalSecret is a resource you apply per secret, per namespace. It says: "fetch this specific secret from AWS Secrets Manager and create a Kubernetes Secret with this name."

Here's the ExternalSecret for RDS credentials:

When this is applied, ESO:

Calls Secrets Manager to fetch petclinic/dev/rds-credentials
Extracts the username and password fields from the JSON
Creates a Kubernetes Secret named rds-credentials in the petclinic-dev namespace
Repeats this every hour

▎ Error you might hit — namespace substitution in prod:

The ExternalSecret manifests reference a specific namespace (petclinic-dev). When deploying to prod, you need the namespace to say petclinic-prod. If you forget to substitute this and apply the dev manifests to the prod namespace, ESO creates secrets in the wrong namespace. The install script handles this using sed to substitute the namespace before applying.

Error you might hit — ESO role missing a secret:

The IAM policy attached to the ESO IRSA role must explicitly list every Secrets Manager secret ARN it's allowed to read. If you add a new secret (like the Alertmanager SMTP credentials) and forget to add its ARN to the IAM policy, ESO silently fails to sync it. The ExternalSecret resource shows a SecretSyncError status. The fix is to update the IAM policy in Terraform and re-apply.

Step 4: How Pods Consume Secrets

Once ESO has created the Kubernetes Secret, pods reference it in their Helm values like this:

The Helm chart template turns this into a secretKeyRef environment variable in the pod spec. The pod sees the database username and password as regular environment variables — it has no idea they came from Secrets Manager. The secret never exists in any file, never goes through Git, and rotates automatically.

The .gitignore Safety Net

Even with Secrets Manager in place, you still need to make sure secrets can never accidentally reach Git. The project's .gitignore blocks the files most likely to contain secrets:

*.tfvars

*.env

*.env.*

*.pem

*.key

*.p12

errored.tfstate

And the block-secret-commit.sh hook in Claude Code adds a second layer - if any of these file types get staged for a commit, the commit is blocked before it can run. Two layers of protection, because one layer is never enough.

Chapter 8: Helm and ArgoCD - Packaging Services and GitOps Deployment

This chapter covers two things that work together to make deployments fast, safe, and automatic: Helm packages your 8 services into something Kubernetes can consume, and ArgoCD watches your Git repository and deploys whatever it finds there.

The Problem Helm Solves

Before this project used Helm, each service had its own raw Kubernetes YAML files — a Deployment, a Service, a ConfigMap, health check configuration, resource limits, and so on. Eight services meant eight sets of files, almost identical to each other but with small differences in port numbers, environment variables, and image names.

Every time you wanted to change something that applied to all services — like adding a new label, adjusting a resource limit, or modifying a probe — you had to edit eight files. Miss one and you'd have inconsistent configuration across services. The maintenance burden was already painful at 8 services. At 50, it would be unmanageable.

Helm solves this with templates. Instead of eight copies of a Deployment YAML, you write one template with placeholders, and Helm fills in the placeholders using values files specific to each service. One change to the template applies to all 8 services simultaneously.

One Chart, Eight Services

The project has a single Helm chart in helm/petclinic-service/ that is shared by all 8 services. The chart contains:

The values.yaml contains sensible defaults for all possible fields — replica count, resource limits, probe paths, autoscaling settings, port number. These defaults are intentionally conservative: 1 replica, no autoscaling, basic resource limits.

The Values Hierarchy

Here is the pattern that makes the single-chart approach work. Values flow in layers, where each layer overrides the one before it:

values.yaml (defaults)

↓ overridden by

helm-values/customers-service.yaml (per-service settings)

↓ overridden by

helm-values/dev.yaml (per-environment settings)

Default values (values.yaml) cover everything with a safe fallback: 1 replica, 100m CPU request, 128Mi memory request, standard health check paths.

Per-service values (helm-values/{service}.yaml) set what's unique to each service: port number, image name, environment variables, database secret references, and init container configuration.

Here is what the customers-service values file looks like in simplified form:

Per-environment values (helm-values/dev.yaml and helm-values/prod.yaml) set the ECR registry URL (which includes your account ID and environment name) and environment-specific replica counts and autoscaling settings.

When ArgoCD deploys a service, it merges all three layers together and passes the result to Helm. The chart produces complete, valid Kubernetes manifests from that merged configuration.

Error you might hit — wrong values key:

Helm merges maps deeply but replaces lists entirely. If you put a value under the wrong key name, Helm silently uses the default instead of your override. No error, no warning — just unexpected behaviour. For example, putting SPRING_DATASOURCE_URL in dev.yaml as a global setting instead of in each MySQL service's values file means some services never see it. Always validate your merged output with helm template before applying:

helm template customers-service helm/petclinic-service/
-f helm-values/customers-service.yaml
-f helm-values/dev.yaml

Init Containers — Enforcing Startup Order

The startup ordering problem from Chapter 1 is solved here, inside the Helm chart, using init containers.

An init container runs before your main application container starts. It runs to completion and only then does Kubernetes start the real container. If the init container fails or keeps running, the main container never starts.

For customers-service, the two init containers above use netcat (nc) to check if config-server is accepting connections on port 8888, and if discovery-server is accepting connections on port 8761. They loop with a 2-second sleep until both succeed. Only then does customers-service itself start.

This means you can apply all 8 services to the cluster simultaneously without worrying about ordering — each service patiently waits for its dependencies to become ready before trying to start.

Error you might hit — rolling update capacity on small nodes:

By default, Kubernetes rolling updates work by starting one new pod before terminating one old pod (maxSurge=1). On t4g.small nodes with limited CPU and memory, this means you briefly need capacity for one extra pod during every update. With 8 services running plus the observability stack on 4 small nodes, there's not enough free capacity for the extra pod — so the rolling update stalls with the new pod stuck in Pending.

The fix is to set maxSurge=0, maxUnavailable=1 in the deployment strategy — terminate one old pod first, then start the new one. This causes a brief single-pod unavailability per service during updates, which is acceptable in dev. The chart in this project defaults to this strategy for exactly this reason.

What ArgoCD Is

ArgoCD is a continuous deployment tool that runs inside your Kubernetes cluster and watches a Git repository. When it detects a change in the repository, it applies the updated Kubernetes manifests to the cluster automatically.

The key difference from traditional CD tools (like running kubectl apply in a GitHub Actions workflow) is the direction of the connection:

Traditional CD: Your CI server connects to the cluster and pushes changes
GitOps with ArgoCD: Your cluster connects to Git and pulls changes

This reversal matters for security. With traditional CD, your CI server needs credentials to access the cluster. Those credentials must be stored somewhere — usually in GitHub Secrets — which creates a potential attack surface. With ArgoCD, the cluster has no inbound connections from CI at all. The only connection is the cluster reaching out to GitHub to check for updates, which requires only read access to a public repository.

📸 Screenshot: ArgoCD UI showing all 8 services as green healthy Application tiles in the petclinic-dev project

Installing ArgoCD

ArgoCD is installed from its official manifest into the argocd namespace:

kubectl create namespace argocd kubectl apply -n argocd -f k8s/argocd/install/install.yaml

Error you might hit — CRDs not ready:

ArgoCD installs its own custom resource definitions (CRDs) — like the Application resource type. If you try to apply your Application CRDs immediately after applying the ArgoCD install manifest, Kubernetes rejects them because the CRD definitions haven't been registered yet. The install script includes a kubectl wait command that blocks until ArgoCD's CRDs are fully registered before proceeding.

ArgoCD Application CRDs — Telling ArgoCD What to Deploy

Each of the 8 services has an Application resource — a custom Kubernetes object that tells ArgoCD everything it needs to know about one deployment:

Where is the Helm chart? — the platform repository URL and the helm/petclinic-service/ path
Which values files to merge? — helm-values/{service}.yaml and helm-values/dev.yaml
Where should it deploy? — the petclinic-dev namespace
What sync policy to use? — automatic for dev, manual for prod

Here is what the Application CRD for customers-service looks like:

The automated sync policy means ArgoCD will:

Automatically sync whenever it detects a Git change (every 3 minutes by default)
Prune resources that exist in the cluster but no longer exist in Git
Self-heal — if someone manually changes a resource in the cluster (breaking the GitOps principle), ArgoCD reverts it back to what Git says

For production, the automated block is removed entirely. ArgoCD shows the app as OutOfSync but waits for a human to click Sync in the UI or run argocd app sync customers-service-prod.

The UTF-8 BOM Bug — The Invisible Character That Broke Everything

This is one of the most frustrating bugs in the entire project, and it's worth describing in detail because it can waste hours if you don't know what to look for.

After creating the ArgoCD Application CRDs and applying them, ArgoCD refused to sync several services and showed cryptic YAML parsing errors. The files looked completely correct in every editor. helm template produced valid output. kubectl apply --dry-run passed.

The problem was a UTF-8 BOM (Byte Order Mark) — a hidden three-byte sequence (EF BB BF in hex) that some text editors on Windows add to the beginning of files. It is completely invisible in most editors. When ArgoCD tried to parse the YAML files, it choked on the invisible character at the start of the file.

The fix:

# Detect which files have a BOM

grep -rl $'\xef\xbb\xbf' k8s/argocd/applications/

grep -rl $'\xef\xbb\xbf' helm-values/

# Remove the BOM from affected files

sed -i 's/^\xef\xbb\xbf//' filename.yaml

If you are on Windows and creating YAML files, make sure your editor is set to save files as UTF-8 without BOM. In VS Code, click the encoding indicator at the bottom right of the screen and select "Save with Encoding → UTF-8".

How Image Updates Flow Through to Deployment

Once ArgoCD is running and all 8 Application CRDs are applied, this is what happens every time you push code to the application repository:

Code pushed to spring-petclinic-microservices

↓

GitHub Actions builds new Docker images

↓

Images pushed to ECR with commit SHA tag (e.g. a3f4bc1)

↓

GitHub Actions updates helm-values/customers-service.yaml: image.tag: a3f4bc1 ↓

Git commit pushed to petclinic-platform repository

↓

ArgoCD detects the changed file within ~3 minutes

↓

ArgoCD runs helm template with the new tag value

↓

Rolling update begins in petclinic-dev namespace

The only thing that changed in Git was one line in one YAML file - the image tag. ArgoCD saw that change and handled the rest. No human touched the cluster.

Chapter 9: The CI/CD Pipeline - From Code Push to Cluster Update

This chapter covers the automation that connects your code to your cluster. When you push a code change, the pipeline runs without any human involvement and ends with updated pods running in Kubernetes. Understanding how it works and where it breaks is essential.

The Two-Workflow Design

The pipeline is split across two GitHub repositories, each with one workflow:

Workflow 1 — build-push.yml lives in the application repo (your fork of spring-petclinic-microservices). It runs when code is pushed. Its job: build Docker images and push them to ECR.

Workflow 2 — update-image-tags.yml lives in the platform repo (petclinic-platform). It runs when workflow 1 signals it. Its job: update the image tag in Git so ArgoCD deploys the new image.

The two workflows are connected by a repository dispatch — an API call that one GitHub repository makes to trigger a workflow in another repository.

This separation is intentional. The application repo focuses on code: build, test, scan, push. The platform repo focuses on deployment: what version is running where. A developer working on Java code doesn't need to know or touch the infrastructure repo. An infrastructure engineer can change deployment configuration without touching application code.

OIDC Federation — No Long-Lived AWS Credentials

The most important security decision in the entire CI pipeline is how GitHub Actions authenticates to AWS.

The naive approach is to create an IAM user, generate access keys, and store them in GitHub Secrets. This works but creates a permanent credential that never expires. If GitHub's secrets storage is ever compromised, or if a workflow accidentally prints the credentials to logs, those keys give an attacker access to your AWS account until you manually rotate them.

The better approach is OIDC federation (OpenID Connect). Here is how it works in plain English:

AWS and GitHub trust each other through a pre-established relationship called an OIDC provider. When a GitHub Actions workflow runs, GitHub gives it a short-lived token that says "this workflow is running in repository X, on branch Y." The workflow presents this token to AWS, which verifies it with GitHub, and if the conditions match the trust policy, AWS issues temporary credentials that expire in one hour.

No stored AWS credentials. No long-lived secrets. Nothing to leak.

Setting this up requires two steps:

In AWS (via Terraform):

# Creates the OIDC trust relationship between AWS and GitHub

resource "aws_iam_openid_connect_provider" "github" { url = "https://token.actions.githubusercontent.com" client_id_list = ["sts.amazonaws.com"] thumbprint_list = ["..."] }

In the GitHub Actions workflow:

After this step, all subsequent AWS CLI and SDK calls in the workflow use temporary credentials scoped to exactly the permissions of that IAM role and nothing else.

Workflow 1: Building ARM64 Docker Images

Your EKS cluster runs on ARM64 Graviton nodes. Your laptop and GitHub Actions runners run on x86_64. These are different CPU architectures, and a Docker image built for one won't run on the other.

Building an ARM64 image on an x86_64 machine requires two tools:

Docker Buildx — An extended version of docker build that supports building for multiple target platforms in one command.

QEMU — Software that emulates different CPU architectures. When building an ARM64 image on x86_64, QEMU translates ARM64 instructions so they run on the x86_64 runner. It's slower than native builds, but it works.

The workflow sets this up before building:

The SHA is the first 7 characters of the Git commit hash — for example a3f4bc1. This becomes the image tag. Every image pushed to ECR is tagged with the exact commit that produced it, making it trivial to trace any running pod back to its source code.

Trivy Vulnerability Scan

After building and before pushing, Trivy scans every image for known security vulnerabilities:

exit-code: 1 means the workflow fails and the image is not pushed if any CRITICAL vulnerability is found. This ensures nothing with a known critical security issue ever reaches your cluster.

Firing the Repository Dispatch

After all 8 images are pushed, the workflow fires a signal to the platform repo:

This HTTP call wakes up update-image-tags.yml in the platform repo, passing along the SHA and which services were updated.

Workflow 2: Updating Image Tags in Git

When the platform repo receives the dispatch, update-image-tags.yml runs. Its job is simple: update the image.tag value in the relevant helm-values/{service}.yaml files and commit the change to Git.

It uses yq — a command-line YAML processor to edit the file precisely without disturbing any surrounding content:

This is better than using sed to find and replace, because sed pattern-matches text if your file has similar-looking values elsewhere, sed can corrupt them. yq understands YAML structure and updates exactly the field you specify.

After updating the files:

📸 Screenshot: GitHub repository showing a recent commit from

](https://cdn.hashnode.com/uploads/covers/63ac05fbf42b1f067f83a129/5ff87c1a-3fcc-4c99-90c4-46deae3bce7d.png align="center")

The Errors That Actually Happened

This pipeline took the most debugging of any single component in the project. Here are the real failures, in the order they happened:

Error 1 — GitHub rejects reusable workflow in a subdirectory

The original design had the tag-update logic in a reusable workflow file at .github/workflows/update-tags/action.yml — a subdirectory. GitHub Actions only supports reusable workflows at the root .github/workflows/ level. Any uses: reference pointing to a subdirectory silently fails with a confusing error about the workflow not being found.

The fix was to inline the update logic directly into update-image-tags.yml instead of trying to make it reusable.

Error 2 — Wrong permissions for writing to the repository

GitHub Actions workflows have a permissions block that controls what the workflow's token can do. The contents: write permission is required to push commits back to the repository. Setting it at the wrong level (the workflow level instead of the specific job level) caused permission errors on push that looked like authentication failures.

Error 3 — yq checksum verification reading the wrong column

To avoid supply chain attacks (downloading a malicious yq binary), the workflow downloads yq and verifies its SHA-256 checksum against the official checksums file. The checksums file has multiple columns — the hash, then a separator character, then the filename. The script was reading column 1 (the separator) instead of column 1 (the hash) because the field number was off by one.

# Wrong — reads wrong column

EXPECTED=$(grep "yq_linux_amd64" checksums | awk '{print $1}')

# Correct — field 19 in this specific checksums file format

EXPECTED=$(grep "yq_linux_amd64" checksums | awk '{print $19}')

This failed with a checksum mismatch error that looked like a corrupt download.

Error 4 — Concurrent pipeline runs collide on git push

When multiple services are updated in one push, the pipeline runs multiple update-tag jobs in parallel, each trying to commit and push to the same branch at the same time. The second push fails because the first push moved the branch forward.

The fix is a git pull --rebase origin main before every push. This fetches any new commits made by concurrent runs and replays your commit on top of them before pushing.

Error 5 — ECR empty after terraform destroy

After destroying and recreating the infrastructure, ECR repositories exist but contain no images. ArgoCD tries to pull the image tags referenced in helm-values/, finds nothing in ECR, and all pods fail with ImagePullBackOff.

The fix is to trigger a manual rebuild. The update-image-tags.yml workflow has a workflow_dispatch trigger — you can run it manually from the GitHub Actions UI, specifying any service name and SHA. Alternatively, make an empty commit to the application repo to trigger build-push.yml and rebuild all images.

📸 Screenshot: GitHub Actions showing the workflow_dispatch trigger UI for update-image-tags.yml with the sha and services input fields

The Complete Pipeline in One View

git push (application repo)

↓

build-push.yml triggers

↓

OIDC → temporary AWS credentials (1 hour)

↓

Maven build → Docker buildx (linux/arm64) → Trivy scan

↓

ECR push: image tagged with 7-char commit SHA

↓

repository_dispatch → platform repo

↓

update-image-tags.yml triggers

↓

yq updates helm-values/{service}.yaml

↓

git pull --rebase → git commit → git push

↓

ArgoCD detects change (~3 min)

↓

Rolling update in petclinic-dev namespace

↓

New pods running, old pods terminated

Every step is automated. Every step is auditable in Git history. No human touches the cluster.

Chapter 10: The Observability Stack - Seeing Everything Your App Does

Running 8 microservices in a cluster without observability is like flying a plane with no instruments. Everything might be fine. Or one service might be failing 30% of requests, and you'd have no idea until a user tells you. This chapter sets up the tools that make the invisible visible.

The Three Things You Need to See

Observability in modern systems comes down to three types of data, each answering a different question:

Metrics - Numbers measured over time. How many requests per second is api-gateway handling? How much memory is customers-service using? Is the error rate going up? Metrics are cheap to store and fast to query, which makes them ideal for dashboards and alerts.

Logs - Text records of events. When a request fails, what exactly happened inside the service? Logs give you the detailed story that metrics can't tell. The challenge is that 8 services each produce hundreds of log lines per minute — you need a way to collect and search them in one place.

Traces - Records of how a single request travels across multiple services. When a user clicks "add visit" on the frontend, that request touches api-gateway, then visits-service, then the database. A trace shows you every hop with timing — so when something is slow, you can see exactly where the time was spent.

This project implements all three pillars using open-source tools that run inside the cluster: Prometheus (metrics), Loki (logs), Zipkin (traces), with Grafana as the single dashboard that ties them together, and Alertmanager to notify you when something goes wrong.

Why Not Just Use CloudWatch?

AWS has its own observability service called CloudWatch that integrates directly with EKS. The question of why not to use it is worth answering directly.

CloudWatch charges per GB of log data ingested - roughly $0.50/GB in eu-central-1. Eight Spring Boot services in active development generate logs quickly. A busy dev environment can easily produce 5–10 GB of logs per day, which adds up to $75–150/month in log ingestion costs alone.

The Loki stack used in this project stores logs on an EBS volume at roughly $0.10/GB per month for storage with no per-GB ingestion charge. For a learning project, the cost difference is significant. ADR-0011 in the repository documents this decision in full.

There's also a tighter integration advantage: Grafana can query both Prometheus and Loki simultaneously, allowing you to correlate a spike in error metrics with the specific log lines that caused it, all in one screen.

Component 1: Prometheus — Collecting Metrics

Prometheus is a time-series database. It works by scraping — periodically calling an HTTP endpoint on each service (/actuator/prometheus in Spring Boot's case) and storing the numbers it gets back.

Every 15 seconds, Prometheus calls each of the 8 services' metrics endpoints and stores values like:

http_server_requests_seconds_count — total request count
http_server_requests_seconds_sum — total time spent on requests
jvm_memory_used_bytes — current JVM heap usage
hikaricp_connections_active — active database connections

These numbers are stored in a format optimised for time-series queries. You can then ask questions like "show me the error rate for customers-service over the last hour" and get an instant answer.

Prometheus runs as a StatefulSet in the monitoring namespace with a persistent EBS volume (10 GB in dev, 50 GB in prod) so its data survives pod restarts.

Error you might hit — prod scrape targets in base config:

The Prometheus ConfigMap in k8s/base/observability/prometheus.yaml defines which services to scrape. The base config should only reference services that exist in the current environment. If dev's Prometheus config includes scrape targets pointing to prod namespace services, Prometheus throws scrape errors every 15 seconds for every unreachable target. The config was split so the base only scrapes the appropriate namespace for each environment.

Component 2: Grafana — Your Dashboard

Grafana is the visualisation layer. It connects to Prometheus and Loki as data sources and displays the data as graphs, tables, and heatmaps on dashboards.

This project provisions Grafana with its data sources configured automatically — you don't have to click through the UI to add Prometheus or Loki. The configuration is stored as a Kubernetes ConfigMap that Grafana reads on startup.

The project includes three pre-configured dashboards:

Service Overview — All 8 services on one screen: request rate, error rate, response time
Per-Service Dashboard — Deep dive into one service: JVM heap, GC pauses, database pool usage, HTTP status code breakdown
JVM Dashboard — Memory, threads, garbage collection across all services

Grafana's admin password is stored in Secrets Manager and synced into the cluster by ESO as a Kubernetes Secret before Grafana deploys. This is important if the Secret doesn't exist when Grafana starts, it falls back to a default password, which is a security risk.

Error you might hit — ExternalSecret applied after Grafana:

The install order matters. If install-observability.sh applies the Grafana StatefulSet before the grafana-admin ExternalSecret has been synced by ESO, Grafana starts with no admin password Secret mounted and generates a random one. You then can't log in because you don't know the password. The fix is to apply the ExternalSecret first and wait for ESO to sync it before deploying Grafana. The install script now includes a kubectl wait step between these two actions.

Access Grafana by port-forwarding to your local machine: kubectl port-forward svc/grafana -n monitoring 3000:3000. Then open http://localhost:3000 in your browser.

Component 3: Loki + FluentBit — Log Aggregation

Loki is a log storage and query system designed to work with Grafana. Unlike traditional log systems that index every word in every log line (which is expensive), Loki only indexes metadata — which pod the log came from, which namespace, which container. The actual log text is stored compressed and searched by streaming through it. This keeps storage costs very low.

FluentBit is a lightweight log collector that runs as a DaemonSet — one pod on every node in the cluster. It tails the container log files that Kubernetes writes to each node's disk (at /var/log/containers/*.log), parses them, and ships them to Loki with metadata labels attached: namespace, pod name, container name.

The result: every log line from every pod across the entire cluster is searchable in one place through Grafana's Explore view.

Component 4: Zipkin — Distributed Tracing

When a request from the browser hits api-gateway and then calls customers-service, those are two separate network hops. Without tracing, if the overall request is slow, you don't know whether the time was spent in api-gateway, in customers-service, or waiting for the database.

Zipkin collects traces — structured records of each request as it passes through multiple services. Spring Boot services in this project use Spring Cloud Sleuth, which automatically instruments every HTTP call and database query, attaches a trace ID, and sends timing data to Zipkin at http://zipkin:9411.

Zipkin runs as a simple Deployment in the tracing namespace. You access it by port-forwarding: kubectl port-forward svc/zipkin -n tracing 9411:9411

Component 5: Alertmanager — Getting Notified When Things Break

Prometheus evaluates alert rules — conditions defined in a ConfigMap that Prometheus checks after every scrape cycle. When a condition is true for longer than a defined duration, Prometheus fires an alert to Alertmanager.

This project defines five alert rules:

ServiceDown — a service has been unreachable for more than 2 minutes
HighErrorRate — more than 5% of requests are returning 5xx errors
HighLatency — p99 response time exceeds 2 seconds
PodRestartLoop — a pod has restarted more than 5 times in 10 minutes
HighMemoryUsage — a pod is using more than 80% of its memory limit

Alertmanager receives these alerts and routes them to email (via SMTP) or Slack. The SMTP credentials live in Secrets Manager.

Error you might hit — Alertmanager fails when SMTP secret is missing:

Alertmanager won't start if its configuration references an SMTP password that doesn't exist. When deploying for the first time without having set up an SMTP secret yet, Alertmanager gets stuck in CrashLoopBackOff.

The fix is to deploy Alertmanager with a null receiver — a configuration that accepts all alerts but throws them away instead of emailing them. This lets Alertmanager run successfully while you set up real alert routing later:

route:

receiver: 'null'

receivers:

- name: 'null'

The install-observability.sh script detects whether the SMTP secret exists and deploys either the null config or the real email config accordingly.

The Node Count Problem

This is worth calling out because it will catch you if you reduce node count to save cost.

The complete observability stack — Prometheus, Grafana, Loki, FluentBit, Alertmanager, Zipkin — alongside 8 application services adds up to significant resource requirements. On 2 t4g.small nodes (each with 2 vCPUs and 2 GB RAM), pods start getting stuck in Pending because there's no room to schedule them.

The desired node count in this project was briefly reduced from 4 to 2 to cut costs during a test, which immediately caused the observability stack pods to become unschedulable. The node count was restored to 4. This is why DEV_DESIRED_NODES=4 is a constant in the project and not something to adjust casually.

Installing the Observability Stack

The entire stack is installed with one script:

bash scripts/install-observability.sh --env dev

This applies all the manifests in k8s/base/observability/ in the correct order: namespaces first, then persistent volume claims, then the StatefulSets, then the DaemonSet, then the alert rules. Rollout timeouts are set to 300 seconds per component to account for EBS volume provisioning time and Karpenter node scale-out if capacity is needed.

Putting It All Together

Once the stack is running, open three browser tabs:

Grafana (localhost:3000) — your dashboards and log search
Zipkin (localhost:9411) — your traces
Spring Boot Admin (kubectl port-forward svc/admin-server 9090:9090) — live health of all 8 services

From the moment a request enters your cluster to the moment a response leaves, every metric, every log line, and every network hop is captured and queryable. When something breaks — and something will eventually break — you have everything you need to understand why.

Chapter 11: Every Error I Hit, Every Fix That Worked, and What I'd Do Differently

This is the chapter that no tutorial ever writes. Every guide shows you the happy path, the version where every command works and every service starts cleanly. The reality of building infrastructure on AWS is that you will spend more time debugging than you spend building. This chapter documents every significant error from this project, why it happened, and exactly how to fix it.

Consider this your debugging reference. When something breaks and it will come here first.

Infrastructure and Terraform Errors

The AWS CLI pager blocks your terminal

After running certain AWS CLI commands (like describing DynamoDB tables or listing EC2 instances), the output opens in a pager like less — waiting for you to press q before the script can continue. Automated scripts hang indefinitely waiting for a keypress that never comes.

Fix — run this once and never think about it again: aws configure set cli_pager ""

Terraform can't find your S3 state bucket

If you run terraform init before running bootstrap-state.sh, Terraform looks for an S3 bucket that doesn't exist yet and fails with NoSuchBucket.

Fix — always run the bootstrap script first: bash scripts/bootstrap-state.sh

If you've already run terraform init and the bucket now exists, run terraform init again; it will reinitialise and connect to the bucket.

Your AWS account ID is hardcoded in backend.tf

The backend.tf file references the S3 bucket by name, which includes your account ID. If you commit backend.tf with a real account ID and someone else forks the repo, their terraform init points to your bucket. The bootstrap-state.sh script replaces a YOUR_ACCOUNT_ID placeholder with the real value automatically — but only if that placeholder is present. If someone committed the real ID directly, the script's substitution no longer works.

Fix — always keep YOUR_ACCOUNT_ID as the placeholder in committed files. Let the script do the substitution locally at runtime.

EKS node group rejects the AMI type

Trying to create an ARM64 node group with AL2_ARM_64 as the AMI type on Kubernetes 1.34 fails with InvalidParameterException. Amazon Linux 2 is no longer supported for new node groups on recent Kubernetes versions.

Fix — use AL2023_ARM_64_STANDARD as the node_ami_type variable value.

Terraform can't manage the EKS cluster after creation

After the cluster is created, subsequent Terraform runs that try to configure it (installing add-ons, managing access entries) fail with Unauthorized. The IAM identity that ran Terraform doesn't have cluster admin access.

Fix — the EKS module must create an access entry for the caller's IAM ARN as part of the same terraform apply that creates the cluster. Make sure this resource exists in your EKS module: resource "aws_eks_access_entry" "terraform_caller" { cluster_name = aws_eks_cluster.main.name principal_arn = data.aws_caller_identity.current.arn type = "STANDARD" }

RDS rejects backup_retention_period > 0 on free tier

Setting backup_retention_period = 7 on an RDS instance in a free-tier AWS account returns FreeTierRestrictionError. Free tier accounts cannot enable automated RDS backups.

Fix — set backup_retention_period = 0 in dev. Take manual snapshots if needed.

ECR destroy fails with RepositoryNotEmptyException

terraform destroy fails when ECR repositories contain images. Terraform tries to delete the repository but AWS rejects deletion of non-empty repositories by default.

Fix — ensure force_delete = true is set on every aws_ecr_repository resource. This tells AWS to empty the repository before deleting it.

ECR lifecycle policy silently does nothing

An ECR lifecycle policy that uses tagPrefixList to match tagged images never triggers. The policy is accepted by AWS without error but no images are ever cleaned up.

Fix — use tagPatternList instead of tagPrefixList when matching images by tag pattern. These are different filters with different matching semantics.

RDS final snapshot conflict on second destroy

Running terraform destroy on prod a second time fails with DBSnapshotAlreadyExists. Terraform tries to create a final RDS snapshot before deleting the instance, but a snapshot from the previous destroy still exists with the same auto-generated name.

Fix — delete the previous final snapshot manually before running destroy: aws rds delete-db-snapshot
--db-snapshot-identifier petclinic-prod-mysql-final
--region eu-central-1 Or set skip_final_snapshot = true in the RDS module for environments where you don't need the snapshot.

DNS and Networking Errors

ACM certificate validation fails with for_each plan-time error

Using the ACM certificate's validation records as a for_each source fails during terraform plan because the keys are not known until apply time.

Fix — use dvo.domain_name as the for_each key instead of the record index. This gives Terraform a stable, known key at plan time.

Duplicate Cloudflare DNS validation records

A wildcard certificate (*.yourdomain.com) and the apex domain (yourdomain.com) share the same ACM validation CNAME record. Terraform tries to create it twice and fails on the second attempt.

Fix — deduplicate the validation records before creating Cloudflare DNS records:

Cloudflare error 81044 on terraform destroy

When re-running Terraform after a partial destroy, Cloudflare returns error 81044 ("record already exists") when Terraform tries to recreate DNS validation records that were never deleted.

Fix — the destroy script handles this by checking for and deleting orphaned ACM certificates and their associated Cloudflare records before running terraform destroy. Never skip destroy.sh and jump directly to terraform destroy.

VPC deletion fails because ALB still exists

Running terraform destroy while an ALB exists inside the VPC fails because AWS won't delete a VPC with active resources inside it.

Fix — always delete the Kubernetes Ingress resource first. This triggers the AWS Load Balancer Controller to delete the ALB from AWS: kubectl delete ingress petclinic-ingress -n petclinic-dev Wait for the ALB to disappear in the AWS Console (about 2 minutes) before running destroy.

VPC ENI race condition during destroy

Even after deleting the Ingress, terraform destroy sometimes fails with an error saying the VPC cannot be deleted because network interfaces (ENIs) still exist. These are left behind by the Load Balancer Controller or VPC CNI during pod termination.

Fix — the destroy.sh script includes a step that waits for all ENIs in the VPC to be released before proceeding. If you hit this manually, wait 2–3 minutes and retry.

AWS Load Balancer Controller missing IAM permissions

The controller creates the ALB but fails on specific operations with AccessDenied. Two permissions that were missing from the initial policy:

ec2:GetSecurityGroupsForVpc
elasticloadbalancing:DescribeListenerAttributes

Fix — add these to the controller's IAM policy and re-apply via Terraform.

Kubernetes and Helm Errors

Pods stuck in Pending — not enough node capacity

After deploying the full observability stack alongside 8 application services, several pods stay in Pending with the reason Insufficient memory or Insufficient cpu.

Fix — increase the desired node count to 4. Two t4g.small nodes cannot fit 8 services plus the full observability stack. This is non-negotiable for the complete setup.

Rolling updates stall with new pod stuck in Pending

During a rolling update, the new pod starts before the old one is terminated (maxSurge=1 default). On small nodes with tight capacity, there isn't room for both the old and new pod simultaneously. The new pod gets stuck in Pending and the rollout never completes.

Fix — set the deployment update strategy to maxSurge=0, maxUnavailable=1. This terminates the old pod first, freeing capacity before starting the new one:

strategy:

rollingUpdate:

maxSurge: 0

maxUnavailable: 1

ArgoCD Application CRDs rejected immediately after install

Applying ArgoCD Application CRDs immediately after installing ArgoCD itself fails because ArgoCD's custom resource definitions haven't been fully registered in the cluster yet.

Fix — wait for the CRDs to be registered before applying Applications: kubectl wait --for condition=established
crd/applications.argoproj.io --timeout=60s

UTF-8 BOM breaks ArgoCD YAML parsing

Several ArgoCD Application CRDs and Helm values files were saved with a UTF-8 BOM — an invisible three-byte sequence added by some Windows text editors. ArgoCD's YAML parser choked on these files and showed cryptic parsing errors. The files looked perfectly valid in every editor.

Fix — detect and remove the BOM:

# Find affected files

grep -rl $'\xef\xbb\xbf' k8s/argocd/ helm-values/

# Remove BOM from a file

sed -i 's/^\xef\xbb\xbf//' filename.yaml

Prevention — configure your editor to save files as UTF-8 without BOM. In VS Code, click the encoding indicator at the bottom right and choose "Save with Encoding → UTF-8".

ESO API version mismatch

ExternalSecret and ClusterSecretStore manifests using apiVersion: external-secrets.io/v1beta1 are silently ignored by ESO v2.x, which requires apiVersion: external-secrets.io/v1. Pods start with no secrets injected and fail with CreateContainerConfigError.

Fix — update all ESO manifests to use v1. Check your ESO version first: helm list -n external-secrets

ESO role missing permission for a new secret

After adding the Alertmanager SMTP secret to Secrets Manager, ESO couldn't sync it. The ExternalSecret showed SecretSyncError. The ESO IAM role's policy didn't include the new secret's ARN.

Fix — update the IAM policy in Terraform to include every secret ARN ESO needs to read, then re-apply.

Grafana starts before its admin password Secret exists

If the grafana-admin ExternalSecret hasn't been synced by ESO before Grafana starts, Grafana generates a random admin password that you don't know.

Fix — apply the ExternalSecret and wait for ESO to create the Kubernetes Secret before deploying Grafana:

kubectl apply -f k8s/base/external-secrets/grafana-admin.yaml kubectl wait externalsecret/grafana-admin
-n monitoring --for=condition=Ready --timeout=60s

Karpenter NodePool uses wrong cluster name in prod

The NodePool manifest for Karpenter contains a tag referencing petclinic-dev. When deploying prod, this tag tells Karpenter to join nodes to the dev cluster instead of prod.

Fix — substitute the cluster name before applying:

sed "s/petclinic-dev/petclinic-prod/g"
k8s/karpenter/nodepool.yaml | kubectl apply -f -

CI/CD Errors

GitHub rejects reusable workflow in a subdirectory

Reusable GitHub Actions workflows must live at the root .github/workflows/ level. Referencing a workflow in a subdirectory (e.g. .github/workflows/shared/update-tags.yml) silently fails with a workflow-not-found error.

Fix — inline the shared logic directly into the main workflow file.

yq checksum verification reads the wrong column

The checksums file for yq has multiple columns. Extracting the hash with awk '{print $1}' returns the correct column on some checksum file formats but the wrong one on others. The verification fails with a checksum mismatch that looks like a corrupt download.

Fix — inspect the checksums file format and count columns carefully. The correct field number for yq's checksum file is extracted with awk '{print $1}' where the hash is genuinely in column 1 — but verify this matches your specific yq version's checksum file format before hardcoding it.

Concurrent CI runs collide on git push

Two parallel workflow runs both commit to the same branch simultaneously. The second push fails because the first push moved the branch forward.

Fix — add git pull --rebase origin main before every git push in the workflow. This replays your commit on top of any concurrent commits.

ECR is empty after infrastructure destroy and rebuild

After destroying and recreating infrastructure, ECR repositories exist but contain no images. All pods fail with ImagePullBackOff.

Fix — trigger a manual image rebuild using workflow_dispatch on the update-image-tags.yml workflow, specifying a valid SHA from a previous successful build. Or make an empty commit to the application repo to trigger a full rebuild.

Windows-Specific Errors

Git Bash converts Unix paths to Windows paths

On Windows, Git Bash automatically converts paths like /Zone:Edit or /dev/null in command arguments into Windows-format paths like C:/Program Files/Git/Zone:Edit. This breaks AWS CLI calls that use path-like strings as argument values.

Fix — prefix the problematic command with MSYS_NO_PATHCONV=1: MSYS_NO_PATHCONV=1 aws ssm put-parameter
--name "/petclinic/config" --value "..."

The Cost Breakdown

After everything is running, here is what this project actually costs per month, per environment:

EKS control plane — $73.00 — the unavoidable floor, charged by AWS for every cluster regardless of node count
EC2 t4g.small × 4 — $0 — covered by the AWS Graviton free trial
RDS db.t4g.micro — $0 — covered by AWS free tier
Application Load Balancer — $0 — covered by AWS free tier
ECR storage — ~$0.50 — roughly $0.10/GB for stored images
Secrets Manager × 3 secrets — ~$1.20 — $0.40 per secret per month
EBS volumes for observability — ~$2.50 — Prometheus, Grafana, and Loki persistent volumes
Data transfer — ~$1.00 — within-region traffic is nearly free

Total: ~$78–80/month per environment when actively running.

To cut costs when not actively using the environment:

# Scale nodes to zero without destroying the cluster

bash scripts/stop-env.sh --env dev

# Cost drops to ~$73/month (just the control plane)

# Restore when ready to work

bash scripts/start-env.sh --env dev

The unavoidable cost: The EKS control plane at $73/month is the floor. You cannot avoid it. This is why you should never run two clusters simultaneously unless you have a specific reason — two clusters cost $146/month in control plane fees alone before a single pod runs.

What I'd Do Differently

Start with the destroy script, not the up script.

The first thing to build and test should be the teardown procedure, not the deployment. When you're learning infrastructure and rebuilding frequently, a broken destroy script means orphaned AWS resources that you pay for and can't easily clean up. Build destroy.sh first, test it, and only then build up.sh.

Write the CLAUDE.md before writing a single Terraform file.

The quality of everything Claude Code generates depends on the quality of the context you give it. Spending 30 minutes writing a thorough CLAUDE.md before starting saves hours of fixing inconsistent naming, wrong tags, and missing conventions across modules.

Pin every version from day one.

Kubernetes add-on versions, Helm chart versions, provider versions — pin all of them explicitly from the start. Unpinned versions means an upgrade can silently change behaviour between applies. The number of bugs caused by version drift in this project was non-zero.

Use helm template obsessively before applying anything.

Every Helm values change should be verified with helm template before applying. Helm's silent override behaviour (wrong key = silently ignored) makes it easy to push a config change that does nothing. If helm template doesn't show your change in the output, your change isn't working.

Don't rush to prod.

Dev should be completely stable, all 8 services healthy, observability working, CI/CD running cleanly, and the full round trip from code push to pod update verified — before you touch the prod environment. Prod amplifies every problem that exists in dev.

Conclusion: Lessons That Stayed, the Claude Code Difference, and What's Next

Where This Started and Where It Ended

When I started this project, I could name every tool in the stack. I knew what Terraform was. I knew what ArgoCD did. I had written Kubernetes Deployments before.

What I could not do was make all of them work together on real infrastructure, with real AWS costs, a real domain, real security constraints, and real errors that no tutorial had ever mentioned.

That gap — between knowing a tool and being able to use it under pressure — is what this project closed. Not by reading more. By building, breaking, fixing, and building again.

If you read this entire guide and followed along, you have now built something that most people with "DevOps" on their resume have never actually built from scratch:

A Terraform-managed AWS infrastructure with 9 reusable modules
A production Kubernetes cluster running real microservices
A GitOps deployment pipeline where Git is the only source of truth
A full observability stack with metrics, logs, traces, and alerts
A CI/CD pipeline with zero hardcoded credentials
Infrastructure that can be torn down and rebuilt from scratch with one script

That is not a small thing.

The Lessons That Actually Stayed

Errors are the curriculum. The happy path teaches you the steps. The errors teach you how the system actually works. Every bug in Chapter 11 forced a deeper understanding of the component it came from than any tutorial could have provided. Lean into broken things — they are the best teachers.

The destroy script is as important as the deploy script. You will rebuild your infrastructure more times than you expect while learning. A broken teardown means orphaned AWS resources, unexpected charges, and state files that lie to you. Build and test teardown first.

Documentation is not optional. The ADRs, runbook, technical spec, and CLAUDE.md in this project felt like extra work while writing them. In practice, they were the reason the project stayed coherent across weeks of work. Every time something broke at 11 PM, the runbook had the answer. Write the docs.

Observability should be built alongside the application, not after. The first time I needed to debug a failing service, I had no logs in one place, no metrics dashboard, and no traces. I was reading raw kubectl logs one pod at a time. Build the observability stack early — ideally before you deploy your first application service.

GitOps changes how you think about deployments. Once you've lived in a world where changing a YAML file in Git automatically updates your cluster, going back to manually running kubectl apply feels wrong. Git becomes your deployment history, your rollback mechanism, and your audit log simultaneously.

Cost discipline from day one. The EKS control plane charges $73/month whether you use the cluster or not. Know your costs before you build, not after the bill arrives. Use the stop-env.sh script when you're not actively working. Destroy environments you're done with.

The Claude Code Difference

I want to be honest about this because it's easy to either oversell or undersell AI-assisted development.

Claude Code genuinely changed the speed at which I could work. Writing a Terraform module that follows project conventions, has correct tags, correct outputs, and passes terraform validate — that used to take me 45 minutes of referencing docs and fixing syntax. With Claude Code and a well-written CLAUDE.md, it took 10 minutes. The multi-agent audits at the end of each epic caught IAM policies that were too permissive and security group rules I had missed. The safety hooks prevented me from accidentally running terraform destroy at least twice.

But Claude Code did not replace understanding. Every time I let it write something without reading it, I paid for it later when the thing broke and I had no idea why. The engineers who get the most out of AI tools are the ones who use them to move faster through work they already understand — not as a substitute for understanding.

The right mental model: Claude Code is a very fast, very knowledgeable pair programmer who doesn't get tired. You are still the engineer making the decisions. You review everything it produces. You are responsible for what runs in your infrastructure.

If you use it that way, it is one of the most powerful tools available to a DevOps engineer right now.

What You Can Build Next

This project is a foundation. Here are the natural next steps if you want to keep going:

Add Karpenter Spot instances — after the Graviton free trial ends in December 2026, switching to Spot instances can cut compute costs by 60–70%
Implement SOPS or Sealed Secrets — a more mature secret management approach where even the Kubernetes Secret resources are encrypted in Git
Add proper integration tests — a test suite that runs against the deployed cluster and verifies end-to-end behaviour before ArgoCD promotes to prod
Explore multi-region — deploy the prod environment in a second AWS region and understand what active-active or active-passive failover actually costs
Add a proper staging environment — a third environment between dev and prod, where changes soak for 24 hours before production promotion

A Word of Thanks

None of this would have happened without the foundation built during the DevOps Micro Internship (DMI) run by Pravin Mishra Sir. The internship gave me the base knowledge of DevOps that made it possible to even attempt a project of this scale. More importantly, it gave me the confidence that real infrastructure projects, the messy, expensive, error-filled kind, are learnable. You just have to start.

If you want to build real DevOps skills the same way, not by watching tutorials but by actually shipping infrastructure, here is where to begin:

DMI Cohort 3 starts 27 June 2025

If you want to build real DevOps skills, apply here 👇

https://docs.google.com/forms/d/e/1FAIpQLSel7ai7nyb0P1qLW4vEyfB_nEsD4lUF1XG88vmAaFGBOb6hPA/viewform

Join the Discord community here 👉 https://lnkd.in/gHYe7Mdg

The repository for everything covered in this guide is open source and ready to fork: https://github.com/paharipratyush/petclinic-platform

Every Terraform module, every Helm chart, every script, every ADR, every runbook - it's all there. Fork it, break it, rebuild it. That's where the real learning happens.

Thanks for reading.

#devops #kubernetes #microservices #docker #eks #terraform #aws #gitops

64 views

Command Palette