Terraform + Terragrunt + Ansible: A Hands-On Learning Journey

Terraform + Terragrunt + Ansible: A Hands-On Learning Journey

Tópico: Terraform + Terragrunt + Ansible: A Hands-On Learning Journey
Categoria: Tutoriais | Programação & Tecnologia
Idioma Principal: Português (Conteúdo de Tecnologia)

Descrição do Conteúdo / Informações:
-------------------------------------------------------------------------
I recently got interview feedback that changed how I approach learning:

"You've used these tools, but the technical depth wasn't there."

Instead of just reading documentation, I decided to build a real multi-environment infrastructure setup from scratch — dev, staging, and prod — using Terraform, Terragrunt, and Ansible. This post is a walkthrough of what I built, why each decision was made, and what I actually learned along the way.

The Problem with Single-Environment Thinking

Up until this point, my Terraform workflow looked like this:

write main.tf → terraform apply → done

That works fine for a single environment. But in a real company, code never goes directly to production. There's always a pipeline:

•
Dev — developers experiment here, things can break, no real users

•
Staging — production mirror, QA tests here before release

•
Prod — real users, real traffic, every mistake costs something

When you try to scale your single main.tf to three environments, three problems appear immediately.

Problem 1: Code duplication. You copy main.tf into environments/dev, environments/staging, and environments/prod. Now you have three identical files. When you add a new resource to dev, you have to manually copy it to the other two. Forget once — your environments silently drift apart.

Problem 2: State file collisions. Terraform saves the current state of your infrastructure to a file called terraform.tfstate. If all three environments write to the same S3 path, a dev apply can overwrite the prod state. Infrastructure gone.

Problem 3: No access control. Without IAM isolation, any engineer with AWS credentials can accidentally run terragrunt apply in the wrong environment.

These are the three problems this lab is designed to solve.

Project Architecture

Here's the full directory structure we're building:

terraform-ansible/
├── _base
│ ├── main.tf # single Terraform entry point, used by all environments
│ └── modules
│ ├── ec2
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ ├── sg
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ └── vpc
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
├── ansible
│ ├── ansible.cfg
│ ├── group_vars
│ │ ├── env_dev.yml
│ │ ├── env_prod.yml
│ │ └── env_staging.yml
│ ├── inventory
│ │ └── aws_ec2.yml # dynamic inventory — AWS tag based
│ ├── playbooks
│ │ └── provision.yml
│ └── roles
│ ├── common
│ │ └── tasks
│ │ └── main.yml
│ └── webserver
│ ├── handlers
│ │ └── main.yml
│ └── tasks
│ └── main.yml
└── live
├── dev
│ └── terragrunt.hcl # dev-specific values
├── prod
│ └── terragrunt.hcl # prod-specific values
├── staging
│ └── terragrunt.hcl # staging-specific values
└── terragrunt.hcl # root config — S3 backend, state locking

The flow looks like this:

terragrunt apply (live/dev)
│
├── reads live/terragrunt.hcl → generates backend.tf automatically
├── reads live/dev/terragrunt.hcl → gets environment-specific inputs
├── runs _base/main.tf → provisions VPC, SG, EC2
└── triggers null_resource → runs Ansible playbook automatically

Step 1: Terraform Modules — Reusable Infrastructure Components

Modules are Terraform's way of packaging reusable infrastructure. Instead of writing the same VPC configuration in every environment, you write it once as a module and call it with different parameters.

Each module follows the same three-file pattern:

•
variables.tf — what inputs the module accepts

•
main.tf — what resources it creates

•
outputs.tf — what values it exposes to the caller

Here's the EC2 module as an example:

modules/ec2/variables.tf

variable "instance_type" {
description = "EC2 instance type"
type = string
}

variable "environment" {
type = string
}

variable "subnet_id" {
type = string
}

variable "sg_id" {
type = string
}

variable "key_name" {
description = "SSH key pair name"
type = string
}

modules/ec2/main.tf

data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]

filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}

resource "aws_instance" "main" {
ami = data.aws_ami.amazon_linux.id
instance_type = var.instance_type
subnet_id = var.subnet_id
vpc_security_group_ids = [var.sg_id]
key_name = var.key_name

tags = {
Name = "${var.environment}-server"
Environment = var.environment
ManagedBy = "terraform"
Project = "terraform-lab"
}
}

modules/ec2/outputs.tf

output "instance_id" {
value = aws_instance.main.id
}

output "public_ip" {
value = aws_instance.main.public_ip
}

The VPC and Security Group modules follow the same pattern. The key insight: modules are just functions. They take inputs, create resources, and return outputs.

Step 2: _base/main.tf — The Single Entry Point

All three environments use this exact file. It calls the modules and accepts all variable values from outside — from Terragrunt:

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}

variable "environment" { type = string }
variable "vpc_cidr" { type = string }
variable "instance_type" { type = string }
variable "key_name" { type = string default = "terraform-lab-key" }
variable "region" { type = string default = "eu-central-1" }

module "vpc" {
source = "../modules/vpc"
vpc_cidr = var.vpc_cidr
environment = var.environment
}

module "sg" {
source = "../modules/sg"
vpc_id = module.vpc.vpc_id
environment = var.environment
}

module "ec2" {
source = "../modules/ec2"
instance_type = var.instance_type
environment = var.environment
subnet_id = module.vpc.subnet_id
sg_id = module.sg.sg_id
key_name = var.key_name
}

resource "null_resource" "ansible_provision" {
depends_on = [module.ec2]

triggers = {
instance_id = module.ec2.instance_id
}

provisioner "local-exec" {
command = <<-EOT
echo "Waiting for instance to be ready..."
sleep 30
cd /path/to/ansible && \
ansible-playbook playbooks/provision.yml -e "target_env=${var.environment}"
EOT
}
}

output "instance_id" { value = module.ec2.instance_id }
output "public_ip" { value = module.ec2.public_ip }
output "vpc_id" { value = module.vpc.vpc_id }

Notice that _base/main.tf has no hardcoded values — no instance type, no CIDR block, no environment name. Everything comes from outside. This is what makes it reusable across environments.

Step 3: Terragrunt — Solving the Multi-Environment Problem

Terragrunt is a thin wrapper around Terraform. It doesn't replace Terraform — it just removes the need to duplicate main.tf across environments by injecting environment-specific values at runtime.

Think of _base/main.tf as a function. Terragrunt calls that function with different arguments for each environment.

Root config

live/terragrunt.hcl is written once and inherited by all environments:

locals {
env = basename(get_terragrunt_dir())
# get_terragrunt_dir() returns the current directory path
# basename() extracts just the last segment: "dev", "staging", or "prod"
# so env is automatically set from the folder name — no hardcoding needed
}

remote_state {
backend = "s3"
config = {
bucket = "your-tfstate-bucket"
key = "${local.env}/terraform.tfstate"
region = "eu-central-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
# backend.tf is generated automatically before every apply
# you never write it manually
}
}

The key field is the critical part. When you run from live/dev, local.env becomes "dev", so the state is saved to dev/terraform.tfstate. From live/prod, it goes to prod/terraform.tfstate. State isolation is automatic.

Per-environment config

Each environment only contains what's different — the input values:

live/dev/terragrunt.hcl

include "root" {
path = find_in_parent_folders()
# inherits everything from live/terragrunt.hcl
}

terraform {
source = "../../_base"
# points to the shared main.tf
}

inputs = {
environment = "dev"
vpc_cidr = "10.0.0.0/16"
instance_type = "t3.micro"
key_name = "terraform-lab-key"
}

live/prod/terragrunt.hcl

include "root" {
path = find_in_parent_folders()
}

terraform {
source = "../../_base" # same main.tf
}

inputs = {
environment = "prod"
vpc_cidr = "10.2.0.0/16"
instance_type = "t3.medium" # only the values differ
key_name = "terraform-lab-key"
}

To deploy:

# Deploy only dev
cd live/dev && terragrunt apply

# Plan all environments at once
cd live && terragrunt run-all plan

# Apply all environments at once
cd live && terragrunt run-all apply

Step 4: Ansible — Post-Provisioning Configuration

Terraform answers the question: "Does this EC2 instance exist?"

Ansible answers the question: "Is nginx installed on that instance and configured correctly?"

These are two different problems. Terraform manages infrastructure state. Ansible manages configuration state. You need both.

Dynamic inventory

Instead of hardcoding IP addresses, Ansible discovers instances by their AWS tags:

ansible/inventory/aws_ec2.yml

plugin: amazon.aws.aws_ec2

regions:
- eu-central-1

filters:
tag:ManagedBy:
- terraform
instance-state-name:
- running

keyed_groups:
- key: tags.Environment
prefix: env
separator: "_"

hostnames:
- tag:Name
- public-ip-address

compose:
ansible_host: public_ip_address
environment: tags.Environment

Any running instance tagged with ManagedBy: terraform is automatically discovered. Instances are grouped by their Environment tag — so dev instances land in the env_dev group, prod in env_prod, and so on. Even if the IP address changes after a destroy/apply cycle, the inventory stays correct.

Roles

ansible/roles/common/tasks/main.yml — runs on every instance:

---
- name: Update all packages
ansible.builtin.dnf:
name: "*"
state: latest

- name: Install base tools
ansible.builtin.dnf:
name: [git, htop, vim, wget]
state: present

- name: Create deploy user
ansible.builtin.user:
name: deploy
shell: /bin/bash
groups: wheel
append: yes

- name: Grant deploy user sudo access
ansible.builtin.copy:
dest: /etc/sudoers.d/deploy
content: "deploy ALL=(ALL) NOPASSWD:ALL"
mode: "0440"

- name: Set timezone
ansible.builtin.timezone:
name: Europe/Istanbul

ansible/roles/webserver/tasks/main.yml — installs and configures nginx:

---
- name: Install nginx
ansible.builtin.dnf:
name: nginx
state: present

- name: Start and enable nginx
ansible.builtin.systemd:
name: nginx
state: started
enabled: yes
daemon_reload: yes

- name: Create environment-specific index.html
ansible.builtin.copy:
dest: /usr/share/nginx/html/index.html
content: |
<h1>{{ app_environment }} environment</h1>
<p>Instance: {{ ansible_facts['hostname'] }}</p>
<p>IP: {{ ansible_facts['default_ipv4']['address'] }}</p>
mode: "0644"
notify: nginx restart

Playbook

---
- name: Instance provisioning
hosts: "env_{{ target_env }}"
become: true
vars:
app_environment: "{{ tags.Environment }}"

roles:
- common
- webserver

Run against a specific environment:

# Only dev
ansible-playbook playbooks/provision.yml -e "target_env=dev"

# Only prod
ansible-playbook playbooks/provision.yml -e "target_env=prod"

Idempotency test

One of Ansible's core properties is idempotency — running the same playbook twice should produce the same result. The second run should show changed=0:

# First run
ansible-playbook playbooks/provision.yml -e "target_env=dev"
# → ok=10 changed=8 failed=0

# Second run — nothing changes
ansible-playbook playbooks/provision.yml -e "target_env=dev"
# → ok=10 changed=0 failed=0

changed=0 on the second run confirms idempotency is working.

Step 5: Connecting Everything — One Command to Rule Them All

With null_resource in _base/main.tf, running terragrunt apply automatically triggers Ansible after the EC2 instance is ready:

terragrunt apply
↓
VPC created
↓
Security Group created
↓
EC2 instance running
↓
null_resource triggers (depends_on = [module.ec2])
↓
sleep 30 (wait for SSH to be ready)
↓
ansible-playbook runs automatically
↓
nginx installed, configured, running

From a single command, you get a fully provisioned and configured server.

Step 6: Proving It Works — IAM Isolation & Drift Testing

IAM isolation

A dev engineer should not be able to touch prod state files. We enforce this with IAM policies:

{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::your-tfstate-bucket/dev/*"
}

The dev IAM user can only read/write to dev/* in S3. Attempting to write to prod/*:

AWS_ACCESS_KEY_ID=dev-key AWS_SECRET_ACCESS_KEY=dev-secret \
aws s3 cp test.txt s3://your-tfstate-bucket/prod/test.txt

# An error occurred (AccessDenied) when calling the PutObject operation

Human error blocked at the policy level.

Drift test

Add a new tag to modules/ec2/main.tf:

tags = {
Name = "${var.environment}-server"
Environment = var.environment
ManagedBy = "terraform"
Project = "terraform-lab" # new tag
}

Run run-all plan to see the change propagated to all three environments simultaneously:

cd live && terragrunt run-all plan

# Plan: 0 to add, 1 to change, 0 to destroy (dev)
# Plan: 0 to add, 1 to change, 0 to destroy (staging)
# Plan: 0 to add, 1 to change, 0 to destroy (prod)

One file changed. Three environments updated. No manual copying, no risk of forgetting one.

Key Takeaways

After building this from scratch, here's what actually clicked for me:

Terraform and Ansible solve different problems. Terraform manages infrastructure state — "does this resource exist in AWS?" Ansible manages configuration state — "is nginx installed and running on that server?" You need both because provisioning a server and configuring it are fundamentally different concerns.

Terragrunt's value isn't magic — it's discipline. The single _base/main.tf enforces consistency. You can't accidentally configure staging differently from prod because there's only one source of truth. Configuration drift becomes structurally impossible rather than just unlikely.

IAM policy is the last line of defense. Engineers make mistakes. The cd live/prod && terragrunt apply accident will happen eventually. When it does, the question is whether your infrastructure or your IAM policy catches it first.

Idempotency is a property you verify, not assume. Running the playbook twice and checking for changed=0 isn't just a test — it's how you know your automation is actually reliable.

All code from this lab is available on GitHub. If you spot something that could be done better, I'd genuinely love to hear it in the comments.