Infrastructure as Code with Terraform: A Production Guide

Manual infrastructure provisioning does not scale. Every click in a cloud console is an undocumented change that your team cannot review, reproduce, or roll back. Infrastructure as Code (IaC) solves this by treating infrastructure the same way you treat application code - versioned, reviewed, tested, and deployed through automated pipelines.

Terraform has become the de facto standard for multi-cloud IaC. Its declarative syntax, extensive provider ecosystem, and mature state management make it the most practical choice for teams managing cloud infrastructure at scale. This guide covers the patterns and practices that separate hobby Terraform projects from production-grade infrastructure management.

Project Structure That Scales

The single biggest mistake teams make with Terraform is putting everything in one directory. A monolithic configuration becomes impossible to manage as infrastructure grows. Instead, organize your Terraform code around environments and logical components.

A proven structure separates concerns into layers:

infrastructure/
  modules/
    networking/
    compute/
    database/
    monitoring/
  environments/
    production/
      networking/
      compute/
      database/
    staging/
      networking/
      compute/
      database/
  global/
    iam/
    dns/

Each subdirectory under an environment is an independent Terraform root module with its own state file. This isolation means a mistake in your compute configuration cannot accidentally destroy your networking layer. Changes to production networking go through a separate plan and apply cycle from changes to the database layer.

Use remote state data sources to share outputs between layers. Your compute layer references networking outputs (VPC ID, subnet IDs) through terraform_remote_state or better yet, through SSM Parameter Store or a similar service that decouples the dependency.

State Management Best Practices

Terraform state is the source of truth for your infrastructure. Mismanaging it leads to resource drift, orphaned resources, and potentially catastrophic deletions.

Always use remote state. Store state in S3 with DynamoDB locking (AWS), Google Cloud Storage with locking (GCP), or Azure Blob Storage with lease locking. Never commit state files to version control - they contain sensitive values and create merge conflicts.

Enable state locking. Without locking, two engineers running terraform apply simultaneously can corrupt state. Every remote backend supports locking - enable it and never disable it.

Use workspaces cautiously. Terraform workspaces allow multiple state files per configuration, which seems ideal for managing environments. In practice, workspaces obscure which environment you are operating on and make it easy to accidentally apply production changes to staging. Separate directory structures per environment provide better isolation and clarity.

Implement state backup and recovery. Enable versioning on your S3 state bucket. If state becomes corrupted, you can restore a previous version. Regularly practice state recovery so your team is confident in the procedure before an actual incident.

Avoid manual state manipulation. Commands like terraform state mv and terraform import are necessary sometimes, but they bypass the plan-review-apply workflow. Document every manual state operation and review it with a second engineer.

Writing Reusable Modules

Modules are Terraform's mechanism for code reuse. A well-designed module encapsulates a logical infrastructure component with a clean interface.

Design principles for production modules:

Minimal required variables. Provide sensible defaults for everything that can have a default. A module that requires 30 variables to use is a module nobody will adopt.
Output everything useful. Module consumers cannot access internal resources unless you explicitly output them. Output IDs, ARNs, endpoints, and security group IDs - anything a downstream module might need.
Version your modules. Use Git tags and reference modules by version. This prevents upstream module changes from unexpectedly altering downstream infrastructure.
Validate inputs. Use validation blocks on variables to catch misconfigurations before they reach the cloud API. Check CIDR ranges, naming conventions, and enum values.
Document with examples. Include a examples/ directory showing common usage patterns. This is more valuable than any amount of README documentation.

A module should do one thing well. A "vpc" module creates a VPC, subnets, route tables, and NAT gateways. It should not also create EC2 instances or RDS databases. Compose smaller modules together in your root module to build complete environments.

CI/CD for Infrastructure Changes

Infrastructure changes deserve the same rigor as application deployments. A CI/CD pipeline for Terraform enforces review, testing, and controlled rollout.

Pipeline stages:

Format and validate. Run terraform fmt -check and terraform validate to catch syntax errors. This is fast and should block PRs immediately.
Plan. Run terraform plan and post the output as a PR comment. Engineers reviewing the PR should see exactly what resources will be created, modified, or destroyed.
Policy checks. Use tools like Open Policy Agent (OPA), Sentinel, or Checkov to enforce organizational policies. Examples: all S3 buckets must have encryption enabled, no security groups may allow 0.0.0.0/0 ingress on port 22, all resources must have required tags.
Apply on merge. After PR approval and merge to the main branch, automatically run terraform apply with the previously generated plan file. This ensures the applied changes match what was reviewed.
Drift detection. Schedule periodic terraform plan runs to detect manual changes that have drifted from the declared state. Alert on drift so it can be reconciled.

Tools like Atlantis, Spacelift, and Terraform Cloud provide managed implementations of this workflow. For teams that prefer self-hosted solutions, GitHub Actions or GitLab CI with careful state locking work well.

Security and Compliance Considerations

Terraform configurations often contain or reference sensitive values: database passwords, API keys, and service account credentials.

Never hardcode secrets in Terraform files. Use sensitive = true on variables, reference secrets from AWS Secrets Manager or HashiCorp Vault using data sources, and ensure your state backend encrypts at rest.

Implement least-privilege IAM for Terraform execution. The service account running Terraform should have only the permissions necessary for the resources it manages. Avoid using admin credentials - scope permissions per root module so the networking pipeline cannot modify IAM policies.

Use provider version constraints. Pin provider versions to prevent unexpected behavior from upstream updates. Use the pessimistic constraint operator (~> 5.0) to allow patch updates while preventing major version changes.

Tag everything. Consistent tagging with environment, team, cost center, and managed-by labels enables cost allocation, access control, and auditing. Use a shared default_tags block in your provider configuration to enforce baseline tags across all resources.

Common Pitfalls and How to Avoid Them

The blast radius problem. A single terraform apply that manages 500 resources is dangerous. If something goes wrong, the blast radius is your entire infrastructure. Break configurations into smaller, independent state files with clear boundaries.

Ignoring plan output. Engineers who run terraform apply -auto-approve without reading the plan will eventually destroy production resources. Make plan review a mandatory step in your workflow - both in CI/CD and locally.

Provider version sprawl. Different teams using different provider versions leads to inconsistent behavior and difficult debugging. Standardize provider versions across the organization and update them in coordinated cycles.

Over-engineering modules too early. Do not build a generic module for every possible use case on day one. Start with concrete implementations, then extract modules when you see repeated patterns across two or three configurations.