Skip to main content

Infrastructure as Code

Problem We're Solving

Configuration drift: Servers configured differently, manual changes not tracked, "it works on my machine" but fails in production.

Solution

Everything in Git via Ansible + automated drift detection + standardized server provisioning.

Core Principle

If it's not in Git managed by Ansible, it doesn't exist.

No manual server configuration. Ever. (Except documented emergencies, which must be codified within 24 hours.)

How It Works

Code Change → Git PR → Review → Merge → CI Test → Deploy to Staging → Manual Prod Deploy

Everything versioned, reviewed, tested, reproducible

Infrastructure Stack

ComponentToolPurpose
Configuration ManagementAnsibleServer provisioning, app deployment
Cloud InfrastructureDigitalOceanVMs, networking, load balancers
NetworkingNetbird VPNZero-trust network access
ContainersDockerApplication isolation
Source ControlGitHubInfrastructure code repository
CI/CDGitHub ActionsAutomated testing and deployment

Repository Structure

infrastructure/
├── ansible.cfg # Ansible configuration
├── inventory/
│ ├── production/ # Prod environment
│ │ ├── hosts.yml # Server inventory
│ │ └── group_vars/ # Group variables + vault
│ ├── staging/ # Staging environment
│ └── development/ # Dev environment
├── playbooks/
│ ├── site.yml # Master playbook
│ ├── web_servers.yml # Web server setup
│ ├── databases.yml # Database setup
│ └── common.yml # Base configuration
├── roles/
│ ├── common/ # Every server gets this
│ ├── docker/ # Docker installation
│ ├── security/ # Hardening, firewall
│ ├── monitoring/ # Metrics, logging
│ └── nginx/ # Nginx web server
└── scripts/
├── provision-new-server.sh # Bootstrap new server
└── drift-detection.sh # Daily drift check

Key Policies

Standard Server Lifecycle

1. Provision New Server

# Create server in DigitalOcean (via web UI or API)
# Add to inventory/production/hosts.yml

# Bootstrap with Ansible
ansible-playbook playbooks/bootstrap.yml -i inventory/production -l new-server-01

What it does:

  • Installs Python (for Ansible)
  • Creates admin users with SSH keys
  • Hardens SSH config
  • Sets up firewall baseline
  • Installs monitoring agent

2. Configure Server Role

# Apply role-specific configuration
ansible-playbook playbooks/web_servers.yml -i inventory/production -l new-server-01

What it does:

  • Installs required packages (Docker, Nginx, etc.)
  • Configures services
  • Sets up SSL certificates
  • Deploys application
  • Configures monitoring

3. Ongoing Management

All changes go through Git:

# 1. Update Ansible code
vim roles/nginx/templates/nginx.conf.j2

# 2. Test locally or in dev
ansible-playbook playbooks/web_servers.yml -i inventory/development --check

# 3. Commit, PR, review, merge

# 4. CI automatically applies to staging

# 5. Manually trigger production deploy (after verification)
ansible-playbook playbooks/site.yml -i inventory/production

4. Decommission Server

# Remove from inventory
# (keep in Git history for audit trail)

# Remove from monitoring, DNS, load balancer (via Ansible)

# Destroy in DigitalOcean (keep snapshots for 30 days)

Drift Detection

Problem: Someone manually edits a config file on a production server.

Solution: Daily automated drift detection.

# Runs daily via GitHub Actions
ansible-playbook playbooks/site.yml \
-i inventory/production \
--check \
--diff

If drift detected:

  1. Alert posted to #devops Rocket.Chat channel
  2. Shows what changed
  3. Team investigates:
    • Was it an emergency fix? → Codify it in Ansible within 24 hours
    • Was it unauthorized? → Revert and investigate

Goal: Drift should be rare (< 1 per month).

Emergency Procedures

Sometimes you need to fix production RIGHT NOW:

Allowed (with caveats):

  1. Make the manual change to fix the outage
  2. Document immediately in #incidents channel
  3. Create Ansible PR within 24 hours to codify the change
  4. Verify Ansible produces same result

Emergency Playbooks (Better):

Keep pre-written playbooks for common emergencies:

# Restart service
ansible-playbook playbooks/emergency/restart-service.yml \
-e "service=nginx host=prod-web-01"

# Scale up
ansible-playbook playbooks/emergency/add-web-server.yml

# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml \
-e "app=api version=v1.2.3"

Infrastructure Metrics

We track:

  • Configuration drift incidents (target: 0 per month)
  • Time from code merge to deployment (target: < 30 min for staging)
  • Infrastructure changes via PR (target: 100%)
  • Playbook test coverage (target: all playbooks tested in CI)

Dashboard: (link to infrastructure metrics)

Common Tasks

Add a new server: Server Provisioning Guide

Update application: Deployment Procedures

Change configuration: Edit Ansible code → PR → Review → Merge → Deploy

Troubleshoot server: Check Ansible logs, review last playbook run

Quick Reference

Run Ansible playbook:

ansible-playbook playbooks/<playbook>.yml -i inventory/<env>

Check what would change:

ansible-playbook playbooks/<playbook>.yml -i inventory/<env> --check --diff

Run on specific server:

ansible-playbook playbooks/<playbook>.yml -i inventory/<env> -l server-name

See server variables:

ansible-inventory -i inventory/<env> --host server-name