Infrastructure as Code

Problem We're Solving

Configuration drift: Servers configured differently, manual changes not tracked, "it works on my machine" but fails in production.

Solution

Everything in Git via Ansible + automated drift detection + standardized server provisioning.

Core Principle

If it's not in Git managed by Ansible, it doesn't exist.

No manual server configuration. Ever. (Except documented emergencies, which must be codified within 24 hours.)

How It Works

Code Change → Git PR → Review → Merge → CI Test → Deploy to Staging → Manual Prod Deploy
     ↓
Everything versioned, reviewed, tested, reproducible

Infrastructure Stack

Component	Tool	Purpose
Configuration Management	Ansible	Server provisioning, app deployment
Cloud Infrastructure	DigitalOcean	VMs, networking, load balancers
Networking	Netbird VPN	Zero-trust network access
Containers	Docker	Application isolation
Source Control	GitHub	Infrastructure code repository
CI/CD	GitHub Actions	Automated testing and deployment

Repository Structure

infrastructure/
├── ansible.cfg                    # Ansible configuration
├── inventory/
│   ├── production/                # Prod environment
│   │   ├── hosts.yml              # Server inventory
│   │   └── group_vars/            # Group variables + vault
│   ├── staging/                   # Staging environment
│   └── development/               # Dev environment
├── playbooks/
│   ├── site.yml                   # Master playbook
│   ├── web_servers.yml            # Web server setup
│   ├── databases.yml              # Database setup
│   └── common.yml                 # Base configuration
├── roles/
│   ├── common/                    # Every server gets this
│   ├── docker/                    # Docker installation
│   ├── security/                  # Hardening, firewall
│   ├── monitoring/                # Metrics, logging
│   └── nginx/                     # Nginx web server
└── scripts/
    ├── provision-new-server.sh    # Bootstrap new server
    └── drift-detection.sh         # Daily drift check

Key Policies

Ansible Standards - Playbook structure, naming, best practices
Server Provisioning - How to create new servers
Configuration Drift Detection - Automated monitoring for manual changes
DigitalOcean Management - Cloud resource management
Docker Standards - Container image guidelines

Standard Server Lifecycle

1. Provision New Server

# Create server in DigitalOcean (via web UI or API)
# Add to inventory/production/hosts.yml

# Bootstrap with Ansible
ansible-playbook playbooks/bootstrap.yml -i inventory/production -l new-server-01

What it does:

Installs Python (for Ansible)
Creates admin users with SSH keys
Hardens SSH config
Sets up firewall baseline
Installs monitoring agent

2. Configure Server Role

# Apply role-specific configuration
ansible-playbook playbooks/web_servers.yml -i inventory/production -l new-server-01

What it does:

Installs required packages (Docker, Nginx, etc.)
Configures services
Sets up SSL certificates
Deploys application
Configures monitoring

3. Ongoing Management

All changes go through Git:

# 1. Update Ansible code
vim roles/nginx/templates/nginx.conf.j2

# 2. Test locally or in dev
ansible-playbook playbooks/web_servers.yml -i inventory/development --check

# 3. Commit, PR, review, merge

# 4. CI automatically applies to staging

# 5. Manually trigger production deploy (after verification)
ansible-playbook playbooks/site.yml -i inventory/production

4. Decommission Server

# Remove from inventory
# (keep in Git history for audit trail)

# Remove from monitoring, DNS, load balancer (via Ansible)

# Destroy in DigitalOcean (keep snapshots for 30 days)

Drift Detection

Problem: Someone manually edits a config file on a production server.

Solution: Daily automated drift detection.

# Runs daily via GitHub Actions
ansible-playbook playbooks/site.yml \
  -i inventory/production \
  --check \
  --diff

If drift detected:

Alert posted to #devops Rocket.Chat channel
Shows what changed
Team investigates:
- Was it an emergency fix? → Codify it in Ansible within 24 hours
- Was it unauthorized? → Revert and investigate

Goal: Drift should be rare (< 1 per month).

Emergency Procedures

Sometimes you need to fix production RIGHT NOW:

Allowed (with caveats):

Make the manual change to fix the outage
Document immediately in #incidents channel
Create Ansible PR within 24 hours to codify the change
Verify Ansible produces same result

Emergency Playbooks (Better):

Keep pre-written playbooks for common emergencies:

# Restart service
ansible-playbook playbooks/emergency/restart-service.yml \
  -e "service=nginx host=prod-web-01"

# Scale up
ansible-playbook playbooks/emergency/add-web-server.yml

# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml \
  -e "app=api version=v1.2.3"

Infrastructure Metrics

We track:

Configuration drift incidents (target: 0 per month)
Time from code merge to deployment (target: < 30 min for staging)
Infrastructure changes via PR (target: 100%)
Playbook test coverage (target: all playbooks tested in CI)

Dashboard: (link to infrastructure metrics)

Common Tasks

Add a new server: Server Provisioning Guide

Update application: Deployment Procedures

Change configuration: Edit Ansible code → PR → Review → Merge → Deploy

Troubleshoot server: Check Ansible logs, review last playbook run

Quick Reference

Run Ansible playbook:

ansible-playbook playbooks/<playbook>.yml -i inventory/<env>

Check what would change:

ansible-playbook playbooks/<playbook>.yml -i inventory/<env> --check --diff

Run on specific server:

ansible-playbook playbooks/<playbook>.yml -i inventory/<env> -l server-name

See server variables:

ansible-inventory -i inventory/<env> --host server-name

Problem We're Solving​

Solution​

Core Principle​

How It Works​

Infrastructure Stack​

Repository Structure​

Key Policies​

Standard Server Lifecycle​

1. Provision New Server​

2. Configure Server Role​

3. Ongoing Management​

4. Decommission Server​

Drift Detection​

Emergency Procedures​

Allowed (with caveats):​

Emergency Playbooks (Better):​

Infrastructure Metrics​

Common Tasks​

Quick Reference​