Infrastructure as Code
Problem We're Solving
Configuration drift: Servers configured differently, manual changes not tracked, "it works on my machine" but fails in production.
Solution
Everything in Git via Ansible + automated drift detection + standardized server provisioning.
Core Principle
If it's not in Git managed by Ansible, it doesn't exist.
No manual server configuration. Ever. (Except documented emergencies, which must be codified within 24 hours.)
How It Works
Code Change → Git PR → Review → Merge → CI Test → Deploy to Staging → Manual Prod Deploy
↓
Everything versioned, reviewed, tested, reproducible
Infrastructure Stack
| Component | Tool | Purpose |
|---|---|---|
| Configuration Management | Ansible | Server provisioning, app deployment |
| Cloud Infrastructure | DigitalOcean | VMs, networking, load balancers |
| Networking | Netbird VPN | Zero-trust network access |
| Containers | Docker | Application isolation |
| Source Control | GitHub | Infrastructure code repository |
| CI/CD | GitHub Actions | Automated testing and deployment |
Repository Structure
infrastructure/
├── ansible.cfg # Ansible configuration
├── inventory/
│ ├── production/ # Prod environment
│ │ ├── hosts.yml # Server inventory
│ │ └── group_vars/ # Group variables + vault
│ ├── staging/ # Staging environment
│ └── development/ # Dev environment
├── playbooks/
│ ├── site.yml # Master playbook
│ ├── web_servers.yml # Web server setup
│ ├── databases.yml # Database setup
│ └── common.yml # Base configuration
├── roles/
│ ├── common/ # Every server gets this
│ ├── docker/ # Docker installation
│ ├── security/ # Hardening, firewall
│ ├── monitoring/ # Metrics, logging
│ └── nginx/ # Nginx web server
└── scripts/
├── provision-new-server.sh # Bootstrap new server
└── drift-detection.sh # Daily drift check
Key Policies
- Ansible Standards - Playbook structure, naming, best practices
- Server Provisioning - How to create new servers
- Configuration Drift Detection - Automated monitoring for manual changes
- DigitalOcean Management - Cloud resource management
- Docker Standards - Container image guidelines
Standard Server Lifecycle
1. Provision New Server
# Create server in DigitalOcean (via web UI or API)
# Add to inventory/production/hosts.yml
# Bootstrap with Ansible
ansible-playbook playbooks/bootstrap.yml -i inventory/production -l new-server-01
What it does:
- Installs Python (for Ansible)
- Creates admin users with SSH keys
- Hardens SSH config
- Sets up firewall baseline
- Installs monitoring agent
2. Configure Server Role
# Apply role-specific configuration
ansible-playbook playbooks/web_servers.yml -i inventory/production -l new-server-01
What it does:
- Installs required packages (Docker, Nginx, etc.)
- Configures services
- Sets up SSL certificates
- Deploys application
- Configures monitoring
3. Ongoing Management
All changes go through Git:
# 1. Update Ansible code
vim roles/nginx/templates/nginx.conf.j2
# 2. Test locally or in dev
ansible-playbook playbooks/web_servers.yml -i inventory/development --check
# 3. Commit, PR, review, merge
# 4. CI automatically applies to staging
# 5. Manually trigger production deploy (after verification)
ansible-playbook playbooks/site.yml -i inventory/production
4. Decommission Server
# Remove from inventory
# (keep in Git history for audit trail)
# Remove from monitoring, DNS, load balancer (via Ansible)
# Destroy in DigitalOcean (keep snapshots for 30 days)
Drift Detection
Problem: Someone manually edits a config file on a production server.
Solution: Daily automated drift detection.
# Runs daily via GitHub Actions
ansible-playbook playbooks/site.yml \
-i inventory/production \
--check \
--diff
If drift detected:
- Alert posted to #devops Rocket.Chat channel
- Shows what changed
- Team investigates:
- Was it an emergency fix? → Codify it in Ansible within 24 hours
- Was it unauthorized? → Revert and investigate
Goal: Drift should be rare (< 1 per month).
Emergency Procedures
Sometimes you need to fix production RIGHT NOW:
Allowed (with caveats):
- Make the manual change to fix the outage
- Document immediately in #incidents channel
- Create Ansible PR within 24 hours to codify the change
- Verify Ansible produces same result
Emergency Playbooks (Better):
Keep pre-written playbooks for common emergencies:
# Restart service
ansible-playbook playbooks/emergency/restart-service.yml \
-e "service=nginx host=prod-web-01"
# Scale up
ansible-playbook playbooks/emergency/add-web-server.yml
# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml \
-e "app=api version=v1.2.3"
Infrastructure Metrics
We track:
- Configuration drift incidents (target: 0 per month)
- Time from code merge to deployment (target: < 30 min for staging)
- Infrastructure changes via PR (target: 100%)
- Playbook test coverage (target: all playbooks tested in CI)
Dashboard: (link to infrastructure metrics)
Common Tasks
Add a new server: Server Provisioning Guide
Update application: Deployment Procedures
Change configuration: Edit Ansible code → PR → Review → Merge → Deploy
Troubleshoot server: Check Ansible logs, review last playbook run
Quick Reference
Run Ansible playbook:
ansible-playbook playbooks/<playbook>.yml -i inventory/<env>
Check what would change:
ansible-playbook playbooks/<playbook>.yml -i inventory/<env> --check --diff
Run on specific server:
ansible-playbook playbooks/<playbook>.yml -i inventory/<env> -l server-name
See server variables:
ansible-inventory -i inventory/<env> --host server-name