Configuration Drift Detection
What is Drift?
Configuration drift occurs when servers are manually modified outside of Ansible, causing them to differ from the defined infrastructure code.
Why It's a Problem
- Servers become "snowflakes" (unique, fragile)
- Changes aren't tracked or documented
- Can't reproduce servers reliably
- Troubleshooting is harder
- Security configurations may be weakened
Detection Strategy
Automated Daily Checks
# .github/workflows/drift-detection.yml
name: Daily Drift Detection
on:
schedule:
- cron: '0 6 * * *' # 6 AM daily
workflow_dispatch: # Manual trigger
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Check for drift
env:
ANSIBLE_VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD_PROD }}
run: |
echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass
# Run in check mode to detect changes
ansible-playbook playbooks/site.yml \
-i inventory/production \
--vault-password-file /tmp/vault_pass \
--check \
--diff \
| tee drift-report.txt
# Check if any changes detected
if grep -q "changed:" drift-report.txt; then
echo "DRIFT_DETECTED=true" >> $GITHUB_ENV
echo "⚠️ Configuration drift detected!"
else
echo "✅ No drift detected"
fi
- name: Alert to Rocket.Chat
if: env.DRIFT_DETECTED == 'true'
run: |
curl -X POST "${{ secrets.ROCKETCHAT_WEBHOOK_URL }}" \
-H 'Content-Type: application/json' \
-d '{
"channel": "#devops",
"text": "⚠️ Infrastructure drift detected in production!",
"attachments": [{
"title": "Drift Detection Report",
"text": "Manual changes detected on production servers. Review GitHub Actions logs.",
"color": "warning"
}]
}'
Manual Drift Check
# Check all production servers
ansible-playbook playbooks/site.yml \
-i inventory/production \
--check \
--diff
# Check specific server
ansible-playbook playbooks/site.yml \
-i inventory/production \
-l web-01 \
--check \
--diff
# Check specific role
ansible-playbook playbooks/web_servers.yml \
-i inventory/production \
--check \
--diff
Output:
- Green: No changes (no drift)
- Yellow: Changes detected (drift found)
Handling Detected Drift
Step 1: Investigate
# Get detailed diff
ansible-playbook playbooks/site.yml \
-i inventory/production \
-l drifted-server \
--check \
--diff \
-vv # Verbose output
Questions to ask:
- What changed?
- When did it change?
- Who made the change?
- Was it an emergency fix?
- Is it documented anywhere?
Step 2: Determine Action
Scenario A: Emergency Fix (Acceptable)
Example: Someone restarted nginx during an outage
Action:
1. Document in #incidents channel
2. Create Ansible PR to codify the change within 24 hours
3. Verify Ansible produces the same result
4. Merge and deploy
Scenario B: Unauthorized Change (Not Acceptable)
Example: Someone manually edited config without documentation
Action:
1. Revert the change by running Ansible
2. Investigate who made the change and why
3. Educate team on IaC policy
4. If needed: Restrict SSH access
Scenario C: Intentional Test (Acceptable with Process)
Example: Testing a config change before adding to Ansible
Action:
1. Verify it's documented in a ticket/PR
2. Complete the Ansible PR
3. Deploy via Ansible
4. Drift will be resolved
Step 3: Fix Drift
Option 1: Update Code to Match Reality
# If the manual change is correct, codify it
vim roles/nginx/templates/nginx.conf.j2
git commit -m "Update nginx config to match production fix"
# PR → Review → Merge
Option 2: Revert to Code
# If the manual change is wrong, revert it
ansible-playbook playbooks/site.yml \
-i inventory/production \
-l drifted-server
# This will restore the server to match Git
Preventing Drift
1. Make Ansible Easy
If Ansible is hard to use, people will SSH and make manual changes.
Make it easy:
- Clear documentation
- Pre-written emergency playbooks
- Fast CI/CD pipeline
- Easy testing in dev/staging
2. Emergency Playbooks
Common tasks that need to be done quickly:
# Restart service (fast, no need to SSH)
ansible-playbook playbooks/emergency/restart-service.yml \
-e "service=nginx host=prod-web-01"
# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml \
-e "app=api version=v1.2.3"
# Scale resources
ansible-playbook playbooks/emergency/scale-up.yml \
-e "server_group=web_servers cpu=4 memory=8G"
3. Restrict SSH Access
- Prod SSH access requires justification
- All SSH sessions logged
- Prefer read-only access
- Use Netbird VPN with network segmentation
4. Audit Trails
# roles/common/tasks/audit.yml
- name: Enable auditd for command logging
package:
name: auditd
state: present
- name: Log all config file changes
template:
src: audit-rules.j2
dest: /etc/audit/rules.d/config-changes.rules
notify: restart auditd
Acceptable Drift Scenarios
Sometimes drift is expected and OK:
- Log files - Constantly changing, ignore them
- Temporary files -
/tmpcontents - Dynamic content - User-uploaded files
- Runtime data - Process IDs, timestamps
- Package updates - OS security patches (automate these)
Configure Ansible to ignore:
# Use proper change detection
- name: Copy config file
template:
src: app.conf.j2
dest: /etc/app/app.conf
register: config_changed
changed_when: config_changed.checksum_dest != config_changed.checksum_src
Metrics
Track drift incidents:
- Number per month (target: 0)
- Time to resolution (target: < 24 hours)
- Root cause (emergency vs unauthorized)
- Repeat incidents on same servers
Dashboard: Monitor in your metrics system
Related Policies
- Ansible Standards - How to write Ansible code
- Incident Response - Emergency procedures
- Server Provisioning - Creating new servers