Skip to main content

Configuration Drift Detection

What is Drift?

Configuration drift occurs when servers are manually modified outside of Ansible, causing them to differ from the defined infrastructure code.

Why It's a Problem

  • Servers become "snowflakes" (unique, fragile)
  • Changes aren't tracked or documented
  • Can't reproduce servers reliably
  • Troubleshooting is harder
  • Security configurations may be weakened

Detection Strategy

Automated Daily Checks

# .github/workflows/drift-detection.yml
name: Daily Drift Detection

on:
schedule:
- cron: '0 6 * * *' # 6 AM daily
workflow_dispatch: # Manual trigger

jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- name: Check for drift
env:
ANSIBLE_VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD_PROD }}
run: |
echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass

# Run in check mode to detect changes
ansible-playbook playbooks/site.yml \
-i inventory/production \
--vault-password-file /tmp/vault_pass \
--check \
--diff \
| tee drift-report.txt

# Check if any changes detected
if grep -q "changed:" drift-report.txt; then
echo "DRIFT_DETECTED=true" >> $GITHUB_ENV
echo "⚠️ Configuration drift detected!"
else
echo "✅ No drift detected"
fi

- name: Alert to Rocket.Chat
if: env.DRIFT_DETECTED == 'true'
run: |
curl -X POST "${{ secrets.ROCKETCHAT_WEBHOOK_URL }}" \
-H 'Content-Type: application/json' \
-d '{
"channel": "#devops",
"text": "⚠️ Infrastructure drift detected in production!",
"attachments": [{
"title": "Drift Detection Report",
"text": "Manual changes detected on production servers. Review GitHub Actions logs.",
"color": "warning"
}]
}'

Manual Drift Check

# Check all production servers
ansible-playbook playbooks/site.yml \
-i inventory/production \
--check \
--diff

# Check specific server
ansible-playbook playbooks/site.yml \
-i inventory/production \
-l web-01 \
--check \
--diff

# Check specific role
ansible-playbook playbooks/web_servers.yml \
-i inventory/production \
--check \
--diff

Output:

  • Green: No changes (no drift)
  • Yellow: Changes detected (drift found)

Handling Detected Drift

Step 1: Investigate

# Get detailed diff
ansible-playbook playbooks/site.yml \
-i inventory/production \
-l drifted-server \
--check \
--diff \
-vv # Verbose output

Questions to ask:

  • What changed?
  • When did it change?
  • Who made the change?
  • Was it an emergency fix?
  • Is it documented anywhere?

Step 2: Determine Action

Scenario A: Emergency Fix (Acceptable)

Example: Someone restarted nginx during an outage
Action:
1. Document in #incidents channel
2. Create Ansible PR to codify the change within 24 hours
3. Verify Ansible produces the same result
4. Merge and deploy

Scenario B: Unauthorized Change (Not Acceptable)

Example: Someone manually edited config without documentation
Action:
1. Revert the change by running Ansible
2. Investigate who made the change and why
3. Educate team on IaC policy
4. If needed: Restrict SSH access

Scenario C: Intentional Test (Acceptable with Process)

Example: Testing a config change before adding to Ansible
Action:
1. Verify it's documented in a ticket/PR
2. Complete the Ansible PR
3. Deploy via Ansible
4. Drift will be resolved

Step 3: Fix Drift

Option 1: Update Code to Match Reality

# If the manual change is correct, codify it
vim roles/nginx/templates/nginx.conf.j2
git commit -m "Update nginx config to match production fix"
# PR → Review → Merge

Option 2: Revert to Code

# If the manual change is wrong, revert it
ansible-playbook playbooks/site.yml \
-i inventory/production \
-l drifted-server
# This will restore the server to match Git

Preventing Drift

1. Make Ansible Easy

If Ansible is hard to use, people will SSH and make manual changes.

Make it easy:

  • Clear documentation
  • Pre-written emergency playbooks
  • Fast CI/CD pipeline
  • Easy testing in dev/staging

2. Emergency Playbooks

Common tasks that need to be done quickly:

# Restart service (fast, no need to SSH)
ansible-playbook playbooks/emergency/restart-service.yml \
-e "service=nginx host=prod-web-01"

# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml \
-e "app=api version=v1.2.3"

# Scale resources
ansible-playbook playbooks/emergency/scale-up.yml \
-e "server_group=web_servers cpu=4 memory=8G"

3. Restrict SSH Access

  • Prod SSH access requires justification
  • All SSH sessions logged
  • Prefer read-only access
  • Use Netbird VPN with network segmentation

4. Audit Trails

# roles/common/tasks/audit.yml
- name: Enable auditd for command logging
package:
name: auditd
state: present

- name: Log all config file changes
template:
src: audit-rules.j2
dest: /etc/audit/rules.d/config-changes.rules
notify: restart auditd

Acceptable Drift Scenarios

Sometimes drift is expected and OK:

  1. Log files - Constantly changing, ignore them
  2. Temporary files - /tmp contents
  3. Dynamic content - User-uploaded files
  4. Runtime data - Process IDs, timestamps
  5. Package updates - OS security patches (automate these)

Configure Ansible to ignore:

# Use proper change detection
- name: Copy config file
template:
src: app.conf.j2
dest: /etc/app/app.conf
register: config_changed
changed_when: config_changed.checksum_dest != config_changed.checksum_src

Metrics

Track drift incidents:

  • Number per month (target: 0)
  • Time to resolution (target: < 24 hours)
  • Root cause (emergency vs unauthorized)
  • Repeat incidents on same servers

Dashboard: Monitor in your metrics system