Configuration Drift Detection

What is Drift?

Configuration drift occurs when servers are manually modified outside of Ansible, causing them to differ from the defined infrastructure code.

Why It's a Problem

Servers become "snowflakes" (unique, fragile)
Changes aren't tracked or documented
Can't reproduce servers reliably
Troubleshooting is harder
Security configurations may be weakened

Detection Strategy

Automated Daily Checks

# .github/workflows/drift-detection.yml
name: Daily Drift Detection

on:
  schedule:
    - cron: '0 6 * * *'  # 6 AM daily
  workflow_dispatch:  # Manual trigger

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Check for drift
        env:
          ANSIBLE_VAULT_PASSWORD: ${{ secrets.ANSIBLE_VAULT_PASSWORD_PROD }}
        run: |
          echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass

          # Run in check mode to detect changes
          ansible-playbook playbooks/site.yml \
            -i inventory/production \
            --vault-password-file /tmp/vault_pass \
            --check \
            --diff \
            | tee drift-report.txt

          # Check if any changes detected
          if grep -q "changed:" drift-report.txt; then
            echo "DRIFT_DETECTED=true" >> $GITHUB_ENV
            echo "⚠️ Configuration drift detected!"
          else
            echo "✅ No drift detected"
          fi

      - name: Alert to Rocket.Chat
        if: env.DRIFT_DETECTED == 'true'
        run: |
          curl -X POST "${{ secrets.ROCKETCHAT_WEBHOOK_URL }}" \
            -H 'Content-Type: application/json' \
            -d '{
              "channel": "#devops",
              "text": "⚠️ Infrastructure drift detected in production!",
              "attachments": [{
                "title": "Drift Detection Report",
                "text": "Manual changes detected on production servers. Review GitHub Actions logs.",
                "color": "warning"
              }]
            }'

Manual Drift Check

# Check all production servers
ansible-playbook playbooks/site.yml \
  -i inventory/production \
  --check \
  --diff

# Check specific server
ansible-playbook playbooks/site.yml \
  -i inventory/production \
  -l web-01 \
  --check \
  --diff

# Check specific role
ansible-playbook playbooks/web_servers.yml \
  -i inventory/production \
  --check \
  --diff

Output:

Green: No changes (no drift)
Yellow: Changes detected (drift found)

Handling Detected Drift

Step 1: Investigate

# Get detailed diff
ansible-playbook playbooks/site.yml \
  -i inventory/production \
  -l drifted-server \
  --check \
  --diff \
  -vv  # Verbose output

Questions to ask:

What changed?
When did it change?
Who made the change?
Was it an emergency fix?
Is it documented anywhere?

Step 2: Determine Action

Scenario A: Emergency Fix (Acceptable)

Example: Someone restarted nginx during an outage
Action:
1. Document in #incidents channel
2. Create Ansible PR to codify the change within 24 hours
3. Verify Ansible produces the same result
4. Merge and deploy

Scenario B: Unauthorized Change (Not Acceptable)

Example: Someone manually edited config without documentation
Action:
1. Revert the change by running Ansible
2. Investigate who made the change and why
3. Educate team on IaC policy
4. If needed: Restrict SSH access

Scenario C: Intentional Test (Acceptable with Process)

Example: Testing a config change before adding to Ansible
Action:
1. Verify it's documented in a ticket/PR
2. Complete the Ansible PR
3. Deploy via Ansible
4. Drift will be resolved

Step 3: Fix Drift

Option 1: Update Code to Match Reality

# If the manual change is correct, codify it
vim roles/nginx/templates/nginx.conf.j2
git commit -m "Update nginx config to match production fix"
# PR → Review → Merge

Option 2: Revert to Code

# If the manual change is wrong, revert it
ansible-playbook playbooks/site.yml \
  -i inventory/production \
  -l drifted-server
# This will restore the server to match Git

Preventing Drift

1. Make Ansible Easy

If Ansible is hard to use, people will SSH and make manual changes.

Make it easy:

Clear documentation
Pre-written emergency playbooks
Fast CI/CD pipeline
Easy testing in dev/staging

2. Emergency Playbooks

Common tasks that need to be done quickly:

# Restart service (fast, no need to SSH)
ansible-playbook playbooks/emergency/restart-service.yml \
  -e "service=nginx host=prod-web-01"

# Rollback deployment
ansible-playbook playbooks/emergency/rollback.yml \
  -e "app=api version=v1.2.3"

# Scale resources
ansible-playbook playbooks/emergency/scale-up.yml \
  -e "server_group=web_servers cpu=4 memory=8G"

3. Restrict SSH Access

Prod SSH access requires justification
All SSH sessions logged
Prefer read-only access
Use Netbird VPN with network segmentation

4. Audit Trails

# roles/common/tasks/audit.yml
- name: Enable auditd for command logging
  package:
    name: auditd
    state: present

- name: Log all config file changes
  template:
    src: audit-rules.j2
    dest: /etc/audit/rules.d/config-changes.rules
  notify: restart auditd

Acceptable Drift Scenarios

Sometimes drift is expected and OK:

Log files - Constantly changing, ignore them
Temporary files - /tmp contents
Dynamic content - User-uploaded files
Runtime data - Process IDs, timestamps
Package updates - OS security patches (automate these)

Configure Ansible to ignore:

# Use proper change detection
- name: Copy config file
  template:
    src: app.conf.j2
    dest: /etc/app/app.conf
  register: config_changed
  changed_when: config_changed.checksum_dest != config_changed.checksum_src

Metrics

Track drift incidents:

Number per month (target: 0)
Time to resolution (target: < 24 hours)
Root cause (emergency vs unauthorized)
Repeat incidents on same servers

Dashboard: Monitor in your metrics system

Ansible Standards - How to write Ansible code
Incident Response - Emergency procedures
Server Provisioning - Creating new servers

What is Drift?​

Why It's a Problem​

Detection Strategy​

Automated Daily Checks​

Manual Drift Check​

Handling Detected Drift​

Step 1: Investigate​

Step 2: Determine Action​

Step 3: Fix Drift​

Preventing Drift​

1. Make Ansible Easy​

2. Emergency Playbooks​

3. Restrict SSH Access​

4. Audit Trails​

Acceptable Drift Scenarios​

Metrics​

Related Policies​