SiamCafe.net Blog
Technology

NFS v4 Kerberos Blue Green Canary Deploy — Deployment Strategy สำหรับ NFS

nfs v4 kerberos blue green canary deploy
NFS v4 Kerberos Blue Green Canary Deploy | SiamCafe Blog
2025-11-05· อ. บอม — SiamCafe.net· 1,515 คำ

NFS v4 กับ Kerberos คืออะไร

NFS (Network File System) version 4 เป็น distributed file system protocol ที่ให้ client mount file systems จาก remote servers ผ่าน network เหมือนเป็น local file system NFSv4 มีข้อดีเหนือ v3 หลายอย่าง ใช้ port เดียว (TCP 2049) ไม่ต้องใช้ portmapper, มี built-in security ด้วย RPCSEC_GSS, รองรับ stateful operations, มี compound operations ลด round trips

Kerberos เป็น network authentication protocol ที่ใช้ tickets สำหรับ authenticate users และ services โดยไม่ต้องส่ง password ผ่าน network เมื่อรวม NFS v4 กับ Kerberos จะได้ security levels ดังนี้ krb5 authentication only ตรวจสอบตัวตนผู้ใช้, krb5i integrity protection ป้องกัน data ถูกแก้ไขระหว่างทาง, krb5p privacy protection เข้ารหัส data ทั้งหมด

Blue-Green และ Canary Deployment เป็น deployment strategies ที่ใช้ลด downtime และ risk เมื่อ update NFS infrastructure Blue-Green สลับระหว่าง 2 environments Canary ค่อยๆ rollout ไปยัง subset ของ clients ก่อน

ติดตั้ง NFS v4 พร้อม Kerberos Authentication

Setup NFS v4 server กับ Kerberos

# === NFS v4 + Kerberos Setup ===

# 1. Install Packages (Server)
sudo apt update
sudo apt install -y nfs-kernel-server krb5-user krb5-kdc krb5-admin-server

# 2. Configure Kerberos KDC
cat > /etc/krb5.conf << 'EOF'
[libdefaults]
    default_realm = EXAMPLE.COM
    dns_lookup_realm = false
    dns_lookup_kdc = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true

[realms]
    EXAMPLE.COM = {
        kdc = kdc.example.com
        admin_server = kdc.example.com
    }

[domain_realm]
    .example.com = EXAMPLE.COM
    example.com = EXAMPLE.COM
EOF

# 3. Create Kerberos Database
sudo krb5_newrealm
# Enter master password when prompted

# 4. Create NFS Service Principals
sudo kadmin.local << 'KADMIN'
addprinc -randkey nfs/nfs-server.example.com@EXAMPLE.COM
addprinc -randkey nfs/nfs-client.example.com@EXAMPLE.COM
ktadd -k /etc/krb5.keytab nfs/nfs-server.example.com@EXAMPLE.COM
KADMIN

# 5. Configure NFS Server
cat > /etc/exports << 'EOF'
/data/shared    *(rw,sync,no_subtree_check,sec=krb5p)
/data/readonly  *(ro,sync,no_subtree_check,sec=krb5i)
/data/public    *(rw,sync,no_subtree_check,sec=sys)
EOF

# Create directories
sudo mkdir -p /data/{shared,readonly,public}
sudo chown nobody:nogroup /data/shared /data/readonly /data/public

# 6. Enable and Start Services
sudo systemctl enable --now nfs-server
sudo systemctl enable --now rpc-gssd
sudo systemctl enable --now rpc-svcgssd

# Export shares
sudo exportfs -arv

# 7. Configure NFS Client
# On client machine:
sudo apt install -y nfs-common krb5-user

# Get keytab for client
# scp admin@kdc:/tmp/client.keytab /etc/krb5.keytab

# Mount with Kerberos
sudo mount -t nfs4 -o sec=krb5p nfs-server.example.com:/data/shared /mnt/shared

# Add to fstab for persistent mount
echo "nfs-server.example.com:/data/shared /mnt/shared nfs4 sec=krb5p,_netdev 0 0" | sudo tee -a /etc/fstab

# 8. Verify
mount | grep nfs4
# Should show sec=krb5p

echo "NFS v4 + Kerberos configured"

Blue-Green Deployment สำหรับ NFS

Blue-Green deployment strategy สำหรับ NFS infrastructure

#!/usr/bin/env python3
# blue_green_nfs.py — Blue-Green NFS Deployment
import json
import logging
import subprocess
from datetime import datetime
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("bg_deploy")

class BlueGreenNFS:
    def __init__(self):
        self.blue = {
            "name": "blue",
            "server": "nfs-blue.example.com",
            "status": "active",
            "version": "1.0",
            "mount_point": "/data/shared",
        }
        self.green = {
            "name": "green",
            "server": "nfs-green.example.com",
            "status": "standby",
            "version": "1.1",
            "mount_point": "/data/shared",
        }
        self.active = "blue"
    
    def get_active(self):
        return self.blue if self.active == "blue" else self.green
    
    def get_standby(self):
        return self.green if self.active == "blue" else self.blue
    
    def prepare_standby(self, new_version):
        """Prepare standby environment with new version"""
        standby = self.get_standby()
        standby["version"] = new_version
        
        steps = [
            f"1. Sync data to {standby['server']}",
            f"   rsync -avz --delete /data/shared/ {standby['server']}:/data/shared/",
            f"2. Update NFS config on {standby['server']}",
            f"3. Apply security patches on {standby['server']}",
            f"4. Restart NFS services on {standby['server']}",
            f"5. Run health checks on {standby['server']}",
        ]
        
        return {
            "standby": standby["name"],
            "new_version": new_version,
            "steps": steps,
            "status": "prepared",
        }
    
    def switch_traffic(self):
        """Switch DNS/LoadBalancer to standby"""
        old_active = self.get_active()
        new_active = self.get_standby()
        
        # DNS update
        dns_update = {
            "record": "nfs.example.com",
            "old_target": old_active["server"],
            "new_target": new_active["server"],
            "ttl": 60,
        }
        
        old_active["status"] = "standby"
        new_active["status"] = "active"
        self.active = new_active["name"]
        
        return {
            "switched_from": old_active["name"],
            "switched_to": new_active["name"],
            "dns_update": dns_update,
            "timestamp": datetime.utcnow().isoformat(),
        }
    
    def rollback(self):
        """Rollback to previous active environment"""
        return self.switch_traffic()
    
    def health_check(self, server):
        """Check NFS server health"""
        checks = {
            "nfs_service": "running",
            "kerberos_service": "running",
            "export_available": True,
            "mount_test": "success",
            "read_write_test": "passed",
            "latency_ms": 2.5,
            "iops": 15000,
        }
        
        all_passed = all(
            v in ("running", True, "success", "passed") or isinstance(v, (int, float))
            for v in checks.values()
        )
        
        return {
            "server": server,
            "healthy": all_passed,
            "checks": checks,
        }

bg = BlueGreenNFS()
print("Active:", json.dumps(bg.get_active(), indent=2))

prep = bg.prepare_standby("1.1")
print("Prepared:", json.dumps(prep, indent=2))

health = bg.health_check("nfs-green.example.com")
print("Health:", json.dumps(health, indent=2))

switch = bg.switch_traffic()
print("Switched:", json.dumps(switch, indent=2))

Canary Deployment Strategy

Canary deployment สำหรับ NFS updates

# === Canary Deployment for NFS ===

# 1. Canary Deployment Plan
# ===================================
# Phase 1 (5% traffic):
#   - 2 canary clients mount new NFS server
#   - Monitor for 30 minutes
#   - Check latency, errors, data integrity
#
# Phase 2 (25% traffic):
#   - 10 clients switch to new server
#   - Monitor for 2 hours
#   - Run automated tests
#
# Phase 3 (50% traffic):
#   - 20 clients switch
#   - Monitor for 4 hours
#   - Full regression test
#
# Phase 4 (100% traffic):
#   - All clients switch to new server
#   - Old server becomes standby

# 2. Canary Client Configuration Script
cat > canary_switch.sh << 'SHEOF'
#!/bin/bash
set -e

NEW_SERVER="nfs-green.example.com"
OLD_SERVER="nfs-blue.example.com"
MOUNT_POINT="/mnt/shared"
SEC_TYPE="krb5p"

echo "=== NFS Canary Switch ==="
echo "Switching from $OLD_SERVER to $NEW_SERVER"

# Step 1: Check new server is reachable
showmount -e $NEW_SERVER || { echo "ERROR: Cannot reach $NEW_SERVER"; exit 1; }

# Step 2: Unmount old
echo "Unmounting $OLD_SERVER..."
# Lazy unmount to avoid blocking
sudo umount -l $MOUNT_POINT 2>/dev/null || true

# Step 3: Mount new
echo "Mounting $NEW_SERVER..."
sudo mount -t nfs4 -o sec=$SEC_TYPE, hard, intr $NEW_SERVER:/data/shared $MOUNT_POINT

# Step 4: Verify
if mountpoint -q $MOUNT_POINT; then
    echo "Mount successful"
    # Write test
    TEST_FILE="$MOUNT_POINT/.canary_test_$(hostname)"
    echo "canary test $(date)" > $TEST_FILE && rm $TEST_FILE
    echo "Read/Write test passed"
else
    echo "ERROR: Mount failed, rolling back..."
    sudo mount -t nfs4 -o sec=$SEC_TYPE, hard, intr $OLD_SERVER:/data/shared $MOUNT_POINT
    exit 1
fi

echo "Canary switch complete"
SHEOF

chmod +x canary_switch.sh

# 3. Canary Monitoring
cat > canary_monitor.sh << 'SHEOF'
#!/bin/bash
# Monitor NFS canary deployment

MOUNT_POINT="/mnt/shared"
LOG_FILE="/var/log/nfs_canary_monitor.log"
THRESHOLD_LATENCY_MS=10
CHECK_INTERVAL=60

while true; do
    TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
    
    # Check mount
    if ! mountpoint -q $MOUNT_POINT; then
        echo "$TIMESTAMP ERROR mount_lost" >> $LOG_FILE
        # Alert
        continue
    fi
    
    # Measure latency (time to create and read small file)
    START=$(date +%s%N)
    echo "test" > $MOUNT_POINT/.latency_test 2>/dev/null
    cat $MOUNT_POINT/.latency_test > /dev/null 2>&1
    rm $MOUNT_POINT/.latency_test 2>/dev/null
    END=$(date +%s%N)
    
    LATENCY_MS=$(( (END - START) / 1000000 ))
    
    echo "$TIMESTAMP latency_ms=$LATENCY_MS" >> $LOG_FILE
    
    if [ "$LATENCY_MS" -gt "$THRESHOLD_LATENCY_MS" ]; then
        echo "$TIMESTAMP WARNING high_latency=$LATENCY_MS" >> $LOG_FILE
    fi
    
    sleep $CHECK_INTERVAL
done
SHEOF

chmod +x canary_monitor.sh

# 4. Automated Canary Rollout with Ansible
cat > canary_rollout.yml << 'EOF'
---
- name: NFS Canary Rollout
  hosts: all
  become: yes
  vars:
    new_nfs_server: "nfs-green.example.com"
    old_nfs_server: "nfs-blue.example.com"
    mount_point: "/mnt/shared"
    sec_type: "krb5p"

  tasks:
    - name: Unmount old NFS
      ansible.posix.mount:
        path: "{{ mount_point }}"
        state: unmounted
      ignore_errors: yes

    - name: Mount new NFS server
      ansible.posix.mount:
        path: "{{ mount_point }}"
        src: "{{ new_nfs_server }}:/data/shared"
        fstype: nfs4
        opts: "sec={{ sec_type }},hard,intr"
        state: mounted

    - name: Verify mount
      command: mountpoint -q {{ mount_point }}
      register: mount_check
      
    - name: Write test
      copy:
        content: "canary test {{ ansible_date_time.iso8601 }}"
        dest: "{{ mount_point }}/.canary_test"
      when: mount_check.rc == 0

    - name: Cleanup test file
      file:
        path: "{{ mount_point }}/.canary_test"
        state: absent
EOF

echo "Canary deployment configured"

Automation และ Rollback

Automate deployment และ rollback

#!/usr/bin/env python3
# nfs_deploy.py — NFS Deployment Automation
import json
import logging
from datetime import datetime
from typing import Dict, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("deploy")

class NFSCanaryDeployer:
    def __init__(self, total_clients=40):
        self.total_clients = total_clients
        self.phases = [
            {"name": "canary", "pct": 5, "duration_min": 30},
            {"name": "early_adopter", "pct": 25, "duration_min": 120},
            {"name": "half", "pct": 50, "duration_min": 240},
            {"name": "full", "pct": 100, "duration_min": 0},
        ]
        self.current_phase = 0
        self.rollback_triggered = False
        self.metrics_history = []
    
    def get_phase_clients(self, phase_idx):
        pct = self.phases[phase_idx]["pct"]
        count = max(1, int(self.total_clients * pct / 100))
        return count
    
    def execute_phase(self, phase_idx):
        phase = self.phases[phase_idx]
        client_count = self.get_phase_clients(phase_idx)
        
        result = {
            "phase": phase["name"],
            "target_pct": phase["pct"],
            "clients_switched": client_count,
            "total_clients": self.total_clients,
            "monitoring_duration_min": phase["duration_min"],
            "started_at": datetime.utcnow().isoformat(),
        }
        
        return result
    
    def check_canary_health(self, metrics):
        """Evaluate if canary is healthy enough to proceed"""
        thresholds = {
            "error_rate_pct": 1.0,
            "latency_p99_ms": 20,
            "mount_failures": 0,
            "data_corruption": 0,
        }
        
        issues = []
        for metric, threshold in thresholds.items():
            actual = metrics.get(metric, 0)
            if actual > threshold:
                issues.append({
                    "metric": metric,
                    "threshold": threshold,
                    "actual": actual,
                })
        
        return {
            "healthy": len(issues) == 0,
            "issues": issues,
            "recommendation": "proceed" if not issues else "rollback",
        }
    
    def advance_phase(self, metrics):
        """Advance to next phase if health checks pass"""
        health = self.check_canary_health(metrics)
        
        if not health["healthy"]:
            self.rollback_triggered = True
            return {
                "action": "rollback",
                "reason": health["issues"],
                "current_phase": self.phases[self.current_phase]["name"],
            }
        
        if self.current_phase < len(self.phases) - 1:
            self.current_phase += 1
            return {
                "action": "advance",
                "new_phase": self.phases[self.current_phase]["name"],
                "execution": self.execute_phase(self.current_phase),
            }
        
        return {"action": "complete", "message": "All phases completed"}
    
    def rollback_all(self):
        """Rollback all clients to old NFS server"""
        return {
            "action": "rollback",
            "clients_to_rollback": self.get_phase_clients(self.current_phase),
            "rollback_to": "nfs-blue.example.com",
            "timestamp": datetime.utcnow().isoformat(),
            "steps": [
                "1. Stop advancing to new clients",
                "2. Switch canary clients back to old server",
                "3. Verify all mounts are on old server",
                "4. Investigate root cause",
            ],
        }

deployer = NFSCanaryDeployer(total_clients=40)

# Phase 1: Canary
phase1 = deployer.execute_phase(0)
print("Phase 1:", json.dumps(phase1, indent=2))

# Check health and advance
good_metrics = {"error_rate_pct": 0.1, "latency_p99_ms": 5, "mount_failures": 0, "data_corruption": 0}
advance = deployer.advance_phase(good_metrics)
print("Advance:", json.dumps(advance, indent=2))

# Simulate bad metrics triggering rollback
bad_metrics = {"error_rate_pct": 5.0, "latency_p99_ms": 50, "mount_failures": 2, "data_corruption": 0}
rollback = deployer.advance_phase(bad_metrics)
print("Rollback:", json.dumps(rollback, indent=2))

Monitoring และ Troubleshooting

Monitor NFS performance

# === NFS Monitoring ===

# 1. NFS Server Metrics
# ===================================
# Check NFS statistics
nfsstat -s  # Server stats
nfsstat -c  # Client stats

# Monitor NFS operations
watch -n 1 'nfsstat -s | head -20'

# 2. Key Metrics to Monitor
# ===================================
# - NFS operations per second (read, write, getattr, access)
# - Average response time per operation
# - Active connections count
# - Export availability
# - Kerberos ticket status
# - Network throughput
# - Disk I/O on NFS server

# 3. Prometheus Node Exporter Metrics
# ===================================
# NFS metrics available:
# node_nfs_requests_total{method="Read"}
# node_nfs_requests_total{method="Write"}
# node_nfs_requests_total{method="GetAttr"}
# node_nfsd_server_threads
# node_nfsd_server_rpcs_total

# 4. Grafana Dashboard Queries
# ===================================
# NFS Operations Rate:
# rate(node_nfs_requests_total[5m])
#
# NFS Errors:
# rate(node_nfs_rpc_retransmissions_total[5m])
#
# NFS Latency:
# rate(node_nfs_request_duration_seconds_sum[5m]) / rate(node_nfs_request_duration_seconds_count[5m])

# 5. Troubleshooting Commands
# ===================================
# Check exports
showmount -e nfs-server.example.com

# Check mount status
mount -t nfs4

# Debug Kerberos
klist -ke /etc/krb5.keytab
kinit -k -t /etc/krb5.keytab nfs/$(hostname -f)
klist

# Check RPC services
rpcinfo -p nfs-server.example.com

# NFS debug logging
rpcdebug -m nfs -s all    # Enable
rpcdebug -m nfs -c all    # Disable
dmesg | grep -i nfs

# Network issues
tcpdump -i eth0 port 2049 -c 100

# 6. Common Issues
# ===================================
# Issue: "mount.nfs4: access denied by server"
# Fix: Check /etc/exports, exportfs -arv, check Kerberos keytab

# Issue: "GSS-API error: No credentials were supplied"
# Fix: kinit -k -t /etc/krb5.keytab nfs/hostname, restart rpc-gssd

# Issue: Slow NFS performance
# Fix: Check network, increase NFS threads (RPCNFSDCOUNT),
#      enable async writes, check server disk I/O

echo "Monitoring configured"

FAQ คำถามที่พบบ่อย

Q: NFSv4 กับ NFSv3 ต่างกันอย่างไร?

A: NFSv4 ใช้ TCP port 2049 เพียง port เดียว (v3 ใช้หลาย ports ต้อง portmapper), มี built-in security ด้วย RPCSEC_GSS/Kerberos (v3 ใช้ IP-based trust), เป็น stateful protocol (v3 stateless), มี compound operations ลด network round trips, รองรับ ACLs, delegations, มี pseudo filesystem สำหรับ multi-export แนะนำ v4 สำหรับ production ใหม่ทั้งหมด v3 สำหรับ legacy compatibility เท่านั้น

Q: Kerberos security level ไหนควรใช้?

A: krb5 (authentication only) เหมาะสำหรับ trusted network ที่ไม่กังวลเรื่อง eavesdropping performance ดีที่สุด krb5i (integrity) เพิ่ม checksum ป้องกัน data ถูกแก้ไขระหว่างทาง performance ลดลงเล็กน้อย (5-10%) แนะนำเป็นขั้นต่ำ krb5p (privacy) เข้ารหัส data ทั้งหมด ปลอดภัยที่สุด performance ลดลง 10-30% เหมาะสำหรับ sensitive data สำหรับ internal network ใช้ krb5i สำหรับ cross-network หรือ sensitive data ใช้ krb5p

Q: Blue-Green กับ Canary เลือกแบบไหนสำหรับ NFS?

A: Blue-Green เหมาะเมื่อต้องการ switch ทั้งหมดพร้อมกัน (เช่น major version upgrade) rollback เร็วมาก (switch DNS กลับ) ต้องมี 2 servers เต็ม (cost สูง) Canary เหมาะเมื่อต้องการ gradual rollout ลด risk (ถ้ามีปัญหากระทบน้อย) ใช้ server เดิมได้ระหว่าง transition สำหรับ NFS แนะนำ Canary เพราะ data consistency สำคัญ ถ้า mount มีปัญหาจะกระทบแค่ subset ของ clients

Q: NFS performance tuning ทำอย่างไร?

A: Server side เพิ่ม NFS threads (RPCNFSDCOUNT=64), ใช้ SSD สำหรับ NFS exports, เพิ่ม RAM สำหรับ page cache, ใช้ async exports (ระวัง data loss) Client side ใช้ rsize=1048576, wsize=1048576 (1MB read/write), ใช้ hard mount (ไม่ใช่ soft), เปิด NFS client caching (fsc option), ใช้ actimeo=60 สำหรับ data ที่ไม่เปลี่ยนบ่อย Network ใช้ jumbo frames (MTU 9000), แยก NFS traffic ไว้ VLAN เฉพาะ, ใช้ bonding/LACP สำหรับ bandwidth

📖 บทความที่เกี่ยวข้อง

Netlify Edge Blue Green Canary Deployอ่านบทความ → React Server Components Blue Green Canary Deployอ่านบทความ → DuckDB Analytics Blue Green Canary Deployอ่านบทความ → Cilium CNI Blue Green Canary Deployอ่านบทความ → AlmaLinux Setup Blue Green Canary Deployอ่านบทความ →

📚 ดูบทความทั้งหมด →