NFS v4 กับ Kerberos คืออะไร
NFS (Network File System) version 4 เป็น distributed file system protocol ที่ให้ client mount file systems จาก remote servers ผ่าน network เหมือนเป็น local file system NFSv4 มีข้อดีเหนือ v3 หลายอย่าง ใช้ port เดียว (TCP 2049) ไม่ต้องใช้ portmapper, มี built-in security ด้วย RPCSEC_GSS, รองรับ stateful operations, มี compound operations ลด round trips
Kerberos เป็น network authentication protocol ที่ใช้ tickets สำหรับ authenticate users และ services โดยไม่ต้องส่ง password ผ่าน network เมื่อรวม NFS v4 กับ Kerberos จะได้ security levels ดังนี้ krb5 authentication only ตรวจสอบตัวตนผู้ใช้, krb5i integrity protection ป้องกัน data ถูกแก้ไขระหว่างทาง, krb5p privacy protection เข้ารหัส data ทั้งหมด
Blue-Green และ Canary Deployment เป็น deployment strategies ที่ใช้ลด downtime และ risk เมื่อ update NFS infrastructure Blue-Green สลับระหว่าง 2 environments Canary ค่อยๆ rollout ไปยัง subset ของ clients ก่อน
ติดตั้ง NFS v4 พร้อม Kerberos Authentication
Setup NFS v4 server กับ Kerberos
# === NFS v4 + Kerberos Setup ===
# 1. Install Packages (Server)
sudo apt update
sudo apt install -y nfs-kernel-server krb5-user krb5-kdc krb5-admin-server
# 2. Configure Kerberos KDC
cat > /etc/krb5.conf << 'EOF'
[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
EXAMPLE.COM = {
kdc = kdc.example.com
admin_server = kdc.example.com
}
[domain_realm]
.example.com = EXAMPLE.COM
example.com = EXAMPLE.COM
EOF
# 3. Create Kerberos Database
sudo krb5_newrealm
# Enter master password when prompted
# 4. Create NFS Service Principals
sudo kadmin.local << 'KADMIN'
addprinc -randkey nfs/nfs-server.example.com@EXAMPLE.COM
addprinc -randkey nfs/nfs-client.example.com@EXAMPLE.COM
ktadd -k /etc/krb5.keytab nfs/nfs-server.example.com@EXAMPLE.COM
KADMIN
# 5. Configure NFS Server
cat > /etc/exports << 'EOF'
/data/shared *(rw,sync,no_subtree_check,sec=krb5p)
/data/readonly *(ro,sync,no_subtree_check,sec=krb5i)
/data/public *(rw,sync,no_subtree_check,sec=sys)
EOF
# Create directories
sudo mkdir -p /data/{shared,readonly,public}
sudo chown nobody:nogroup /data/shared /data/readonly /data/public
# 6. Enable and Start Services
sudo systemctl enable --now nfs-server
sudo systemctl enable --now rpc-gssd
sudo systemctl enable --now rpc-svcgssd
# Export shares
sudo exportfs -arv
# 7. Configure NFS Client
# On client machine:
sudo apt install -y nfs-common krb5-user
# Get keytab for client
# scp admin@kdc:/tmp/client.keytab /etc/krb5.keytab
# Mount with Kerberos
sudo mount -t nfs4 -o sec=krb5p nfs-server.example.com:/data/shared /mnt/shared
# Add to fstab for persistent mount
echo "nfs-server.example.com:/data/shared /mnt/shared nfs4 sec=krb5p,_netdev 0 0" | sudo tee -a /etc/fstab
# 8. Verify
mount | grep nfs4
# Should show sec=krb5p
echo "NFS v4 + Kerberos configured"
Blue-Green Deployment สำหรับ NFS
Blue-Green deployment strategy สำหรับ NFS infrastructure
#!/usr/bin/env python3
# blue_green_nfs.py — Blue-Green NFS Deployment
import json
import logging
import subprocess
from datetime import datetime
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("bg_deploy")
class BlueGreenNFS:
def __init__(self):
self.blue = {
"name": "blue",
"server": "nfs-blue.example.com",
"status": "active",
"version": "1.0",
"mount_point": "/data/shared",
}
self.green = {
"name": "green",
"server": "nfs-green.example.com",
"status": "standby",
"version": "1.1",
"mount_point": "/data/shared",
}
self.active = "blue"
def get_active(self):
return self.blue if self.active == "blue" else self.green
def get_standby(self):
return self.green if self.active == "blue" else self.blue
def prepare_standby(self, new_version):
"""Prepare standby environment with new version"""
standby = self.get_standby()
standby["version"] = new_version
steps = [
f"1. Sync data to {standby['server']}",
f" rsync -avz --delete /data/shared/ {standby['server']}:/data/shared/",
f"2. Update NFS config on {standby['server']}",
f"3. Apply security patches on {standby['server']}",
f"4. Restart NFS services on {standby['server']}",
f"5. Run health checks on {standby['server']}",
]
return {
"standby": standby["name"],
"new_version": new_version,
"steps": steps,
"status": "prepared",
}
def switch_traffic(self):
"""Switch DNS/LoadBalancer to standby"""
old_active = self.get_active()
new_active = self.get_standby()
# DNS update
dns_update = {
"record": "nfs.example.com",
"old_target": old_active["server"],
"new_target": new_active["server"],
"ttl": 60,
}
old_active["status"] = "standby"
new_active["status"] = "active"
self.active = new_active["name"]
return {
"switched_from": old_active["name"],
"switched_to": new_active["name"],
"dns_update": dns_update,
"timestamp": datetime.utcnow().isoformat(),
}
def rollback(self):
"""Rollback to previous active environment"""
return self.switch_traffic()
def health_check(self, server):
"""Check NFS server health"""
checks = {
"nfs_service": "running",
"kerberos_service": "running",
"export_available": True,
"mount_test": "success",
"read_write_test": "passed",
"latency_ms": 2.5,
"iops": 15000,
}
all_passed = all(
v in ("running", True, "success", "passed") or isinstance(v, (int, float))
for v in checks.values()
)
return {
"server": server,
"healthy": all_passed,
"checks": checks,
}
bg = BlueGreenNFS()
print("Active:", json.dumps(bg.get_active(), indent=2))
prep = bg.prepare_standby("1.1")
print("Prepared:", json.dumps(prep, indent=2))
health = bg.health_check("nfs-green.example.com")
print("Health:", json.dumps(health, indent=2))
switch = bg.switch_traffic()
print("Switched:", json.dumps(switch, indent=2))
Canary Deployment Strategy
Canary deployment สำหรับ NFS updates
# === Canary Deployment for NFS ===
# 1. Canary Deployment Plan
# ===================================
# Phase 1 (5% traffic):
# - 2 canary clients mount new NFS server
# - Monitor for 30 minutes
# - Check latency, errors, data integrity
#
# Phase 2 (25% traffic):
# - 10 clients switch to new server
# - Monitor for 2 hours
# - Run automated tests
#
# Phase 3 (50% traffic):
# - 20 clients switch
# - Monitor for 4 hours
# - Full regression test
#
# Phase 4 (100% traffic):
# - All clients switch to new server
# - Old server becomes standby
# 2. Canary Client Configuration Script
cat > canary_switch.sh << 'SHEOF'
#!/bin/bash
set -e
NEW_SERVER="nfs-green.example.com"
OLD_SERVER="nfs-blue.example.com"
MOUNT_POINT="/mnt/shared"
SEC_TYPE="krb5p"
echo "=== NFS Canary Switch ==="
echo "Switching from $OLD_SERVER to $NEW_SERVER"
# Step 1: Check new server is reachable
showmount -e $NEW_SERVER || { echo "ERROR: Cannot reach $NEW_SERVER"; exit 1; }
# Step 2: Unmount old
echo "Unmounting $OLD_SERVER..."
# Lazy unmount to avoid blocking
sudo umount -l $MOUNT_POINT 2>/dev/null || true
# Step 3: Mount new
echo "Mounting $NEW_SERVER..."
sudo mount -t nfs4 -o sec=$SEC_TYPE, hard, intr $NEW_SERVER:/data/shared $MOUNT_POINT
# Step 4: Verify
if mountpoint -q $MOUNT_POINT; then
echo "Mount successful"
# Write test
TEST_FILE="$MOUNT_POINT/.canary_test_$(hostname)"
echo "canary test $(date)" > $TEST_FILE && rm $TEST_FILE
echo "Read/Write test passed"
else
echo "ERROR: Mount failed, rolling back..."
sudo mount -t nfs4 -o sec=$SEC_TYPE, hard, intr $OLD_SERVER:/data/shared $MOUNT_POINT
exit 1
fi
echo "Canary switch complete"
SHEOF
chmod +x canary_switch.sh
# 3. Canary Monitoring
cat > canary_monitor.sh << 'SHEOF'
#!/bin/bash
# Monitor NFS canary deployment
MOUNT_POINT="/mnt/shared"
LOG_FILE="/var/log/nfs_canary_monitor.log"
THRESHOLD_LATENCY_MS=10
CHECK_INTERVAL=60
while true; do
TIMESTAMP=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# Check mount
if ! mountpoint -q $MOUNT_POINT; then
echo "$TIMESTAMP ERROR mount_lost" >> $LOG_FILE
# Alert
continue
fi
# Measure latency (time to create and read small file)
START=$(date +%s%N)
echo "test" > $MOUNT_POINT/.latency_test 2>/dev/null
cat $MOUNT_POINT/.latency_test > /dev/null 2>&1
rm $MOUNT_POINT/.latency_test 2>/dev/null
END=$(date +%s%N)
LATENCY_MS=$(( (END - START) / 1000000 ))
echo "$TIMESTAMP latency_ms=$LATENCY_MS" >> $LOG_FILE
if [ "$LATENCY_MS" -gt "$THRESHOLD_LATENCY_MS" ]; then
echo "$TIMESTAMP WARNING high_latency=$LATENCY_MS" >> $LOG_FILE
fi
sleep $CHECK_INTERVAL
done
SHEOF
chmod +x canary_monitor.sh
# 4. Automated Canary Rollout with Ansible
cat > canary_rollout.yml << 'EOF'
---
- name: NFS Canary Rollout
hosts: all
become: yes
vars:
new_nfs_server: "nfs-green.example.com"
old_nfs_server: "nfs-blue.example.com"
mount_point: "/mnt/shared"
sec_type: "krb5p"
tasks:
- name: Unmount old NFS
ansible.posix.mount:
path: "{{ mount_point }}"
state: unmounted
ignore_errors: yes
- name: Mount new NFS server
ansible.posix.mount:
path: "{{ mount_point }}"
src: "{{ new_nfs_server }}:/data/shared"
fstype: nfs4
opts: "sec={{ sec_type }},hard,intr"
state: mounted
- name: Verify mount
command: mountpoint -q {{ mount_point }}
register: mount_check
- name: Write test
copy:
content: "canary test {{ ansible_date_time.iso8601 }}"
dest: "{{ mount_point }}/.canary_test"
when: mount_check.rc == 0
- name: Cleanup test file
file:
path: "{{ mount_point }}/.canary_test"
state: absent
EOF
echo "Canary deployment configured"
Automation และ Rollback
Automate deployment และ rollback
#!/usr/bin/env python3
# nfs_deploy.py — NFS Deployment Automation
import json
import logging
from datetime import datetime
from typing import Dict, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("deploy")
class NFSCanaryDeployer:
def __init__(self, total_clients=40):
self.total_clients = total_clients
self.phases = [
{"name": "canary", "pct": 5, "duration_min": 30},
{"name": "early_adopter", "pct": 25, "duration_min": 120},
{"name": "half", "pct": 50, "duration_min": 240},
{"name": "full", "pct": 100, "duration_min": 0},
]
self.current_phase = 0
self.rollback_triggered = False
self.metrics_history = []
def get_phase_clients(self, phase_idx):
pct = self.phases[phase_idx]["pct"]
count = max(1, int(self.total_clients * pct / 100))
return count
def execute_phase(self, phase_idx):
phase = self.phases[phase_idx]
client_count = self.get_phase_clients(phase_idx)
result = {
"phase": phase["name"],
"target_pct": phase["pct"],
"clients_switched": client_count,
"total_clients": self.total_clients,
"monitoring_duration_min": phase["duration_min"],
"started_at": datetime.utcnow().isoformat(),
}
return result
def check_canary_health(self, metrics):
"""Evaluate if canary is healthy enough to proceed"""
thresholds = {
"error_rate_pct": 1.0,
"latency_p99_ms": 20,
"mount_failures": 0,
"data_corruption": 0,
}
issues = []
for metric, threshold in thresholds.items():
actual = metrics.get(metric, 0)
if actual > threshold:
issues.append({
"metric": metric,
"threshold": threshold,
"actual": actual,
})
return {
"healthy": len(issues) == 0,
"issues": issues,
"recommendation": "proceed" if not issues else "rollback",
}
def advance_phase(self, metrics):
"""Advance to next phase if health checks pass"""
health = self.check_canary_health(metrics)
if not health["healthy"]:
self.rollback_triggered = True
return {
"action": "rollback",
"reason": health["issues"],
"current_phase": self.phases[self.current_phase]["name"],
}
if self.current_phase < len(self.phases) - 1:
self.current_phase += 1
return {
"action": "advance",
"new_phase": self.phases[self.current_phase]["name"],
"execution": self.execute_phase(self.current_phase),
}
return {"action": "complete", "message": "All phases completed"}
def rollback_all(self):
"""Rollback all clients to old NFS server"""
return {
"action": "rollback",
"clients_to_rollback": self.get_phase_clients(self.current_phase),
"rollback_to": "nfs-blue.example.com",
"timestamp": datetime.utcnow().isoformat(),
"steps": [
"1. Stop advancing to new clients",
"2. Switch canary clients back to old server",
"3. Verify all mounts are on old server",
"4. Investigate root cause",
],
}
deployer = NFSCanaryDeployer(total_clients=40)
# Phase 1: Canary
phase1 = deployer.execute_phase(0)
print("Phase 1:", json.dumps(phase1, indent=2))
# Check health and advance
good_metrics = {"error_rate_pct": 0.1, "latency_p99_ms": 5, "mount_failures": 0, "data_corruption": 0}
advance = deployer.advance_phase(good_metrics)
print("Advance:", json.dumps(advance, indent=2))
# Simulate bad metrics triggering rollback
bad_metrics = {"error_rate_pct": 5.0, "latency_p99_ms": 50, "mount_failures": 2, "data_corruption": 0}
rollback = deployer.advance_phase(bad_metrics)
print("Rollback:", json.dumps(rollback, indent=2))
Monitoring และ Troubleshooting
Monitor NFS performance
# === NFS Monitoring ===
# 1. NFS Server Metrics
# ===================================
# Check NFS statistics
nfsstat -s # Server stats
nfsstat -c # Client stats
# Monitor NFS operations
watch -n 1 'nfsstat -s | head -20'
# 2. Key Metrics to Monitor
# ===================================
# - NFS operations per second (read, write, getattr, access)
# - Average response time per operation
# - Active connections count
# - Export availability
# - Kerberos ticket status
# - Network throughput
# - Disk I/O on NFS server
# 3. Prometheus Node Exporter Metrics
# ===================================
# NFS metrics available:
# node_nfs_requests_total{method="Read"}
# node_nfs_requests_total{method="Write"}
# node_nfs_requests_total{method="GetAttr"}
# node_nfsd_server_threads
# node_nfsd_server_rpcs_total
# 4. Grafana Dashboard Queries
# ===================================
# NFS Operations Rate:
# rate(node_nfs_requests_total[5m])
#
# NFS Errors:
# rate(node_nfs_rpc_retransmissions_total[5m])
#
# NFS Latency:
# rate(node_nfs_request_duration_seconds_sum[5m]) / rate(node_nfs_request_duration_seconds_count[5m])
# 5. Troubleshooting Commands
# ===================================
# Check exports
showmount -e nfs-server.example.com
# Check mount status
mount -t nfs4
# Debug Kerberos
klist -ke /etc/krb5.keytab
kinit -k -t /etc/krb5.keytab nfs/$(hostname -f)
klist
# Check RPC services
rpcinfo -p nfs-server.example.com
# NFS debug logging
rpcdebug -m nfs -s all # Enable
rpcdebug -m nfs -c all # Disable
dmesg | grep -i nfs
# Network issues
tcpdump -i eth0 port 2049 -c 100
# 6. Common Issues
# ===================================
# Issue: "mount.nfs4: access denied by server"
# Fix: Check /etc/exports, exportfs -arv, check Kerberos keytab
# Issue: "GSS-API error: No credentials were supplied"
# Fix: kinit -k -t /etc/krb5.keytab nfs/hostname, restart rpc-gssd
# Issue: Slow NFS performance
# Fix: Check network, increase NFS threads (RPCNFSDCOUNT),
# enable async writes, check server disk I/O
echo "Monitoring configured"
FAQ คำถามที่พบบ่อย
Q: NFSv4 กับ NFSv3 ต่างกันอย่างไร?
A: NFSv4 ใช้ TCP port 2049 เพียง port เดียว (v3 ใช้หลาย ports ต้อง portmapper), มี built-in security ด้วย RPCSEC_GSS/Kerberos (v3 ใช้ IP-based trust), เป็น stateful protocol (v3 stateless), มี compound operations ลด network round trips, รองรับ ACLs, delegations, มี pseudo filesystem สำหรับ multi-export แนะนำ v4 สำหรับ production ใหม่ทั้งหมด v3 สำหรับ legacy compatibility เท่านั้น
Q: Kerberos security level ไหนควรใช้?
A: krb5 (authentication only) เหมาะสำหรับ trusted network ที่ไม่กังวลเรื่อง eavesdropping performance ดีที่สุด krb5i (integrity) เพิ่ม checksum ป้องกัน data ถูกแก้ไขระหว่างทาง performance ลดลงเล็กน้อย (5-10%) แนะนำเป็นขั้นต่ำ krb5p (privacy) เข้ารหัส data ทั้งหมด ปลอดภัยที่สุด performance ลดลง 10-30% เหมาะสำหรับ sensitive data สำหรับ internal network ใช้ krb5i สำหรับ cross-network หรือ sensitive data ใช้ krb5p
Q: Blue-Green กับ Canary เลือกแบบไหนสำหรับ NFS?
A: Blue-Green เหมาะเมื่อต้องการ switch ทั้งหมดพร้อมกัน (เช่น major version upgrade) rollback เร็วมาก (switch DNS กลับ) ต้องมี 2 servers เต็ม (cost สูง) Canary เหมาะเมื่อต้องการ gradual rollout ลด risk (ถ้ามีปัญหากระทบน้อย) ใช้ server เดิมได้ระหว่าง transition สำหรับ NFS แนะนำ Canary เพราะ data consistency สำคัญ ถ้า mount มีปัญหาจะกระทบแค่ subset ของ clients
Q: NFS performance tuning ทำอย่างไร?
A: Server side เพิ่ม NFS threads (RPCNFSDCOUNT=64), ใช้ SSD สำหรับ NFS exports, เพิ่ม RAM สำหรับ page cache, ใช้ async exports (ระวัง data loss) Client side ใช้ rsize=1048576, wsize=1048576 (1MB read/write), ใช้ hard mount (ไม่ใช่ soft), เปิด NFS client caching (fsc option), ใช้ actimeo=60 สำหรับ data ที่ไม่เปลี่ยนบ่อย Network ใช้ jumbo frames (MTU 9000), แยก NFS traffic ไว้ VLAN เฉพาะ, ใช้ bonding/LACP สำหรับ bandwidth
