Kubernetes 生产环境最佳实践

资源管理#

始终设置 Resource Limits#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:v1.0
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

黄金法则：

requests = 应用正常运行所需资源
limits = 应用最大可用资源
limits / requests 比例建议 ≤ 2

QoS 等级#

┌─────────────────────────────────────────────┐
│  Guaranteed (最高优先级)                     │
│  requests == limits                          │
├─────────────────────────────────────────────┤
│  Burstable (中等优先级)                      │
│  requests < limits                           │
├─────────────────────────────────────────────┤
│  BestEffort (最低优先级)                     │
│  无 requests 和 limits                       │
└─────────────────────────────────────────────┘

健康检查#

三种探针配置#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

常见错误：

❌ initialDelaySeconds 设置过短，导致容器反复重启
❌ 未区分 liveness 和 readiness
❌ 健康检查端点有依赖外部服务

Pod 反亲和性#

避免单点故障，确保 Pod 分散部署：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - my-app
      topologyKey: kubernetes.io/hostname

PodDisruptionBudget#

保护服务在节点维护期间的可用性：

1
2
3
4
5
6
7
8
9


apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2  # 或使用 maxUnavailable
  selector:
    matchLabels:
      app: my-app

安全最佳实践#

非 Root 运行#

1
2
3
4
5
6
7


securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

Network Policy#

1
2
3
4
5
6
7
8
9


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

监控告警#

关键指标#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# 节点资源
sum(rate(container_cpu_usage_seconds_total[5m])) by (node)
sum(container_memory_working_set_bytes) by (node)

# Pod 状态
kube_pod_status_phase{phase="Failed"}
kube_pod_container_status_restarts_total

# API Server
apiserver_request_duration_seconds_bucket
apiserver_request_total{code=~"5.."}

总结清单#

□ 所有 Pod 设置 resources requests/limits
□ 配置 liveness 和 readiness 探针
□ 使用 PodDisruptionBudget
□ 配置 Pod 反亲和性
□ 使用 NetworkPolicy 限制流量
□ 容器以非 Root 用户运行
□ 配置 Prometheus + Grafana 监控
□ 设置关键告警规则

1
2


$ kubectl get pods --all-namespaces | grep -v Running
# 目标：输出为空