Pod 紀錄 - 系統工程師 Beck Yeh 的技術分享

Table of Contents

pod 基礎介紹

pod 是 kubernetes 裡的運作的最小單位，一個 Pod 對應到一個應用服務，一個 pod 裡可以啟動一個以上的 container

pod 示例：

stateDiagram
  state Pod {
    tomcat-->Volumn
    busybox-->Volumn 
  }

pod yaml 範例

apiVersion: v1
kind: Pod
metadata:
  name: volume-pod
spec:
  containers:
  - name: tomcat
    image: tomcat
    ports:
    - containerPort: 8080
    volumeMounts:
    - name: app-logs
      mountPath: /usr/local/tomcat/logs
  - name: busybox
    image: busybox
    command: ["sh", "-c", "tail -f /logs/catalina*.log"]
    volumeMounts:
    - name: app-logs
      mountPath: /logs
  volumes:
  - name: app-logs
    emptyDir: {}

Pod 生命週期以及重啟策略

Pod 重啟策略

Pod 狀態表：

狀態值	描述
Pending	API Server 已經建立該 Pod，但在 Pod 內還有一個或多個容器的鏡像沒有建立，包括正在下載 image 的過程
Running	Pod 內的所有容器已經建立，且至少有一個容器處於運行狀態、正在啟動狀態或正在重起壯態
Succeeded	Pod 內所有容器均成功執行後退出，且不會再重啟
Failed	Pod 內所有容器均已退出，但至少有一個容器退出為失敗狀態
Unknow	由於某種原因無法獲取該 Pod 的狀態，可能由於網域通信問題導致

Pod 的重啟政策（ RestartPolicy ）應用於 Pod 內的所有容器，並且僅由 Pod 所處的 Node 上的 kubelet 進行判斷和重啟操作。
當某個容器異常退出或是健康檢查（ healthy check ）失敗時，根據 RestartPolicy 的設定來進行相動應的操做。

Pod 的重啟策略包含：

Always：當容器失效時，由 kubelet 自動重啟
OnFailure：當容器終止運行且退出碼不為 0 時，由 kubelet 自動重啟
Never：不論容器運行狀態為何，kubelet 都不會重啟

每種控制器對 Pod 的 RestartPolicy 如下：

RC 和 DaemonSet：必須設置為 Always，需要保證該 Pod 持續運行
Job：OnFailure 或 Never，確保容器執行完成後不用重啟
kubelet：在 Pod 失效時自動重啟，不論將 RestartPolicy 設定為何，也不會對 Pod 進行健康檢查

Pod 健康檢查和服務可用性檢查

Kubernetes 對 Pod 的健康檢查是透過兩種探針來檢查

LivenessProbe：用於判斷容器服務是否存活（ Running 狀態），如果 LivenessProbe 偵測到容器不健康，則 kubelet 將殺掉容器，並根據 RestartPolicy 做相對的處理，如果 Pod 沒有設定 LivenessProbe，則會被認為永遠是 Success。
ReadinessProbe：用於判斷容器服務是否可用（ Ready 狀態），達到 Ready 狀態的 Pod 才可以接受請求。

LivenessProbe、ReadinessProbe 皆可設定以下三種方式：

ExecAction：在容器內執行一個命令，如果返回 0 ，則表示容器健康

apiVersion: v1
kind: Pod
metadata:
 labels:
   test: liveness
 name: liveness-exec
spec:
 containers:
 - name: liveness
   image: gcr.io/google_containers/busybox
   args:
   - /bin/sh
   - -c
   - echo ok > /tmp/health; sleep 10; rm -rf /tmp/health; sleep 600
   livenessProbe:
     exec:
       command:
       - cat
       - /tmp/health
     initialDelaySeconds: 15
     timeoutSeconds: 1

TCPSocketAction：透過容器的 IP 地址和 Port 號進行 TCP 檢查

apiVersion: v1
kind: Pod
metadata:
 name: pod-with-healthcheck
spec:
 containers:
 - name: nginx
   image: nginx
   ports:
   - containerPort: 80
   livenessProbe:
     tcpSocket:
       port: 80
     initialDelaySeconds: 30
     timeoutSeconds: 1

HTTPGetAction：如果響應的的狀態碼介於 200 ~ 400 ，則認為容器健康

apiVersion: v1
kind: Pod
metadata:
 name: pod-with-healthcheck
spec:
 containers:
 - name: nginx
   image: nginx
   ports:
   - containerPort: 80
   livenessProbe:
     httpGet:
       path: /_status/healthz
       port: 80
     initialDelaySeconds: 30
     timeoutSeconds: 1

對於所有探測方式，都需要設定以下參數：

initialDelaySeconds：啟動容器後首次進行健康檢查的等待時間，單位是秒。
timeoutSeconds：健康檢查發送請求後等待響應的逾時時間，單位是秒。

Pod 自動調度

NodeSelector：定向調度

NodeSelector 範例：

apiVersion: v1
kind: ReplicationController 
metadata:
  name: redis-master
  labels:
    name: redis-master 
spec:
  replicas: 1
  selector:
    name: redis-master
  template:
    metadata:
      labels:
        name: redis-master
    spec:
      containers:
      - name: master
        image: kubeguide/redis-master
        ports:
        - containerPort: 6379
      nodeSelector:
        zone: north

實際演練：

[root@master ~]# kubectl label nodes slave2.beckyeh.com zone=north
node/slave2.beckyeh.com labeled

[root@master ~]# kubectl describe node slave2.beckyeh.com
Name:               slave2.beckyeh.com
Roles:              
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=slave2.beckyeh.com
                    kubernetes.io/os=linux
                    zone=north
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 26 May 2020 12:14:39 +0800
Taints:             
Unschedulable:      false
Lease:
  HolderIdentity:  slave2.beckyeh.com
  AcquireTime:     
  RenewTime:       Thu, 02 Jul 2020 19:01:09 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 29 Jun 2020 21:55:36 +0800   Mon, 29 Jun 2020 21:55:36 +0800   WeaveIsUp                    Weave pod has set this
  MemoryPressure       False   Thu, 02 Jul 2020 19:00:57 +0800   Mon, 29 Jun 2020 21:54:44 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 02 Jul 2020 19:00:57 +0800   Mon, 29 Jun 2020 21:54:44 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 02 Jul 2020 19:00:57 +0800   Mon, 29 Jun 2020 21:54:44 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 02 Jul 2020 19:00:57 +0800   Mon, 29 Jun 2020 21:54:44 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.50.196
  Hostname:    slave2.beckyeh.com
Capacity:
  cpu:                4
  ephemeral-storage:  46110724Ki
  hugepages-2Mi:      0
  memory:             16266132Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  42495643169
  hugepages-2Mi:      0
  memory:             16163732Ki
  pods:               110
System Info:
  Machine ID:                 3b0ca94b6fcc4f3abcbd2c5a52804850
  System UUID:                3B0CA94B-6FCC-4F3A-BCBD-2C5A52804850
  Boot ID:                    7f3952b2-9705-4778-baa6-c6c34f0c5822
  Kernel Version:             3.10.0-1127.8.2.el7.x86_64
  OS Image:                   CentOS Linux 7 (Core)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.9
  Kubelet Version:            v1.18.3
  Kube-Proxy Version:         v1.18.3
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
  default                     dapi-test-pod-container-vars         125m (3%)     250m (6%)   32Mi (0%)        64Mi (0%)      2d20h
  default                     nginx-deployment-559fdddb7b-jzpjw    0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  kube-system                 kube-proxy-lrpbn                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         37d
  kube-system                 weave-net-wgrrg                      20m (0%)      0 (0%)      0 (0%)           0 (0%)         37d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                145m (3%)  250m (6%)
  memory             32Mi (0%)  64Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:              

[root@master ~]# kubectl create -f redis-master-controller.yaml 
replicationcontroller/redis-master created

[root@master ~]# kubectl get pods -o wide
NAME                                READY   STATUS      RESTARTS   AGE     IP          NODE                    NOMINATED NODE   READINESS GATES
redis-master-25jq4                  1/1     Running     0          46s     10.36.0.3   slave2.beckyeh.com

需要注意的是，如果指定了 Pod 的 nodeSelector 條件，並且在 Cluster 內不存在包含該 label 的 Node，即使在 Cluster 還有其他 Node 可用，這個 Pod 也無法被成功調度。

Kubernetes 預設標籤：

kubernetes.io/hostname
beta.kubernetes.io/os
beta.kubernetes.io/arch
kubernetes.io/os
kubernetes.io/arch

親和性調度

NodeAffinity：Node 親和性調度

NodeAffinity 是用於替換 NodeSelector 的調度方式，目前有兩種表達方式：

requiredDuringSchedulingIgnoredDuringExecution：必須滿足指定的規則才可以調度 Pod 到 Node 上（功能上與 nodeSelector 類似），相當於硬限制
preferredDuringSchedulingIgnoredDuringExecution：強調優先滿足指定規則，調度器會嘗試調度 Pod 到 Node 上，但並不強求，相當於軟限制。多個優先級規則還可以設置權重。

IgnoredDuringExecution：如果一個 Pod 所在的節點的 Label 在 Pod 運行期間發生變化，該 Pod 還能繼續在該節點運行。

NodeAffinity 支持包括 In、NotIn、Exists、DoesNotExist、Gt、Lt 等判斷式

NodeAffinity 規則設定，須注意以下事項：

如果同時定義了 nodeSelector 與 nodeAffinity ，則必須同時滿足兩個條件，Pod 才能運行在該 Node 上
如果 nodeAffinity 指定了多個 nodeSelectorTerms ，那麼其中一個能滿足即可
如果 nodeAffinity 指定了多個 matchExpressions ，則必須滿足所有 matchExpressions 的條件

範例：

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: disk-type
            operator: In
            values:
            - ssd
  containers:
  - name: with-node-affinity
    image: gcr.io/google_containers/pause:2.0

PodAffinify：Pod 親合與互斥調度

說明：Pod 親合和與互斥和需要大量的處理，這可能會顯著減慢大規模集群中的調度。
我們不建議在超過數百個節點的集群中使用它們。
說明：Pod 互斥需要對節點進行一致的標記，即集群中的每個節點必須具有適當的標簽能夠匹配 topologyKey。
如果某些或所有節點缺少指定的 topologyKey 標簽，可能會導致意外行為。

建立參考範本：

---
apiVersion: v1
kind: Pod
metadata:
  name: pod-flag
  labels:
    security: "S1"
    app: "nginx"
spec:
  containers:
  - name: nginx
    image: nginx

podAffinity 範例：

# affinity
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: gcr.io/google_containers/pause:2.0

podAntiAffinity 範例：

# anti-affinity
---
pods/pod-with-pod-affinity.yaml 

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: failure-domain.beta.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

在這個 pod 的 affinity 配置定義了一條 pod 親和規則和一條 pod 反親和規則。

在此示例中，podAffinity 配置為 requiredDuringSchedulingIgnoredDuringExecution，然而 podAntiAffinity 配置為 preferredDuringSchedulingIgnoredDuringExecution。

pod 親和規則表示，僅當節點和至少一個已運行且有 key 為 “security” 且值為 “S1” 的標簽的 pod 處於同一區域時，才可以將該 pod 調度到節點上。（更確切的說，如果節點 N 具有帶有 key failure-domain.beta.kubernetes.io/zone 和某個值 V 的標簽，則 pod 有資格在節點 N 上運行，以便集群中至少有一個節點具有鍵 failure-domain.beta.kubernetes.io/zone 和值為 V 的節點正在運行具有鍵 “security” 和值 “S1” 的標簽的 pod。）

pod 反親和規則表示，如果節點已經運行了一個具有 key “security” 和值 “S2” 的標簽的 pod，則該 pod 不希望將其調度到該節點上。（如果 topologyKey 為 failure-domain.beta.kubernetes.io/zone，則意味著當節點和具有 key “security” 和值 “S2” 的標簽的 pod 處於相同的區域，pod 不能被調度到該節點上。）

除了附加的標籤外，節點還預先填充了一些標準標籤

說明：這些標簽的值是特定於雲供應商的，因此不能保證可靠。
例如，kubernetes.io/hostname 的值在某些環境中可能與節點名稱相同，但在其他環境中可能是一個不同的值。

標籤的情形可以使用 kubectl describe node node_name 來進行查看

PS C:\Users\a2316> kubectl describe node gke-awoo-gke-worker-awoo-gke-worker-p-1aef4f65-7c8m
Name:               gke-awoo-gke-worker-awoo-gke-worker-p-1aef4f65-7c8m
Roles:              
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/instance-type=n1-standard-1
                    beta.kubernetes.io/masq-agent-ds-ready=true
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=awoo-gke-worker-pool-proxy
                    cloud.google.com/gke-os-distribution=cos
                    failure-domain.beta.kubernetes.io/region=asia-east1
                    failure-domain.beta.kubernetes.io/zone=asia-east1-c
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-awoo-gke-worker-awoo-gke-worker-p-1aef4f65-7c8m
                    kubernetes.io/os=linux
                    node.kubernetes.io/masq-agent-ds-ready=true
                    projectcalico.org/ds-ready=true
                    worknode=high

永遠不放置在相同節點範例：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

上面的例子使用 PodAntiAffinity 規則和 topologyKey: "kubernetes.io/hostname" 來部署 redis 集群以便在同一主機上沒有兩個實例。
參閱 ZooKeeper 教程，以獲取配置反親和來達到高可用性的 StatefulSet 的樣例（使用了相同的技巧）。

Taint 和 Toleration

概念

節點親和性，是 pod 的一種屬性（偏好或硬性要求），它使 pod 被吸引到一類特定的節點。Taint 則相反，它使節點能夠排斥一類特定的 pod。
Taint 和 toleration 相互配合，可以用來避免 pod 被分配到不合適的節點上。每個節點上都可以應用一個或多個 taint ，這表示對於那些不能容忍這些 taint 的 pod，是不會被該節點接受的。
如果將 toleration 應用於 pod 上，則表示這些 pod 可以（但不要求）被調度到具有匹配 taint 的節點上。

可以使用命令 kubectl taint 給節點增加一個 taint。例如：

kubectl taint nodes node1 key=value:NoSchedule

設定情形：

PS C:\Users\a2316> kubectl.exe describe node slave2.beckyeh.com
Name:               slave2.beckyeh.com
Roles:              
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=slave2.beckyeh.com
                    kubernetes.io/os=linux
                    zone=north
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 26 May 2020 12:14:39 +0800
Taints:             key=value:NoSchedule

給節點 node1 增加一個 taint，它的 key 是 key，value 是 value，效果是 NoSchedule。
這表示只有擁有和這個 taint 相匹配的 toleration 的 pod 才能夠被分配到 node1 這個節點。
您可以在 PodSpec 中定義 pod 的 toleration。

想刪除上述命令添加的 taint ，您可以執行：

kubectl taint nodes node1 key:NoSchedule-

下面兩個 toleration 均與上面例子中使用 kubectl taint 命令創建的 taint 相匹配，因此如果一個 pod 擁有其中的任何一個 toleration 都能夠被分配到 node1 ：

tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"

tolerations:
- key: "key"
  operator: "Exists"
  effect: "NoSchedule"

Pod 的 Toleration 聲明中的 key 和 effect 需要與 Taint 的設置相同，並且滿足以下條件之一
如果不指定 operator ，則默認值為 Equal。

operator 的值是 Exists（無須指定 value ）
operator 的值是 Equal 並且 value 相等

說明：存在兩種特殊情況：
如果一個 toleration 的 key 為空且 operator 為 Exists，表示這個 toleration 與任意的 key 、value 和 effect 都匹配，即這個 toleration 能容忍任意 taint。
tolerations: - operator: "Exists
如果一個 toleration 的 effect 為空，則 key 值與之相同的相匹配 taint 的 effect 可以是任意值。
tolerations: - key: "key" operator: "Exists" 

上述例子使用到的 effect 的一個值 NoSchedule，您也可以使用另外一個值 PreferNoSchedule 。
這是 “優化” 或 “軟” 版本的 NoSchedule —— 系統會盡量避免將 pod 調度到存在其不能容忍 taint 的節點上，但這不是強制的。
effect 的值還可以設置為 NoExecute，下文會詳細描述這個值。

您可以給一個節點添加多個 taint ，也可以給一個 pod 添加多個 toleration。
Kubernetes 處理多個 taint 和 toleration 的過程就像一個過濾器：從一個節點的所有 taint 開始遍歷，過濾掉那些 pod 中存在與之相匹配的 toleration 的 taint。
余下未被過濾的 taint 的 effect 值決定了 pod 是否會被分配到該節點，特別是以下情況：

如果未被過濾的 taint 中存在至少一個 effect 值為 NoSchedule 的 taint，則 Kubernetes 不會將 pod 分配到該節點。
如果未被過濾的 taint 中不存在 effect 值為 NoSchedule 的 taint，但是存在 effect 值為 PreferNoSchedule 的 taint，則 Kubernetes 會嘗試將 pod 分配到該節點。
如果未被過濾的 taint 中存在至少一個 effect 值為 NoExecute 的 taint，則 Kubernetes 不會將 pod 分配到該節點（如果 pod 還未在節點上運行），或者將 pod 從該節點驅逐（如果 pod 已經在節點上運行）。

例如，您給一個節點添加了如下的 taint：

kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule

然後存在一個 pod，它有兩個 toleration：

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"

在這個例子中，上述 pod 不會被分配到上述節點，因為其沒有 toleration 和第三個 taint 相匹配。
但是如果在給節點添加上述 taint 之前，該 pod 已經在上述節點運行，那麽它還可以繼續運行在該節點上，因為第三個 taint 是三個 taint 中唯一不能被這個 pod 容忍的。

通常情況下，如果給一個節點添加了一個 effect 值為 NoExecute 的 taint，則任何不能忍受這個 taint 的 pod 都會馬上被驅逐，任何可以忍受這個 taint 的 pod 都不會被驅逐。
但是，如果 pod 存在一個 effect 值為 NoExecute 的 toleration 指定了可選屬性 tolerationSeconds 的值，則表示在給節點添加了上述 taint 之後，pod 還能繼續在節點上運行的時間。

例如：

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

這表示如果這個 pod 正在運行，然後一個匹配的 taint 被添加到其所在的節點，那麽 pod 還將繼續在節點上運行 3600 秒，然後被驅逐。
如果在此之前上述 taint 被刪除了，則 pod 不會被驅逐。

通過 taint 和 toleration，可以靈活地讓 pod 避開某些節點或者將 pod 從某些節點驅逐。下面是幾個使用例子：

專用節點：如果您想將某些節點專門分配給特定的一組用戶使用，您可以給這些節點添加一個 taint（即， kubectl taint nodes nodename dedicated=groupName:NoSchedule），然後給這組用戶的 pod 添加一個相對應的 toleration（通過編寫一個自定義的 admission controller，很容易就能做到）。擁有上述 toleration 的 pod 就能夠被分配到上述專用節點，同時也能夠被分配到集群中的其它節點。如果您希望這些 pod 只能被分配到上述專用節點，那麽您還需要給這些專用節點另外添加一個和上述 taint 類似的 label （例如：dedicated=groupName），同時還要在上述 admission controller 中給 pod 增加節點親和性要求上述 pod 只能被分配到添加了 dedicated=groupName 標簽的節點上。
配備了特殊硬件的節點：在部分節點配備了特殊硬件（比如 GPU）的集群中，我們希望不需要這類硬件的 pod 不要被分配到這些特殊節點，以便為後繼需要這類硬件的 pod 保留資源。要達到這個目的，可以先給配備了特殊硬件的節點添加 taint（例如 kubectl taint nodes nodename special=true:NoSchedule or kubectl taint nodes nodename special=true:PreferNoSchedule)，然後給使用了這類特殊硬件的 pod 添加一個相匹配的 toleration。和專用節點的例子類似，添加這個 toleration 的最簡單的方法是使用自定義 admission controller。比如，我們推薦使用 Extended Resources 來表示特殊硬件，給配置了特殊硬件的節點添加 taint 時包含 extended resource 名稱，然後運行一個 ExtendedResourceToleration admission controller。此時，因為節點已經被 taint 了，沒有對應 toleration 的 Pod 會被調度到這些節點。但當你創建一個使用了 extended resource 的 Pod 時，ExtendedResourceToleration admission controller 會自動給 Pod 加上正確的 toleration ，這樣 Pod 就會被自動調度到這些配置了特殊硬件件的節點上。這樣就能夠確保這些配置了特殊硬件的節點專門用於運行需要使用這些硬件的 Pod，並且您無需手動給這些 Pod 添加 toleration。
基於 taint 的驅逐: 這是在每個 pod 中配置的在節點出現問題時的驅逐行為，接下來的章節會描述這個特性

基於 taint 的驅逐

FEATURE STATE： Kubernetes v1.18 [stable]

前文我們提到過 taint 的 effect 值 NoExecute ，它會影響已經在節點上運行的 pod

如果 pod 不能忍受 effect 值為 NoExecute 的 taint，那麽 pod 將馬上被驅逐
如果 pod 能夠忍受 effect 值為 NoExecute 的 taint，但是在 toleration 定義中沒有指定 tolerationSeconds，則 pod 還會一直在這個節點上運行。
如果 pod 能夠忍受 effect 值為 NoExecute 的 taint，而且指定了 tolerationSeconds，則 pod 還能在這個節點上繼續運行這個指定的時間長度。

此外，Kubernetes 1.6 已經支持（alpha階段）節點問題的表示。
換句話說，當某種條件為真時，node controller會自動給節點添加一個 taint。當前內置的 taint 包括：

node.kubernetes.io/not-ready：節點未準備好。這相當於節點狀態 Ready 的值為 "False"。
node.kubernetes.io/unreachable：node controller 訪問不到節點. 這相當於節點狀態 Ready 的值為 "Unknown"。
node.kubernetes.io/out-of-disk：節點磁盤耗盡。
node.kubernetes.io/memory-pressure：節點存在內存壓力。
node.kubernetes.io/disk-pressure：節點存在磁盤壓力。
node.kubernetes.io/network-unavailable：節點網絡不可用。
node.kubernetes.io/unschedulable: 節點不可調度。
node.cloudprovider.kubernetes.io/uninitialized：如果 kubelet 啟動時指定了一個 "外部" cloud provider，它將給當前節點添加一個 taint 將其標志為不可用。在 cloud-controller-manager 的一個 controller 初始化這個節點後，kubelet 將刪除這個 taint。

在節點被驅逐時，節點控制器或者 kubelet 會添加帶有 NoExecute 效應的相關污點。如果異常狀態恢覆正常，kubelet 或節點控制器能夠移除相關的污點。

說明： 為了保證由於節點問題引起的 pod 驅逐 rate limiting 行為正常，系統實際上會以 rate-limited 的方式添加 taint。在像 master 和 node 通訊中斷等場景下，這避免了 pod 被大量驅逐。

使用這個功能特性，結合 tolerationSeconds，pod 就可以指定當節點出現一個或全部上述問題時還將在這個節點上運行多長的時間。

比如，一個使用了很多本地狀態的應用程序在網絡斷開時，仍然希望停留在當前節點上運行一段較長的時間，願意等待網絡恢覆以避免被驅逐。在這種情況下，pod 的 toleration 可能是下面這樣的：

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 6000

注意，Kubernetes 會自動給 pod 添加一個 key 為 node.kubernetes.io/not-ready 的 toleration 並配置 tolerationSeconds=300，除非用戶提供的 pod 配置中已經已存在了 key 為 node.kubernetes.io/not-ready 的 toleration。
同樣，Kubernetes 會給 pod 添加一個 key 為 node.kubernetes.io/unreachable 的 toleration 並配置 tolerationSeconds=300，除非用戶提供的 pod 配置中已經已存在了 key 為 node.kubernetes.io/unreachable 的 toleration。

這種自動添加 toleration 機制保證了在其中一種問題被檢測到時 pod 默認能夠繼續停留在當前節點運行 5 分鐘。
這兩個默認 toleration 是由 DefaultTolerationSeconds admission controller 添加的。

DaemonSet 中的 pod 被創建時，針對以下 taint 自動添加的 NoExecute 的 toleration 將不會指定 tolerationSeconds：

node.kubernetes.io/unreachable
node.kubernetes.io/not-ready

這保證了出現上述問題時 DaemonSet 中的 pod 永遠不會被驅逐。

基於節點狀態添加 taint

Node 生命周期控制器會自動創建與 Node 條件相對應的帶有 NoSchedule 效應的污點。同樣，調度器不檢查節點條件，而是檢查節點污點。這確保了節點條件不會影響調度到節點上的內容。用戶可以通過添加適當的 Pod 容忍度來選擇忽略某些 Node 的問題(表示為 Node 的調度條件)。

自 Kubernetes 1.8 起， DaemonSet 控制器自動為所有守護進程添加如下 NoSchedule toleration 以防 DaemonSet 崩潰：

node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/out-of-disk (只適合 critical pod)
node.kubernetes.io/unschedulable (1.10 或更高版本)
node.kubernetes.io/network-unavailable (只適合 host network)

添加上述 toleration 確保了向後兼容，您也可以選擇自由的向 DaemonSet 添加 toleration。

Pod Priority and Preemption：Pod 優先級調度

如何使用 priority 以及 preemption

新增 PriorityClasses

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

建立帶有 priorityClassName 的 Pod

apiVersion: v1
kind: Pod
metadata:
 name: nginx
 labels:
   env: test
spec:
 containers:
 - name: nginx
   image: nginx
   imagePullPolicy: IfNotPresent
 priorityClassName: high-priority

請注意：Kubernetes 已經提供了兩個 PriorityClasses： system-cluster-critical and system-node-critical。
這些是通用類，用於確保始終優先安排關鍵組件。

[root@master ~]# kubectl get priorityclasses
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            48d
system-node-critical      2000001000   false            48d

[root@master ~]# kubectl describe priorityclasses system-cluster-critical
Name:           system-cluster-critical
Value:          2000000000
GlobalDefault:  false
Description:    Used for system critical pods that must run in the cluster, but can be moved to another node if necessary.
Annotations:    
Events:

Warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priorities, causing other Pods to be evicted/not get scheduled. An administrator can use ResourceQuota to prevent users from creating pods at high priorities.
See limit Priority Class consumption by default for details.

PriorityClass

PriorityClass 對象可以具有小於或等於 10億的任何 32位元整數值。
較大的數值通常保留給不應該搶占或驅逐的關鍵系統 Pod。
集群管理員應為他們想要的每個此類映射創建一個 PriorityClass 對象。

PriorityClass 有兩個可選欄位： globalDefault and description。
globalDefault 欄位指示了 Pods 是否直接套用而不需要設定priorityClassName。
在系統內，只有一個 PriorityClass 可以將 globalDefault 設定為 true。
如果沒有一個 PriorityClass 設定了全域的 globalDefault，沒有設定 priorityClassName 的 Pod，預設優先級為 0。
新增 globalDefault 設置為 true 的 PriorityClass 不會更改現有 Pod 的優先級。
此類 PriorityClass 的值僅用於新增 PriorityClass 之後建立的 Pod。

description 欄位是一個任意字符串。它旨在告訴群集用戶何時應使用此 PriorityClass。

如果刪除 PriorityClass，則使用已刪除 PriorityClass 名稱的現有 Pod 保持不變，但是您不能創建更多使用已刪除 PriorityClass 名稱的Pod。

Non-preempting PriorityClass

FEATURE STATE: Kubernetes v1.15 [alpha]

Example Non-preempting PriorityClass：

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be preempted."

用戶可以提交要優先於其他工作負載的作業，但不希望通過搶占運行的Pod來丟棄現有工作。
一旦群集資源 “自然地” 變得空閒，將會把具有 PreemptionPolicy: Never 的高優先級的任務在列隊的最前方。

pod 基礎介紹

Pod 生命週期以及重啟策略

Pod 重啟策略

Pod 健康檢查和服務可用性檢查

Pod 自動調度

NodeSelector：定向調度

親和性調度

NodeAffinity：Node 親和性調度

PodAffinify：Pod 親合與互斥調度

Taint 和 Toleration

概念

基於 taint 的驅逐

基於節點狀態添加 taint

Pod Priority and Preemption：Pod 優先級調度

如何使用 priority 以及 preemption

PriorityClass

Non-preempting PriorityClass

Please Share This Share this content

Beck Yeh

You Might Also Like

Kubernetes Deployment 紀錄

預設 token 過期

發佈留言 取消回覆

Share this content

發佈留言取消回覆