Kubernetes 端點切片（Endpoint Slices）

Kubernetes 網(wǎng)絡(luò)策略

Kubernetes IPv4/IPv6 雙協(xié)議棧

Kubernetes 存儲

Kubernetes 配置

Kubernetes 安全

Kubernetes 策略

Kubernetes 調(diào)度，搶占和驅(qū)逐

Kubernetes 集群管理

Kubernetes 擴展

Kubernetes 應(yīng)用故障排除

Kubernetes 集群故障排查

Kubernetes 管理集群

Kubernetes 控制節(jié)點上的CPU管理策略

Kubernetes 改變默認(rèn)StorageClass

Kubernetes 更改PersistentVolume的回收策略

Kubernetes 自動擴縮集群DNS服務(wù)

Kubernetes 自定義DNS服務(wù)

Kubernetes 調(diào)試DNS問題

Kubernetes 遷移多副本的控制面以使用云控制器管理器

Kubernetes 通過名字空間共享集群

Kubernetes 通過配置文件設(shè)置Kubelet參數(shù)

Kubernetes 配置API對象配額

Kubernetes 限制存儲消耗

Kubernetes 靜態(tài)加密Secret數(shù)據(jù)

Kubernetes 配置Pods和容器

Kubernetes 管理Kubernetes對象

Kubernetes 管理Secrets

Kubernetes 給應(yīng)用注入數(shù)據(jù)

Kubernetes 運行應(yīng)用

Kubernetes 運行Jobs

Kubernetes 訪問集群中的應(yīng)用程序

Kubernetes 擴展Kubernetes

Kubernetes TLS

Kubernetes 管理集群守護進程

Kubernetes 安裝服務(wù)目錄

Kubernetes 網(wǎng)絡(luò)

Kubernetes 任務(wù)

Kubernetes 安全

Kubernetes 無狀態(tài)應(yīng)用程序

Kubernetes 有狀態(tài)的應(yīng)用

Kubernetes Service

Kubernetes 使用源IP

閱讀(1.3k) 書簽贊(0) 我要糾錯

Kubernetes 節(jié)點健康監(jiān)測

2022-06-02 09:17 更新

節(jié)點健康監(jiān)測

節(jié)點問題檢測器（Node Problem Detector）是一個守護程序，用于監(jiān)視和報告節(jié)點的健康狀況。你可以將節(jié)點問題探測器以 ?DaemonSet ?或獨立守護程序運行。節(jié)點問題檢測器從各種守護進程收集節(jié)點問題，并以 ?NodeCondition ?和 ?Event ?的形式報告給 API 服務(wù)器。

要了解如何安裝和使用節(jié)點問題檢測器，請參閱節(jié)點問題探測器項目文檔。

在開始之前

你必須擁有一個 Kubernetes 的集群，同時你的 Kubernetes 集群必須帶有 kubectl 命令行工具。建議在至少有兩個節(jié)點的集群上運行本教程，且這些節(jié)點不作為控制平面主機。如果你還沒有集群，你可以通過 Minikube 構(gòu)建一個你自己的集群，或者你可以使用下面任意一個 Kubernetes 工具構(gòu)建：

Katacoda
玩轉(zhuǎn) Kubernetes

局限性

節(jié)點問題檢測器只支持基于文件類型的內(nèi)核日志。它不支持像 journald 這樣的命令行日志工具。
節(jié)點問題檢測器使用內(nèi)核日志格式來報告內(nèi)核問題。

啟用節(jié)點問題檢測器

一些云供應(yīng)商將節(jié)點問題檢測器以插件形式啟用。你還可以使用 ?kubectl ?或創(chuàng)建插件 Pod 來啟用節(jié)點問題探測器。

使用 kubectl 啟用節(jié)點問題檢測器

?kubectl ?提供了節(jié)點問題探測器最靈活的管理。你可以覆蓋默認(rèn)配置使其適合你的環(huán)境或檢測自定義節(jié)點問題。例如：

創(chuàng)建類似于 ?node-strought-detector.yaml? 的節(jié)點問題檢測器配置：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/

Note: 你應(yīng)該檢查系統(tǒng)日志目錄是否適用于操作系統(tǒng)發(fā)行版本。

使用 ?kubectl ?啟動節(jié)點問題檢測器：

kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

使用插件 pod 啟用節(jié)點問題檢測器

如果你使用的是自定義集群引導(dǎo)解決方案，不需要覆蓋默認(rèn)配置，可以利用插件 Pod 進一步自動化部署。

創(chuàng)建 ?node-strick-detector.yaml?，并在控制平面節(jié)點上保存配置到插件 Pod 的目錄 ?/etc/kubernetes/addons/node-problem-detector?。

覆蓋配置文件

構(gòu)建節(jié)點問題檢測器的 docker 鏡像時，會嵌入默認(rèn)配置。

不過，你可以像下面這樣使用 ?ConfigMap ?將其覆蓋：

更改 ?config/? 中的配置文件
創(chuàng)建 ?ConfigMap ??node-strick-detector-config?：

kubectl create configmap node-problem-detector-config --from-file=config/

更改 ?node-problem-detector.yaml? 以使用 ConfigMap:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
        - name: config # Overwrite the config/ directory with ConfigMap volume
          mountPath: /config
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: config # Define ConfigMap volume
        configMap:
          name: node-problem-detector-config

使用新的配置文件重新創(chuàng)建節(jié)點問題檢測器：

# 如果你正在運行節(jié)點問題檢測器，請先刪除，然后再重新創(chuàng)建
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml

Note: 此方法僅適用于通過 ?kubectl ?啟動的節(jié)點問題檢測器。

如果節(jié)點問題檢測器作為集群插件運行，則不支持覆蓋配置。插件管理器不支持 ?ConfigMap?。

內(nèi)核監(jiān)視器

內(nèi)核監(jiān)視器（Kernel Monitor）是節(jié)點問題檢測器中支持的系統(tǒng)日志監(jiān)視器守護進程。內(nèi)核監(jiān)視器觀察內(nèi)核日志并根據(jù)預(yù)定義規(guī)則檢測已知的內(nèi)核問題。

內(nèi)核監(jiān)視器根據(jù) ?config/kernel-monitor.json? 中的一組預(yù)定義規(guī)則列表匹配內(nèi)核問題。規(guī)則列表是可擴展的，你始終可以通過覆蓋配置來擴展它。

添加新的 NodeCondition

要支持新的 ?NodeCondition?，請在 ?config/kernel-monitor.json? 中的 ?conditions ?字段中創(chuàng)建一個條件定義：

{
  "type": "NodeConditionType",
  "reason": "CamelCaseDefaultNodeConditionReason",
  "message": "arbitrary default node condition message"
}

檢測新的問題

你可以使用新的規(guī)則描述來擴展 ?config/kernel-monitor.json? 中的 ?rules ?字段以檢測新問題：

{
  "type": "temporary/permanent",
  "condition": "NodeConditionOfPermanentIssue",
  "reason": "CamelCaseShortReason",
  "message": "regexp matching the issue in the kernel log"
}

配置內(nèi)核日志設(shè)備的路徑

檢查你的操作系統(tǒng)（OS）發(fā)行版本中的內(nèi)核日志路徑位置。 Linux 內(nèi)核日志設(shè)備通常呈現(xiàn)為 ?/dev/kmsg?。但是，日志路徑位置因 OS 發(fā)行版本而異。 ?config/kernel-monitor.json? 中的 ?log ?字段表示容器內(nèi)的日志路徑。你可以配置 ?log ?字段以匹配節(jié)點問題檢測器所示的設(shè)備路徑。

添加對其它日志格式的支持

內(nèi)核監(jiān)視器使用 ?Translator ?插件轉(zhuǎn)換內(nèi)核日志的內(nèi)部數(shù)據(jù)結(jié)構(gòu)。你可以為新的日志格式實現(xiàn)新的轉(zhuǎn)換器。

建議和限制

建議在集群中運行節(jié)點問題檢測器以監(jiān)控節(jié)點運行狀況。運行節(jié)點問題檢測器時，你可以預(yù)期每個節(jié)點上的額外資源開銷。通常這是可接受的，因為：

內(nèi)核日志增長相對緩慢。
已經(jīng)為節(jié)點問題檢測器設(shè)置了資源限制。
即使在高負(fù)載下，資源使用也是可接受的。有關(guān)更多信息，請參閱節(jié)點問題檢測器基準(zhǔn)結(jié)果。

以上內(nèi)容是否對您有幫助：

← Kubernetes 資源指標(biāo)管道

Kubernetes 使用crictl對Kubernetes節(jié)點進行調(diào)試 →

寫筆記

我要補充