跳到主要内容

第4章 集群监控

k8s集群监控介绍

前面我们学习了如何监控k8s里的应用,但是k8s集群自己的状态信息监控也很重要,比如说以下资源信息:

k8s节点的运行状态:内存,cpu,磁盘,网络等。

k8s组件的运行状态:kube-apiserver,kube-controller,kube-scheduler,coredns等。

k8s资源的运行状态:Deployment,Daemoset,Service,Ingress等。

对于k8s集群的监控,主要有以下几种方案:

cAdvisor: 监控容器内部的资源使用情况,已经集成在kubelet里了
kube-state-metrics: 监控k8s资源的使用状态,比如Deployment,Pod副本状态等
metrics-server: 监控k8s的cpu,内存,磁盘,网络等状态

k8s节点监控

传统架构监控节点运行状态我们有非常成熟的zabbix可以使用,在k8s里我们可以直接使用prometheus官方的node-exporter来获取节点的各种指标,比如cpu,mem,disk,net等信息。

项目地址:

https://github.com/prometheus/node_exporter

因为是节点监控,所以我们采用DaemonSet模式,在每个节点都部署 node-exporter

cat > prom-node-exporter.yaml << 'EOF'
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: prom
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostPID: true #使用主机的 pid 命名空间
hostIPC: true #使用主机的 ipc 命名空间
hostNetwork: true #用主机的网络命名空间
nodeSelector:
kubernetes.io/os: linux
containers:
- name: node-exporter
image: prom/node-exporter:latest
args:
- --web.listen-address=$(HOSTIP):9100 #监控指标数据的地址端口号
- --path.procfs=/host/proc #监控proc路径
- --path.sysfs=/host/sys #监控sys路径
- --path.rootfs=/host/root #监控/rootfs路径
- --no-collector.hwmon #禁用不需要的一些采集器
- --no-collector.nfs
- --no-collector.nfsd
- --no-collector.nvme
- --no-collector.dmi
- --no-collector.arp
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) #忽略监控的磁盘信息
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ #忽略监控的文件系统
ports:
- containerPort: 9100
env:
- name: HOSTIP
valueFrom:
fieldRef:
fieldPath: status.hostIP #读取node本身的IP地址
resources:
requests:
cpu: 150m
memory: 180Mi
limits:
cpu: 150m
memory: 180Mi
securityContext: #安全上下文,用于定义Container的权限和访问控制
runAsNonRoot: true #容器不以root运行
runAsUser: 65534 #使用指定的uid用户运行
volumeMounts:
- name: proc
mountPath: /host/proc
- name: sys
mountPath: /host/sys
- name: root
mountPath: /host/root
mountPropagation: HostToContainer #mountPropagation确定挂载如何从主机传播到容器
readOnly: true
tolerations: #容忍
- operator: "Exists" #Exists相当于值的通配符,因此pod可以容忍特定类别的所有污点。
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
EOF

配置文件解释:

#获取pod信息
https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

env:
- name: HOSTIP
valueFrom:
fieldRef:
fieldPath: status.hostIP #读取node本身的IP地址
--------------------------------
securityContext: #全上下文,用于定义Container的权限和访问控制
runAsNonRoot: true #容器不以root运行
runAsUser: 65534 #使用指定的uid用户运行
--------------------------------
tolerations: #容忍
- operator: "Exists" #Exists相当于值的通配符,因此pod可以容忍特定类别的所有污点。
--------------------------------
securityContext: #全上下文,用于定义Container的权限和访问控制
runAsNonRoot: true #容器不以root运行
runAsUser: 65534 #使用指定的uid用户运行
--------------------------------
hostPID: true #使用主机的 pid 命名空间
hostIPC: true #使用主机的 ipc 命名空间
hostNetwork: true #用主机的网络命名空间

应用资源配置

[root@node1 prom]# kubectl apply -f prom-node-exporter.yaml
daemonset.apps/node-exporter created

查看pod创建状态

[root@node1 prom]# kubectl -n prom get pod
NAME READY STATUS RESTARTS AGE
mysql-dp-79b48cff96-m96bz 2/2 Running 0 15m
node-exporter-7zhjf 1/1 Running 0 41s
node-exporter-kzqvz 1/1 Running 0 41s
node-exporter-zbsrl 1/1 Running 0 41s
prometheus-796566c67c-lhrns 1/1 Running 0 60m

k8s服务发现

我们已经在每个节点上部署了node-exporter,那我们要如何采集到的每个节点的数据呢?如果按照我们上面的操作方法写死IP工作量太大,写service的话只会显示一条数据,这两种方式都不太方便。那么有什么好办法可以解决这种情况吗?这种时候就需要使用prometheus的服务发现功能了。

在k8s里,prometheus通过与k8s的API集成,支持5种服务发现模式,分别为:Node、Service、Pod、Endpoints、Ingress

官方文档:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

node节点自动发现

修改prometheus配置

cat > prom-cm.yml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'coredns'
static_configs:
- targets: ['10.2.0.16:9153','10.2.0.17:9153']

- job_name: 'mysql'
static_configs:
- targets: ['mysql-svc:9104']

- job_name: 'nodes'
kubernetes_sd_configs: #k8s自动服务发现
- role: node #自动发现类型为node
EOF

生效配置

[root@node1 prom]# kubectl apply -f prom-cm.yml
configmap/prometheus-config configured

[root@node1 prom]# curl -X POST "http://10.2.2.86:9090/-/reload"

查看prometheus

1726449159658-c2f3242a-4f0a-49cc-989f-080223eb2475.png

这是发现提示有问题,因为prometheus默认去访问的是10250,我们需要把10250替换为9100

这里就需要使用prometheus提供的relabel_configs的replace功能。replace可以在采集数据之前通过Target的Metadata信息,动态重写Label的值。我们这里可以把_address_这个标签的端口修改为9100

relabel_config配置官方说明:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

修改prometheus配置:

cat > prom-cm.yml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'coredns'
static_configs:
- targets: ['10.2.0.16:9153','10.2.0.17:9153']

- job_name: 'mysql'
static_configs:
- targets: ['mysql-svc:9104']

- job_name: 'nodes'
kubernetes_sd_configs:
- role: node
relabel_configs: #配置重写
- action: replace #基于正则表达式匹配执行的操作
source_labels: ['__address__'] #从源标签里选择值
regex: '(.*):10250' #提取的值与之匹配的正则表达式
replacement: '${1}:9100' #执行正则表达式替换的值
target_label: __address__ #结果值在替换操作中写入的标签
EOF

生效配置:

[root@node1 prom]# kubectl apply -f prom-cm.yml
configmap/prometheus-config configured

等待一会再更新
[root@node1 prom]# curl -X POST "http://10.2.2.86:9090/-/reload"

prometheus查看:

1726449159640-d1c21c66-20d2-4443-8bef-62d60eff6679.png

这里还有一个问题,就是节点标签目前只有主机名和任务名,不适合我们后期分组查询数据时候,所以我们需要把更多的标签添加进去,prometheus对于k8s的自动发现node模式支持以下的标签,详情可以查询官网:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

1726449159993-61452dd4-4b8f-4257-ab07-1d76287d6c76.png

标签解释:

__meta_kubernetes_node_name: 节点对象的名称。
__meta_kubernetes_node_label: 节点对象的每个标签
__meta_kubernetes_node_annotation: 每个节点的注释
__meta_kubernetes_node_address: 每个节点地址类型的第一个地址

重标记动作解释:

<relabel_action> 决定了要采取的重标记动作:

  • replace:将 regexsource_labels 拼接后匹配。然后,将 target_label 设置为 replacement,并用 replacement 中的匹配组引用(${1}, ${2}, ...)替换为其值。如果 regex 不匹配,则不进行替换。
  • lowercase:将拼接后的 source_labels 映射为小写。
  • uppercase:将拼接后的 source_labels 映射为大写。
  • keep:如果 regex 与拼接后的 source_labels 不匹配,则丢弃目标。
  • drop:如果 regex 与拼接后的 source_labels 匹配,则丢弃目标。
  • keepequal:如果拼接后的 source_labelstarget_label 不匹配,则丢弃目标。
  • dropequal:如果拼接后的 source_labelstarget_label 匹配,则丢弃目标。
  • hashmod:将 target_label 设置为拼接后的 source_labels 的哈希值的模。
  • labelmap:将 regex 与所有源标签名匹配,而不仅仅是 source_labels 指定的那些。然后将匹配的标签的值复制到 replacement 给出的标签名中,匹配组引用(${1}, ${2}, ...)在 replacement 中的值将被替换。
  • labeldrop:将 regex 与所有标签名匹配。任何匹配的标签都将从标签集中移除。
  • labelkeep:将 regex 与所有标签名匹配。任何不匹配的标签都将从标签集中移除。

在使用 labeldroplabelkeep 时,必须小心确保一旦移除了标签,指标仍然具有唯一的标签。

修改配置:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'coredns'
static_configs:
- targets: ['10.2.0.16:9153','10.2.0.17:9153']

- job_name: 'mysql'
static_configs:
- targets: ['mysql-svc:9104']

- job_name: 'nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: replace
source_labels: ['__address__']
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__

- action: labelmap #将正则表达式与所有标签名称匹配
regex: __meta_kubernetes_node_label_(.+) #提取符合正则匹配的标签,然后提交到Label里

配置解释:

官方说明:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

添加一个动作为labelmap,作用是将符合__meta_kubernetes_node_label_(.+)正则表达式提取的标签添加到Lable标签里。

生效配置:

[root@node1 prom]# kubectl apply -f prom-cm.yml
configmap/prometheus-config configured

等一会再更新
[root@node1 prom]# curl -X POST "http://10.2.2.86:9090/-/reload"

查看prometheus

1726449160010-9e34765d-d080-4eb1-8473-1118b0eaa3ce.png

kubelet节点自动发现

prometheus自动发现kubelet访问的是10250端口,但是必须是https协议,而且必须提供证书,我们可以直接使用k8s的证书。除此之外访问集群资源还需要相应的权限,还需要带上我们刚才为prometheus创建的service-account-token,实际上我们为prometheus创建的RBAC资源产生的secrets会以文件挂载的形式挂载到Pod里,所以我们查询的时候只要带上这个token就具备了查询集群资源的权限。另外我们还设置了跳过证书检查。

资源配置如下:

cat > prom-cm.yml << 'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'coredns'
static_configs:
- targets: ['10.2.0.16:9153','10.2.0.17:9153']

- job_name: 'mysql'
static_configs:
- targets: ['mysql-svc:9104']

- job_name: 'nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: ['__address__']
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
EOF

配置解释:

- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https #配置用于请求的协议
tls_config: #tls配置
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt #ca证书
insecure_skip_verify: true #禁用证书检查
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token #serviceaccount授权,默认securt会以文件形式挂载到pod里
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

应用配置:

[root@node1 prom]# kubectl apply -f prom-cm.yml
configmap/prometheus-config configured

稍等一会配置
[root@node1 prom]# curl -X POST "http://10.2.2.86:9090/-/reload"

查看效果:

1726449160024-38ead31d-1195-4435-9b08-34aa89f31b76.png

Contarner容器自动发现

学习Docker的时候我们已经知道收集docker容器使用的是cAdvisor,而k8s的kubelet已经内置了cAdvisor。所以我们只需要访问即可,这里我们可以使用访问kubelet的暴露的地址访问cAdvisor数据。只需要在原有发现的地址基础上再添加一层目录即可 /metrics/cadvisor

配置文件:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'coredns'
static_configs:
- targets: ['10.2.0.2:9153', '10.2.1.61:9153']

- job_name: 'mysql'
static_configs:
- targets: ['mysql-svc.default:9104']

- job_name: 'nodes'
kubernetes_sd_configs: #k8s自动服务发现
- role: node #自动发现类型为node
relabel_configs:
- action: replace
source_labels: ['__address__'] #需要修改的源标签
regex: '(.*):10250' #正则表达式(10.0.0.10):10250
replacement: '${1}:9100' #替换后的内容10.0.0.10:9100
target_label: __address__ #将替换后的内容覆盖原来的标签
- action: labelmap #将正则表达式与所有标签名称匹配
regex: __meta_kubernetes_node_label_(.+) #提取符合正则匹配的标签,然后天交到Label里

- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
replacement: /metrics/cadvisor
target_label: __metrics_path__

应用配置:

[root@node1 prom]# kubectl apply -f prom-cm.yml
configmap/prometheus-config configured

[root@node1 prom]# curl -X POST "http://10.2.2.86:9090/-/reload"

查看结果:

1726449160098-e3ac98dd-23f9-49ef-b0b8-b760176518c1.png

关于查询的数据可以查看官网的说明:

https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md

APIServer服务自动发现

API Server是k8s的核心组件,对于API Server的监控我们可以直接通过k8s的Service来获取:

[root@node1 prom]# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.1.0.1 <none> 443/TCP 10d

修改配置:

- job_name: 'apiservers'
kubernetes_sd_configs:
- role: endpoints

应用配置:

kubectl apply -f prom-cm.yml
curl -X POST "http://10.2.2.86:9090/-/reload"

查看结果:

1726449160403-c6ccbafd-0445-44e8-8fd4-b7d55ae00510.png

这时我们发现promtheus把所有的endpoint都找出来了,那么哪个才是我们需要的呢?通过查看API Server的svc可以发现API Server的通讯端口是6443,所以6443端口的服务才是我们需要的。

[root@node1 prom]# kubectl describe svc kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.1.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.0.0.11:6443
Session Affinity: None
Events: <none>

要想保留我们发现的的API Server,那么就需要查看他都有什么标签,然后将拥有这些标签的服务保留下来。

1726449160450-2fdf7a75-48c6-450f-bfbb-ed6518fefb12.png

我们需要匹配的条件是标签名为__meta_kubernetes_service_label_component 的值为"apiserver"的服务。

因为这个端口是https协议的,所以我们还需要带上认证的证书。

修改配置文件:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'coredns'
static_configs:
- targets: ['10.2.0.16:9153','10.2.0.17:9153']

- job_name: 'mysql'
static_configs:
- targets: ['mysql-svc:9104']

- job_name: 'nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: ['__address__']
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

- job_name: 'cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
- source_labels: [__meta_kubernetes_node_name]
regex: (.*)
replacement: /metrics/cadvisor
target_label: __metrics_path__

- job_name: 'apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver

应用修改:

kubectl apply -f prom-cm.yml
curl -X POST "http://10.2.2.86:9090/-/reload"

查看结果:

1726449160454-eb0d1e86-78b7-4d9e-af9a-535ae89b3f42.png

Pod自动发现

我们这里采集到的Pod监控也是使用自动发现Endpoints。只不过这里需要做一些匹配的处理.

配置如下:

- job_name: 'pods'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

生效配置:

kubectl apply -f prom-cm.yml
curl -X POST "http://10.2.2.86:9090/-/reload"

查看效果:1726449160554-be8abfbb-ef9e-4aed-9b93-39b83048c0e1.png

匹配参数解释:

__meta_kubernetes_service_annotation_prometheus_io_scrape 为 true
__address__的端口和__meta_kubernetes_service_annotation_prometheus_io_port 的端口一样

自动发现原理:我们创建svc的时候添加了prometheus和metrics端口的注解,这样就能被prometheus自动发现到

[root@node1 prom]# kubectl -n kube-system get svc kube-dns -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: KubeDNS

对应promethues自动发现里的数据:1726449160865-66248efa-9798-4b71-8c0e-34df5f048050.png

那也就意味着以后发布的应用,我们只需要添加这两条注解就可以被自动发现了,现在我们可以来修改下刚才创建的mysql配置,添加相关注解就可以自动被发现了。

---
kind: Service
apiVersion: v1
metadata:
name: mysql-svc
namespace: prom
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9104"
spec:
selector:
app: mysql
ports:
- name: mysql
port: 3306
targetPort: 3306
- name: mysql-prom
port: 9104
targetPort: 9104

查看prometheus可以发现mysql的pod已经自动被发现了

1726449160917-df5c0b8d-b7a8-44e2-9515-5c93b2bd6cc8.png

现在就可以删除刚才我们配置的静态mysql了,同理,我们刚才静态配置的coredns也可以删掉了,同样使用自动发现来处理,配置如下:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prom
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

#- job_name: 'coredns'
# static_configs:
# - targets: ['10.2.0.16:9153','10.2.0.17:9153']


#- job_name: 'mysql'
# static_configs:
# - targets: ['mysql-svc:9104']

- job_name: 'nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: ['__address__']
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

- job_name: 'cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
- source_labels: [__meta_kubernetes_node_name]
regex: (.*)
replacement: /metrics/cadvisor
target_label: __metrics_path__

- job_name: 'apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_component]
action: keep
regex: apiserver

- job_name: 'pods'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

- job_name: 'coredns'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_endpoints_name]
action: keep
regex: kube-dns

应用配置:

kubectl apply -f prom-cm.yml
curl -X POST "http://10.2.2.86:9090/-/reload"

查看结果:

1726449160956-4403e3b9-4832-4733-badb-7fa4f1b1bcb6.png

k8s资源对象监控

介绍

我们刚才自动发现使用的是endpoints,监控的都是应用数据以,但是在k8s内部的pod,deployment,daemonset等资源也需要监控,比如当前有多少个pod,pod状态是什么样等等。

这些指标数据需要新的exporter来提供,那就是kube-state-metrics

项目地址:

https://github.com/kubernetes/kube-state-metrics

安装

克隆代码:

git clone https://github.com/kubernetes/kube-state-metrics.git

修改配置:

这里我们主要修改镜像地址和添加自动发现的注解

cd kube-state-metrics/examples/standard
vim deployment.yaml
-------------------
- image: bitnami/kube-state-metrics:2.1.1
-------------------

vim service.yaml
-------------------
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/version: 2.1.1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: kube-state-metrics
namespace: kube-system
......................
-------------------

创建资源配置:

[root@node1 standard]# kubectl apply -f ./
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
serviceaccount/kube-state-metrics created
service/kube-state-metrics created

查看promtheus:

1726449161251-8bedf98d-a86f-4c53-9c44-128281376510.png

应用场景:

存在执行失败的Job: kube_job_status_failed
集群节点状态错误: kube_node_status_condition{condition="Ready", status!="true"}==1
集群中存在启动失败的 Pod:kube_pod_status_phase{phase=~"Failed|Unknown"}==1
最近30分钟内有 Pod 容器重启: changes(kube_pod_container_status_restarts_total[30m])>0

更多使用方法可以查看官方文档:

https://github.com/kubernetes/kube-state-metrics

更新: 2024-09-21 16:11:33