0%

Prometheus kubernetes-cadvisor服务自动发现

这一篇文章,通过kubernetes_sd_configs配置服务自动发现。

还原来一样,是通过kube-prometheus安装的k8s监控系统。

和之前由kubernetes-operator已经提供的yaml文件不同,这次我们需要新建一个yaml文件配置。

prometheus-additional.yaml内容如下,注意,这里自动发现的role设置为pod。通过它,将自动发现k8s集群里面所有的pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: pod
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_pod_node_name]
regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
target_label: __metrics_path__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)

然后还需要修改prometheus-prometheus.yaml的内容,在spec项下级,serviceAccountName标签前面添加additionalScrapeConfigs标签,如下:

1
2
3
4
5
6
7
spec:
...
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
...

删除之前的secret,如果有的话。

1
2
# kubectl delete secret additional-configs -n monitoring
secret "additional-configs" deleted

第一次新建当然是没有的,所以直接创建secret:

1
2
# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created

然后应用prometheus的配置,因为做过修改。

1
2
# kubectl apply -f prometheus/prometheus-prometheus.yaml 
prometheus.monitoring.coreos.com/k8s configured

稍等片刻后,我们可以看到Prometheus Configuration设置已经变化。已经有job_name为kubernetes-cadvisor的记录了。kubernetes-cadvisor部分内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
- job_name: kubernetes-cadvisor
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: pod
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap

但奇怪的是,我们在Targets页面并没有发现kubernetes-cadvisor这个Target。

1
2
3
4
5
# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
...
prometheus-k8s-0 3/3 Running 1 77d
prometheus-k8s-1 3/3 Running 1 77d

查看pod,未发现异常。

查看一下Pod日志。kubectl logs -f prometheus-k8s-0 prometheus -n monitoring

level=error ts=2020-03-05T02:56:36.493Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth
msg=”/app/discovery/kubernetes/kubernetes.go:263: Failed to list v1.Endpoints: endpoints is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”endpoints\” in API group \”\” at the cluster scope”
level=error ts=2020-03-05T02:56:37.491Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg=”/app/discovery/kubernetes/kubernetes.go:264: Failed to list
v1.Service: services is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”services\” in API group \”\” at the cluster scope”
level=error ts=2020-03-05T02:56:37.494Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg=”/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”pods\” in API group \”\” at the cluster scope”

很多出错日志,主要是上面三条的循环出现。

  1. Failed to list *v1.Endpoints: endpoints is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”endpoints\” in API group \”\” at the cluster scope.

  2. Failed to list *v1.Service: services is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”services\” in API group \”\” at the cluster scope.

  3. Failed to list *v1.Pod: pods is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”pods\” in API group \”\” at the cluster scope.

endpoints/services/pods is forbidden,说明是RBAC权限问题,namespace: monitoring下的serviceaccount: prometheus-k8s这个用户没有权限。

查看prometheus-prometheus.yaml内容,可以看到Prometheus绑定了一个名为prometheus-k8s的serviceAccount对象。

1
2
3
4
5
6
7
8
9
10
11
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
...
serviceAccountName: prometheus-k8s
...

通过查看prometheus-clusterRole.yaml得知,名为prometheus-k8s的serviceAccount对象绑定的是一个名为prometheus-k8s的ClusterRole。

绑定关系在prometheus-clusterRoleBinding.yaml这个文件。

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-k8s
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring

查看这个Role:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get

在上面的权限规则中,并未发现对endpoints/services/pods的list权限。所以做如下修改:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
# new apiGroups
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch

应用一下

1
2
# kubectl apply -f prometheus-clusterRole.yaml 
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured

这时就可以看到Targets了。

我们分析一下第二个Endpoint为:https://kubernetes.default.svc:443/api/v1/nodes/k8s-node1/proxy/metrics/cadvisor

在node节点上是无法解析kubernetes.default.svc的,所以我们先获取这个域名的真实地址。

1
2
3
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 62d

再通过secret获取到token,就可以访问url以获取指标了。

1
2
# curl -k https://10.96.0.1:443/api/v1/nodes/k8s-node1/proxy/metrics/cadvisor -H "Authorization: Bearer 
eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtb25pdG9yaW5nIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6InByb21ldGhldXMtazhzLXRva2VuLXNtNmdkIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InByb21ldGhldXMtazhzIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMTk2OTkyOTItNDY2Yi00NWQ4LWJmYmYtYzkyZjIwNjczOWY3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om1vbml0b3Jpbmc6cHJvbWV0aGV1cy1rOHMifQ.iggZ4ZxmD0y04OQfDlo4P6zRgzn0ryVhcdhlgncpnBY5BJ39Xz0a2AA51ePa78R2njFDjPcecgDJRcqPv76X3o-C-G7EZvN_Ru8zSdB51YxqlLNoIW5hy6Jr27aw74lMslg1MYX_31kkRTqD9DxVn6lq6Uqf4Djebj_E-2maiwl863GCeNRfS1X6KM8idsVknLlpdVINbM8U_l1Yuw-auNzelAk1NQlBdbJqsm1CZKIg_YBsT-KbiyTsbjX2v0uL1D6-Q5Xs9NZMLEAa7dfwz_EOYMDnIGbv-eyhD-924H4_pGOIoQ0dCBP01cxFm7pLJPGouwLaEwPs5BRS0B6u-w"

指标以container_开始。和我们之前讨论的kubelet/1的指标完全一样。Prometheus monitoring/kubelet监控指标

所以其实是可以通过修改kubelet/1的配置来达到新增这个Target的目的。但kubelet/1的目的主要是监控kubelet的情况,其他的Pod它不监控,Target是没有被keep的。如下图:

1
2
3
4
5
6
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement: $1
action: keep

我将指标存入kubernetes-cadvisor-1.txt 点击查看,供下载查看。