0%

我们在Grafana测试环境,搭建好了一个Dashboard,现在想把它放到生产环境中运行。这时,就需要用到Grafana的导出导入功能。

导出

Dashboard右上角,点击Share Dashboard图标。

同时,选中第三个标签Export,最后点击Save to file按钮。这时会把Dashboard保存到一个json文件中,并下载到本地。

导入

点击Import菜单

点击Upload .json File按钮,上传刚才保存的json文件。或者在Or paste JSON文本框中,粘贴刚才下载的json的内容后,再点击Load按钮。

这一篇文章,通过kubernetes_sd_configs配置服务自动发现。

还原来一样,是通过kube-prometheus安装的k8s监控系统。

和之前由kubernetes-operator已经提供的yaml文件不同,这次我们需要新建一个yaml文件配置。

prometheus-additional.yaml内容如下,注意,这里自动发现的role设置为pod。通过它,将自动发现k8s集群里面所有的pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: pod
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_pod_node_name]
regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
target_label: __metrics_path__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)

然后还需要修改prometheus-prometheus.yaml的内容,在spec项下级,serviceAccountName标签前面添加additionalScrapeConfigs标签,如下:

1
2
3
4
5
6
7
spec:
...
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
serviceAccountName: prometheus-k8s
...

删除之前的secret,如果有的话。

1
2
# kubectl delete secret additional-configs -n monitoring
secret "additional-configs" deleted

第一次新建当然是没有的,所以直接创建secret:

1
2
# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created

然后应用prometheus的配置,因为做过修改。

1
2
# kubectl apply -f prometheus/prometheus-prometheus.yaml 
prometheus.monitoring.coreos.com/k8s configured

稍等片刻后,我们可以看到Prometheus Configuration设置已经变化。已经有job_name为kubernetes-cadvisor的记录了。kubernetes-cadvisor部分内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
- job_name: kubernetes-cadvisor
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: pod
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap

但奇怪的是,我们在Targets页面并没有发现kubernetes-cadvisor这个Target。

1
2
3
4
5
# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
...
prometheus-k8s-0 3/3 Running 1 77d
prometheus-k8s-1 3/3 Running 1 77d

查看pod,未发现异常。

查看一下Pod日志。kubectl logs -f prometheus-k8s-0 prometheus -n monitoring

level=error ts=2020-03-05T02:56:36.493Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth
msg=”/app/discovery/kubernetes/kubernetes.go:263: Failed to list v1.Endpoints: endpoints is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”endpoints\” in API group \”\” at the cluster scope”
level=error ts=2020-03-05T02:56:37.491Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg=”/app/discovery/kubernetes/kubernetes.go:264: Failed to list
v1.Service: services is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”services\” in API group \”\” at the cluster scope”
level=error ts=2020-03-05T02:56:37.494Z caller=klog.go:94 component=k8s_client_runtime func=ErrorDepth msg=”/app/discovery/kubernetes/kubernetes.go:265: Failed to list *v1.Pod: pods is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”pods\” in API group \”\” at the cluster scope”

很多出错日志,主要是上面三条的循环出现。

  1. Failed to list *v1.Endpoints: endpoints is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”endpoints\” in API group \”\” at the cluster scope.

  2. Failed to list *v1.Service: services is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”services\” in API group \”\” at the cluster scope.

  3. Failed to list *v1.Pod: pods is forbidden: User \”system:serviceaccount:monitoring:prometheus-k8s\” cannot list resource \”pods\” in API group \”\” at the cluster scope.

endpoints/services/pods is forbidden,说明是RBAC权限问题,namespace: monitoring下的serviceaccount: prometheus-k8s这个用户没有权限。

查看prometheus-prometheus.yaml内容,可以看到Prometheus绑定了一个名为prometheus-k8s的serviceAccount对象。

1
2
3
4
5
6
7
8
9
10
11
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
...
serviceAccountName: prometheus-k8s
...

通过查看prometheus-clusterRole.yaml得知,名为prometheus-k8s的serviceAccount对象绑定的是一个名为prometheus-k8s的ClusterRole。

绑定关系在prometheus-clusterRoleBinding.yaml这个文件。

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-k8s
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-k8s
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring

查看这个Role:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get

在上面的权限规则中,并未发现对endpoints/services/pods的list权限。所以做如下修改:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
# new apiGroups
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch

应用一下

1
2
# kubectl apply -f prometheus-clusterRole.yaml 
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured

这时就可以看到Targets了。

我们分析一下第二个Endpoint为:https://kubernetes.default.svc:443/api/v1/nodes/k8s-node1/proxy/metrics/cadvisor

在node节点上是无法解析kubernetes.default.svc的,所以我们先获取这个域名的真实地址。

1
2
3
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 62d

再通过secret获取到token,就可以访问url以获取指标了。

1
2
# curl -k https://10.96.0.1:443/api/v1/nodes/k8s-node1/proxy/metrics/cadvisor -H "Authorization: Bearer 
eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtb25pdG9yaW5nIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6InByb21ldGhldXMtazhzLXRva2VuLXNtNmdkIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InByb21ldGhldXMtazhzIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMTk2OTkyOTItNDY2Yi00NWQ4LWJmYmYtYzkyZjIwNjczOWY3Iiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om1vbml0b3Jpbmc6cHJvbWV0aGV1cy1rOHMifQ.iggZ4ZxmD0y04OQfDlo4P6zRgzn0ryVhcdhlgncpnBY5BJ39Xz0a2AA51ePa78R2njFDjPcecgDJRcqPv76X3o-C-G7EZvN_Ru8zSdB51YxqlLNoIW5hy6Jr27aw74lMslg1MYX_31kkRTqD9DxVn6lq6Uqf4Djebj_E-2maiwl863GCeNRfS1X6KM8idsVknLlpdVINbM8U_l1Yuw-auNzelAk1NQlBdbJqsm1CZKIg_YBsT-KbiyTsbjX2v0uL1D6-Q5Xs9NZMLEAa7dfwz_EOYMDnIGbv-eyhD-924H4_pGOIoQ0dCBP01cxFm7pLJPGouwLaEwPs5BRS0B6u-w"

指标以container_开始。和我们之前讨论的kubelet/1的指标完全一样。Prometheus monitoring/kubelet监控指标

所以其实是可以通过修改kubelet/1的配置来达到新增这个Target的目的。但kubelet/1的目的主要是监控kubelet的情况,其他的Pod它不监控,Target是没有被keep的。如下图:

1
2
3
4
5
6
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: kubelet
replacement: $1
action: keep

我将指标存入kubernetes-cadvisor-1.txt 点击查看,供下载查看。

执行df -h命令发现/var/lib/docker目录占用空间太大。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# df -h
文件系统 容量 已用 可用 已用% 挂载点
devtmpfs 898M 0 898M 0% /dev
tmpfs 910M 0 910M 0% /dev/shm
tmpfs 910M 60M 851M 7% /run
tmpfs 910M 0 910M 0% /sys/fs/cgroup
/dev/mapper/centos-root 17G 5.9G 12G 35% /
/dev/sda1 1014M 150M 865M 15% /boot
tmpfs 910M 4.0K 910M 1% /var/lib/kubelet/pods/0914bac1-2c2f-4294-a036-afe6fb5d2a16/volumes/kubernetes.io~secret/config
tmpfs 910M 4.0K 910M 1% /var/lib/kubelet/pods/7975066f-117d-4523-a55e-53262ed8f131/volumes/kubernetes.io~secret/config-volume
tmpfs 910M 12K 910M 1% /var/lib/kubelet/pods/7454b8be-1bd2-4c17-9784-55bf49d0e6fd/volumes/kubernetes.io~secret/node-exporter-token-hwqlc
tmpfs 910M 12K 910M 1% /var/lib/kubelet/pods/7975066f-117d-4523-a55e-53262ed8f131/volumes/kubernetes.io~secret/alertmanager-main-token-z4qrq
tmpfs 910M 12K 910M 1% /var/lib/kubelet/pods/71e0ccae-267d-4025-b4ec-a5ccbd1f0adc/volumes/kubernetes.io~secret/kube-proxy-token-kq9lp
tmpfs 910M 4.0K 910M 1% /var/lib/kubelet/pods/e685593b-9684-48cc-a304-040b30707088/volumes/kubernetes.io~secret/config
tmpfs 910M 12K 910M 1% /var/lib/kubelet/pods/53e83a95-4db5-48f7-b166-b37df18b38e0/volumes/kubernetes.io~secret/calico-node-token-f6g2m
tmpfs 910M 12K 910M 1% /var/lib/kubelet/pods/0914bac1-2c2f-4294-a036-afe6fb5d2a16/volumes/kubernetes.io~secret/prometheus-k8s-token-sm6gd
tmpfs 910M 12K 910M 1% /var/lib/kubelet/pods/e685593b-9684-48cc-a304-040b30707088/volumes/kubernetes.io~secret/prometheus-k8s-token-sm6gd
overlay 17G 5.9G 12G 35% /var/lib/docker/overlay2/70944089660c7647c7a83e1dcef9d6297e3e440d4d2d4742f5d68448bde9ad95/merged
shm 64M 0 64M 0% /var/lib/docker/containers/e59d5037dc150cf7c2ed10ea94d8549c3b98e0b847980a1a306eda8dcaf7e6de/mounts/shm
overlay 17G 5.9G 12G 35% /var/lib/docker/overlay2/bbdeb82071f8c4cbb2d59e26721735d5f3fd82f8139c0a448133156a0f12631f/merged
overlay 17G 5.9G 12G 35% /var/lib/docker/overlay2/383c5b7d30028f38eca1e4f2b3ac6144d52b5b1e8f296f2314de9ca0fd867f7d/merged
shm 64M 0 64M 0% /var/lib/docker/containers/df1ee27e97c823d18fe76ca3a7195cd081e466527b3cf9169a9a9f816b4c3391/mounts/shm

执行docker ps -a发现大量已经停止或退出的容器,需要清理空间。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# docker system prune
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all dangling images
- all build cache
Are you sure you want to continue? [y/N] y
Deleted Containers:
64d81b1dc03026478f0114b657d4cd938967868967e5e5ee9b8ea3d72814a736
c5f180a1c5c2804d45d13d11f338d241e482ada9673bff0ba8cc18759ae4fcae
aa06316b5fd638c22fc41bd8f8ea4c511566c4c537c81bf5aad5f276ac6c2fae
ce5048ebf25bd48bbe1372aaf38c68078a05c06157ffcb4a8e646470f647acb0
0c4b37413e3483366fb640b94ad33596aef149a50bc610a2d720663159280871
7575ec0a092bc9d3dd1a54ac2e3b92ce0dd8f3f570565ecc918684b0c51087ea
1e335ee77f4167c19f6e54a1728d3d49775f651d871009752b1cfb7ec9fd7867
393bd6280ecf423804c27f264cfc95e72df7580dc6881bb1a249152b4554b10e
63ad5c44a30c80645e84b16cc8e1009e46235bd7001153f8d8d3c4887ed18786
ee8c92a30092e7c30338d2a6721d13f250c2a426947a29c2e2c30f9936c0e07a
cd49e432a75a5f9719e8436af70a40e9869e02db51c4f8c08f02ad26a6bdc3ba
35435e8cd10d2dd9b26993da909aba31f4d52df3475b30ba7536d473004c7e91
6028cf4eac7090a89c05edad7e45096130ee7e3336f28d2c0e2a890111d80043
6e1d20022ecdd871a684b016c89d4d1122c9c5a00de998f921e2eacbd54da482
6417077821f3fc76da1e436401b0ed58a842b8afec6134c3895d5f3c7ae8d069
b49f44b5a26aad43d9d7b0f1ff57b659d4aab90d13477063160862018d3ba6f3

Deleted Images:
untagged: nginx@sha256:b2d89d0a210398b4d1120b3e3a7672c16a4ba09c2c4a0395f18b9f7999b768f2
deleted: sha256:f7bb5701a33c0e572ed06ca554edca1bee96cbbc1f76f3b01c985de7e19d0657
deleted: sha256:8f1984cf0de0461043fcabd6b2a2040325ac001d9897a242e0d80486fe71575e
deleted: sha256:59d91a36ba4b720eadfbf346ddb825c2faa2d66cc7a915238eaf926c4b4b40ee
deleted: sha256:556c5fb0d91b726083a8ce42e2faaed99f11bc68d3f70e2c7bbce87e7e0b3e10
untagged: nginx@sha256:ee380d57f01eac528f59d031cbeff5eba98b077d25ebdc7d6aef6734f00a41b7
deleted: sha256:231d40e811cd970168fb0c4770f2161aa30b9ba6fe8e68527504df69643aa145
deleted: sha256:dc8adf8fa0fc82a56c32efac9d0da5f84153888317c88ab55123d9e71777bc62
deleted: sha256:77fcff986d3b13762e4777046b9210a109fda20cb261bd3bbe5d7161d4e73c8e

Total reclaimed space: 183.4MB

这样就删除了不用的容器和镜像了。