#作者:任少近
文章目录
- 6Ceph资源对像管理
- 6.1查看services
- 6.2查看Jobs
- 6.3 查看deployments.apps
- 6.4查看daemonsets.apps
- 6.5查看configmaps
- 6.6查看clusterroles.rbac.authorization.k8s.io
- 6.7查看clusterrolebindings.rbac.authorization.k8s.io
- 6.8通过cephclusters.ceph查看OSD池信息
- 7访问ceph
- 7.1Toolbox客户端
- 7.1K8s节点访问ceph
- 7.2暴露端口web访问
- 7.3删除OSD Deployment
- 7.4Ceph的Pool(多租户)创建pool设置pg的数量
- 7.5修改登录密码
- 8安装错误汇总
- 8.1quincy版的Ceph-coomon安装报错
- 9故障处理
- 9.1ceph集群提示daemons have recently crashed, health: HEALTH_WARN
- 9.2osd down
6Ceph资源对像管理
6.1查看services
[root@k8s-master ~]# kubectl -n rook-ceph get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
rook-ceph-mgr ClusterIP 10.110.141.201 <none> 9283/TCP 13h
rook-ceph-mgr-dashboard ClusterIP 10.103.197.146 <none> 8443/TCP 13h
rook-ceph-mon-a ClusterIP 10.110.163.61 <none> 6789/TCP,3300/TCP 13h
rook-ceph-mon-b ClusterIP 10.100.49.10 <none> 6789/TCP,3300/TCP 13h
rook-ceph-mon-c ClusterIP 10.96.193.162 <none> 6789/TCP,3300/TCP 13h
6.2查看Jobs
[root@k8s-master]#kubectl -n rook-ceph get jobs
NAME COMPLETIONS DURATION AGE
rook-ceph-osd-prepare-k8s-master 1/1 6s 11h
rook-ceph-osd-prepare-k8s-node1 1/1 7s 11h
rook-ceph-osd-prepare-k8s-node2 1/1 7s 11h
rook-ceph-osd-prepare-k8s-node3 1/1 6s 11h
6.3 查看deployments.apps
[root@k8s-master]# kubectl -n rook-ceph get deployments.apps
NAME READY UP-TO-DATE AVAILABLE AGE
csi-cephfsplugin-provisioner 2/2 2 2 12h
csi-rbdplugin-provisioner 2/2 2 2 12h
rook-ceph-crashcollector-k8s-master 1/1 1 1 12h
rook-ceph-crashcollector-k8s-node1 1/1 1 1 12h
rook-ceph-crashcollector-k8s-node2 1/1 1 1 12h
rook-ceph-crashcollector-k8s-node3 1/1 1 1 12h
rook-ceph-mgr-a 1/1 1 1 12h
rook-ceph-mgr-b 1/1 1 1 12h
rook-ceph-mon-a 1/1 1 1 12h
rook-ceph-mon-b 1/1 1 1 12h
rook-ceph-mon-c 1/1 1 1 12h
rook-ceph-operator 1/1 1 1 12h
rook-ceph-osd-0 1/1 1 1 12h
rook-ceph-osd-1 1/1 1 1 12h
rook-ceph-osd-2 1/1 1 1 12h
rook-ceph-osd-3 1/1 1 1 12h
6.4查看daemonsets.apps
[root@k8s-master]# kubectl -n rook-ceph get daemonsets.apps
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
csi-cephfsplugin 4 4 4 4 4 <none> 12h
csi-rbdplugin 4 4 4 4 4 <none> 12h
6.5查看configmaps
[root@k8s-master]# kubectl -n rook-ceph get configmaps
NAME DATA AGE
kube-root-ca.crt 1 13h
rook-ceph-csi-config 1 12h
rook-ceph-csi-mapping-config 1 12h
rook-ceph-mon-endpoints 5 12h
rook-ceph-operator-config 33 13h
rook-config-override 1 12h
6.6查看clusterroles.rbac.authorization.k8s.io
[root@k8s-master # kubectl -n rook-ceph get clusterroles.rbac.authorization.k8s.io
NAME CREATED AT
cephfs-csi-nodeplugin 2023-06-13T13:56:29Z
cephfs-external-provisioner-runner 2023-06-13T13:56:29Z
rbd-csi-nodeplugin 2023-06-13T13:56:29Z
rbd-external-provisioner-runner 2023-06-13T13:56:29Z
rook-ceph-cluster-mgmt 2023-06-13T13:56:29Z
rook-ceph-global 2023-06-13T13:56:29Z
rook-ceph-mgr-cluster 2023-06-13T13:56:29Z
rook-ceph-mgr-system 2023-06-13T13:56:29Z
rook-ceph-object-bucket 2023-06-13T13:56:29Z
rook-ceph-osd 2023-06-13T13:56:29Z
rook-ceph-system 2023-06-13T13:56:29Z
6.7查看clusterrolebindings.rbac.authorization.k8s.io
kubectl -n rook-ceph get clusterrolebindings.rbac.authorization.k8s.iocephfs-csi-nodeplugin-role ClusterRole/cephfs-csi-nodeplugin
cephfs-csi-provisioner-role ClusterRole/cephfs-external-provisioner-runner
rbd-csi-nodeplugin ClusterRole/rbd-csi-nodeplugin
rbd-csi-provisioner-role ClusterRole/rbd-external-provisioner-runner
rook-ceph-global ClusterRole/rook-ceph-global
rook-ceph-mgr-cluster ClusterRole/rook-ceph-mgr-cluster
rook-ceph-object-bucket ClusterRole/rook-ceph-object-bucket
rook-ceph-osd ClusterRole/rook-ceph-osd
rook-ceph-system ClusterRole/rook-ceph-system
6.8通过cephclusters.ceph查看OSD池信息
如果你使用了 Rook Ceph Operator 来管理 Ceph 集群,还可以查看 Rook 中的自定义资源来获取 OSD 池的信息
[root@k8s-master ~]# kubectl get cephclusters.ceph.rook.io rook-ceph -o yaml
7访问ceph
7.1Toolbox客户端
部署
cd rook/deploy/examples/
kubectl apply -f toolbox.yaml
连接ceph 集群
[root@k8s-master ~]# kubectl -n rook-ceph exec -it rook-ceph-tools-7857bc9568-q9fjk /bin/bash
bash-4.4$ ceph -scluster:id: e320aa6c-0057-46ad-b2bf-5c49df8eba5ahealth: HEALTH_OKservices:mon: 3 daemons, quorum a,b,c (age 13h)mgr: b(active, since 13h), standbys: aosd: 4 osds: 4 up (since 13h), 4 in (since 13h)data:pools: 1 pools, 1 pgsobjects: 2 objects, 449 KiBusage: 45 MiB used, 200 GiB / 200 GiB availpgs: 1 active+clean
7.1K8s节点访问ceph
在节点添加ceph.conf keyring
[root@k8s-master]#mkdir /etc/ceph
[root@k8s-master]#cd /etc/ceph
[root@k8s-master]#vi ceph.conf
[global]
mon_host = 10.110.163.61:6789,10.100.49.10:6789,10.96.193.162:6789
[client.admin]
keyring = /etc/ceph/keyring[root@k8s-master]#vi keyring
[client.admin]
key = AQCGfYhkeMnEFRAAJnW4jUMwmJz2b1dPvdTOJg==
验证
telnet 10.110.163.61 6789 以上三个services地址任一个
添加yum源
[ceph]
name=ceph
baseurl=https://mirrors.aliyun.com/ceph/rpm-quincy/el8/x86_64/
enabled=1
gpgcheck=0
安装ceph-common (安装失败,详情见5.1)
[root@k8s-master]#yum install -y ceph-common
成功可在节点上直接操作如下:
7.2暴露端口web访问
执行rook/deploy/examples/dashboard-external-https.yaml
[root@k8s-master examples]#kubectl apply -f rook/deploy/examples/dashboard-external-https.yamlrook-ceph-mgr-dashboard-external-https NodePort 10.106.127.224 <none> 8443:31555/TCP
获取密码:
kubectl -n rook-ceph get secrets rook-ceph-dashboard-password -o jsonpath='{.data.password}' | base64 --decode > rook-ceph-dashboard-password.passwordG+LIkJwXQ/E*>/P&DbzB
访问,用户名为admin
https://192.168.123.194:31555/
7.3删除OSD Deployment
如果cluster.yaml中removeOSDsIfOutAndSafeToRemove: true设置为true,则Rook Operator将自动清除Deployment。默认为false。
7.4Ceph的Pool(多租户)创建pool设置pg的数量
以pool为颗粒度,如果不创建/指定,则数据会存放在默认的pool里。创建pool需要设置pg的数量,一般来说每个OSD为100个PG,也可以按照如下规则配置:
若少于5个OSD, 设置pg_num为128。
5~10个OSD,设置pg_num为512。
10~50个OSD,设置pg_num为4096。
超过50个OSD,可以参考pgcalc计算。
Pool上还需要设置CRUSH Rules策略,这是data如何分布式存储的策略。
此外,针对pool,还可以调整POOL副本数量、删除POOL、设置POOL配额、重命名POOL、查看POOL状态信息。
7.5修改登录密码
登录kubectl exec -it rook-ceph-tools-7857bc9568-q9fjk bash
bash-4.4$ echo -n ‘1qaz@WSX’ > /tmp/password.txt
bash-4.4$ ceph dashboard ac-user-set-password admin --force-password -i /tmp/password.txt,以新密码登录。
以admin/1qaz@WSX为用户名密码登录。
8安装错误汇总
8.1quincy版的Ceph-coomon安装报错
原因:aliyuncs上无el7版本的quincy依赖包,只有el8依赖包有quincy版,尝试octopus版同样报错。
--> Finished Dependency Resolution
Error: Package: 2:libcephfs2-17.2.6-0.el8.x86_64 (ceph)Requires: libstdc++.so.6(GLIBCXX_3.4.21)(64bit)
Error: Package: 2:ceph-common-17.2.6-0.el8.x86_64 (ceph)Requires: libstdc++.so.6(CXXABI_1.3.11)(64bit)
Error: Package: 2:libcephfs2-17.2.6-0.el8.x86_64 (ceph)Requires: libstdc++.so.6(GLIBCXX_3.4.20)(64bit)
Error: Package: 2:ceph-common-17.2.6-0.el8.x86_64 (ceph)Requires: libstdc++.so.6(GLIBCXX_3.4.22)(64bit)
。。。。。
。。。。。Requires: libstdc++.so.6(GLIBCXX_3.4.20)(64bit)
Error: Package: 2:librgw2-17.2.6-0.el8.x86_64 (ceph)Requires: libicuuc.so.60()(64bit)
Error: Package: 2:librgw2-17.2.6-0.el8.x86_64 (ceph)Requires: libstdc++.so.6(GLIBCXX_3.4.21)(64bit)
Error: Package: 2:librgw2-17.2.6-0.el8.x86_64 (ceph)Requires: libstdc++.so.6(CXXABI_1.3.11)(64bit)
Error: Package: 2:librgw2-17.2.6-0.el8.x86_64 (ceph)Requires: libthrift-0.13.0.so()(64bit)You could try using --skip-broken to work around the problemYou could try running: rpm -Va --nofiles --nodigest
9故障处理
9.1ceph集群提示daemons have recently crashed, health: HEALTH_WARN
bash-4.4$ ceph statuscluster:id: e320aa6c-0057-46ad-b2bf-5c49df8eba5ahealth: HEALTH_WARN3 mgr modules have recently crashedservices:mon: 3 daemons, quorum a,b,c (age 23h)mgr: b(active, since 23h), standbys: aosd: 4 osds: 4 up (since 23h), 4 in (since 23h)data:pools: 1 pools, 1 pgsobjects: 2 objects, 449 KiBusage: 45 MiB used, 200 GiB / 200 GiB avail
pgs: 1 active+clean#查看详细日志信息
bash-4.4$ ceph health detail
HEALTH_WARN 3 mgr modules have recently crashed
[WRN] RECENT_MGR_MODULE_CRASH: 3 mgr modules have recently crashed
ceph 的 crash模块用来收集守护进程出现 crashdumps (崩溃)的信息,并将其存储在ceph集群中,以供以后分析。
crash查看一下
bash-4.4$ ceph crash ls
ID ENTITY NEW
2023-06-14T13:56:38.064890Z_75a59d8c-9c99-47af-8cef-e632d8f0a010 mgr.b *
2023-06-14T13:56:53.252095Z_bc44e5d3-67e5-4c22-a872-e9c7f9799f55 mgr.b *
2023-06-14T13:57:38.564803Z_1f132169-793b-4ac6-a3c7-af48c91f5365 mgr.b *
#带*号表示为最新,上面说mgr和osd有异常信息,接下来排查下osd和mgr,看看是不是因为没有归档的原因造成bash-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.19519 root default
-5 0.04880 host k8s-master0 ssd 0.04880 osd.0 up 1.00000 1.00000
-3 0.04880 host k8s-node11 ssd 0.04880 osd.1 up 1.00000 1.00000
-9 0.04880 host k8s-node23 ssd 0.04880 osd.3 up 1.00000 1.00000
-7 0.04880 host k8s-node32 ssd 0.04880 osd.2 up 1.00000 1.00000
通过上面的命令,排查到集群状态是ok,判断crash没有归档,造成误报,接下来进行归档
#第一种方法,适合只有一两个没有归档的
#ceph crash ls
#ceph crash archive <id>
#第二种方法,适合多个归档异常的,我们这边直接执行下面的命令
#ceph crash archive-all
9.2osd down
直接看挂了,看日志。
通过ceph osd tree,发现osd.3 down了。在k8s-node2上。