glusterfs集群故障恢复

glusterfs集群故障恢复以2台主机最简单的副本方式举例，安装步骤如下，先配置hosts，主机名，安装gluster服务

OS: CentOS 7.4
cat > /etc/hosts <<EOF
192.168.0.198 master
192.168.0.199 slave
EOF
hostnamectl --static set-hostname master/slave
exec $SHELL
yum install centos-release-gluster -y
yum install -y glusterfs glusterfs-server glusterfs-fuse glusterfs-rdma glusterfs-geo-replication glusterfs-devel
systemctl start glusterd && systemctl enable glusterd && systemctl status glusterd

添加其他节点，创建volume并启动

gluster peer probe slave
mkdir /data
gluster peer status
gluster volume create gfs-volume replica 2 transport tcp master:/data slave:/data force
ls -a /data/
gluster volume info
gluster volume start gfs-volume
gluster volume info gfs-volume

其他主机挂载gfs

yum install centos-release-gluster -y
yum install -y glusterfs-fuse

cat /etc/hosts
192.168.1.1 master

cat > /etc/fstab <<EOF
master:gfs-volume /opt glusterfs defaults,_netdev 0 0
EOF
mount -a

假如slave主机挂掉，并且不能启动，有2种方式可以恢复， 1，再创建一台主机，保持主机名和ip同故障主机一致 2，新创建一台主机，替换故障的主机

方案1，如果还是用192.168.0.199 这个ip作为slave节点，先安装glusterfs服务，不用启动既然slave挂了，那么先要找到slave主机gfs的uuid，在正常的节点查看，这里是在master节点

# cat /var/lib/glusterd/peers/*
uuid=090f4559-e1a4-43ed-8a3d-2edd4042ce50
state=3
hostname1=slave

在挂掉的slave主机编辑gfs配置后，启动gfs服务

# cat /var/lib/glusterd/glusterd.info 
UUID=090f4559-e1a4-43ed-8a3d-2edd4042ce50
operating-version=1
# systemctl start glusterd

加入集群，检查状态，如果不ok，那么多重启几次glusterfs

gluster peer status
gluster peer probe master
gluster peer status
systemctl restart glusterd

同步数据

gluster volume info
gluster volume sync master all
cat /var/lib/glusterd/glusterd.info 
查看 operating-version 这个数值已经变动

查看状态

gluster volume heal gfs-volume info

方案2，替代故障主机

创建主机，主机名比如为 three，ip为 192.168.0.200，在正常运行的gfs主机添加新主机,替换故障主机

gluster peer probe three
gluster volume replace-brick gfs-volume slave:data three:/data commit force

查看状态

gluster volume heal gfs-volume info

因为gfs模式为replica ，到这里也就结束了，如果是Distributed ，需要rebalance

gluster volume rebalance gfs-volume fix-layout start

附：移除一个gfs节点Remove a brick

gluster volume remove-brick gfs-volume replica 1 slave:/data start

移除slave主机的数据，只保留1份，即master 查看移除的状态

gluster volume remove-brick gfs-volume replica 1 slave:/data status

添加一个副本

gluster volume add-brick gfs-volume replica 2 slave:/data

参考文档

Managing GlusterFS Volumes GlusterFS Architecture

2018年08月06日于 linux工匠发表