【Case Sharing】 Negligent operation leads to unavailability of ceph cluster

Fault Description

A ceph cluster consisting of 5 physical servers, one of which needs to be down for replacement due to memory corruption. After the replacement was completed, the cluster state of the node was found to be abnormal.


 Failure analysis

l The five physical servers in the cluster are all mon nodes;

If one of them has a problem, it will not cause the whole cluster to stop service for the time being;

l The previous day's change operation only replaced the memory and did not change other components.

Log Analysis

Check the operating system logs and ceph logs of the failed node.



ceph logs:



1. Restart the service

Try to restart the ceph service as described in the logs above.


2. Check the ceph configuration

Check the configuration files involved in ceph, and authentication keyring,and found that the configuration is the same, and there is no change.

3、Check the network configuration

Through the network test, found that ceph involved in the public and cluster network can not connect to other nodes, 

check the network card found that the network cable is not connected.


Summary of experience

It was finally determined that the failure was caused by the unplugging of the network cable during the change operation, 

but after the change was completed, it was triggered by the incorrect connection 

status of the network cable.

Through this failure, it is concluded that the operation and maintenance personnel must pay attention to the standardization 

of the operation in the usual operation and maintenance operation, 

make a good record before the operation,and after the operation, make sure that all the related operations are completed. For the test work after the completion of 

the change, we should also pay more attention to be rigorous and comprehensive.

