【Case Sharing】 CEPH pg activating status processing

Fault Description

A customer's server room power transformation, power repair is complete, the physical server will be powered up to the normal state,

start openstack in the virtual machine,one of the virtual machine can not be started,

at the same time there is a virtual machine after the start, can not be written to the file system, and IO utilization rate soared,

but the IO does not have the actual amount of reads and writes.

图片1.jpg

图片2.jpg

Failure analysis

1, through and customer communication inquiry, in the server before the normal power down, the customer will openstack virtual

machine all shutdown, openstack service stopsbefore the ceph and openstack physical server normal shutdown.

2. After the server is powered up, start ceph first and then start openstack.

3, after starting the openstack service, manually start all the vm.

4, at this time did not find abnormal conditions, until after receiving feedback from the application staff, found that a vm startup is not normal,

a vm file system read and write is not normal.

Log in to openstack and ceph management platformto check the status, and found that the status of a vm in openstack is as follows:

图片3.jpg

The ceph status is as follows:

图片4.jpg

Attempting to reboot the virtual machine in an abnormal state reports the following error:

图片5.jpg

Troubleshooting

Through the output of ceph, it is found that osd.7 prompts slow ops, while 1 pg is in activating state.

1. Determine the status of osd

图片6.jpg

Determine that osd.7 belongs to the ceph03 node with the above command.

2. Determine the pg status

图片7.jpg

With the above commands, it was found that pg 7.1d already had a STUCK state when it was shut down last night.

The activating state in ceph means that the pgs have been interconnected, but cannot be active normally.

3. Check the ceph logs

Check the ceph log of ceph03 node, /var/log/ceph/ceph-osd.7.log, with the following contents:

图片8.jpg

Troubleshooting

1、Try to restart mon service.

Try to restart the ceph.mon service, it did not take effect.

图片9.jpg

2. Try to reboot to fix pg.

Tried to repair pg, did not work.

图片10.jpg

3. Restart osd service

Try to restart osd service, the problem is solved.

图片11.jpg

After the ceph issue was resolved, the vm status in openstack changed to normal.

Summary of experience

1, ceph change, need to shut down, it is recommended to stop all applications, and then shut down the ceph operation.

2, after re-powering the computer, first ensure that the ceph status is normal, and then go to start the application.

3, for the daily operation and maintenance of ceph, we should do more monitoring and establish a performance baseline,

so that when we find problems, we can make effective comparisons.

For more information, please visit Antute's official website:www.antute.com.cn

Operation & Maintenance Management

Hardware Maintenance

Software Maintenance

DC Migration

Implementation Service