一个Docker升级引发的大坑
昨天一台跑selenium自动测试的Jenkins Slave突然出错,报docker进程无法启动。
Jul 13 06:34:12 ecsa00400332 systemd[1]: docker.service failed.
Jul 13 06:34:12 ecsa00400332 systemd[1]: Unit docker.service entered failed state.
Jul 13 06:34:12 ecsa00400332 systemd[1]: Failed to start Docker Application Container Engine.
Jul 13 06:34:12 ecsa00400332 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jul 13 06:34:12 ecsa00400332 dockerd-current[2902]: time=”2017-07-13T06:34:12.444088520+03:00″ level=fatal msg=”Error starting daemon: error initializing graphdriver: devicemapper: Non existing device docker-thinpool”
还有更早一点的日志
Jul 12 10:55:34 ecsa00400332 docker-storage-setup: ERROR: Docker has been previously configured for use with devicemapper graph driver. Not creating a new thin pool as existing docker metadata will fail to work with it. Manual cleanup is required before this will succeed.
看来是docker的存储有问题了。
看了一下逻辑卷的情况,果然docker-thinpool显示为inactive状态
local@ecsa00400332:~ $ sudo lvscan
inactive '/dev/docker/thinpool' [47.50 GiB] inherit
ACTIVE '/dev/VolGroup00/LogVol00' [37.76 GiB] inherit
ACTIVE '/dev/VolGroup00/LogVol01' [2.00 GiB] inherit
既然是存储的问题,那就重新跑一遍docker-storage-setup:
local@ecsa00400332:~ $ sudo docker-storage-setup
ERROR: There is not enough free space in volume group VolGroup00 to create data volume of size MIN_DATA_SIZE=2G.
空间不够?
local@ecsa00400332:~ $ sudo vgdisplay
--- Volume group ---
VG Name docker
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 6
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size 50.00 GiB
PE Size 4.00 MiB
Total PE 12799
Alloc PE / Size 12413 / 48.49 GiB
Free PE / Size 386 / 1.51 GiB
VG UUID KuWsmb-5quD-2HKL-Y1G1-90uo-ojdw-LOqUcd
--- Volume group ---
VG Name VolGroup00
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 3
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 1
Act PV 1
VG Size 39.80 GiB
PE Size 4.00 MiB
Total PE 10189
Alloc PE / Size 10179 / 39.76 GiB
Free PE / Size 10 / 40.00 MiB
VG UUID 2qxBah-Q4ui-hAKh-GIb2-esYQ-AQuH-Xjk8RX
空间也够,而且也有一个名为docker的Volume Group,应该是专门给docker用的。
为什么还要用VolGroup00这个VG呢? (这台机子是别人搭建的,我刚接手不久)
不管了,配置一下让它用docker这个VG
编辑/etc/sysconfig/docker-storage-estup (CentOS 7)
写上:
VG=docker
注意: 这个文件是用来覆盖/lib/docker-storage-setup/docker-storage-setup
中的设置的。
重新运行docker-storage-setup,结果说还是说空间不够。
但至少这次它会去找docker这个VG了。
由于这是一台跑自动化WEB测试的机子,都是一些临时数据,我决定把docker VG
下的逻辑卷(lv)全删掉,再重试一次。
(为了保险起见,我还是为这台云服务器做了一个快照,万一有问题,我还可以恢复到最初的状态)
[local@ecsa00400332 ~]$ sudo lvremove docker/thinpool
Logical volume "thinpool" successfully removed
再试一次:
[local@ecsa00400332 ~]$ sudo docker-storage-setup
Using default stripesize 64.00 KiB.
Rounding up size to full physical extent 52.00 MiB
Logical volume "docker-pool" created.
Logical volume docker/docker-pool changed.
OK,逻辑卷创建好了,再次尝试启动docker。
结果还是不行。重新google了一下,说是要删除原来的/var/lib/docker目录。
删除/var/lib/docker目录将会导致原来所有docker的
镜像和container、registry数据丢失。
于是将/var/lib/docker目录改个名字,重新跑一遍docker-storage-setup。
再次尝试启动docker,还是无法启动。
使用journalctl查看日志,发现以下内容:
Jul 13 10:21:24 ecsa00400332 dockerd-current[16612]: time=”2017-07-13T10:21:24+03:00″ level=fatal msg=”unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: storage-driver: (from flag: devicemapper, from file: devicemapper), storage-opts: (from flag: [dm.fs=xfs dm.thinpooldev=/dev/mapper/docker-docker–pool dm.use_deferred_removal=true], from file: [dm.thinpooldev=/dev/mapper/docker-thinpool dm.use_deferred_removal=true])\n”
Jul 13 10:21:24 ecsa00400332 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jul 13 10:21:24 ecsa00400332 systemd[1]: Failed to start Docker Application Container Engine.
— Subject: Unit docker.service has failed
既然是两处的配置参数不同导致无法启动。我就把/etc/sysconfig/docker-storage中的内容注释掉。
再次尝试,还是不行。再看日志,终于发现是配置的thinpooldev设备名称和docker-storage-setup命令生成的逻辑卷名称不同。
命令创建的是docker-docker–pool,而/etc/docker/daemon.json中指定的是/dev/mapper/docker-thinpool。
把名称更新之后,docker终于可以启动了。
再次google之后,发现引发这个问题的原因是最近的一次docker程序的升级。新版本中docker的存储有了比较大的改动,导致原来创建的东西不再可用了。
我不知道这是否属实。如果果真如此的话,那就太坑了。再怎么说这么大的改动不能让人随便升级。事实如何,还有待进一步的考证。