2017 年 7 月 14 日acheng

一个Docker升级引发的大坑

昨天一台跑selenium自动测试的Jenkins Slave突然出错，报docker进程无法启动。

Jul 13 06:34:12 ecsa00400332 systemd[1]: docker.service failed.
Jul 13 06:34:12 ecsa00400332 systemd[1]: Unit docker.service entered failed state.
Jul 13 06:34:12 ecsa00400332 systemd[1]: Failed to start Docker Application Container Engine.
Jul 13 06:34:12 ecsa00400332 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jul 13 06:34:12 ecsa00400332 dockerd-current[2902]: time=”2017-07-13T06:34:12.444088520+03:00″ level=fatal msg=”Error starting daemon: error initializing graphdriver: devicemapper: Non existing device docker-thinpool”

还有更早一点的日志

Jul 12 10:55:34 ecsa00400332 docker-storage-setup: ERROR: Docker has been previously configured for use with devicemapper graph driver. Not creating a new thin pool as existing docker metadata will fail to work with it. Manual cleanup is required before this will succeed.

看来是docker的存储有问题了。

看了一下逻辑卷的情况，果然docker-thinpool显示为inactive状态

local@ecsa00400332:~ $ sudo lvscan
  inactive          '/dev/docker/thinpool' [47.50 GiB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol00' [37.76 GiB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [2.00 GiB] inherit

既然是存储的问题，那就重新跑一遍docker-storage-setup：

local@ecsa00400332:~ $ sudo docker-storage-setup 
ERROR: There is not enough free space in volume group VolGroup00 to create data volume of size MIN_DATA_SIZE=2G.

空间不够？

local@ecsa00400332:~ $ sudo vgdisplay
  --- Volume group ---
  VG Name               docker
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  6
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               50.00 GiB
  PE Size               4.00 MiB
  Total PE              12799
  Alloc PE / Size       12413 / 48.49 GiB
  Free  PE / Size       386 / 1.51 GiB
  VG UUID               KuWsmb-5quD-2HKL-Y1G1-90uo-ojdw-LOqUcd

  --- Volume group ---
  VG Name               VolGroup00
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               39.80 GiB
  PE Size               4.00 MiB
  Total PE              10189
  Alloc PE / Size       10179 / 39.76 GiB
  Free  PE / Size       10 / 40.00 MiB
  VG UUID               2qxBah-Q4ui-hAKh-GIb2-esYQ-AQuH-Xjk8RX

空间也够，而且也有一个名为docker的Volume Group，应该是专门给docker用的。
为什么还要用VolGroup00这个VG呢？（这台机子是别人搭建的，我刚接手不久）
不管了，配置一下让它用docker这个VG
编辑/etc/sysconfig/docker-storage-estup （CentOS 7）
写上：

VG=docker

注意：这个文件是用来覆盖/lib/docker-storage-setup/docker-storage-setup
中的设置的。
重新运行docker-storage-setup，结果说还是说空间不够。
但至少这次它会去找docker这个VG了。
由于这是一台跑自动化WEB测试的机子，都是一些临时数据，我决定把docker VG
下的逻辑卷（lv）全删掉，再重试一次。
（为了保险起见，我还是为这台云服务器做了一个快照，万一有问题，我还可以恢复到最初的状态）

[local@ecsa00400332 ~]$ sudo lvremove docker/thinpool
  Logical volume "thinpool" successfully removed

再试一次：

[local@ecsa00400332 ~]$ sudo docker-storage-setup
  Using default stripesize 64.00 KiB.
  Rounding up size to full physical extent 52.00 MiB
  Logical volume "docker-pool" created.
  Logical volume docker/docker-pool changed.

OK,逻辑卷创建好了，再次尝试启动docker。
结果还是不行。重新google了一下，说是要删除原来的/var/lib/docker目录。

删除/var/lib/docker目录将会导致原来所有docker的
镜像和container、registry数据丢失。
于是将/var/lib/docker目录改个名字，重新跑一遍docker-storage-setup。
再次尝试启动docker，还是无法启动。

使用journalctl查看日志，发现以下内容：

Jul 13 10:21:24 ecsa00400332 dockerd-current[16612]: time=”2017-07-13T10:21:24+03:00″ level=fatal msg=”unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: storage-driver: (from flag: devicemapper, from file: devicemapper), storage-opts: (from flag: [dm.fs=xfs dm.thinpooldev=/dev/mapper/docker-docker–pool dm.use_deferred_removal=true], from file: [dm.thinpooldev=/dev/mapper/docker-thinpool dm.use_deferred_removal=true])\n”
Jul 13 10:21:24 ecsa00400332 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Jul 13 10:21:24 ecsa00400332 systemd[1]: Failed to start Docker Application Container Engine.
— Subject: Unit docker.service has failed

既然是两处的配置参数不同导致无法启动。我就把/etc/sysconfig/docker-storage中的内容注释掉。

再次尝试，还是不行。再看日志，终于发现是配置的thinpooldev设备名称和docker-storage-setup命令生成的逻辑卷名称不同。

命令创建的是docker-docker–pool,而/etc/docker/daemon.json中指定的是/dev/mapper/docker-thinpool。

把名称更新之后，docker终于可以启动了。

再次google之后，发现引发这个问题的原因是最近的一次docker程序的升级。新版本中docker的存储有了比较大的改动，导致原来创建的东西不再可用了。

我不知道这是否属实。如果果真如此的话，那就太坑了。再怎么说这么大的改动不能让人随便升级。事实如何，还有待进一步的考证。

acheng

专业Linux/Unix/Windows系统管理员，开源技术爱好者。对操作系统底层技术，TCP/IP协议栈以及信息系统安全有强烈兴趣。电脑技术之外，则喜欢书法，古典诗词，数码摄影和背包行。