TagLinux

docker: devicemapper fix for “device or resource busy” (EBUSY)

Audience:
This article is intended for folks familiar with docker and looking to fix particular issues encountered when using devicemapper storage/graph driver.

Overview:
While this is issue not exclusive to devicemapper, the mechanics currently involved in this driver cause it to be affected by this.

A couple of the more commons issues seen when using the ‘devicemapper’ storage driver is when trying to stop and/or remove a contianer.
In the docker daemon logs, you may see output like:

[error] deviceset.go:792 Warning: error waiting for device ac05cffda663a01cbc37879bc146fcd68d0f95b5b141f60da2b64579add1f4ef to close: Timeout while waiting for device ac05cffda663a01cbc37879bc146fcd68d0f95b5b141f60da2b64579add1f4ef to close

or more likely:

Cannot destroy container ac05cffda663: Driver devicemapper failed to remove root filesystem ac05cffda663a01cbc37879bc146fcd68d0f95b5b141f60da2b64579add1f4ef: Device is Busy
[8ad069f7] -job rm(ac05cffda663) = ERR (1)
[error] server.go:1207 Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container ac05cffda663: Driver devicemapper failed to remove root filesystem ac05cffda663a01cbc37879bc146fcd68d0f95b5b141f60da2b64579add1f4ef: Device is Busy
[error] server.go:110 HTTP Error: statusCode=500 Cannot destroy container ac05cffda663: Driver devicemapper failed to remove root filesystem ac05cffda663a01cbc37879bc146fcd68d0f95b5b141f60da2b64579add1f4ef: Device is Busy

Diagnosis:
What’s happening behind the scenes is that devicemapper has established a new thin snapshot device to mount then container on.
Sometime during the life of that container another PID on the host, unrelated to docker, has likely started and unshared some namespaces from the root namespace, namely the mount namespace (CLONE_NEWNS). In the mounts referenced in this unshared host PID, it includes the thin snapshot device and its mount for the container runtime.

When the container goes to stop and unmount, while it may unmount the device from the root namespace, that umount does not propogate to the unshared host PID.
When the container is removed, devicemapper attempts to remove the thin snapshot device, but since the unshared host PID includes a reference to the device in its mountinfo, the kernel sees the device as still busy (EBUSY). Despite the fact the mounts of the root mount namespace may no longer show this device and mount.

Reproduction:
1) start the docker daemon (with debugging and forced devicemapper): `sudo docker -d -D -g devicemapper`
2) start a container: `sudo docker run -it busybox top`
3) run a pid with unshared mount namespace:
3.a) compile a simple C application: http://pastebin.com/HfSn8udJ
3.b) use the unshare(1) utility: `sudo unshare -m top`
4) stop or kill the container from step #2
5) watch the docker daemon logs

If you kill/quit the unshared application from #3 while the docker daemon is attempting to remove the container, then the daemon can clean up the container nicely. Otherwise there is cruft left around (in /var/lib/docker) and the daemon will not have record of the container to be removed.

Investigating:
Tooling to visualize this inheritance does not really exist yet, so deriving the culprit can take a little effort.
Sometimes the `perf` tool can quickly point out the culprit. Something like:

perf probe -a clone_mnt
perf record -g -e probe:clone_mnt -aR docker -d
[...]
perf report

Another option, is that while the container is running, get the mountinfo of the container from only the docker daemon’s pid:

> sudo grep $(docker ps -q | head -1) /proc/$(cat /var/run/docker.pid)/mountinfo
173 169 253:5 / /var/lib/docker/devicemapper/mnt/7c62e78ca18f88e152debfb0b40847c1486bcef14d40300154bf0c9e9800d824 rw,relatime - ext4 /dev/mapper/docker-253:2-4980739-7c62e78ca18f88e152debfb0b40847c1486bcef14d40300154bf0c9e9800d824 rw,seclabel,discard,stripe=16,data=ordered

Kill/stop the container (`docker kill ...` or just ‘q’ out of `top`).
Now that the container and device should be unmounted everywhere, lets grep for any PIDs still holding a reference:

sudo grep -l 7c62e78ca1 /proc/*/mountinfo
/proc/5132/mountinfo

We have our culprit. Find out the command:

ps -f 5132
UID PID PPID C STIME TTY STAT TIME CMD
root 5132 5131 0 17:47 pts/6 S+ 0:01 top

This is the unshared PID we started earlier.

Notice also, we can determine for sure they are on different mount namespaces by checking the following:

sudo ls -l /proc/$(cat /var/run/docker.pid)/ns/mnt /proc/5132/ns/mnt
lrwxrwxrwx. 1 root root 0 Nov 4 18:13 /proc/10388/ns/mnt -> mnt:[4026532737]
lrwxrwxrwx. 1 root root 0 Nov 4 18:13 /proc/5132/ns/mnt -> mnt:[4026531840]

Notice they are referencing different numbers here.

Now to cleanup, we can stop/kill this unshared PID, and `docker rm ...` our test container.

Solution:
The solution to this issue is to have the docker docker itself run in an unshared mount namespace. Unfortunately, due to the threading model of the golang runtime, this can not be encapsulated inside the docker daemon itself, but will have to be done for the invocation of the docker daemon.

The two ways to go about this is either with the unshare(1) utility or editing the systemd unit file for the docker.server (where ever this applies).

For systemd unit file (likely /usr/lib/systemd/system/docker.service, but may also be /etc/systemd/system/docker.service), include “MountFlags=private” in the [service] section. e.g.:

[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target docker.socket
Requires=docker.socket

[Service]
ExecStart=/usr/bin/docker -d -H fd:// -H unix://var/run/docker.sock
MountFlags=private
LimitNOFILE=1048576
LimitNPROC=1048576

[Install]
WantedBy=multi-user.target

For the unshare(1) approach, you’ll have to find and edit the init script for your system. But the layout is as follows (and covers manually calling docker as well):

sudo unshare -m docker -- -d

The important piece being the “-m” flag for cloning the mount namespace, and the “--” for the separation before arguments to the docker executable itself.

What this will accomplish is at the launch of the docker daemon, it will get its own private mount namespace that is a clone of the root namespace at that point. The docker daemon is free to create devices, mount them, unmount them and remove the device, because none of that mount information will be in the root mount namespace nor subject to being cloned into other PID’s mount namespace.

Links:

Good Times

This year has been a nice for socializing with nerds. Attending conferences is something that can get tiresome, if the content is something you do not find interesting or stimulating. Thankfully, I have no required conferences, so I can be choosy (schedules permitting). Naturally open source and languages would top the list of places to attend.

First off, was hanging out with the KDE folks in San Francisco, CA for CampKDE – April 4,5 2011. There were a number of good talks, and a great opportunity to shake the hands of names that I have seen around, as well as meeting many new folks. The kde-promo team has a YouTube channel, that they have published all the talks and interviews to. http://www.youtube.com/user/kdepromo
The talk I gave there is titled “Slackware: Quickly and Easily Manage Your KDE SC Hacking.” You can get the slides in [PDF] or [ODP], plus the videos posted on the kde-promo channel have the full talk (youtube.com/watch?v=Qs7vR3POHeo), as well as an interview afterwards by Wade Olson (youtube.com/watch?v=YIpUmPul1i4).

Next, was down to Spartanburg, SC for the SouthEast Linux Fest (SELF) – June 10-12, 2011.
This is the third year that I have attended SELF, and second time to speak, but what differentiated this year from any other speaking engagement (in the past, or distant future), was that it was a talk title “Slackware Demystified”, and none other than the founder of Slackware Linux, Patrick Volkerding was not only in attendance, but sitting on the front row! The slides from this presentation are available in [HTML] or [PDF]. Unfortunately, the videos have not been published yet. Hopefully they will actually get them published, unlike the previous two years…

Lastly … so far, was a local conference, that I did not speak at, only attended. JrubyConf – August 3-5, 2011. While I use MRI Ruby much more than JRuby, this conference was a great way to be around and hear from many brilliant folks (Like Wayne Seguin, Charles Nutter (headius), Nick Sieger, and Jim Weirich, to name a few), plus I felt that I needed to make up for missing out on RubyConf taking place in Baltimore, MD.

The KDE folks strongly encouraged making it to the DesktopSummit, which was hosted in Berling, Germany this year. While it was surely appetising to think of attending, it did not work out this time. It would have been nice to shake hands with some fellow contributors, like Eric Hameleers (alienBob). Better luck next year.

All good times, I look forward to next year, or event the rest of 2011.

Feel free to leave feedback on the talks.

Take care,
vb

fix pm-suspend in >= 2.6.35 linux kernel

For those with a newer laptop, in my case a Lenovo, the NEC USBv3 port is cool, unused and so far, a pain in the ease of my mobile life.

The hardware info (from lspci -nnvv) is:

0f:00.0 USB Controller [0c03]: NEC Corporation Device [1033:0194] (rev 03) (prog-if 30)
 Subsystem: Lenovo Device [17aa:219c]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 18
 Region 0: Memory at f2200000 (64-bit, non-prefetchable) [size=8K]
 Capabilities: [50] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
 Address: 0000000000000000  Data: 0000
 Capabilities: [90] MSI-X: Enable- Count=8 Masked-
 Vector table: BAR=0 offset=00001000
 PBA: BAR=0 offset=00001080
 Capabilities: [a0] Express (v2) Endpoint, MSI 00
 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
 ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
 RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
 MaxPayload 128 bytes, MaxReadReq 512 bytes
 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
 LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <4us, L1 unlimited
 ClockPM+ Surprise- LLActRep- BwNot-
 LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
 ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
 DevCap2: Completion Timeout: Not Supported, TimeoutDis+
 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
 LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
 Compliance De-emphasis: -6dB
 LnkSta2: Current De-emphasis Level: -3.5dB
 Capabilities: [100] Advanced Error Reporting
 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
 AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
 Capabilities: [140] Device Serial Number ff-ff-ff-ff-ff-ff-ff-ff
 Capabilities: [150] #18
 Kernel driver in use: xhci_hcd
 Kernel modules: xhci-hcd



Up until the 2.6.35 kernel, this had been using the kernel driver “xhci”, but in >= 2.6.35, it has been renamed “xhci_hcd”.

What does this have to do with anything?  The ease of my suspending to disk. This USBv3 driver is not yet playing well with the freezing/thawing process, and will give a cryptic error in /var/log/pm-suspend.log, pointing you to line 295 of /usr/lib${LIBSUFFIX}/pm-utils/pm-functions. When it gets to the point of executing `echo -n "mem" > /sys/power/state`, it returns a permission denied. Even though the contents of /sys/power/state show something like “mem disk”.

So for pm-utils, if you

  • go to /etc/pm/config.d/
  • create a file called “unload_modules“, with contents of
  • SUSPEND_MODULES="xhci xhci_hcd"
  • it’s a space delimited value, so add others if you are having issues.

TA DA. now i can resume the comfort of a quick suspend and resume!

Take care,

vb Continue reading

kde-4.3.1 x86_64, for slackware-current

i realize i just posted 4.3.0, but now 4.3.1 is tagged.
http://cardinal.lizella.net/~vbatts/kde/kde4-packages/4.3.1/

take care,

vb