Docker Storage Drivers

# Deep dive into

# Docker storage drivers

## *

## Jérôme Petazzoni - @jpetazzo

## Docker - @docker

---

# Who am I?

- [@jpetazzo](https://twitter.com/jpetazzo)

- Tamer of Unicorns and Tinkerer Extraordinaire.red[¹]

- Grumpy French DevOps person who loves Shell scripts
 .small[Go Away Or I Will Replace You Wiz Le Very Small Shell Script]

- *Some* experience with containers
 (built and operated the dotCloud PaaS)

---

# Outline

- Extremely short intro to Docker

- Copy-on-write

- History of Docker storage drivers

- AUFS, BTRFS, Device Mapper, Overlay.gray[fs], VFS, ZFS

- Conclusions

---

# Extremely short intro to Docker

---

# What's Docker?

- A *platform* made of the *Docker Engine* and the *Docker Hub*

- The *Docker Engine* is a runtime for containers

- It's Open Source, and written in Go 
 .small[.small[ http://www.slideshare.net/jpetazzo/docker-and-go-why-did-we-decide-to-write-docker-in-go ]]

- It's a daemon, controlled by a REST.gray[-ish] API

- *What is this, I don't even?!?*
 Check the recording of this online "Docker 101" session:
 https://www.youtube.com/watch?v=pYZPd78F4q4

---

# If you've never seen Docker in action ...

This will help!

.small[
```
jpetazzo@tarrasque:~$ docker run -ti python bash
root@75d4bf28c8a5:/# pip install IPython
Downloading/unpacking IPython
  Downloading ipython-2.3.1-py3-none-any.whl (2.8MB): 2.8MB downloaded
Installing collected packages: IPython
Successfully installed IPython
Cleaning up...
root@75d4bf28c8a5:/# ipython
Python 3.4.2 (default, Jan 22 2015, 07:33:45) 
Type "copyright", "credits" or "license" for more information.

IPython 2.3.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:
```
]

---

# What happened here?

- We created a *container* (~lightweight virtual machine),
 with its own:

- filesystem (based initially on a `python` image)
  - network stack
  - process space

- We started with a `bash` process
 (no `init`, no `systemd`, no problem)

- We installed IPython with pip, and ran it

---

# What did .underline[not] happen here?

- We did not make a full copy of the `python` image

The installation was done in the *container*, not the *image*:

- We did not modify the `python` image itself

- We did not affect any other container
 (currently using this image or any other image)

---

# How is this important?

- We used a *copy-on-write* mechanism
 (Well, Docker took care of it for us)

- Instead of making a full copy of the `python` image,
  keep track of changes between this image and our container

- Huge disk space savings (1 container = less than 1 MB)

- Huge time savings (1 container = less than 0.1s to start)

---

# Copy-on-write

---

# What is copy-on-write?

*It's a mechanism allowing to share data.*

*The data appears to be a copy, but is only
a link (or reference) to the original data.*

*The actual copy happens only when someone
tries to change the shared data.*

*Whoever changes the shared data ends up
sharing their own copy instead.*

---

# A few metaphors

- First metaphor:
 white board and tracing paper

- Second metaphor:
 magic books with shadowy pages

- Third metaphor:
 just-in-time house building

---

# How does it work?

![Magic!](http://media.giphy.com/media/ujUdrdpX7Ok5W/giphy.gif)

(More about this in a few minutes!)

---

# Copy-on-write is *everywhere*

- Process creation with `fork()`

- Consistent disk snapshots

- Efficient VM provisioning

- CONTAINERS
 (building, shipping, running thereof)

---

# Process creation

- On virtually all UNIX systems,
 the only way.red[*] to create a new process
 is by making a copy of an existing process

- This has to be fast
 (even if the process uses many GBs of RAM)

- See `fork()` and `clone()` system calls

---

# Memory snapshots

- Example: Redis `SAVE` command

- Achieves *consistent* snapshot

- Implementation relies on `fork()`

- One "copy" continues to serve requests,
 the other writes memory to disk and exits

---

# Mapped memory

- Load a file in memory instantaneously!
 (even if it's huge and doesn't fit in RAM)

- Data is loaded on demand from disk

- Writes to memory can be:

- visible just by us (copy-on-write)
  
  - committed to disk

See: `mmap()`

---

# How does it work?

- Thanks to the MMU! (Memory Management Unit)

- Each memory access goes through it

- Translates memory accesses (location.red[¹] + operation.red[²]) into:

- actual physical location

- or, alternatively, a *page fault*

---

# Memory behaviors

- Read to private data = OK

- Read to shared data = OK

- Write to private data = OK

- Write to shared data = *page fault*

- the MMU notifies the OS

- the OS makes a copy of the shared data

- the OS informs the MMU that the copy is private

- the write operation resumes as if nothing happened

---

# Other scenarios

- Access to non-existent memory area = `SIGSEGV`/`SIGBUG`
 .small[(a.k.a. "Segmentation fault"/"Bus error" a.k.a. "Go and learn to use pointers")]

- Access to swapped-out memory area = load it from disk
 .small[(a.k.a. "My program is now 1000x slower")]

- Write attempt to code area / exec attempt to non-code area:
 kill the process (sometimes)
 .small[(prevents some attacks, but also messes with JIT compilers)]

Note: one "page" is typically 4 KB.

---

# Consistent disk snapshots

- "I want to make a backup of a busy database"

- I don't want to stop the database during the backup

- the backup must be *consistent* (single point in time)

- Initially available (I think) on external storage (NAS, SAN),
 because/therefore:

- it's complicated

- it's expensive

---

# Thin provisioning for VMs.red[¹]

- Put system image on copy-on-write storage

- For each machine.red[¹], create copy-on-write instance

- If the system image contains a lot of useful software,
  people will almost never need to install extra stuff

- Each extra machine will only need disk space for data!

WIN $$$ (And performance, too, because of caching)

---

# Copy-on-write spreads everywhere

(In no specific order; non-exhaustive list)

- LVM (Logical Volume Manager) on Linux

- ZFS on Solaris, then FreeBSD, Linux ...

- BTRFS on Linux

- AUFS, UnionMount, overlay.gray[fs] ...

- Virtual disks in VM hypervisors
 (qcow, qcow2, VirtualBox/VMware equivalents, EBS...)

---

# Copy-on-write and Docker: a love story

- Without copy-on-write...

- it would take forever to start a container

- containers would use up a lot of space

- Without copy-on-write "on your desktop"...

- Docker would not be usable on your Linux machine

- There would be no Docker at all.
 And no meet-up here tonight.
 And we would all be shaving yaks instead.
 ☹
---

# Thank you:

Junjiro R. Okajima (and other AUFS contributors)

Chris Mason (and other BTRFS contributors)

Jeff Bonwick, Matt Ahrens (and other ZFS contributors)

Miklos Szeredi (and other overlay.gray[fs] contributors)

The many contributors to Linux device mapper, thinp target, etc.

---

# History of Docker storage drivers

---

# First came AUFS

- Docker used to be dotCloud
 (PaaS, like Heroku, Cloud Foundry, OpenShift...)

- dotCloud started using AUFS in 2008
 (with vserver, then OpenVZ, then LXC)

- Great fit for high density, PaaS applications
 (More on this later!)

---

# AUFS is not perfect

- Not in mainline kernel

- Applying the patches used to be *exciting*

- ... especially in combination with GRSEC

- ... and other custom fancery like `setns()`

---

# But some people believe in AUFS!

- dotCloud, obviously

- Debian and Ubuntu use it in their default kernels,
 for Live CD and similar use cases:

- Your root filesystem is a copy-on-write between
 - the read-only media (CD, DVD...) 
 - and a read-write media (disk, USB stick...)

- As it happens, we also .red[♥] Debian and Ubuntu very much

- First version of Docker is targeted at Ubuntu (and Debian)

---

# Then, some people started to believe in Docker

- Red Hat users *demanded* Docker on their favorite distro

- Red Hat Inc. wanted to make it happen

- ... and contributed support for the Device Mapper driver

- ... then the BTRFS driver

- ... then the overlay.gray[fs] driver

---

# Special thanks:

Alexander Larsson

Vincent Batts

\+ all the other contributors and maintainers, of course

.small[(But those two guys have played an important role in the
initial support, then maintenance, of the BTRFS, Device Mapper, and overlay drivers.
Thanks again!)]

---

# Let's see each

# storage driver

# in action

---

# AUFS

---

# In Theory

- Combine multiple *branches* in a specific order

- Each branch is just a normal directory

- You generally have:

- at least one read-only branch (at the bottom)

- exactly one read-write branch (at the top)

(But other fun combinations are possible too!)

---

# When opening a file...

- With `O_RDONLY` - read-only access:

- look it up in each branch, starting from the top

- open the first one we find

- With `O_WRONLY` or `O_RDWR` - write access:

- look it up in the top branch;
 if it's found here, open it

- otherwise, look it up in the other branches;
 if we find it, copy it to the read-write (top) branch,
 then open the copy

---

# When deleting a file...

- A *whiteout* file is created
 (if you know the concept of "tombstones", this is similar)

# ls -la /var/lib/docker/aufs/diff/$(docker ps --no-trunc -lq)/etc
total 8
drwxr-xr-x 2 root root 4096 Jan 27 15:36 .
drwxr-xr-x 5 root root 4096 Jan 27 15:36 ..
-r--r--r-- 2 root root    0 Jan 27 15:36 .wh.shadow
```
]

---

# In Practice

- The AUFS mountpoint for a container is
 `/var/lib/docker/aufs/mnt/$CONTAINER_ID/`

- It is only mounted when the container is *running*

- The AUFS branches (read-only and read-write) are in
 `/var/lib/docker/aufs/diff/$CONTAINER_OR_IMAGE_ID/`

- All writes go to `/var/lib/docker`

```
dockerhost# df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb        15G  4.8G  9.5G  34% /mnt
```

---

# Under the hood

- To see details about an AUFS mount:

- look for its internal ID in `/proc/mounts`

- look in `/sys/fs/aufs/si_.../br*`

- each branch (except the two top ones)
 translates to an image

---

# Example

.small[
```
dockerhost# grep c7af /proc/mounts
none /mnt/.../c7af...a63d aufs rw,relatime,si=2344a8ac4c6c6e55 0 0

dockerhost# grep . /sys/fs/aufs/si_2344a8ac4c6c6e55/br[0-9]*
/sys/fs/aufs/si_2344a8ac4c6c6e55/br0:/mnt/c7af...a63d=rw
/sys/fs/aufs/si_2344a8ac4c6c6e55/br1:/mnt/c7af...a63d-init=ro+wh
/sys/fs/aufs/si_2344a8ac4c6c6e55/br2:/mnt/b39b...a462=ro+wh
/sys/fs/aufs/si_2344a8ac4c6c6e55/br3:/mnt/615c...520e=ro+wh
/sys/fs/aufs/si_2344a8ac4c6c6e55/br4:/mnt/8373...cea2=ro+wh
/sys/fs/aufs/si_2344a8ac4c6c6e55/br5:/mnt/53f8...076f=ro+wh
/sys/fs/aufs/si_2344a8ac4c6c6e55/br6:/mnt/5111...c158=ro+wh

dockerhost# docker inspect --format {{.Image}} c7af
b39b81afc8cae27d6fc7ea89584bad5e0ba792127597d02425eaee9f3aaaa462

dockerhost# docker history -q b39b 
b39b81afc8ca
615c102e2290
837339b91538
53f858aaaf03
511136ea3c5a
```
]

---

# Performance, tuning

- AUFS `mount()` is fast, so creation of containers is quick

- Read/write access has native speeds

- But initial `open()` is expensive in two scenarios:

- when writing big files (log files, databases ...)

- with many layers + many directories in `PATH`
 (dynamic loading, anyone?)

- Protip: when we built dotCloud, we ended up putting
  all important data on *volumes*

- When starting the same container 1000x, the data is loaded
  only once from disk, and cached only once in memory
  (but `dentries` will be duplicated)

---

# Device Mapper

---

# Preamble

- Device Mapper is a complex subsystem; it can do:

- RAID

- encrypted devices

- snapshots (i.e. copy-on-write)

- and some other niceties

- In the context of Docker, "Device Mapper" means
 "the Device Mapper system + its *thin provisioning target*"
 (sometimes noted "thinp")

---

# In theory

- Copy-on-write happens on the *block* level
 (instead of the *file* level)

- Each container and each image gets its own block device

- At any given time, it is possible to take a snapshot:

- of an existing container (to create a frozen image)

- of an existing image (to create a container from it)

- If a block has never been written to:

- it's assumed to be all zeros

- it's not allocated on disk
 (hence "thin" provisioning)

---

# In practice

- The mountpoint for a container is
 `/var/lib/docker/devicemapper/mnt/$CONTAINER_ID/`

- It is only mounted when the container is *running*

- The data is stored in two files, "data" and "metadata"
 (More on this later)

- Since we are working on the block level, there is not much
  visibility on the diffs between images and containers

---

# Under the hood

- `docker info` will tell you about the state of the pool
 (used/available space)

- List devices with `dmsetup ls`

- Device names are prefixed with `docker-MAJ:MIN-INO`

.small[MAJ, MIN, and INO are derived from the block major,
  block minor, and inode number where the Docker data is
  located (to avoid conflict when running multiple Docker
  instances, e.g. with Docker-in-Docker)]

- Get more info about them with `dmsetup info`, `dmsetup status`
 (you shouldn't need this, unless the system is badly borked)

- Snapshots have an internal numeric ID

- `/var/lib/docker/devicemapper/metadata/$CONTAINER_OR_IMAGE_ID`
 is a small JSON file tracking the snapshot ID and its size

---

# Extra details

- Two storage areas are needed:
 one for *data*, another for *metadata*

- "data" is also called the "pool"; it's just a big pool of blocks
 (Docker uses the smallest possible block size, 64 KB)

- "metadata" contains the mappings between virtual offsets (in the
  snapshots) and physical offsets (in the pool)

- Each time a new block (or a copy-on-write block) is written,
  a block is allocated from the pool

- When there are no more blocks in the pool, attempts to write
  will stall until the pool is increased (or the write operation
  aborted)

---

# Performance

- By default, Docker puts data and metadata on a loop device
  backed by a sparse file

- This is great from a usability point of view
 (zero configuration needed)

- But terrible from a performance point of view:

- each time a container writes to a new block,
  - a block has to be allocated from the pool,
  - and when it's written to,
  - a block has to be allocated from the sparse file,
  - and sparse file performance isn't great anyway

---

# Tuning

- Do yourself a favor: if you use Device Mapper,
 put data (and metadata) on real devices!

- stop Docker

- change parameters

- wipe out `/var/lib/docker` (important!)

- restart Docker

.small[
```
docker -d --storage-opt dm.datadev=/dev/sdb1 --storage-opt dm.metadatadev=/dev/sdc1
```
]

---

# More tuning

- Each container gets its own block device

- with a real FS on it

- So you can also adjust (with `--storage-opt`):

- filesystem type

- filesystem size

- `discard` (more on this later)

- Caveat: when you start 1000x containers,
 the files will be loaded 1000x from disk!

---

# See also

- https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

- https://github.com/docker/docker/tree/master/daemon/graphdriver/devmapper

- http://en.wikipedia.org/wiki/Sparse_file

- http://en.wikipedia.org/wiki/Trim_%28computing%29

]

---

# BTRFS

---

# In theory

- Do the whole "copy-on-write" thing at the filesystem level

- Create.red[¹] a "subvolume" (imagine `mkdir` with Super Powers)

- Snapshot.red[¹] any subvolume at any given time

- BTRFS integrates the snapshot and block pool management features
  at the filesystem level, instead of the block device level

---

# In practice

- `/var/lib/docker` has to be on a BTRFS filesystem!

- The BTRFS mountpoint for a container or an image is
 `/var/lib/docker/btrfs/subvolumes/$CONTAINER_OR_IMAGE_ID/`

- It should be present even if the container is not running

- Data is not written directly, it goes to the journal first
 (in some circumstances.red[¹], this will affect performance)

.footnote[.red[¹] E.g. uninterrupted streams of writes.
 The performance will be half of the "native" performance.]

---

# Under the hood

- BTRFS works by dividing its storage in *chunks*

- A chunk can contain meta or metadata

- You can run out of chunks (and get `No space left on device`)
 even though `df` shows space available
 (because the chunks are not full)

- Quick fix:

```
# btrfs filesys balance start -dusage=1 /var/lib/docker
```

---

# Performance, tuning

- Not much to tune

- Keep an eye on the output of `btrfs filesys show`!

This filesystem is doing fine:

.small[
```
# btrfs filesys show
Label: none  uuid: 80b37641-4f4a-4694-968b-39b85c67b934
        Total devices 1 FS bytes used 4.20GiB
        devid    1 size 15.25GiB used 6.04GiB path /dev/xvdc
```
]

This one, however, is full (no free chunk) even though there
is not that much data on it:

.small[
```
# btrfs filesys show
Label: none  uuid: de060d4c-99b6-4da0-90fa-fb47166db38b
        Total devices 1 FS bytes used 2.51GiB
        devid    1 size 87.50GiB used 87.50GiB path /dev/xvdc
```
]

---

# Overlay.gray[fs]

---

# Preamble

What with the grayed .gray[fs]?

- It used to be called (and have filesystem type) `overlayfs`

- When it was merged in 3.18, this was changed to `overlay`

---

# In theory

- This is just like AUFS, with minor differences:

- only two branches (called "layers")

- but branches can be overlays themselves

---

# In practice

- You need kernel 3.18

- On Ubuntu.red[¹]:

- go to http://kernel.ubuntu.com/~kernel-ppa/mainline/

- locate the most recent directory, e.g. `v3.18.4-vidi`

- download the `linux-image-..._amd64.deb` file

- `dpkg -i` that file, reboot, enjoy

---

# Under the hood

- Images and containers are materialized under
 `/var/lib/docker/overlay/$ID_OF_CONTAINER_OR_IMAGE`

- Images just have a `root` subdirectory
 (containing the root FS)

- Containers have:

- `lower-id` → file containing the ID of the image

- `merged/` → mount point for the container .small[(when running)]

- `upper/` → read-write layer for the container

- `work/` → temporary space used for atomic copy-up

---

# Performance, tuning

- Implementation detail:
 identical files are hardlinked between images
 (this avoids doing composed overlays)

- Not much to tune at this point

- Performance should be slightly better than AUFS:

- no `stat()` explosion

- good memory use

- slow copy-up, still (nobody's perfect)

---

# VFS

---

# In theory

- No copy on write. Docker does a full copy each time!

- Doesn't rely on those fancy-pesky kernel features

- Good candidate when porting Docker to new platforms
 (think FreeBSD, Solaris...)

- Space .underline[in]efficient, slow

---

# In practice

- Might be useful for production setups

(If you don't want / cannot use volumes, and don't want / cannot
  use any of the copy-on-write mechanisms!)

---

# ZFS

### (Coming Soon To A Docker Release Near You)

### (i.e. it is in 1.7-RC)

## WOOT!

---

![ZFS on Docker by @solarce](zfs.png)

---

# Sneak peek

- Obviously you need to setup ZFS yourself

- *Should* be compatible with all flavors
 (kernel-based and FUSE)

- Want to play with it?
 https://master.dockerproject.com/

---

# From a user POV...

- Similar to BTRFS

- Fast snapshot, fast creation, fast snapshot...

- Epic performance and reliability (YMMV)

- ZFS has the reputation to be quite Memory hungry
 (=you probably don't want that in a tiny VM)

---

# Implementation details

- Shells out to `zfs-utils`

- Why not `libzfs`?
 Because it's tied to a specific kernel version

- A Tale Of Two Pull Requests

---

# Conclusions

---

*The nice thing about Docker storage drivers,
 is that there are so many of them to choose from.*

---

# What do, what do?

- If you do PaaS or other high-density environment:

- AUFS (if available on your kernel)

- overlayfs (otherwise)

- If you put big writable files on the CoW filesystem:

- BTRFS, ZFS, or Device Mapper (pick the one you know best)

- *Wait, really, you want me to pick one!?!*

---

# Bottom line

---

*The best storage driver to run your production
 will be the one with which you and your team
 have the most extensive operational experience.*

---

# Bonus track

## `discard` and `TRIM`

---

# `TRIM`

- Command sent to a SSD disk, to tell it:
 *"that block is not in use anymore"*

- Useful because on SSD, *erase* is very expensive (slow)

- Allows the SSD to pre-erase cells in advance
 (rather than on-the-fly, just before a *write*)

- Also meaningful on copy-on-write storage
 (if/when every snapshots as trimmed a block, it can be freed)

---

# `discard`

- Filesystem option meaning:
 *"can I has `TRIM` on this pls"*

- Can be enabled/disabled at any time

- Filesystem can also be trimmed manually with `fstrim`
 (even while mounted)

---

# The `discard` quandary

- `discard` works on Device Mapper + loopback devices

- ... but is particularly slow on loopback devices
 (the loopback file needs to be "re-sparsified"
 after container or image deletion, and this is a slow
 operation)

- You can turn it on or off depending on your preference

---

# That's all folks!

---

# Questions?