~jpetazzo/

Installing Archlinux with LUKS, SecureBoot, TPM

2024-02-23T00:00:00+00:00

Once in a while, I need to install Archlinux on a new machine. This is the procedure that I follow. It has been recently updated to include root device encryption using LUKS, with the encryption keys stored in the machine’s TPM, and uses SecureBoot so that the device can be unlocked without typing a passphrase, while retaining a good(ish) security level.

A tiny bit of context

I usually don’t need to install or reinstall Linux very often (I don’t deploy lots of physical machines). It’s mostly on new laptops or desktop machines, or when a disk fails. Since I don’t do it very often, it helps to have some notes to remember the different steps, command-line flags, etc.

I’ve considered fully automating the process (e.g. with custom ISO/USB images or PXE boot) but since each machine and install are slightly different, I’m fine with just a detailed procedure.

Recently, my Dell XPS 13 main board failed. On that model, the NVMe disk is soldered to the main board. When Dell replaces the main board, you end up with a blank system and need to reinstall. Shortly after getting the main board replaced, another failure happened (the WiFi card disappeared - no longer listed in lspci - and one of the USB ports completely died). It had to be replaced again (one more reinstall). And then the replacement board had another problem: the the CPU clock wouldn’t go above 200 MHz. That last issue was a huge pain in the neck to address, because the machine’s diagnostics would still pass, so Dell initially refused to change the main board. They insisted on updating drivers, reinstalling Windows, etc. and it took almost 10 days of constant back-and-forth with them to finally get them to replace the main board. (As soon as the main board was replaced, the system was fine.)

Since I had to reinstall that system multiple times in the span of a few weeks, I decided to clean up my notes, improve the process a bit (e.g. automate partition creation, store LUKS keys in TPM, enable SecureBoot) and turn that into a blog post in case it’s useful to others. The commands are presented in such a way that if you’re connecting to the machine over SSH, you can copy-paste 90% of them without having to tweak too many things, and it then takes about 10 minutes from end-to-end to get everything up and running.

Disclaimer: there is nothing special or original about this install process. Most of the information has been gleaned from the Archlinux wiki, in particular the Installation Guide. If you’re interested by the SecureBoot + TPM2 + LUKS bits, the following resources have been very helpful:

Preparation

Make sure that the machine has free (non-partitioned) disk space. A totally blank disk is fine. If you dual boot Windows, you can shrink the Windows partition from Windows. (That’s what I do because my laptop won’t install Windows with the normal Windows ISO images; I have to build a recovery media from another machine, and the recovery process completely wipes the partition table and destroys everything that was on the disk anyway. Ain’t that just lovely!)

If you want to do the (optional) SecureBoot part, enter the BIOS and look for the SecureBoot options. We will need to enroll our own keys, so switch to “Setup Mode” (it’s called “Audit Mode” on my Dell BIOS).

Get an Archlinux ISO/USB image and copy it to a USB stick. Boot it. Get to the shell.

Connecting to internet

You can skip this step if you just want to mess around with partitions, chroot into an existing system, etc.; but to install Archlinux, we will need internet access eventually (to download packages).

Also, personally, I prefer to do the install from another machine (so that I can copy-paste commands and error messages if necessary) so I start the SSH server, add my keys to the root account, and log in from the other machine to continue from there.

ESSID="Your WiFi Network Name Here"
iwctl station wlan0 connect $ESSID

Note: sometimes, it seems necessary to run iwctl station wlan0 scan before trying to connect. I don’t know why.

Partitioning disks

The general idea is that we want:

a large-ish (100+ GB) partition for the Linux system
a swap partition (not strictly mandatory)
a large enough (1+ GB) boot partition

Ideally, the boot partition will be an “ESP” or “EFI System Partition”. This is a special partition type, typically formatted using the VFAT filesytem, so that it’s readable by the machine’s UEFI firmware.

The boot partition will hold the boot loader files as well as our kernels, ramdisks, microcode files. A typical kernel + ramdisk is around 30 MB on my machines. A rescue kernel + ramdisk is around 150 MB. Multiply these figures by two if you’re keeping the previous kernel around as a fallback, or if you’re experimenting with different kernels. If you’re booting multiple OSes, they will share the same boot partition, so you might want to account for that too. Personally I like to have 1 GB here to be on the safe side.

If you already have Windows installed on the machine, it is likely that you already have an EFI System Partition, and that it is fairly small (e.g. 100 MB). The Windows boot process is extremely brittle, so resizing or moving that ESP might render the Windows system unbootable. Since the Windows boot process doesn’t produce useful error messages, it is fairly difficult to figure out what’s confusing it. The recommended approach in that case is to create a separate “Linux extended boot” partition. The (relatively small) Linux boot loader will be installed on the ESP (alongside the Windows and other boot loaders), and the (relatively big) Linux kernels and initrds and other files will go to the extended boot partition.

When there is just an ESP, it is typically mounted on /boot.

When there is both an ESP and an extended boot partition, the ESP is typically mounted on /efi and the extended boot partition on /boot.

Plan 1: manual partitioning

find disks with lsblk
use cfdisk to partition disk
if there is no partition of type “EFI System”:
- it will be mounted on /boot
- it will be the “ESP” (EFI System Partition)
if there is an “EFI System” partition of 1G or more:
- nothing to do!
- we will mount it on /boot
if there is an “EFI System” partition of less than 1G:
- create a 1G partition, type “Linux extended boot”
- it will be mounted on /boot
- the “EFI System” partition will be mounted on /efi
create the other partitions:
- e.g. 300G “Linux filesystem” for /
- e.g. 32G “Linux swap”
recommended: set the type of the partitions accordingly (i.e. “Linux swap” for the swap partition, and “Linux root (x86-64)” for the root partition - if you’re on amd64)
“recent” (2022ish?) versions of systemd boot hooks will be able to recognize these partitions, meaning that it won’t be necessary to put them in /etc/fstab, nor to pass the root device to the kernel command line

Plan 2: semi-automatic partitioning

Find disks with lsblk. Here is some example output:

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
loop0         7:0    0 757.8M  1 loop  /run/archiso/airootfs
sda           8:0    1  14.3G  0 disk
├─sda1        8:1    1   868M  0 part
└─sda2        8:2    1    15M  0 part
nvme0n1     259:0    0 953.9G  0 disk
├─nvme0n1p1 259:8    0   100M  0 part
├─nvme0n1p2 259:9    0    16M  0 part
├─nvme0n1p3 259:10   0   300G  0 part
└─nvme0n1p4 259:11   0   598M  0 part

/dev/sda is the USB stick with the Archlinux installer.

/dev/nvme0n1 is the disk where we want to install Archlinux. It’s not directly obvious from the output of the lsblk command, but there is a bit more than 600G available (not partitioned) on that disk.

DISK=/dev/nvme0n1

# Show the current partition table
sgdisk $DISK --print

# Since we already have an ESP, and it's small, let's create an XBOOTLDR partition
sgdisk $DISK --new=0:0:+1G --change-name=0:boot --typecode=0:ea00

# On a completely blank disk, or a disk with no ESP, we could create an ESP partition
#sgdisk $DISK --new=0:0:+1G --change-name=0:boot --typecode=0:ef00

# Create root partition and swap partition
sgdisk $DISK --new=0:0:+500G --change-name=0:archlinux --typecode=0:8304
sgdisk $DISK --new=0:0:+32G --change-name=0:swap --typecode=0:8200

# Check that everything is fine
sgdisk $DISK --print

Here is what the last sgdisk command shows us on my system:

...
Number  Start (sector)    End (sector)  Size       Code  Name
          2048          206847   100.0 MiB   EF00  EFI system partition
        206848          239615   16.0 MiB    0C01  Microsoft reserved ...
        239616       629385215   300.0 GiB   0700  Basic data partition
    1999183872      2000408575   598.0 MiB   2700  Basic data partition
     629385216       631482367   1024.0 MiB  EA00  boot
     631482368      1680058367   500.0 GiB   8304  archlinux
    1680058368      1747167231   32.0 GiB    8200  swap

Note: naming the partitions isn’t strictly necessary, but it makes it possible to reference them at /dev/disk/by-partlabel/, which is pretty convenient in my humble opinion.

Alright, let’s set this env var for convenience:

ROOTDEV=/dev/disk/by-partlabel/archlinux

Encrypting the Linux partition

Encrypting the Linux partition gives you some extra security if your machine’s disk is no longer in your possession, for instance:

if the machine (or its disk) gets stolen
if you need to send back the machine or the disk for replacement or repair (and can’t wipe the disk before)

On modern machines, the performance and CPU overhead of disk encryption is negligible.

On the other hand, each time you boot your machine, you will need to provide the encryption key. It will not be possible to boot the machine without the key. Typically, the key is secured by a password (that has to be provided at each boot). This can be problematic for machines that need to be able to boot unattended. In my case, I have machines at home that are usually off, and I turn them on with wake-on-lan when I’m away. Since I’m not physically at the machine, I cannot type a password; and at this point, the machine hasn’t booted, so it’s not connected to internet or VPN etc. We’ll see later how to store the key in the machine’s TPM to solve that.

Note: if you don’t want to encrypt the Linux partition, just skip this step!

The commands below will ask you a password. You can put a dummy password at this point (e.g. “1234”). LUKS doesn’t directly derive the encryption key from the password. Instead, it will generate a secure key, then store the key in a “key slot”, itself encrypted with the password. This means that later, we will be able to change that dummy password without having to re-encrypt the whole disk. There are multiple key slots, which means that we can have multiple passwords, as well as recovery keys, keys stored in the TPM or other hardware modules, and we can even completely remove the password if we use other key slots.

cryptsetup luksFormat --type luks2 $ROOTDEV
cryptsetup luksOpen $ROOTDEV root
ROOTDEV=/dev/mapper/root

Making filesystems and installing Linux

This is fairly straightforward. This is mostly pulled from the Archlinux installation guide.

mkswap /dev/disk/by-partlabel/swap
swapon /dev/disk/by-partlabel/swap
mkfs -t ext4 $ROOTDEV
mount --mkdir $ROOTDEV /mnt
mkfs -t vfat /dev/disk/by-partlabel/boot
mount --mkdir /dev/disk/by-partlabel/boot /mnt/boot

# If there is a separate ESP:
mount --mkdir /dev/disk/by-partlabel/EFI* /mnt/efi

This is not strictly necessary. The parallel downloads typically make it faster to download packages; and according to an MIT study, the Color will make your install 42% more fancy [[citation needed]].

sed -i "s/^#ParallelDownloads/ParallelDownloads/" /etc/pacman.conf
sed -i "s/^#Color/Color/" /etc/pacman.conf

Now I suggest to have a look at /etc/pacman.d/mirrorlist. It should have been automatically populated with the mirrors closest to your location; but if it hasn’t, then you can run this:

reflector --save /etc/pacman.d/mirrorlist \
--protocol https --latest 5 --sort age

Now install the base system and a few extra packages:

pacstrap -K /mnt base linux linux-firmware linux-headers \
less sudo git base-devel networkmanager vim man-db man-pages openssh

Then drop into the newly installed system for some finishing touches:

arch-chroot /mnt

# Adjust and run the following command if your system will
# be in a given timezone (personally I keep it in UTC and just
# set the TZ environment variable in my profile, but you do you!)
# ln -sf /usr/share/zoneinfo/Region/City /etc/localtime

MYHOSTNAME=fancyhostnameoowee
MYUSERNAME=jp
ROOTPASSWORD=securerootpassword
USERPASSWORD=secureuserpassword

hwclock --systohc
echo en_US.UTF-8 UTF-8 >> /etc/locale.gen
locale-gen

echo $MYHOSTNAME > /etc/hostname

# Optionally, set a root password
chpasswd <<< root:$ROOTPASSWORD

# Optionally, create a user (strongly recommended :))
useradd -m $MYUSERNAME
chpasswd <<< $MYUSERNAME:$USERPASSWORD
echo "$MYUSERNAME ALL=(ALL) ALL" > /etc/sudoers.d/$MYUSERNAME

# Personally I like to enable these, but that's up to you
systemctl enable NetworkManager.service
systemctl enable sshd.service

Setting up the bootloader

I use systemd-boot. Feel free to use something else, but you might have to adjust the SecureBoot part later (if you intended to use SecureBoot).

If you only have the “EFI System” partition:

bootctl install

If you have both “EFI System” and “Linux extended boot”:

bootctl install --boot-path=/boot --esp-path=/efi

Personally I like to edit $ESP/loader/loader.conf:

console-mode auto
default @saved
timeout 10

(Replace $ESP with /boot or /efi depending on where your EFI System Partition is located.)

Note: default @saved means that instead of booting systematically to Linux or Windows or whatever, systemd-boot will boot to the “default entry”. That “default entry” can be changed with bootctl set-default or in the boot menu itself (by selecting an entry and pressing d). Check the systemd-boot manpage for more funny keyboard shortcuts!

At this point, we’re supposed to generate /etc/fstab, but we won’t do it. Instead, we’re going to use a “fancy” initrd, based on systemd, which will automatically detect our various partitions, using GPT partition types. That’s why we had to set the partition types correctly earlier.

Generating the initrd

If you don’t want to use SecureBoot, you can generate an initrd, reboot, and call it a day. If you want to use SecureBoot, you can generate the initrd anyway and check that everything is fine before going on to the SecureBoot section. But you can also skip this section (and go straight to “Setting up SecureBoot”) if you want.

To generate the initrd, we need to first edit /etc/mkinitcpio.conf and update the HOOKS line to use the fancy systemd initrd mentioned previously:

HOOKS=(base systemd autodetect microcode modconf kms keyboard sd-vconsole block sd-encrypt filesystems fsck)

systemd is here for the partition detection
sd-encrypt will detect LUKS partitions and unlock them (prompting us for the password if necessary)

We can now build the initrd:

mkinitcpio --allpresets

Then we configure systemd-boot to add a Linux entry:

cat >/boot/loader/entries/arch.conf <<EOF
title Arch Linux
linux /vmlinuz-linux
#initrd /amd-ucode.img
#initrd /intel-ucode.img
initrd /initramfs-linux.img
#options root=LABEL=xxx rootfstype=ext4
EOF

I’ve commented out the microcode files; feel free to install the relevant package for your CPU and uncomment the corresponding line.

I’ve also left a commented out options line just in case you don’t want to use partition autodetection.

At that point, we can already reboot into the newly installed system if want.

Setting up SecureBoot

Here is a really quick primer about SecureBoot and TPM, in case you wonder why we would bother with all that. Please note that I’m not a SecureBoot expert, and it’s quite possible that I’m misusing some terminology here or even got a few things completely wrong. If you’re an expert in these things, feel free to point out the mistakes so I can fix that part :)

TPM and PCRs

Most modern PCs have a TPM (Trusted Platform Module). The TPM has various features, including:

a secure random number generator (not used here but nifty anyways)
the ability to store encryption keys securely
the ability to verify that a system is “trusted”, and only give access to the encryption keys if it is

“Trusted” here means that the entire boot chain has to be signed properly. This includes the boot loader and the kernel as well as the associated files (initrd, CPU microcode) and the kernel parameters.

“Signed properly” means signed by a key enrolled into the TPM. By default, the TPM has at least keys from Microsoft, meaning that it’s fairly straightforward to boot Windows into SecureBoot mode. There might also be some Red Hat keys but I didn’t look much into that.

The “trusted” aspect of the system isn’t a binary thing (trusted / non-trusted). In fact, the TPM supports multiple PCRs (Platform Configuration Registers) that store hashes of various system components. For instance, on Linux, PCR11 will contain the hash of the kernel boot image (kernel, initrd, and associated options) and PCR12 will contain the hash of the kernel command line. These hashes are called “measurements”. Keys can be bound to a specific set of PCR measurements, which means that the keys will be available only when the designated set of PCRs will match specific values. In other words, it is possible to set up the TPM so that a the key that unlocks our Linux partition is only available when booting a specific Linux kernel with a specific command line, and altering any of these (e.g. to put a good old init=/bin/bash) would cause the corresponding PCR hash value to change and therefore the TPM would refuse to unseal the key.

You might wonder: “Hey, if I get anything wrong, could that make my system unbootable?” As far as I understand, if you change something that affects the PCR measurements (and the TPM refuses to unseal the key), or even if you disable SecureBoot altogether, you will still be able to boot your system; but you will need to provide your LUKS password or recovery key to unlock your root device. (I believe that this is why my friends who use Windows BitLocker complain about having to enter their BitLocker recovery key after some software or hardware upgrades, but I have zero direct experience with BitLocker myself.)

By the way: there are two versions of TPM, and if I understand correctly, they’re very different. TPM2 is not just a superset of TPM1. We’re going to use TPM2 here.

SecureBoot with Linux

I’m a little bit unsure about the low-level implementation details here. I don’t know if the boot loader is the one loading the boot files (kernel, initrd, microcode…), verifying their signatures, and reporting the PCR measurements to the TPM; or if the file loading actually goes through some UEFI function calls that take care of the verification and update the PCR. Either way, with systemd-boot, the recommended way seems to be to build a Unified Kernel Image (UKI). A UKI is an executable file that bundles together everything that is needed to boot the system (kernel, initrd, microcode, kernel command line) and can be loaded (and executed) directly by the UEFI. As it’s a single file, it can also conveniently be signed, thus ensuring that when we validate the signature and execute it, nothing has been tampered with (nothing has been changed in the initrd or the kernel command line, for instance).

Long story short: we need to generate and sign UKIs.

Enabling it all

Install the SecureBoot package:

pacman -S sbctl sbsigntools

check that “Setup Mode” is “Enabled”:

sbctl status

If it’s not, make sure that you didn’t forget to set the BIOS to “Setup Mode” (or “Audit Mode” on my Dell BIOS). If you did, you’ll need to reboot, do that, then come back to the installer. Boo!

Create your own signing keys:

sbctl create-keys

For reference, sbctl places generated keys in /usr/share/secureboot.

Sign the systemd bootloader:

sbctl sign -s \
  -o /usr/lib/systemd/boot/efi/systemd-bootx64.efi.signed \
  /usr/lib/systemd/boot/efi/systemd-bootx64.efi

Enroll your custom keys:

sbctl enroll-keys --microsoft

The --microsoft is useful only if you’re going to dual boot to Windows. You can remove it otherwise.

If you get permission errors, you might have to chattr -i a couple of files, then try again.

Now we need to configure mkinitcpio so that it generates UKI in addition to “normal” initrds.

First, if you haven’t done it already, edit /etc/mkinitcpio.conf and update the HOOKS line:

HOOKS=(systemd autodetect microcode modconf kms keyboard sd-vconsole block sd-encrypt filesystems fsck)

Then edit /etc/mkinitcpio.d/linux.preset, and uncomment the default_uki and fallback_uki lines. Change the paths there from /efi to /boot, since if we have a separate EFI System Partition, it will generally be too small anyway.

Note: if /etc/mkinitcpio.d/linux.preset doesn’t exist, make sure that the mkinitcpio and linux packages are installed. Re-install them if necessary. (That happened to me once when I was tinkering around.)

Build the new initrd and our new Unified Kernel Images:

mkinitcpio --allpresets

We can now sign all these EFI binaries:

sbctl sign -s /boot/EFI/Linux/arch-linux.efi
sbctl sign -s /boot/EFI/Linux/arch-linux-fallback.efi
sbctl sign -s /efi/EFI/systemd/systemd-bootx64.efi
sbctl sign -s /efi/EFI/Boot/bootx64.efi
sbctl verify

The -s (or --save) flag means that sbctl will store that file’s location in its database, so that we can re-sign everything later (e.g. after a kernel upgrade) with sbctl sign-all. (We won’t have to do that ourselves; this is automatically done by mkinitcpio hooks.)

If you’re dual booting and see errors about Microsoft stuff not being signed, don’t worry, that’s normal: sbctl only verifies with our keys here.

We can now reboot using the new EFI iamges.

At this point we’ll still need to give our root volume password when booting, but the next step will be to use a key in the TPM instead.

Enrolling a TPM key

# Install the TPM tools
pacman -S tpm2-tools

# Check the name of the kernel module for our TPM
systemd-cryptenroll --tpm2-device=list

# Generate a recovery key (not mandatory but strongly recommended)
systemd-cryptenroll --recovery-key /dev/gpt-auto-root-luks

# Generate a key in the TPM2 and add it to a key slot in the LUKS device
systemd-cryptenroll --tpm2-device=auto /dev/gpt-auto-root-luks --tpm2-pcrs=7

# This is the command to use later, to remove the (insecure) initial password
#systemd-cryptenroll /dev/gpt-auto-root-luks --wipe-slot=password

Note: --tpm2-pcrs=7 means that the key will be available only with the current Secure Boot state. In other words, if Secure Boot is disabled, or if Secure Boot keys are altered, the key won’t be available. This means that if you turn off Secure Boot to boot a rescue ISO, the key won’t be available On the other hand, it doesn’t measure the kernel and initrds, so if you upgrade your kernel, the key will still be available. Some folks might decide to but an even more restrictive set of PCRs here, but it will then require more work when upgrading kernels. Check the systemd-cryptenroll(1) manpage for some details.

Check if your TPM requires a kernel module:

lsmod | grep tpm

If your TPM requires a kernel module, edit /etc/mkinitcpio.conf one more time and edit the MODULES line to add the module used by your TPM (as identified above). For instance:

MODULES=(tpm_tis)

Run mkinitcpio --allpresets one more time, reboot, and this time you shouldn’t have to enter a password to unlock the root volume!

Wrapping up

Some steps can probably be simplified a bit. In particular, we’re running mkinitcpio a lot of times. We could check the name of the TPM module before rebooting, and add the module to mkinitcpio.conf earlier. (That’s actually what I do when installing my systems.) I kept instructions in that order because that way, things are grouped in a more logical way and (I think) it’s easier to understand if you’re new to all this.

Finally, SecureBoot is not absolutely unbreakable. There are attacks against it. If you intend to store extremely sensitive data (e.g. military) in a volume encrypted with a key stored in a TPM, you should do some research beforehand. (But I hope that in that case, you’re not following my blog for advice. That would be worrisome. :)) It’s good enough for my use case, though (making sure that my data won’t be readable by Dell technicians or second-hand hardware brokers who would end up with my dead laptop main board or its soldered-on disk).

Debugging Django Performance

2023-06-12T00:00:00+00:00

This is a story of how we identified a performance issue in EphemeraSearch. The performance issue itself isn’t very interesting (there is a very low chance that you will run into the same issue), but the methodology that we used might be useful if you need to troubleshoot performance issues in Django.

The problem

EphemeraSearch is an archive of old mail. It currently focuses on postcards. The postcards are scanned and made available online through the website. (It’s free by the way; so if you’re a genealogy or history fan, feel free to browse!) Postcards are then transcribed and indexed, meaning that you can search for the name of an ancestor, or a place where they lived, and if we have some of their postcards, they might turn up.

Recently, we’ve rolled out eBay integration, meaning that people selling old postcards on eBay can automatically add their collections to our website with very little effort. This was a big success, and resulted in thousands of new postcards being added to the site. Yay! 🎉 But at the same time, performance started to degrade. Boo! 😭

When we hit the point where just displaying a postcard would take a few seconds, it was clear that we had to find the cause of the problem and address it.

At first, it looks like an index problem. However, while we can’t be sure that the indexes are perfect, loading individual postcards from the database (for instance with the Django ORM) was extremely fast. So what could it be?

We had to dig deeper!

First investigations

After digging a bit in our APM, we quickly identified one request in particular that was way slower than expected: the API request to fetch a single postcard.

This should take 10-100ms maybe; not 500ms-1.5s; and it could definitely explain the slowdowns that our users were seeing.

We looked at the flame graph (showing which function calls were taking how long), but unfortunately, we were stumped pretty quickly:

A lot of time seems to be spent in that api.viewsets.get_serializer method, but it’s not clear why. Trying to drill down in that function showed a bajillion (really, hundreds of thousands!) of small individual calls but nothing that seemed obvious.

It turns out that the problem was right there; but we didn’t see it at first, unfortunately!

We had to dig even deeper.

Reproducing locally

At that point, we felt like we were hitting the limits of what we could do merely by tinkering with production or pre-production environments. We had to reproduce the issue locally!

We established a baseline (the output below has been edited for clarity):

$ hey -n 10 -c 1 "http://localhost:8000/api/ephemera/..."

Summary:
  ...
  Average:  	0.3526 secs
  ...

We definitely recommend checking out hey by the way; it’s a bit like a modern ab (Apache Benchmark), letting you quickly get neat histograms of the latency of a service.

Note that the same request that took about 800ms on our production platform (powered by [Heroku]) took only 350ms locally. This can be a bit surprising at first: why would a local development machine be faster than a good hosting server? There can be multiple explanations:

our local development machine isn’t very busy otherwise, and lots of Intel/AMD CPUs can boost their clock when running a single core, often resulting in nice single-thread performance compared to a server with dozens of cores but truly using all of them at the same time;
on our local development platform, the database is local, and the lower latency can yield performance improvements, especially if there are many database requests (each requiring a full round-trip to the database);
…

Later, we found out that the issue was CPU bound; which probably explains why, retrospectively, our local environment performed relatively well compared to the production platform.

Research

Our web frontend isn’t “pure” Django. It’s a React app that communicates with an API, itself implemented on top of Django using Django REST Framework (or DRF).

DRF is pretty amazing because it makes it convenient to implement Django views providing REST APIs in very customizable ways, especially when objects reference (or include) other objects and you want to control how things are serialized (to JSON in our case) and validated (when handling POST, PUT, or PATCH requests, for instance).

We did a bit of research, and found two great resources about DRF performance:

Improve Serialization Performance in Django Rest Framework by Haki Benita,
Web API performance: profiling Django REST framework by Tom Christie.

We learned a lot by reading these blog posts; but… they didn’t take us closer to solving our bug. They even left us more confused. Specifically, Tom Christie’s post showed requests where database queries took 66% of the time before optimizations, and 80% after optimizations. In our scenario, database queries take only 20% of the time.

Obviously we were doing something very wrong!

Profiling

We tried to evaluate the performance of DRF serializers (carefully importing the right modules, instantiating the right classes, etc) but at the end, we found that performance to be acceptable and in line with expectations. (We won’t reproduce the code here because it wasn’t super helpful.)

At this point, we decided on two things.

We wanted to reproduce the request execution in a Python interpreter, as closely as possible to the real thing.
Then we would use the Python profiler to find hotspots.

This is the technique that we came up with, to reproduce our request.

Note that the following code was executed in Django’s shell_plus environment.

import django.test
import django.urls

# This is our test request, broken down with URI + Query String.
uri = "/api/ephemera/13373/"
qstring = "?img=2&expand=~all&referrer=/ephemera/13373/"

# Let's build a request object...
factory = django.test.RequestFactory()
request = factory.get(uri+qstring)

# Now invoke Django's URL router to find the view
# that is supposed to handle the request...
resolvermatch = django.urls.resolve(uri)

# Then invoke that view.
resolvermatch.func(request, *resolvermatch.args, **resolvermatch.kwargs)

# We can get a performance baseline with IPython %timeit:
%timeit resolvermatch.func(request, *resolvermatch.args, **resolvermatch.kwargs)

Hopefully, the snippet above should work for most GET requests, regardless of the Django apps and frameworks that you’re using. There are two drawbacks, though:

it shortcuts most (if not all?) of Django’s middlewares,
it probably requires some adaptation if you want to benchmark or profile anything else than GET requests.

In our case, we didn’t care about the middlewares, because our APM showed that most of the time seemed to go in our app - or, to phrase things differently, that the middlewares’ overhead was hardly measurable.

The next step was to profile our request, to see where CPU time was going - hopefully with more details than in our APM tool.

This can be done like this:

import cProfile
cProfile.run(
  "for i in range(10): resolvermatch.func(request, *resolvermatch.args, **resolvermatch.kwargs)",
  sort="tottime"
)

When I was profiling Python code back in college (more than 20 years ago🫣) this is more or less what I was doing; but we now live in more modern times, and there are some pretty neat tools to show visual representations of Python profiling information; for instance SnakeViz and Tuna.

They’re both one pip install away; and after installing them, we could do:

# Write the profiling information to a file
cProfile.run(
  "for i in range(10): resolvermatch.func(request, *resolvermatch.args, **resolvermatch.kwargs)",
  "/tmp/pstats"
)

# Execute snakeviz or tuna on that file
# (this will run a web server and open a page in your browser)
!snakeviz /tmp/pstats
!tuna /tmp/pstats

(If you’re surprised by this ! syntax: we’re using IPython here, and this just means “shell out and execute snakeviz in tuna in a subshell”.)

This is what SnakeViz showed us:

The thing that caught our attention was the 231,590 calls to base.py:__init__ that you can see at the bottom of the screen.

Why the heck did we create 231,590 new instances?!? And, new instances of what?!?

Retrospectively, at that point, the answer was right in front of us - but we didn’t see it quite yet. (If you see it, ~~leave a comment, subscribe and hit the bell~~ congrats! 🎊🎖️🐍)

Print to the rescue

There were certainly more elegant ways to find what was going on, but we decided to edit base.py (specifically, django/db/models/base.py located in our virtualenv lib/python-XXX/site-packages) to sprinkle some print statements at the beginning of that constructor:

class Model(AltersData, metaclass=ModelBase):
	def __init__(self, *args, **kwargs):
    	print(self.__class__, args and args[0])
    	...

This would show us which Model subclass was instantiated, and its arguments.

Then, we repeated our test request.

And to our greatest surprise, we saw that when accessing a single postcard, our code was actually instantiating every single postcard in the database. The output was filled with lines like this, for every single postcard (they correspond to the Ephemeron model):

...
<class 'models.Ephemeron'> 896
<class 'models.Ephemeron'> 895
<class 'models.Ephemeron'> 894                                            	<class 'models.Ephemeron'> 893
...

Again, hindsight is 20/20: the number of calls to base.py:__init__ corresponded very closely to the number of postcards in the database at that point; so we could probably have guessed what was being instantiated. But this one was a clear confirmation rather than (educated) guesswork!

Breakpoint

The next step was to find out what was creating all these instances. We went for a low-tech but super effective method: good old breakpoint()!

We changed the constructor like this, so that when it hit an object with a pk of 20000, it would break to the debugger, giving us an opportunity to check the stack trace and the sequence of callers:

class Model(AltersData, metaclass=ModelBase):
	def __init__(self, *args, **kwargs):
    	print(self.__class__, args and args[0])
    	if args and args[0]==20000: breakpoint()
    	...

Note that we didn’t put a straight, unconditional breakpoint here, because there are many instances that get created before we hit the “problematic” ones. This was a way to make sure that we’d trigger exactly for what we were looking for, instead of having to repeatedly press c (for continue) in the debugger.

After that, we sent one more test request. Our breakpoint was hit!

We walked up and down the stack (with up and down) until we saw this:

ipdb> l
       	logger.critical(f"unknown action requested: {self.action}")
	286
   	#  logger.verbose(f"{self} getting serializer {self.serializer_class}")
   	ret = super(DynSerializerModelViewSet, self).get_serializer(*args, **kwargs)
   	try:
--> 290         	if (
           	self.queryset
           	and ret.Meta.model != self.queryset.model
           	and settings.DJANGO_ENV == "development"
       	):
           	embed(header="wrong models")

The problem came from the if self.queryset. This tries to interpret the QuerySet as a boolean value (or, to say it differently: it casts the QuerySet to a boolean value). It does that by calling the __bool__() method on the QuerySet. If we look again at the SnakeViz screenshot above, we’ll see that __bool__() call. In other circumstances, that might have been a dead giveaway. In that case, we missed it. Also, for unknown reasons, it didn’t show up in the APM tool.

That bit of code was leftover debugging code that wasn’t used anymore (and as you can notice from the rest of the test condition, it only gets used in development anyway) so we removed it, and repeated our tests locally:

$ hey -n 10 -c 1 "http://localhost:8000/api/ephemera/...

Summary:
  ...
  Average:  	0.1070 secs
  ...

100ms instead of 350ms, i.e. 3.5x faster. Not bad!

That piece of leftover debugging code was in the get_serializer method, i.e. exactly what our APM flame graph was showing us in the beginning. Unfortunately, we missed that boolean QuerySet evaluation during our first inspection!

Aftermath

We deployed the “fix” to preproduction, and the request duration was divided by 5:

The fix made it to production, with similar results.

The nice thing is that we can now address other performance problems which were previously hidden by that one. Yay!

So, what did we learn?

Evaluating a QuerySet in a boolean context will evaluate the query (this is clearly mentioned in the QuerySet API reference) and can sometimes construct model instances for every row in the query result. The latter part was unexpected, and might be caused by something weird in our QuerySet. (Naively, we’d have expected the query to merely check if there was at least one result and not construct the whole thing?)

It pays to pay attention to what our tools are telling us. There were some very valuable hints early on in our investigation, but we failed to notice them. But that’s OK, and that’s also why we have multiple investigation tools and techniques: so that we can catch with one the stuff that with missed with another.

Simulating Django requests is relatively straightforward, and the method that we used here (with django.test and django.urls) is probably transposable to other Django requests - whether you use DRF or not. The general idea is also probably transposable to other frameworks and other languages.

Reproducing problems locally can be super helpful, because it allows some crude-but-effective hacks - like instrumenting the Django ORM base model constructor to see which instances were created, or even adding a conditional breakpoint there to see what’s going on.

Leaving dead code is probably not a good idea in the first place; but we’ve all been there - adding some temporary hacks when tracking a really weird bug, and forgetting to remove them later. This is why reproducing problems locally is the way: it saves us the long round-trip (commit to a branch - deploy to staging - test stuff - repeat) and once we have identified the fix, it’s easier to commit just what we need and leave out the rest.

That’s it! Again, while the original bug was very specific to our environment, we hope that the general technique and workflow that we used might be useful to others some day. Thanks for reading; and if you enjoyed this, go check some postcards on EphemeraSearch!

You Belong Here

2022-06-30T00:00:00+00:00

To all my LGBT+ friends, peers, and fellow members of the Kubernetes and Cloud Native communities: you do belong here. No matter what others think or say; privately or publicly; you played (and are still playing) a major role in the success of this community. Many of you, through your code, your docs, your talks, your workshops, your pull requests, your comments thereon, your presence on social media and even in the world in general, have positively influenced and helped me in so many ways that I wouldn’t be able to list them all.

Why does this need to be said? Because in 2022, a member of the Kubernetes Steering Committee publicly came under the spotlight for expressing openly homophobic and transphobic views. Phrases like “God created each person male or female” or “God’s bounds for sexuality are one man and one woman, in marriage”. As a cis man, most oftentimes, when someone says something like that, I just roll my eyes and put them in the “nutjob” bucket. I’ve also been lucky enough to never have to deal with that kind of bigotry first-hand growing up and then working for various companies and organizations. However, for folks who are transgender or gay, these phrases can be pretty hurtful, for at least a few reasons:

if they live in a conservative environment (family, country), they will hear them regularly;
if these words are said by a community leader, manager, or otherwise authority figure, they will carry more weight;
even if the person who says these sentences doesn’t actually do anything to cause material harm, they will once in a while influence someone enough to take action. (Just look at the rates of crime against gay and trans folks; or don’t, if you don’t want to ruin a good day.)

“Words of affirmation” isn’t my favorite language of love, which means that it can be difficult or awkward for me to express gratitude with a public statement. But I’m trying anyway. (In parallel, I’m also exploring other ways to act and reduce the harm caused by bigotry and intolerance in the spaces in which I participate.)

This is not an exhaustive list, just the first examples that came to mind. I’m not going to name anyone but I sure hope that some of you will recognize yourself there.

When I delivered a training that featured kubebuilder content, the most relevant and useful resources that I found were written by trans and non-binary folks.
When I brushed up my Golang skills for that same training, I reviewed a lot of content/talks written by gay folks.
When I worked at Docker, some of the kindest, empathetic, and skilled folks I worked with were trans - some closeted, some out.
When I was trying to learn/understand stuff like Nomad; cert-manager; KinD; Cluster API… guess what, it’s trans folks all the way down again.
When KubeCon went virtual and speakers had to record their talks, there was one talk in particular that truly took advantage of the opportunity and shipped a video production (instead of just recording themselves in Zoom), and I hate to break it to you but it wasn’t presented by cis dudes.
The most skilled experts in Kubernetes security (and perhaps security in general) that I know personally are trans. (I’m not claiming that they’re the best in the world; but they’re folks with whom I had the privilege to share an omakase, a bowl of ramen, or some coffee at multiple conferences; I can’t say that much about e.g. Bruce Schneier, although I’m sure he’s a Great Dude!)

The list could go on and on.

I’m not saying that you are required to demonstrate exceptional levels of expertise as listed above to successfully engage with the Kubernetes and Cloud Native ecosystems. You are welcome and you belong regardless of your skills and involvement.

I’m also not saying that cis/straight folks aren’t doing shit in the community. But I’m saying that if the queer (at the broadest interpretation of that term) folks leave, it’s going to be fucking noticeable. So I would love you to stay, and to keep being awesome and bring your world-class skills to this community. And if there is anything else that you’d like me to do on your behalf, let me know.

Thank you.

Anti-Patterns When Building Container Images

2021-11-30T00:00:00+00:00

This is a list of recurring anti-patterns that I see when I help folks with their container build pipelines, and suggestions to avoid them or refactor them into something better.

And since only a Sith deals in absolutes, keep in mind that these anti-patterns aren’t always bad.

Many of them are harmless when used separately. But when combined, they can easily compromise your productivity and waste time and resources, as we will see.

Big images

It’s better to have smaller images, because they will generally be faster to build, push and pull, use less disk space and network.

But how big is big?

For microservices with relatively few dependencies, I don’t worry about images below 100 MB. For more complex workloads (monoliths or, say, data science apps), it’s fine to have images up to 1 GB. Above that, I would start to investigate.

I wrote a series of blog posts about optimizing the size of your images (part 1, part 2, part 3), so I’m not going to repeat that here; instead, let’s focus on some exceptions to the rule.

All-in-one mega images

Sometimes you need Node, PHP, Python, Ruby, and a few database engines in your image, as well as hundreds of libraries, because your image will be used as a base for a PAAS or CI platform. This is the case on platforms that have just one available image to run all apps and all jobs; then the image needs to have everything installed, of course.

I don’t have magic solutions for this. Keep in mind that you will probably need to support multiple images anyway eventually, so when you introduce support for, say, version selection, you might want to allow selection of smaller images with a tighter focus. Just an idea!

Data sets

Some code (especially in data science) needs a data set to function. It could be a reference genome, a machine learning model, a huge graph on which we’ll do some computation…

It’s tempting to put the dataset in the image, so that the container can “just work” no matter where and how we run it. And if the dataset is small, that’s generally fine.

But if the data set is big (let’s say, more than 1 GB) it will start becoming a problem. Sure, if your Dockerfile is well organized, the model will be added before the code; but if you add the model after the code, it will be a catastrophe. Builds will be slow, use up a lot of disk space, and if code must be tested on remote machines (as opposed to locally), the model will be pushed/pulled every time and use a lot of disk space on the remote machines too. That’s very bad.

Instead, consider mounting the data set from a volume. Assume that your code can access the data it needs on, say, /data.

When you run locally with a tool like Compose, you can use a bind-mount from a local directory (which will act as a cache) and a separate container to load the data. The Compose file would look like this:

services:
  data-loader:
    image: nixery.dev/shell/curl
    volumes:
    - ./data:/data
    command: |
      if ! [ -f /data/dataset ]; then
        curl ... -o /data/dataset
        touch /data/ready
      fi
  data-worker:
    build: worker
    volumes:
    - ./data:/data
    command: |
      while ! [ -f /data/ready ]; do sleep 1; done
      exec worker     

The data-worker will wait for the data to be available before starting, and data-loader will download the data to the local directory data. It will download it only once. If you need to download the data again, just delete that directory and run again.

Now, when running e.g. on Kubernetes, we can leverage an initContainer to download the data, with a Pod spec similar to this:

spec:
  volumes:
  - name: data
  initContainers:
  - name: data-loader
    image: nixery.dev/curl
    volumeMounts:
    - name: data
      mountPath: /data
    command:
    - curl
    - ...
    - -o
    - /data/dataset
  containers:
  - name: data-worker
    image: .../worker
    volumeMounts:
    - name: data
      mountPath: /data

Note that the worker container doesn’t need to wait for the data to be loaded, since Kubernetes will start it only after the initContainer is done.

If we run multiple workers per node, we can also use a hostPath volume (instead of an ephemeral emptyDir volume) so that the data only gets loaded once.

Another option is to leverage a DaemonSet to automatically populate that data directory on every node of the cluster ahead of time.

The best option depends on your particular use case. Do you have a single, big data set? Multiple ones? How often do they change?

The big upside is that your images will be much smaller, and they will still behave identically in local environments and in remote clusters, without requiring you to add special code to download or manage the model in your app logic. Big win!

Small images

It’s also possible to have images that are too small. Wait, what’s wrong with an image that would just be 5 MB?

Nothing wrong with the size of the image, but if it’s so small, it might be missing some useful tools, and that might cost you and your colleagues a lot of time when troubleshooting the image.

Images built with distroless or with FROM scratch might be small, but if your team is regularly stumped because they can’t even get a shell in the image to e.g. check which version of a particular file is there, see running processes with ps, or network connections with netstat or ss, what’s the point?

⚠️ This is extremely context-dependent. Some teams never need to get a shell in an image. Or, if you use Docker, you can use docker cp to copy some static tools (e.g. busybox) to a running container and check what’s going on. Or, if you’re working with local images, you can easily rebuild your image and add the tools that you need. Or, if you’re running on Kubernetes, you can enable the ephemeral containers alpha feature. But on most production Kubernetes clusters, you won’t have access to the underlying container engine and you may not be able to enable alpha features, so…

Here is one way to add a very basic toolkit to an existing image. This example shows a distroless image but it should work with other images as well:

FROM gcr.io/distroless/static-debian11
COPY --from=busybox /bin/busybox /busybox
SHELL ["/busybox", "sh", "-c"]
RUN /busybox --install

If you want more tools, there is a very elegant way to leverage Nixery and install your tools without clobbering the existing image. For code deployed on Kubernetes, it’s even possible to add the tools in a volume, so that you don’t need to rebuild and redeploy a new image. If you’re interested, let me know, and I’ll write a follow-up post about that!

Overall, I personally like to build on top of Alpine images, because they’re tiny (Alpine is 5 MB) and once you have Alpine you can apk add whatever you want when you need it. Network traffic acting up? Install tcpdump and ngrep. Need to JSON stuff in and out? curl and jq to the rescue!

Bottom line: small images are generally good, and distroless is honestly some pretty awesome sauce in the right circumstances. If your circumstances are “I can’t get in my container and I’m resorting to adding print() statements to my code and pushing it all the way through CI to staging because I can’t kubectl exec ls”, you might want to reconsider. Just saying!

Zip, tar, and other archives

(Added December 15th, 2021.)

It is generally a bad idea to add an archive (zip, tar.gz or otherwise) to a container image. It is certainly a bad idea if the container unpacks that archive when it starts, because it will waste time and disk space, without providing any gain whatsoever!

It turns out that Docker images are already compressed when they are stored on a registry and when they are pushed to, or pulled from, a registry. This means two things:

storing compressed files in a container image doesn’t take less space,
storing uncompressed files in a container image doesn’t use more space.

If we include an archive (e.g. a tarball) and decompress it when the container starts:

we waste time and CPU cycles, compared to a container image where the data would already be uncompressed and ready to use;
we waste disk space, because we end up storing both the compressed and uncompressed data in the container filesystem;
if the container runs multiple times, we waste more time, CPU cycles, and disk space each time we run an additional copy of the container.

If you notice that a Dockerfile is copying an archive, it is almost always better to uncompress the archive (e.g. using a multi-stage build) and copy the uncompressed files.

Rebuilding common bases

It’s pretty common to have a common base image shared between multiple apps, or multiple components within the same app. Especially when you have a bunch of non-trivial dependencies and they take a while to build; it sounds like a good idea to shove them in a base image, and reference that image from our other images.

If that image takes a long time to build (say, more than a few minutes), I recommend that you store that base image in a registry, and instead of building it locally, pull it from that registry.

Why?

Reason #1: pulling an image is almost always faster than building it. (Yes, there are exceptions, but trust me, they’re pretty rare.)

Reason #2: since this is the base on top of which everything else gets build, you probably want to make sure that you have a very specific set of versions in that image; otherwise we’re back to problems like “works on my machine” - exactly what we were trying to avoid by using containers! If everyone rebuilds the base image locally, we need to be extra careful about making that build process deterministic and reproducible: pinning all versions; checking the hashes of all downloads; using && or set -e in all the appropriate places to abort immediately if something fails within a list of commands in the build process. Or, we can simply store the base image in a registry, and now we’re sure that everyone is using the same one. Done.

What if we need to tweak that base image, though? Is there an easy way to do that without pushing a new version of the base image (which shouldn’t be necessary if we only need it locally), or without editing Dockerfiles?

If you’re using Compose, here is an example of a foundation image pattern. It’s a very simple pattern (I don’t think it’ll blow your mind!) but I often see it reimplemented with shell scripts, Makefiles, and other tools, so I thought it could be useful to show that it’s possible to do it with just Compose. If you build one of your apps, it will pull the base image; but if you need a custom base image, you can rebuild that specific image separately with docker-compose build.

Building from the root of a giant monorepo

I don’t have strong opinions for or against monorepos, but if your code lives in a monorepo, you probably have different subdirectories corresponding to different services and containers.

For instance:

monorepo
├── app1
│   └── source...
└── app2
    └── source...

One possibility is to put the Dockerfiles at the root of the repository (or in their own, separate subdirectory), for instance like this:

monorepo
├── app1
│   └── source...
├── app2
│   └── source...
├── Dockerfile.app1
└── Dockerfile.app2

We can then build each service with e.g. docker build . -f Dockerfile.app1. The problem with this approach is that if we use the “old” Docker builder (not BuildKit), the first thing that it does is upload the entire repo to the Docker Engine. If you have a giant 5 GB repo, Docker will copy 5 GB at the beginning of each build, even if your Dockerfile is otherwise well-designed and leverages caching perfectly.

I prefer to have Dockerfiles in each subdirectory, so that they can be built independently, in a small and isolated context:

monorepo
├── app1
│   ├── Dockerfile
│   └── source...
└── app2
    ├── Dockerfile
    └── source...

We can then go to directories app1 or app2 and run docker build ., and it will only need the content of that subdirectory.

However, sometimes, the build process needs dependencies that live outside of the application directory; for instance some shared code in the lib subdirectory below:

monorepo
├── app1
│   └── source...
├── app2
│   └── source...
└── lib
    └── source...

What should we do in this situation?

Solution #1: package the dependencies in their own images. When building the images for app1 and app2, instead of copying that lib directory from the repository, copy it from a lib image or a common base image. Of course, this may or may not be relevant in your situation, because one of the main selling points of monorepos is that a particular commit can describe exactly which version of the code and its dependencies we are using; and this solution can break that.

Solution #2: use BuildKit. BuildKit doesn’t need to copy the entire build context, so it will be much more efficient in that scenario.

Let’s talk more about BuildKit in that context!

Not using BuildKit

BuildKit is a new backend for docker build. It’s a complete rehaul with a ton of new features, including parallel builds, cross-arch builds (e.g. building ARM images on Intel and vice versa), building images in Kubernetes Pods, and much more; while remaining fully compatible with the existing Dockerfile syntax. It’s like switching to a fully electric car: we still drive it with a wheel and two pedals, but internally it is completely different from the old thing.

If you are using a recent version of Docker Desktop, you are probably already using BuildKit, so that’s great. Otherwise (in particular, if you’re on Linux), set the environment variable DOCKER_BUILDKIT=1 and run your docker build or docker-compose command; for instance:

DOCKER_BUILDKIT=1 docker build . --tag test

If you end up liking the result (and I’m pretty confident that you will), you can set that variable in your shell profile.

“How do I know if I’m using BuildKit?”

Build output without BuildKit:

Sending build context to Docker daemon  529.9kB
Step 1/92 : FROM golang:alpine AS builder
 ---> cfd0f4793b46
...
Step 90/92 : RUN (     ab -V ...
 ---> Running in 645af9563c4d
Removing intermediate container 645af9563c4d
 ---> 0972a40bd5bb
Step 91/92 : CMD   if tty >/dev/null; then ...
 ---> Running in 50226973af9f
Removing intermediate container 50226973af9f
 ---> 2e963346566b
Step 92/92 : EXPOSE 22/tcp
 ---> Running in e06a628465b3
Removing intermediate container e06a628465b3
 ---> 37d860630477
Successfully built 37d860630477

starts with “Sending build context…” (in this case, more than 500 kB)
needs to transfer the entire build context at each build
text output is mostly in black and white, except the standard error output of the build stages which is in red
every line of the Dockerfile corresponds to a “step”
every line of the Dockerfile generates an intermediary image (the ---> xxx that we see in the output)
execution is linear (92 steps for this image and all its stages)
build time for this image: 3 minutes, 40 seconds

Build output for the same Dockerfile, with BuildKit:

 => [internal] load build definition from Dockerfile                                           0.0s
 => => transferring dockerfile: 8.91kB                                                         0.0s
 => [internal] load .dockerignore                                                              0.0s
 => => transferring context: 2B                                                                0.0s
 => [internal] load metadata for docker.io/library/golang:alpine                               0.0s
...
 => [stage-19 27/28] COPY setup-tailhist.sh /usr/local/bin                                     0.0s
 => [stage-19 28/28] RUN (     ab -V | head -n1 ;    bash --version | head -n1 ;    curl --ve  0.7s
 => exporting to image                                                                         2.0s
 => => exporting layers                                                                        2.0s
 => => writing image sha256:9bd0149e04b9828f9e0ab2b09222376464ee3ca00a2de0564f973e2f90e0cfdb   0.0s

starts with a few [internal] lines and only transfers what it needs from the build context
can cache parts of the build context across builds
text output is mostly dark blue
Dockerfile commands like RUN and COPY do produce new steps, but other commands (like the EXPOSE and CMD at the end) do not
each step generates a layer, but no intermediary images
execution is parallelized when possible, using a dependency graph (the final image is the 28th step of the 19th stage of that Dockerfile)
build time for this image: 1 minute, 30 seconds

So make sure that you’re using BuildKit: I can’t think of any downside. It should never be slower, and in many cases, it will make your builds much faster.

Requiring rebuilds for every single change

That’s another anti-pattern. Granted, if you use a compiled language, and want to run the code in containers, you might have to rebuild each time you make a code change.

But if you’re using an interpreted language, or if you’re working on static files or templates, it shouldn’t be necessary to rebuild images (and recreate containers) after each change.

Most of the development workflows that I see are using correctly volumes, or live update with tools like Tilt; but once in a while, I see someone with e.g. generated Python code, or re-running webpack completely after each change (instead of using the webpack dev server), for instance.

(By the way, if you try to deploy your changes to a development Kubernetes cluster really fast, you should absolutely check Ellen Körbes’ Quest for the Fastest Deployment Time (video and slides). Spoilers, I have enough fingers on one hand to count the seconds between “Save my Go code in my editor” and “that code is now running on my remote Kubernetes clusters”. 💯)

Again, that anti-pattern is not always a big deal. If your build only takes a couple of seconds and the new layers are just a few megabytes, it’s probably alright if you rebuild and recreate containers all the time.

Using custom scripts instead of existing tools

We’ve all done it: the good old ./build.sh (or build.bat). More than two decades ago, when I was doing my bachelor’s degree in computer science, most of my C homework assignments were built with a crappy shell script instead of a Makefile. Not because I didn’t know about Makefiles, but because we worked on both Linux and HP/UX and I kept finding creative ways to shoot myself in the foot with subtle differences between their respective implementations of make. (This might be why I tend to stay away from bashisms today, when I can.)

There are many tools out there providing outstanding developer experience. Compose, Skaffold, Tilt, just to name a few. They have excellent documentations and tutorials, and are used by thousands of developers out there. Some of your developers already know them and know how to maintain Compose files or Tiltfiles.

If our homemade deployment script is just about 10 lines, it’s not doing anything complicated, and can be replaced by a Compose file or Tiltfile. (Keep in mind that if it’s using any external tool like Terraform or a cloud CLI, we need to make sure that this is installed, which will always be at least as much work as “git clone ; docker-compose up”.)

If our homemade deployment script is about 100 lines, it might be doing something more complex. Building an image and then pushing it and then kicking a CI job and then provisioning a staging cluster to test that image, obtaining the address of the cluster to inject it in a local client; that kind of thing; handling many variations and special cases. If it’s 100 lines, there can’t be that many variations, and we’re exactly at the point where everyone will start adding their own particular special case to the script, slowly taking us to the next stage.

If our homemade deployment script has a thousand lines or more, it probably has a lot of custom logic in it, and handles a lot of situations; that’s great! It also means that it now requires you to write documentation, tests, and maybe even run internal training for new hires. Unfortunately, in my experience, these scripts are at least 10x bigger (often more like 100x) than an equivalent Compose file or Tiltfile. They have more bugs, less features, and nobody outside your team or organization knows how to use them.

If you work with one of these bigger deployment scripts, my suggestion is to try to remove rather than add code to it. Move the really custom parts to independent, standalone scripts that can run equally well locally or in containers. Replace the non-custom parts with standard tooling. It’s easier to maintain many small scripts rather than a big one.

“But we want to hide the complexity of containers / Docker / Kubernetes from our developers!”

You do you; but I think the best way to empower developers is to hide that complexity behind standard tools, because when they need to dive into the tooling, they can tap into a rich ecosystem instead of having to rely on your internal tooling or platform team.

Forcing things to run in containers

I like running all my stuff in containers, but I think it’s a very bad idea to force folks to run things in containers.

Let’s say that we have a script that uses the gcloud CLI, Terraform, and a few other tools like crane and jq.

On most platforms, these tools are easy to install with your preferred package manager. The script should therefore be able to run locally.

But to make things easier for our developers (and make sure that we use up-to-date versions of these tools), we build a container image with all these tools. Instead of running the script directly, we tell our devs to use that image.

At first, it looks like this just means replacing yadda-deploy.sh with docker run yadda-image. In practice, we will need to expose some env vars, bind-mount some volumes for credentials and code. We might end up writing a new yadda-deploy.sh script (that will do the docker run behind the scenes). And that’s where we can hit trouble.

Compare these two options:

Method #1: to do this task, run the script yadda-deploy.sh. This script requires tools X, Y, and Z to be installed. If you don’t want to install these tools locally, you can run that script in a container by using image yadda/deploy (built using the Dockerfile in this subdirectory) and the following docker or docker-compose command: …

Method #2: to do this task, run the script yadda-deploy.sh. This script requires Docker to be installed.

At first, method #2 seems better, and that’s why so many teams go this route. Look, it’s shorter, and there are less requirements! Except it’s missing a lot of details. Method #1 manages to tell you a lot of details about the requirements, in just a few lines. In method #2 you need to open the script to see what it’s doing. Probably an easy task if it’s a small 10-line script; harder if it’s one of these giant scripts that we were discussing in the previous section.

Before shipping this new workflow to our users, a good litmus test is to check how hard it is it to make changes to the script and run it. Can we still run the script locally, or is there something that prevents us from doing so?

And this gets worse when we run the script in a remote environment, for instance in CI or on Kubernetes!

Indeed, if our script must call Docker (or Compose), what happens if we try to run that script in an environment that is already containerized? Sometimes we can use Docker-in-Docker in CI, but it’s not always an option; so if our script relies on invoking Docker or Compose, we’re in trouble.

On the other hand, if we’re sticking to “run yadda-deploy.sh in an environment that has packages X, Y, and Z” it’s way easier to do because we already know which packages we need and which image has them.

Using overly complex tools

After recommending that you use tools rather than shell scripts, here is the opposite advice. Don’t add a complex dependency if the problem can be solved with a few lines of script (or with a tool that is already used in the stack).

Example: let’s say that we need to generate a file (configuration or otherwise) from a template and environment variables. In many cases, a here document is sufficient.

If the template has many $, rather than escaping them, we could use [envsubst] from the gettext package.

If the variables come from a JSON file instead of the environment, we might prepare them with a tool like jq.

If some variables need to be transformed, e.g. lowercase, remove special characters, spaces, encode or decode base64, compute hashes… We can install extra tools to do all these transformations before calling envsubst.

Perhaps we also need to support loops? At that point, we might decide to invest in a proper templating engine. That’s where things get really interesting!

If our stack includes a language like Node, Python, or Ruby, there is a good chance that we can find a small package that does what we need. (For instance, in Python, the Jinja2 package provides the j2 CLI tool.) On the other hand, if our stack doesn’t include Python, adding Python just so that we can install Jinja2 feels excessive.

If we are already using Terraform, it has a powerful templating engine that can generate local or remote files. Great! But adding Terraform just for its templating engine might also be a tad much.

(To be honest, if I’m in a very minimal environment and I need to generate fancy templates, I would probably write a script that outputs the whole file that I need, and redirect the output to the file to be generated. But each situation is different!)

We also need to be careful about using tools that are difficult to learn, and/or that very few folks know how to use. Bazel is probably one of the most efficient ways to produce artifacts and run CI on huge codebases, but how many of your colleagues are sufficiently familiar with Bazel to maintain build rules? And when that one person leaves, what will you do? 😬

Conflicting names for scripts and images

Another memory from my early days in computer science: during my first year using UNIX, I kept shooting myself in the foot by calling my test scripts and programs test.

So what?

This is not a big problem in itself; but I was using DOS before. On DOS, if you want to run a program named HELLO.COM or HELLO.EXE located in your current directory, you can run hello directly; you don’t have to do ./hello like on UNIX. So I had customized my login scripts so that . was in my $PATH.

Maybe you see where this is going: instead of running ./test I was running test and ended up calling /usr/bin/test (also known as /usr/bin/[) and wondering why nothing happened (because without arguments, /usr/bin/test doesn’t display anything and just exits).

My advice: avoid to name your scripts in a way that could conflict with other popular programs. Some folks will see it and they will be careful, others might not notice and accidentally run the wrong thing.

This is particularly true with 2-letter commands, because UNIX has so many of them! For instance:

bc and dc (“build container” and “deploy container” for some folks, but also some relatively common text-mode calculators on UNIX)
cc (“create container” but also the standard C compiler on UNIX)
go (conflicts with the Go toolchain)

Building with Dockerfiles

Finally, sometimes, using a Dockerfile to build your image isn’t the best solution. In Moving and Building Container Images, The Right Way, Jason Hall explains in particular how to build and push images containing Go programs efficiently and securely. Spoilers: it’s specific to Go (because Go has an outstanding toolchain), but even if you want to containerize other languages, it’s a good read, I promise.

Jason also mentions Buildpacks. I’m not a huge fan of Buildpacks; perhaps because they remind me of my time at dotCloud, and that after working for half a decade with similar build systems, it felt like a huge relief to work with Dockerfiles. 🤷🏻 But they definitely have merits so if you feel like Dockerfiles are too much (or, depending on the perspective, not enough) you should definitely check Buildpacks.

And more

As I said in the introduction of this series of tips: don’t treat these recommendations as absolute rules. What I’m saying is “hey, careful, if you do this, it can have unexpected consequences; look, here is what I suggest to improve the situation”.

When I deliver container training, I have a whole section about tips & tricks to build “better images” and write “better Dockerfiles”. I wrap it up with the following conclusion:

The point of containers isn’t to get smaller images. The point of containers is to help us ship code faster, more reliably, with less bugs, and/or at a bigger scale. Let’s say that you implement multi-stage builds, and you realize that now your tests run slower or are breaking randomly. Roll back, and try to address the main pain point instead! If you spend half of the day waiting for your code to get to staging or production because images take forever to push and pull, then, yes, maybe it’s a great idea to optimize image size. But if it’s not helping you to meet your goals, don’t do it.

Thanks for reading!

Acknowledgements: I would like to thank Dana Engebretson for our conversations, as well as her suggestions and feedback while writing this blog post. If you’re looking for a consultant to support machine learning workflows, you should absolutely reach out to her!

Mésinformation sélective: brief.me et le nucléaire

2020-07-31T00:00:00+00:00

Je suis abonné à une newsletter qui synthétise l’information quotidienne. Je la pensais objective, mais depuis quelques temps, j’ai des doutes. Je les partage pour créer une conversation.

Cela fait quelques années que je suis abonné à brief.me. Le principe est simple: chaque jour, on reçoit par email une synthèse de l’information, sur quasiment tous les sujets.

L’équipe éditoriale semble faire un effort significatif pour rester neutre et présenter une diversité de points de vue, sans porter de jugement. Je serais incapable de l’associer à un quelconque bord politique. Il y a parfois des prises de positions plus tranchées sur certains sujets, mais dans une rubrique intitulée « C’est leur avis », qui donne la parole à un expert identifié clairement. Bref, le style, la fréquence, la durée de brief.me me convient parfaitement pour savoir ce qui se passe en France alors que je n’y vis plus. (Et pour préciser mon propos : je pense que ça convient aussi très bien à des gens qui vivent en France et veulent un bon survol de l’information, quitte à aller approfondir certains sujets ensuite. Je ne suis pas en train de dire que c’est « léger », loin de là.)

En ce qui concerne la majorité des sujets traités par brief.me, je ne me considère pas compétent. Sauf si ça parle de nouvelles technologies, ou éventuellement de certains domaines scientifiques auxquels je m’intéresse suffisamment pour effectuer des recherches personnelles régulières ; mais c’est anecdotique. Donc je leur fais globalement confiance.

Récemment, plusieurs articles ont remis en question cette confiance. Ces articles traitaient de l’énergie nucléaire en France. C’est un sujet auquel je m’intéresse depuis un moment, dans le contexte de la lutte contre le réchauffement climatique.

Astrid et les déchets nucléaires

Dans son édition du 12 septembre 2019, brief.me parle du « surcoût caché des déchets nucléaires », reprenant un rapport de Greenpeace. Cela tombe quelques semaines après l’arrêt du projet de réacteur Astrid, un prototype de réacteur à neutrons rapides. Les réacteurs à neutrons rapides sont plus difficiles à mettre au point (il y en a donc très peu en service aujourd’hui). En revanche, ils permettent de recycler le combustible nucléaire plusieurs fois, voire dans certains cas de « brûler » certains déchets nucléaires.

(Pour plus d’information sur le sujet, je vous recommande ou bien cet excellent article sur la filière sodium, ou encore cet article sur le retraitement du plutonium et des actinides mineurs qui se base sur le réacteur BN-800 russe.)

Je suis surpris de voir un encart sur le coût des déchets, sans mentionner que l’État vient d’arrêter un projet de recherche qui aurait justement pu permettre de … réduire ces déchets.

J’ai contacté la rédaction de brief.me pour en savoir plus :

J’ai été surpris de voir la brève sur l’article de Greenpeace sur les déchets nucléaires, sans la mettre en perspective avec l’arrêt du projet ASTRID, qui avait pour but de réduire ces mêmes déchets nucléaires, et auquel Greenpeace s’opposait, justement.

En général, je m’attends à une analyse assez neutre de votre part, et citer une organisation anti-nucléaire comme Greenpeace ne me paraît pas si neutre que ça lorsqu’il n’y a pas de contre-point.

J’ai eu droit à une réponse détaillée :

Merci pour votre message ! Nous aurions pu mettre l’information que nous avons traitée sur Greenpeace en perspective avec l’arrêt du projet Astrid si nous avions abordé le sujet dans un “Tout s’explique”, avec davantage de place, car il aurait fallu en effet dire que Greenpeace s’était opposé au projet mais pas parce qu’il promettait de réduire les déchets nucléaires. L’ONG s’y opposait en raison de l’usage de plutonium. C’est un débat intéressant, mais qui nécessite un développement entier. Or, notre “Ça alors” portait sur ce qui doit être considéré comme des déchets nucléaires et ce qui peut être réutilisé. Greenpeace étant une association militante, nous avons pris ses déclarations avec recul et avons veillé à vérifier si ses propos étaient fondés. Nous nous sommes appuyés sur la Cour des comptes, qui n’est pas un organisme militant. Comme elle soulevait des points allant dans le même sens que Greenpeace, nous avons cherché à avoir un contrepoint de la part d’EDF. Nous avons appelé l’entreprise. Elle nous a demandé d’envoyer un e-mail, mais n’a finalement pas répondu. Nous avons néanmoins recherché ses réponses à la Cour des comptes et ses déclarations à l’AFP. Elles ne sont pas d’une limpidité absolue pour le grand public, mais nous avons veillé à rapporter le fait qu’EDF déclare avoir la situation en mains.

Cela m’a rassuré de constater qu’il y avait eu vérification d’information, même si cela passe à côté d’un détail important : Astrid permettait justement de réduire les stocks de Plutonium, et devrait donc être salué par les gens qui s’inquiètent des risques de prolifération liés au nucléaire. Mais j’admets que tout ça peut être un peu technique.

Dossier sur le nucléaire en France

Le 29 février 2020, brief.me fait un dossier sur le nucléaire en France. Le dossier parle de la construction des premières centrales dans les années 1960, des divers mouvements de contestation (lors de la construction de Fessenhein, puis autour du centre d’enfouissement des déchets) ; il parle aussi de l’abandon du nucléaire par divers pays suite aux accidents de Three Mile Island, Tchernobyl, et Fukushima.

Il y a juste un léger hic : dans ce dossier, on ne trouve pas une seule fois les mots carbone ou CO2 et encore moins réchauffement climatique. C’est gênant, car l’énergie nucléaire est aujourd’hui la seule solution dont on dispose pour réduire le bilan carbone de la production électrique, et donc lutter contre le réchauffement climatique.

(Attention, je ne dis pas que le nucléaire est la seule solution pour lutter contre le réchauffement climiatique ; mais que c’est la seule solution dont on dispose aujourd’hui pour décarboner la production électrique. La production électrique ne représente qu’une partie de notre consommation d’énergie et de nos émissions de CO2.)

Faisons quelques parenthèse techniques (mais que j’espère facilement compréhensibles) pour expliquer tout ça.

Le nucléaire et le carbone

« Si on produisait notre électricité avec exclusivement de l’énergie solaire, éolienne, ou hydroélectrique, elle ne produirait pas (ou très peu) de CO2 ! » Malheureusement, ce n’est pas possible.

L’énergie hydroélectrique nécessite d’avoir un relief adéquat, permettant la construction de barrages, par exemple. C’est le cas de la Norvège, qui produit 95% de son électricité de cette manière. En revanche, dans des pays comme la France ou l’Allemagne, on a déjà installé des barrages et centrales à peu près partout là où on pouvait. On est au max !

Les énergies solaire et éolienne sont des énergies intermittentes. Quand il n’y a pas de vent la nuit, elles ne produisent rien. Il faut donc construire beaucoup plus de panneaux solaires et d’éoliennes (ça, c’est faisable) et pouvoir stocker l’énergie (ça, on ne sait pas encore le faire à une échelle suffisante). Pour l’instant, le seul système de stockage d’énergie à grande échelle qu’on sache construire, c’est les centrales de pompage-turbinage ou STEP (Station de transfert d’énergie par pompage). Une STEP, c’est une centrale qui se trouve à côté d’un grand réservoir d’eau situé en hauteur. (Pour préciser les ordres de grandeur : on ne parle pas d’une piscine sur pilotis, mais plutôt d’un lac en haut d’une montagne.) Quand il y a de l’électricité en rab’, la station pompe de l’eau vers le haut. Puis, quand on a besoin de restituer l’électricité, on fait redescendre l’eau, qui fait alors tourner des turbines. Là aussi, on est limité par le terrain (en gros, il faut pouvoir installer une gigantesque piscine, de la taille d’un lac, le plus haut possible). Et là aussi, on est (presque) au max : on en a installé quasiment partout là où on pouvait.

Si vous n’êtes pas convaincu·e par ce discours, je vous invite à consulter electricity map, un site qui montre en temps réel le bilan carbone de la production électrique pays par pays. On voit que les champions sont :

la Norvège, qui produit quasiment toute son électricité de manière hydroélectrique,
la Suède, qui produit environ 40% de son électricité de manière hydroélectrique, 40% avec des centrales nucléaires, et le reste avec de l’éolien et de la biomasse,
la France, qui produit environ 70% de son électricité via le nucléaire.

Quant à l’Allemagne, qui a enterpris une ambitieuse transition énergétique (la fameuse Energiewende), elle est dans le milieu du peloton, en dépit d’investissements considérables. À partir de 2010, l’Allemagne a dépensé plus de 100 milliards d’euros par an pour l’Energiewende. Ce montant est en augmentation constante. Le coût de l’Energiewende atteint plus de 200 milliards en 2020. La moitié correspondant à des subventions pour la production d’énergie éolienne.

(Pour clarifier : cet argent ne sert pas à faire des recherches pour améliorer les éoliennes ou le stockage électrique, ni même à construire des éoliennes, mais à rémunérer les investisseurs qui financent la construction d’éoliennes.)

Tout ça pour une électricité qui génère 10 fois plus de CO2 que l’électricité française. Pourquoi ? Parce que quand l’éolien et le solaire ne produisent pas, il faut faire tourner des centrales thermiques qui fonctionnent aux énergies fossiles. En France, au plus fort de l’hiver, on complète avec 10% d’énergies fossiles, surtout du gaz. En Allemagne, un tiers de l’électricité vient de centrales au charbon, dont le bilan carbone est catastrophique. Ce tiers suffit à gâcher tout ce que le solaire et l’éolien permet de gagner de l’autre côté.

(Pour se faire une idée, pour produire 1 kWh d’électricité, une centrale au charbon émet une quantité de CO2 de 1000g, une centrale nucléaire 6g seulement. Ces chiffres prennent en compte les émissions indirectes : extraction et retraitement du combustible nucléaire, par exemple.)

Une fois qu’on sait ça, on peut bien sûr avoir son avis pour ou contre le nucléaire, mais dans un dossier sur le nucléaire, ne pas parler de l’aspect décarboné de cette énergie, cela me paraît totalement malhonnête.

Qu’en pense brief.me ?

Voici ce que j’ai envoyé à brief.me à l’époque :

J’ai été surpris par votre dossier sur le nucléaire en France. Je l’ai trouvé “subtilement anti-nucléaire”. Je m’explique :

juxtaposer la durée de vie la plus longue (en milliers d’années), et le volume total de déchets (en millions de m3), laisse penser qu’il y a un important volume de déchets a longue durée de vie (alors qu’ils ne représentent qu’une toute petite partie du volume total),

inclure une citation anti-nucléaire sans présenter de point de vue alternatif,

mentionner les pays qui sortent du nucléaire semble impliquer que ça va dans le sens des choses,

passer complètement sous silence l’aspect décarboné de cette énergie semble bizarre dans le contexte actuel de réchauffement climatique.

Je dis “subtilement anti-nucléaire” dans la mesure où ce n’est pas flagrant, et quand on ne connait pas ce sujet, votre dossier peut paraître impartial.

Voici la réponse que j’ai eue :

C’est toujours compliqué de rendre un sujet accessible au plus grand nombre, tout en le synthétisant. Nous ne pensons pas avoir manqué de neutralité toutefois, mais nous sommes d’accord avec vous sur le fait que nous aurions au moins pu rappeler par une phrase que l’énergie nucléaire est l’une des énergies qui émet le taux le plus faible de CO2, même si l’enjeu de l’article ne portait pas sur l’impact du nucléaire sur le réchauffement climatique. Concernant les déchets, nous mentionnons bien que seulement “certains mettent plusieurs centaines de milliers d’années à perdre leur radioactivité”, alors que le reste des chiffres porte sur l’ensemble des déchets produits. À chacun d’interpréter ensuite ce que signifie la sortie de l’énergie nucléaire de plusieurs pays : bon sens, mauvais sens ? Quoi qu’il en soit, ces décisions se sont multipliées dernièrement, principalement en raison des accidents nucléaires.

Je n’ai pas été convaincu, et c’est à ce moment que j’ai commencé à penser à écrire cet article.

Bis repetita

Dans son édition du 4 juillet 2020, brief.me parle du démantèlement des centrales nucléaires. Et encore une fois, on parle du coût (très élevé), du temps que ça prend (très long), de la catastrophe de Fukushima … et pas un mot sur le carbone.

Pas un mot non plus sur le fait que le coût de démantèlement est totalement pris en charge par EDF. Si ça coûte 1 ou 10 milliards de plus, c’est EDF qui paie la note. Vous allez vous dire, « oui, mais EDF va répercuter ça sur le prix de l’électricité ! »

Tout d’abord, EDF est contraint de constituer des réserves financières considérables pour ce démantèlement. Ces réserves sont souvent remises en question avec des arguments parfois techniques (si on se rend compte que le démantèlement sera plus complexe que prévu), parfois financiers (car ces réserves prennent en compte l’inflation et les taux d’intérêts, qui peuvent changer de manière significative sur l’échelle de temps qui nous intéresse).

Ensuite, le coût de l’électricité en France est plutôt bas. 20% plus bas que la moyenne européenne, presque deux fois moins cher qu’en Allemagne. Allez, un petit calcul sur un coin de table : en 2017, la consommation électrique en France était de 474 TWh. Cela fait 474 milliards de kWh. Le kWh pour les particuliers est actuellement à environ 17 centimes (30 centimes pour l’Allemagne). Si on augmente le prix du kWh de 1 centime (passant donc à 18 centimes le kWh), on débloque 4,7 milliards d’Euros par an. Autant dire qu’on a de la marge.

Le cas de Fessenheim

Puisqu’on parle du démantèlement, parlons des circonstances de l’arrêt de la centrale de Fessenheim. Il y a deux lectures possibles :

cette centrale est la plus vieille centrale nucléaire française, et elle n’était plus sûre,
la fermeture de cette centrale était une promesse électorale de François Hollande pour rallier le vote d’EELV (Europe Écologie Les Verts), et n’a aucune raison technique.

Il se trouve qu’une centrale nucléaire n’a pas d’âge limite. Quand on la construit, on s’assure qu’elle pourra fonctionner au moins pendant un certain temps, et un temps le plus long possible, car ce genre de centrale représente des investissements considérables, mais a ensuite un coût de fonctionnement très faible. Un peu comme si on achetait une voiture très chère, mais qui ne consomme ensuite presque rien. On veut s’assurer qu’elle ne tombera pas en panne tout de suite. Les ingénieurs qui ont conçu les réacteurs de Fessenheim ont fait leurs calculs pour qu’ils durent au moins 40 ans. Au moins. Lors de la construction de la centrale, on ne savait pas comment se comporteraient certaines pièces dans la durée, donc on a pris des marges de sécurité considérables à tous les niveaux. Surtout pour la cuve du réacteur, car c’est un élément important pour la sécurité, et c’est un des seuls éléments qu’on ne peut pas changer (sauf en reconstruisant complètement le réacteur). Puis, tous les 10 ans, on fait la « visite décennale », une sorte de contrôle technique très poussé, pour voir où on en est. Il se trouve que sur quasiment tous les réacteurs, on se rend compte que l’usure de cette fameuse cuve est moindre que prévue. Et puis on trouve des techniques pour l’user moins, par exemple en plaçant différemment le combustible nucléaire, de manière à minimiser l’irradiation de cette cuve. Et du coup, cela permet de voir que la centrale peut en fait durer 50, 60 ans, voire plus ; tout en restant aussi sûre (voire davantage, car les normes évoluent) qu’à sa conception.

Si vous voulez en savoir plus sur ce sujet, je vous recommande cet excellent article sur la durée de vie des centrales nucléaires qui aborde à la fois les questions techniques et réglementaires.

Il se passe la même chose avec, par exemple, certaines missions spatiales de la NASA. La mission Cassini (qui devait explorer Saturne, ses lunes, et ses anneaux) devait durer de 2004 à 2008, et a été étendue jusqu’en … 2017. Les rovers Spirit et Opportunity sont arrivés sur Mars en 2004, pour une mission qui devait durer trois mois. Ils ont continuer à fonctionner pendant respectivement 5 et 14 ans. Quant à Curiosity, arrivé sur Mars en 2012, sa mission devait durer deux ans. Huit ans plus tard, il y est encore et est pleinement opérationnel. Tout ça parce que les matériaux et composants ont tenu mieux que prévu, ou bien parce qu’on ne savait pas exactement ce qu’il fallait, donc on a compté large. Une fois sur place, on se rend compte que cette marge va permettre de faire durer la mission plus longtemps.

(Et pour la petite histoire, comme les rovers Spirit et Opportunity ont eu plusieurs problèmes à cause de la poussière qui se déposait sur leurs panneaux solaires, il a été décidé d’équiper Curiosity d’un RTG, un générateur électrique nucléaire.)

La sûreté nucléaire

« D’accord, mais là on parle d’une centrale nucléaire. En cas de catastrophe, les conséquences sont plus graves qu’en cas de défaillance d’une mission spatiale sur Mars ou Saturne ! »

Il se trouve que même en comptant la catastrophe de Tchernobyl, l’énergie nucléaire est l’une des plus sûres au monde. Selon qui fait les calculs (et comment), elle est à peu près aussi sûre (ou un peu plus, un peu moins) que l’énergie solaire ou éolienne.

« Comment ?!? »

C’est bête, mais poser et entretenir des panneaux solaires et des éoliennes, c’est dangereux. Il y a régulièrement des intervenant·e·s qui tombent d’un toit (pour le solaire) ou d’une éolienne. Et une éolienne, c’est haut.

Et puis, une éolienne ou un panneau solaire, ça ne produit pas grand-chose. Il faut donc en installer beaucoup. Il faut donc une quantité de main-d’œuvre considérable, et prendre beaucoup de risques (au regard de l’énergie produite).

Entendons-nous bien : je ne suis pas en train de dire qu’il faut remplacer des panneaux solaires et des éoliennes par des centrales nucléaires. Je suis en train de dire que ces sources d’énergie sont comparables niveau danger.

Au charbon !

En fait, si on a peur des centrales, il vaudrait mieux se battre contre les centrales au charbon.

Commençons par l’impact direct du charbon : maladies respiratoires, cancers … On estime que la pollution des centrales au charbon provoque chaque année plus de 10 000 morts prématurées en Europe.

Les centrales au charbon allemandes provoquent environ 500 morts par an en France.

Pendant que la France ferme la centrale de Fessenheim, l’Allemagne continue d’ouvrir des centrales au charbon, comme celle de Datteln 4 en mai 2020.

Conclusion : la fermeture de la centrale de Fessenheim va coûter des vies humaines. C’est un fait avéré.

(Petite clarification : Datteln 4 n’a pas été construite spécifiquement pour remplacer Fessenheim. Elle aurait certainement été ouverte quand même. En revanche, Fessenheim exportait une part significative de sa production vers l’Allemagne. Pour remplacer cette production, l’Allemagne va devoir solliciter davantage ses centrales au charbon.)

Et puis outre la pollution atmosphérique, il y a le réchauffement climatique. Enrayer le réchauffement climatique va demander des efforts considérables, à tous les niveaux. Comme évoqué plus haut, en dépit de dépenses pharaoniques, l’Allemagne obtient (au niveau de la production électrique) des résultats très médiocres, à cause de ce recours au charbon.

Si vous lisez ces lignes et avez moins de 40 ans, il est à peu près certain que de votre vivant, le réchauffement climatique va provoquer des transformations planétaires profondes, beaucoup plus graves que Tchernobyl, Fukushima, ou (dans un autre domaine) COVID-19 et la crise économique associée. C’est dit.

Belote, rebelote, et dix de der

Dans l’édition du 8 juillet 2020, brief.me parle de l’action climatique de la France jugée insuffisante. En parlant de pas mal de choses, mais sans prononcer « nucléaire » une seule fois.

Finalement, dans son panorama « Faire face au réchauffement » du 17 juillet, brief.me parle du réchauffement climatique et de ses conséquences … et arrive encore une fois à ne pas parler de l’énergie nucléaire.

Je me suis demandé si j’avais une vue biaisée de la situation. J’ai fait un petit sondage sur Twitter pour demander ce que les gens en pensaient. Ça vaut ce que ça vaut, mais sur la centaine de personnes ayant répondu, les deux tiers pensent que parler de nucléaire en France sans mentionner le bilan carbone, c’est délivrer une information partiale. Je ne suis donc pas le seul à y penser !

La rédaction de brief.me (ou les journalistes qui ont travaillé sur ces articles et dossiers) est-elle anti-nucléaire ? Considère-t-elle que ce sont deux sujets totalement indépendants, déconnectés ? Peut-être craint-elle de faire le lien entre les deux et d’être critiquée ? Je l’ignore. En tout cas, de mon point de vue, ce traitement de l’information contribue à entretenir une défiance vis-à-vis de l’énergie nucléaire en France, à un moment où il s’agit d’une des pistes nécessaires (mais clairement pas suffisantes) dans cet enjeu titanesque qu’est la lutte contre le réchauffement climatique. Et si brief.me n’est pas capable (ou ne souhaite pas) relier les points sur ce sujet, je ne leur fais pas confiance pour être capables de le faire sur d’autres thèmes (sur lesquels je n’aurai peut-être pas le même recul pour me rendre compte que je lis une information biaisée).

Merci à Tristan Kamin pour ses commentaires sur certaines parties techniques. Toutes les erreurs et imprécisions restantes sont les miennes.

Offsetting the carbon footprint of air travel

2020-07-24T00:00:00+00:00

I recently decided to check how much it would cost to offset the carbon footprint of my air travel. It was cheaper than I thought: for about 170 flights, it was about $1000. Here are some details and thoughts about the process.

A little bit of background

Since my move to the US in 2011, I’ve been flying a lot. Flights to Europe for vacations and holidays; domestic flights in the US when I was in a long distance relationship; and then my career evolved, as I became Docker’s first developer advocate. Between 2013 and 2018, I spent roughly 50% of my time at home, and the rest of the time traveling to conferences, meetings, and the odd vacation.

Climate change is real, and the environmental impact of aviation accounts for about 2-3% of all human-induced CO2 emissions. We won’t prevent catastrophic global warming just by cutting air travel; but it’s one of the things that has been growing significantly over the last decades.

In fact, it follows an exponential growth, except during periods of crisis. The graph below is from 2017, so it obviously doesn’t show the effects of the 2020 pandemic; but you still get the general idea:

Source: TSEconomist.

Another thing that has been growing significantly is computing. Virtually insignificant a few decades ago, it now accounts for 2-3% of our CO2 emissions as well. And just like air travel, its growth follows an exponential curve.

Source: Belkhir, Elmeligi, 2018.

So, there would also be a few things to say about “green computing,” but let’s stick to air travel for today.

What’s the idea behind carbon offsets?

When we burn fossil fuels (like coal, gas, oil), the combustion releases CO2 in the atmosphere. That CO2 is a greenhouse gas, and it’s directly responsible for the extremely rapid temperature elevations that we’re witnessing.

Carbon offsets are projects to reduce CO2 (or other greenhouse gases). There are many ways to do that. One strategy is to reduce CO2 emissions, for instance by replacing a source of energy that generates a lot of CO2 with another source generating less CO2. Example: if you keep your house warm by burning fuel, I could incentivize you to install a heat pump or other efficient heating system that will give you the same temperature inside, but generate less CO2. Another strategy is to plant trees: trees absorb CO2 from the air and turn it into carbon (a tree is about 50% carbon, in mass). Estimates tell us that a tree captures about 48 pounds of CO2 per year. In 2017, worldwide CO2 emissions added up to 36 billion tonnes. So to compensate for worldwide CO2 emissions, we “just” have to plant 1.5 trillion trees. Easy!

Some folks think that we can plant a trillion trees. Other folks think that it’s actually pretty hard, and even if we do it:

we have to keep planting trees as our CO2 emissions increase year over year;
we have to make sure that these trees stay in place (don’t get cut, don’t burn because there are more forest fires, etc).

We won’t solve global warming (or even just the carbon aspect of it) with a single solution. It’s likely that we will have to fly less planes, drive less cars, plant more trees, use better sources of electricity, keep our phones, computers, and other devices longer, and many other things.

That being said, what’s the process for applying carbon offsets to air travel?

Measuring our individual impact

The International Civil Association Organization (ICAO) has created a calculator to estimate how much CO2 can be attributed to a single trip. The general idea is to estimate how much fuel was burned by the plane for that flight, how much of it can be attributed to passenger travel (versus, say, freight transport), divide by the number of passengers, and multiply by 3.16 (because burning 1 tonne of aviation fuel generates 3.16 tonnes of CO2). There are some minor adjustments that are detailed in the ICAO methodology, but that’s the general idea.

There are many sites that offer more-or-less easy-to-use calculators where you can enter a specific flight information, and that give you the option to fund a carbon offset project to match the emissions of that specific flight.

So, in theory, all I had to do was to find one of these sites, enter my flight data, and enter my credit card number.

Practical details

There were, however, two details that I needed to address.

First, finding a reputable organization. Since many carbon offset programs finance actions that are in another part of the globe, you generally can’t just go and check for yourself that they’re actually planting trees or doing whatever they promised to do. This is true for many other markets, of course; but I don’t know if there is a trustable certification system for consumer-oriented carbon offset programs

Next, I had about 170 flights over 5 years (2015-2019). I was keeping track of the time I was spending in and out of the US for immigration reasons, so I already had a spreadsheet with almost every single flight during that period: departure airport, arrival airport, and date. I spent a few hours adding missing flights (domestic flights and flights not bound to or from the US) as well as the class of travel (economy except for a couple of upgrades). But thinking about manually entering that data on a website felt daunting. (Especially because I felt there had to be an easier way!)

Project Wren

Both challenges were solved when I saw someone I trusted endorse Project Wren. First, I appreciated that Wren gives us a way to estimate our carbon footprint depending on our lifestyle (where we live, what we eat, etc). They also offered multiple kinds of carbon offsets. And they had a flight calculator.

Alright, so I found the right “vehicle,” but I was still dreading entering all my flights manually. I was considering reimplementing the ICAO formula myself to compute my carbon footprint, and making a financial contribution of that amount. But before I could follow through on that plan, I was contacted by one of the co-founders of Wren, checking in to know if I needed help with my project (I had left my email address when creating a profile on Wren). During a short email exchange, I explained what I was trying to do, shared that spreadsheet, and got it back annotated with the CO2 equivalent and offset cost for each flight.

Results and thoughts

The numbers were astonishingly low. To offset my 200 flights, It barely cost me $1000. Domestic flights were a few bucks each, and long distance travel (say, Europe-US) $15-20 each.

I found this both encouraging… and depressing.

Encouraging, because it means that these offsets are relatively easy. I was expecting something much higher, and I thought that I would have to make a more difficult choice. But paying $1000 for five years flying almost every week… felt like the least I could do. Of course, $1000 is a lot of money for many folks; but let’s be honest: if you can afford to travel that much, you can most likely afford the offset.

Depressing, precisely because it means that we could offset the carbon emissions of all plane travel by raising the price by peanuts. (One percent, maybe?) For domestic travel, the carbon offset would cost less than a coffee (and definitely less than a coffee at the airport or in the plane).

Of course, offsets are not a magic solution. They are but one of the many things that we need to do to tackle climate change. It turns out that they’re easier and cheaper than I thought, even given my specific profile.

Let’s be clear: I don’t consider these cheap carbon offsets as a free pass to fly around in as many planes as I want, as long as I offset the associated CO2 emissions. We need a holistic approach. During the last few years, I’ve flown significantly less. My main source of income is now my Docker and Kubernetes training courses. As a freelancer, I have more freedom about how I organize my work. I group my customer engagements so that I cross the Atlantic less often. Sometimes I lose a customer who wants me to fly “right there right now” and doesn’t want to wait. Well, so be it.

Within Europe, I take the train when it’s feasible, even if it’s sometimes a bit longer, and often more expensive. I have the privilege of having a job that lets me work from home (when I don’t travel), so I don’t commute and don’t own a car. Even with these efforts, there is still enough air travel to have a “carbon footprint” that is far worse than the average European, and even the average American. So I need to continue to improve that; and to keep looking at other options too.

Offsetting more than air travel

Project Wren isn’t limited to air travel. They can also compute someone’s average carbon footprint depending on where they live, the size of their house, what they eat, how they move around, and many other factors. It’s based on statistics and averages, of course, but it’s still very useful.

And when purchasing a carbon offset, you get the option of picking exactly how you want that offset to happen, i.e. to what kind of initiative the money will go to.

I encourage you to have a look at Project Wren, or at any other similar project, if only to get an idea of your carbon footprint. If you can finance a carbon offset project or reduce your carbon footprint in other ways, that’s fantastic, but my goal here was just to share my experience with one specific aspect of the battle that we’re fighting against climate change. Thanks for reading!

Deploying ephemeral Kubernetes clusters with Terraform and env0

2020-07-17T00:00:00+00:00

env0 is a SaaS that can deploy Terraform plans, track their cost, and automatically shut them down after a given time. I’m going to show how to use it to deploy short-lived Kubernetes clusters and make sure that they get shut down when we don’t use them anymore.

Wait but why

As you may or may not know, my main source of income is the delivery of Docker and Kubernetes training. When I prepare, test, or update my materials, I need to spin up Kubernetes clusters. Often, I can use a local cluster. In fact, I often use (simultaneously) k3d, kind, and minikube; especially since these tools are now able to provision multiple clusters, and clusters with multiple nodes (not just a one-node “toy” cluster).

I currently have the following contexts in my ~/.kube/config file:

[jp@hex ~]$ kubectl config get-contexts
CURRENT   NAME             CLUSTER          AUTHINFO            NAMESPACE
          aws              kubernetes       aws                 helmcoins
          k3d-awesome      k3d-awesome      admin@k3d-awesome   default
          k3d-hello        k3d-hello        admin@k3d-hello     blue
          k3d-yeehaw       k3d-yeehaw       admin@k3d-yeehaw    kube-system
*         kind-kind        kind-kind        kind-kind           default
          kind-superkind   kind-superkind   kind-superkind      green
          minikube         minikube         minikube            

But sometimes, I need a “real” cluster. It could be because:

I need to make it available to someone else
I need to expose pods with a Service of type LoadBalancer
I need to obtain TLS certificates with Let’s Encrypt (typically to run a Docker registry with e.g. Harbor or GitLab; Docker registries need TLS certificates!)
I need Persistent Volumes that are not node-local
I need more resources (e.g. demonstrate a multi-node ElasticSearch cluster using Elastic’s ECK operator)

… As you can see, there is no shortage of reason (or excuse) to run a “real” cluster. (I say “real” with quotes, because the clusters that I run locally are just as real; but they aren’t reachable from outside my LAN and they have less resources.)

I have a bunch of scripts to spin up Kubernetes clusters. They’re designed specifically to provision a large number of clusters for a workshop or training. (I’ve used them to provision hundreds of clusters, for instance the morning just before a conference workshop. Back when conferences were still a thing, remember? Anyway.)

I often use these scripts to give myself one or a handful of clusters to run a bunch of tests. But I have to be careful to remember to shut down these clusters, otherwise they add up to my cloud bill.

That’s where env0 comes in: it gives me a way to provision resources (Kubernetes clusters or anything else, really) and give them a specific lifetime. A few hours, a few days, whatever suits my needs.

It can even start and stop environments following a specific schedule. For instance, every morning at 9am, spin up my development cluster; and shut it down at 5pm. (Talk about enforcing work-life balance!😅)

I assume that many of my readers are tinkerers like me who can easily do something similar with e.g. a script triggered by a crontab, or maybe leveraging a service like GitHub Actions. But env0 has a lot of extra features that make it interesting even for the members of the I-can-do-this-with-a-tiny-shell-script club:

it can track the individual cost of each environment that we deploy (on AWS, Azure, and GCP)
it gives us a nice web frontend to see what’s running (rather than sieving through the console of our cloud provider)
it gives us a way to define “templates” and then make them self-service for others to use
it’s using Terraform and will take care of saving Terraform state (if you’ve been using Terraform before, you probably see what I mean; otherwise, I’ll explain in a bit!)

Before showing you a demo, I’ll talk a bit about Terraform. If you’re familiar with Terraform, feel free to skip to the next part.

Terraform

Terraform is one of the many outstanding Open Source projects created by HashiCorp. (They also make Consul, Nomad, Vagrant, Vault; just to name a few.)

Terraform is one way to do Infrastructure as code. (I think it’s the most common way; and I would argue that it’s also the best one, but that’s a purely personal opinion!)

In practical terms, this means that we can describe our infrastructure in configuration files, and then use Terraform to create/update/destroy that infrastructure. It is declarative and implements a reconciliation loop, which means that we can:

write configuration files describing our infrastructure
run Terraform to create all the things
make changes to the files
run Terraform again: it will create/update/destroy things accordingly
rinse and repeat as many times as we want

Of course, this only works with infrastructure that you can drive with an API. Cloud stuff usually works. Physical machines usually don’t. (Except if you’re using things like IPMI, PXE servers, and an API in front of all that; but I digress.)

Since our infrastructure is defined in configuration files (Terraform uses HCL, by the way), these files can be under version control, for instance in a git repository. Which means that we can use mechanisms like pull requests and code reviews to make changes to the infrastructure. Again, in concrete terms, this means that if I want to add a virtual machine or scale up a cluster, I will:

make changes to the configuration files
commit these changes to git
make a pull request to our central repo
ask a coworker to review that pull request and merge it (or perhaps do that part myself if I feel confident enough in my changes, and my team’s or organization’s policy allows it)
trigger Terraform (or just watch, if it’s triggered automatically) to apply my changes

This lets us keep track of which changes were made, when, why, by whom. It also makes it easy to roll back changes. It can also help to bring up copies of the whole stack; e.g. “we need to replicate all these VMs, load balancers, and assorted services, to run a bunch of tests, staging, or whatever”.

Terraform does not, however, provide a cloud-agnostic abstraction. This was one of my early misconceptions about the product, by the way: I thought that I could define a stack to run on AWS, and easily move it to Azure. Terraform does not do that. When you define resources, you define e.g. EC2 instances, or Google Cloud instances, or OpenStack instances. Converting from one to another can take a significant amount of work. There are abstractions; e.g. once you have a bunch of VMs, you can have a common thing that will SSH into them and configure them; but the part that will bring up the VMs will be different for each cloud provider.

Terraform also requires that you carefully keep a state file, typically named terraform.state. You have one such file for each stack that you deploy and maintain with Terraform. If you are the only person working on your resources, you can just keep that file locally and you’ll be fine. But if multiple people work on a given stack, you need to keep that file in a central place. It could be on an S3 bucket or in a special-purpose VM or container. It’s also important to make sure that only one person at a time (or rather, one execution of Terraform at a time) accesses that file at any given time. It is recommended to have some locking mechanism in place. Terraform supports various state locking mechanisms. HashiCorp also offers Terraform Cloud to manage all that stuff and put a nice web interface in front of it - for a fee, of course.

Before we dive into env0, a little bit of Terraform-related vocabulary:

a Terraform configuration has inputs called variables (a well-designed plan will try to put all the configurable and tweakable values in these variables)
a Terraform configuration can also have outputs (outputs are values generated by the plan and the infrastructure that we use, like the IP address or DNS entry for an app’s load balancer)
a Terraform configuration is made of modules
each module is a bunch of configuration files (usually with a .tf extension)
terraform apply is the command that will synchronize the infrastructure’s state with the Terraform configuration (it is used for the initial plan execution, and subsequent modifications)
terraform plan will build a plan, or a kind of diff, if you will, between the Terraform configuration and the infrastructure state; it will show what would exactly happen if we were to execute terraform apply (i.e. “do you want to create/delete/change this?”)

Now, since I want to deploy Kubernetes clusters with Terraform, I need to find a suitable Terraform configuration.

Since I’m only an intermediate-level Terraform user, instead of writing my own Terraform configuration, I shopped around, and I found a few templates:

For simplicity, I decided to stick to managed Kubernetes clusters. This means that you don’t need a lot of Kubernetes-specific or cloud-specific knowledge to follow along. But if you want to get fancy, you can use a powerful Kubernetes distribution like Lokomotive or Typhoon and customize your cluster deployment to your liking.

Alright, let’s see how to use Terraform and env0 together to deploy some Kubernetes clusters!

env0 in action

If you have an AWS account, I encourage you to try it out for yourself. Here are some turn-by-turn instructions if you want to do exactly what I did, i.e. spin up some Kubernetes clusters!

Note: I added a few screenshots to show what the interface looks like. And if you prefer something less static, good news, I also recorded a video to show what it’s like! It’s one of my first videos, so feel free to let me know what you think :)

Step 1: create an env0 account. (You don’t need a credit card.)

Step 2: once you’re logged in, you must create an organization, so that you can create your own templates. (You can’t create templates in the demo organization.)

Step 3: configure policies. This is not strictly necessary, but this is (in my opinion) one of the very interesting features of env0, so I wanted to make sure that we’d have a look! If we click on “settings” (just above the organization name) and then on the “policies” tab, we will see the Maximum Environment TTL and the Default Environment TTL. This are the delays after which our environments get automatically destroyed. (Note that you can always change that later, after you deploy an environment. So don’t worry about “oops my environment is going to self-destruct and I can’t do anything about it!”, you can extend it as long as you need it.)

Step 4: configure AWS credentials.

We need to give our AWS credentials to env0, so that it can create cloud resources on our behalf. (Well, technically, env0 will run Terraform, and Terraform will create the resources.) We need an AWS access key and the corresponding secret key. If you are familiar with IAM, you know what to do! Otherwise, you can go to your AWS security credentials, click on “Access keys”, and “Create New Access Key”. You can use the new access key with env0, so that you can delete it when you’re done. Once you have an access key, go to variables in env0, and set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Make sure to tick the “sensitive” checkbox for the secret key. It will make sure that env0 doesn’t show (or expose in any way) that variable.

Step 5: create a template. Pick templates in the left column, then create a template.

in HTTP/S repository, enter https://github.com/terraform-providers/terraform-provider-aws
in Path, enter examples/eks-getting-started
enter a template name of your choice
save your template

Step 6: enable the template. This part is not super intuitive, so here is what you need to know: by default, templates don’t show up in projects. They have to be enabled for each project. This seems like a superfluous step when you have one project and one template; but if you have dozens of projects and hundreds of templates, it makes sense to select which ones are visible and where. So, anyway! You’ll have to go to “manage templates” and enable the template.

Step 7: create an environment. Alright, stuff is about to get real! In the “create environment” page, you will see the template that we just created. Select “run now”. No need to customize anything, just click “run”. If everything goes well, 10-15 minutes later your cluster will be ready. (Note that this delay is not caused by env0 or Terraform, but entirely by EKS; it’s particularly slow to provision the Kubernetes control plane.)

Generally, after provisioning resources with Terraform, the Terraform configuration generates outputs. The outputs could be IP addresses, passwords, or generally speaking, any kind of information allowing us to access the resources. In this case, the Terraform configuration generates a kubeconfig file. We need to download that file to use it with the kubectl command line.

Step 8: retrieve kubeconfig. Click on the “resources” tab of the environment. If the output panel is empty, just reload the page. You should see a kubeconfig row appear. Click on it, it will automatically copy the content of the kubeconfig file to your clipboard. Open a new file, paste the content of the kubeconfig file, save that under any name you like (say, kubeconfig.env0).

Note that this kubeconfig file invokes an external program, the aws-iam-authenticator. Again, this is not something specific to env0 or Terraform, but to EKS. If you don’t have that program, you need to install the AWS IAM authenticator before moving on.

Step 9: ~~profit~~ use Kubernetes cluster. All you have to to is to tell kubectl to use the configuration file that you created in the previous step; for instance with kubectl --kubeconfig kubeconfig.env0 or by setting the environment variable KUBECONFIG:

export KUBECONFIG=kubeconfig.env0

Now if you kubectl get nodes you should see that you have a brand new 1-node cluster. Yay!

jp@zagreb:~$ kubectl get nodes
NAME                                       STATUS   ROLES    AGE     VERSION
ip-10-0-0-159.us-west-2.compute.internal   Ready    <none>   8m35s   v1.16.8-eks-fd1ea7

Note that the Terraform config that we used is great because it worked out of the box; but it’s not-so-great because the name of the cluster (and of a few other resources) is hard-coded. So if you try to deploy it a second time, it won’t work. I was able to deploy another cluster by tweaking a handful of files. The right solution (generating unique resource names) is left as an exercise for the reader, as we say.

Terraform versions

I was wondering if it would be easy to deploy any Terraform configuration with env0. So I decided to try the other EKS example that I had found!

If you want to run it for yourself, it’s pretty much just like the previous walk-thru, except when creating the template. This is the information that we need to enter:

in HTTP/S repository, enter https://github.com/terraform-aws-modules/terraform-aws-eks
in Path, enter examples/basic

Now, if we try to deploy this configuration, env0 complains, telling us the the Terraform version that we use is not compatible with the eks module. What is this about? I was surprised, because I had tried this module on my local machine before trying it with env0, and it worked fine!

It turns out that env0 uses tfenv to offer a convenient way to switch between Terraform versions. By default, tfenv will use the lowest Terraform version that is supposed to work with our Terraform configuration. And in the main.tf file of our Terraform configuration, there is a line that says required_version = ">= 0.12.0". This causes tfenv to use version 0.12.0, even though the requirements for the EKS module indicate that we need Terraform 0.12.9.

(Aparté: at first, this sounds like a bug in tfenv. However, if we look at it more closely, determining the lowest Terraform version for the top-level module is easy, but resolving and possibly downloading all dependencies would be much more complex, and I understand why tfenv won’t do it.)

So, how do we fix that?

The env0 docs tell us how to specify the Terraform version, either by setting the ENV0_TF_VERSION environment variable, or by changing the required_version directive in our Terraform configuration.

I did the latter, by:

forking the terraform-aws-module repository,
changing the required_version in my fork,
updating the env0 template to use my fork.

And after that, it should deploy like a charm. At the end of the deployment, we get a kubectl_config output which we can copy-paste to a kubeconfig file, just like before. Except this time we get a 3-node cluster:

jp@zagreb:~$ kubectl get nodes
NAME                                       STATUS   ROLES    AGE     VERSION
ip-10-0-2-52.us-west-2.compute.internal    Ready    <none>   3m39s   v1.16.8-eks-fd1ea7
ip-10-0-2-69.us-west-2.compute.internal    Ready    <none>   3m39s   v1.16.8-eks-fd1ea7
ip-10-0-3-252.us-west-2.compute.internal   Ready    <none>   3m36s   v1.16.8-eks-fd1ea7

(And if we look closely, we’ll notice that this is actually two node groups, with nodes of different sizes. Fancy!)

What’s next?

There are at least 3 features that are worth mentioning, but that I’m going to skip (or keep for another blog post) since this is already getting fairly long.

GitOps. When I want to update one of these environments, I can make change to my Terraform configurations, commit these changes, push these commits to a separate branch, and tell env0 to update an environment using that specific branch. And if I don’t like it, I can switch back to the original branch. This encourages a workflow where every change goes through version control, which is a pretty big deal, in my opinion. The env0 blog has a great post on that topic, showing why and how to use per-pull request environments.

Cost tracking. env0 automatically tags resources, and on some cloud providers (AWS, Azure, GCP) it can track the individual cost of each environment. This is definitely something that I want to play with, because even at my very modest scale, I often have multiple things going on in my cloud accounts, and if something gives me a way to keep track of how much each little toy experiment (or customer project) costs me, sign me up! The env0 blog also has a post on that topic, showing how to track cost over time.

API and CLI. env0 has an API, and it is relatively easy to use. I like using a web interface to get started and click around, but when automating things, nothing beats a CLI (or an API). One of my future goals is to start environments automatically with a one-liner. Meanwhile, I already hacked something together to list environments:

[jp@hex env0]$ ./env0 ls
EKS (terraform-providers jpetazzo's fork-76391       INACTIVE  2020-07-03T13:59:07.000Z
EKS (terraform-providers upstream)-31765             INACTIVE  2020-07-02T15:34:18.000Z
AWS EKS (jpetazzo's fork)-96167 (no TF VER env var)  INACTIVE  2020-07-02T14:33:40.000Z
AWS EKS (jpetazzo's fork)-32766                      INACTIVE  2020-07-02T14:32:12.000Z
AWS EKS-38020                                        INACTIVE  2020-07-02T13:37:40.000Z

And finally, you can also use custom flows, to declare hooks that should be executed at any point of the process; e.g. to execute custom scripts and actions before or after Terraform runs. Even if in theory, we can probably do everything we need within the Terraform configuration, it’s often easier to add a little shell snippet this way.

Wrapping up

env0 is a young product but it’s already very promising. In fact, I can see it being useful for many teams or organizations using Terraform, even if they don’t need the environment TTL or cost tracking features.

In the future, I will explore how to use it to provision environments for my workshops and training sessions. I wonder how it would scale to dozens or hundreds of environments, and how difficult it would be to integrate it in a self-serve workflow, for instance.

I’d also love to hear your ideas and suggestions! (After all, as I said multiple times earlier, I’m not a Terraform power-user.)

One more thing - env0’s team is very reactive and quick to address issues. During my tests, at some point, I hit a bug in the web UI that prevented me from stopping one environment. I reached out to the team. They immediately pointed me to the API (which gave me access past the web UI) and they fixed the web UI bug within a few hours. Kudos!

Streaming tech talks and training / Using Linux

2020-06-27T00:00:00+00:00

If you are using Linux as your main operating system, you might wonder if it’s doable to use it to stream content, and how. In this article, I’ll tell you everything I learned about this: what works, what doesn’t, and the various hacks that I’m using to keep it working.

This will be interesting if you are using (or want to use) Linux for these things, but there will be also technical tidbits (related to e.g. USB webcams) that you might find useful even if you’re using other systems.

We’re going to talk about:

supporting multiple webcams on Linux
USB bandwidth challenges
leveraging NVENC (NVIDIA’s GPU-accelerated encoding) for H264 parallel encoding
V4L2 and ALSA loopback devices, with or without OBS
using and abusing ffmpeg for various purposes
run a bunch of things in Docker containers because why not

It’ll be fun!

TL,DR

The short version is that Linux is a robust platform to stream video. As often with Linux, the user interface can be a bit less polished and I often had to learn more than I wanted to know about some parts of the system. However, it offers a lot of flexibility and you can combine things in very powerful ways.

If the previous paragraph let you wondering “what does that exactly mean?”, I will give you a little example. I’m using a Stream Deck to change scenes with OBS Studio. Each button on the Stream Deck is a tiny LCD screen. I’m told that on macOS and Windows, you can set things up so that the scene change buttons actually show a tiny preview of the scene that you’re about to change to. On my setup, these buttons just show a dull, boring text string. However, I am able to take the video stream that comes out of OBS, encode it with 4 different video bitrates using GPU acceleration, send these streams to various platforms like Twitch or YouTube, record locally the highest bitrate, while also sending it to another computer on my network that will rebroadcast it to a Zoom or Jitsi meeting. (Not because it’s fun, but because I actually need these features sometimes.)

Install this

I’m going to suggest that you run various commands to see by yourself how things work. If you want to follow along, or generally speaking, If you want to tinker with video on Linux, I recommend that you install:

v4l-utils or v4l2-utils or whatever package contains the program v4l2-ctl, a command-line tool to list information about webcams and tweak their settings (like autofocus and such)
guvcview, a handy GUI tool to preview a webcam, but also to tweak webcam settings while the webcam is in use under another program
ffmpeg, the ultimate video swiss-army knife, to convert files but also encode in real time, stream, transcode, and much more

You might also want to get:

v4l2-loopback, a kernel module that implements a virtual webcam
obs-v4l2sink, an OBS plugin to send video to a V4L2 device (like the virtual webcam above)
gphoto2, if you want to try to use a DSLR as webcam (of course it only works on some models)

V4L2

On Linux, webcams use V4L2. That stands for “Video For Linux, version 2”. Every USB webcam that you plug into your system (as well as other video acquisition devices) will show up as a device node like /dev/video0.

On all the laptops that I used so far, the built-in webcam was actually a USB webcam, by the way.

All the USB webcams that I worked with actually showed up as two device nodes. The first one is the one from which you can get the actual video stream. The second one only yields “metadata”. I don’t know what kind of metadata. I wasn’t able to do anything useful with that other device node, so I just ignore it.

You can check it out for yourself and list your webcams like this:

v4l2-ctl --list-devices

If you have at least one webcam, it should be on /dev/video0, so there is a good chance that you can run the following command to see a preview of that webcam:

ffplay /dev/video0

This should open a window with a live preview of the webcam, and output a bunch of information in the terminal as well. For instance, on my laptop, I see this:

Input #0, video4linux2,v4l2, from '/dev/video0':B sq=    0B f=0/0
  Duration: N/A, start: 437216.677903, bitrate: 147456 kb/s
    Stream #0:0: Video: rawvideo (YUY2 / 0x32595559), yuyv422, 1280x720, 147456 kb/s, 10 fps, 10 tbr, 1000k tbn, 1000k tbc

What’s particulary interesting is the yuyv422, 1280x720, 147456 kb/s, 10 fps bit.

yuyv422 is the pixel format. While most screens work with RGB data (since they have actual red, green, and blue pixels), video acquisition and compression often works with YUV. Y is luminance (brightness), U and V are chrominance (color). Our eyes are more sensitive to changes in brightness than to changes in color, so YUV formats often discard some of the U and V data to save space without losing much in quality. That particular pixel format requires 16 bits per pixel (instead of 24 for RGB).
1280x720 is the current capture resolution.
10 fps is the current capture frame rate: 10 frames per second.
147456 kb/s is the data transfer, corresponding exactly to 1280 pixels x 720 lines x 16 bits per pixel x 10 frames per second.

That data transfer information is important, because many webcams are USB 2, which is limited to 480 Mb/s. This is a significant limiting factor, as we are about to see.

The USB rabbit hole

You might have noticed that in the example above, we have “only” 10 frames per second. This is a bit low, and we can probably see that the video is a bit choppy. (TV and movies are typically 24 to 30 frames per second.) How can we get more?

Easy: use the -framerate option with ffplay. This instructs ffplay to try and open the device with extra parameters to achieve that frame rate. Our command line becomes:

ffplay -framerate 30 /dev/video0

On my system, we get exactly the same result as earlier (10 frames per second), with a message telling us:

The driver changed the time per frame from 1/30 to 1/10

This is because we asked a frame rate and resolution too high for the webcam, or rather, its USB controller. The formula that we used above tells us that we would need 442 Mb/s for a raw, 30 fps video; but that’s just the video data. We need to add the overhead of the USB protocol. And even if we manage to stay below 480 Mb/s, we’re dangerously close to it, and the USB chipset in the webcam might not be able to pull it off.

So, how do we get a higher resolution and frame rate?

Compressed formats

Most webcams can use compressed formats as well, and that’s what we need here.

To see the various formats that our webcam can handle, we can use v4l2-ctl --list-formats, which on my built-in webcam, yields the following list:

	[0]: 'YUYV' (YUYV 4:2:2)
	[1]: 'MJPG' (Motion-JPEG, compressed)

My webcam (and all USB webcams I’ve seen so far) can send compressed frames in MJPEG. MJPEG is basically a sequence of JPEG pictures. It is widely supported, it is significantly more efficient than raw pictures, but not as efficient as other formats like H264, for instance. With MJPEG, each frame is an entire, independent frame. More advanced codecs will use interframe prediction and motion compensation, and they will send groups of pictures (GOP in short) consisting of a whole image called an intra frame followed by inter frames that are basically small “diffs” based on that whole image.

Each webcam (and capture device) advertises a full list of resolution, formats, and frame rates that it supports. We can see it with v4l2-ctl --list-formats-ext.

We can tell ffmpeg to use a different format with the -pixel_format flag. That flag requires to use a format code. The format codes don’t quite match what v4l2-ctl tells us. For MJPEG, we should use mjpeg, not MJPG. To see these format codes, we can run ffplay -list_formats all /dev/video0. The pixel formats will be shown at the end:

[video4linux2,...] Raw       :     yuyv422 :           YUYV 4:2:2 : 640x480 ...
[video4linux2,...] Compressed:       mjpeg :          Motion-JPEG : 640x480 ...

The format codes are yuyv422 and mjpeg.

(Note that ffmpeg also shows us the supported resolutions, which is nice; but it doesn’t show the supported frame rates, which is why v4l2-ctl is more useful in that regard.)

On my system, the following command will grab video from the webcam using MJPEG:

ffplay -framerate 30 -pixel_format mjpeg /dev/video0

If you don’t get the full resolution of your webcam, it might be because you used it previously (in another program) at a different resolution. It looks like unless instructed otherwise, the device keeps whatever resolution it had last time. You can change the resolution with the -video_size flag, like this:

ffplay -framerate 30 -pixel_format mjpeg -video_size 1280x720  /dev/video0

You might notice that ffplay doesn’t tell us anymore the bitrate of the video, because MJPEG doesn’t yield a constant bitrate: each frame can have a different size.

MJPEG and OBS Studio

In OBS Studio, the problem will manifest itself quite differently. On the three machines where I tried it, when I add a webcam in OBS, I can set the resolution, frame rate, and a “Video Format”.

The “Video Format” gives me a choice that looks like this:

YUYV 4:2:2
BGR3 (Emulated)
YU12 (Emulated)
YV12 (Emulated)

The first entry corresponds to the raw uncompressed format. The three others (with the (Emulated) annotation) crrespond to formats that are converted on the fly. As it turns out, picking one of the “emulated” formats will configure the webcam to use MJPEG, and convert it to one of these formats.

When I try to use the uncompressed format with a frame rate and resolution that aren’t supported, the video for that webcam freezes. As soon as I switch to a lower frame rate, lower resolution, or to an emulated format, it unfreezes.

So we definitely want to use one of the emulated formats, unless we’re happy with a smaller resolution or lower frame rate, of course.

BGR3, YU12, etc.

When we stream or record video, it will almost always be YUV. However, when it gets displayed on our monitor, it will be RGB. Most video cards can convert YUV to RGB in hardware. With that in mind, it would make sense for OBS Studio to use YUV internally. This would save CPU cycles by avoiding superfluous RGB/YUV conversions, except to display the video preview on screen, which could be hardware-accelerated anyway.

However, I don’t know how OBS Studio works internally. I don’t know if its “native” internal format is RGB or YUV. I didn’t notice any difference in video quality or in CPU usage when switching between BGR3 and YU12, but my tests weren’t very scientific, so feel free to check for yourself.

Buffering

I experienced random lag issues with OBS Studio, especially after suspend/resume cycles. For instance, one camera would appear to lag behind the others, as if it had a delay of a few tenths of second.

The “fix” is to change the resolution or frame rate. It’s enough to e.g. change from 30 to 15 fps, and then back to 30fps. Somehow it resets the acquisition process.

I recently tried to uncheck the “Use buffering” option in OBS, and it seems to solve the problem (I didn’t experience lag issues since then) without adverse effects.

About USB hubs

If you’re using USB 2 webcams and are experiencing issues, try to connect them directly to your computer. I’ve had issues (at the highest resolutions and frame rates) when connecting multiple webcams to the USB hub on my screen. It seemed like a good idea at first: being able to plug the webcams into the screen simplified wiring. However, the webcams are then sharing the USB bandwidth going from the hub to the computer.

The problems can even appear when the webcam is the only device plugged into the hub. Even worse: I’ve seen one webcam fail when plugged into some ports of my laptop’s docking station, but not others.

It turns out that some ports of that docking station were root ports, while others were actually behind an internal hub. (This is similar to what you get when you buy a 7-port USB hub; it isn’t actually a 7-port hub, but two 4-port hubs, the second one being chained to a port of the firts one.)

Note that using USB 3 hubs or a fancy docking station won’t help you at all, because USB 2 and USB 3 use different data lanes. If you plug some USB 2 webcam (like a Logitech C920s) into a fancy USB 3 hub that has a 10 Gb/s link to your machine, all USB 2 (and USB 1) devices on that hub are going to use the USB 2 data wires going to the computer, and will be limited to 480 Mb/s total.

All these bandwidth constraints may or may not affect you at all. If you’re running a single webcam in MJPEG behind a couple of hubs, you’ll probably be fine. If you are running multiple webcams and/or at full HD or 4K resolutions and/or behind multiple hubs shared with other peripherals (keyboard, mouse, audio interfaces), it’s a different story. If you want to rule out that kind of issue, try connecting the webcam directly to the computer. If your computer has both USB 2 and USB 3 ports, use USB 3 ports (even if the webcam is USB 2) because on some computers, the USB 2 ports are already behind a hub.

USB 3 and USB-C

USB 3 offers much faster speeds. It starts at 5000 Mb/s, so 10x faster than USB 2, woohoo!

Webcams supporting USB 3 shouldn’t be affected by all the bandwidth issues mentioned above. So if you intend to have multiple cameras and super high resolutions, try to get USB 3 stuff. (Note, however, that if your goal is to improve video quality, you should first invest in lights and other equipment, as mentioned in part 2 of this article series.)

“How can I know if my stuff is USB 3?”

USB-C connectors (the rounded ones on modern laptops, phones, etc.) almost always indicate USB 3. (All the USB-C connectors I found on computers and webcams were USB 3. However, some cables have USB-C connectors but are only USB 2. Tricky, I know.)

For USB A connectors (the older rectangular ones), USB 3 is generally indicated by the blue color or by the SuperSpeed logo as shown below.

`usbtop` doesn’t work

While troubleshooting my USB bandwidth problems, I found a tool that seemed promising: usbtop. It shows the current USB bandwidth utilization.

Unfortunately, it made me waste a lot of time, because it showed numbers that were way smaller than reality, leading me to believe that I had a lot of available bandwidth, while my bus was, in fact, almost saturated.

I realized the problem when grabbing raw video output from a webcam. ffplay would give me an exact number which I could very with a quick back-of-the-envelope calculation, while usbtop, alas, would show me some much smaller number.

V4L2 is back with a loopback

Let’s leave aside all that USB nonsense for a bit.

Linux makes it super easy to have virtual webcams, thanks to the V4L2 Loopback device. This is a device that looks like a webcam, except that you can also send video to it.

V4L2 Loopback is not part of the vanilla kernel, so you will need to install it. Look for a package named v4l2loopback-dkms or similar; it should take care of compiling the module for you.

I personally load the module with:

modprobe v4l2loopback video_nr=8,9 card_label=EOS1100D,OBS

This creates devices /dev/video8 and /dev/video9. They will show up respectively as EOS1100D and OBS in webcam selection dialogs. (The first one is to use a DSLR as a webcam, the second one is used to get the video output of OBS and pipe it to whatever I need.)

In the examples below, I will assume that /dev/video9 is a V4L2 loopback device. Adapt accordingly.

For instance, here is how to Rickroll your friends or coworkers during a Skype or Zoom call and show them the video clip of Rick Astley instead of your face:

Install YouTube downloader script:
```
pip install --user youtube-dl
```

Download video:

youtube-dl https://www.youtube.com/watch?v=dQw4w9WgXcQ

For convenience, rename it to a shorter filename, e.g.:
```
mv *dQw4w9WgXcQ.mkv rickroll.mkv
```

Decode video and play it through loopback device:

ffmpeg -re rickroll.mkv -f v4l2 -s 1280x720 /dev/video9

Check that it looks fine:
```
ffplay /dev/video9
```
Go to Jitsi, Skype, Zoom, whatever, and select the virtual webcam (if you loaded the module like I mentioned above, it should show up as “OBS”). Enjoy.

Note that this only gives you video. We’ll talk about audio later.

Also note the -re flag that we used above: it tells ffmpeg to read the input file at “native frame rate”. Without this option, ffmpeg would read our video as fast as it can, resulting in a very accelerated Rick Astley in the output.

One more thing: if you want to play the video in a loop, add -stream_loop -1. You’re welcome.

Good resolutions

You might have noticed in the example above that I resized the video to 720p. Without the -s 1280x720 we would get full HD, 1080p (1920x1080) output. This is mostly fine (in my experience, Zoom supports it) but many web-based video systems (like Jitsi) limit resolutions to 720p and below. If we hadn’t resized the video, our virtual webcam would advertise a picture size of 1920x1080, and Jitsi would filter it out, and our “OBS” virtual webcam wouldn’t show up. (Thanks to the folks at Mozilla who helped me figure that out by the way. They’ve been incredibly helpful!)

Furthermore, the V4L2 Loopback device only supports one resolution at a time, and it won’t switch as long as there is at least one reader or one writer attached to it. Which means that if you mess around a bit (e.g. if you do some tests with 1080p video and open some WebRTC test page) your browser might keep the video device open, and it would be stuck in 1080p. ffmpeg would still send the video in whatever resolution you tell it, but now that would be invalid (and the ffplay test would yield garbled video output, because ffplay would still see a 1080p device). To troubleshoot that kind of issue, you can run fuser -auv /dev/video9 to see which processes are currently using the file, and restarting your browser if necessary.

Sending OBS output to a virtual webcam

If you want to use OBS with regular video conferencing apps, you can use an OBS plugin called obs-v4l2sink. It will add an entry “V4L2 Video Output” to the “Tools” menu in OBS, letting you send OBS video to a V4L2 loopback device. You can then use that loopback device in any app you want.

The resolution limit still applies: if you set up OBS with a resolution of 1080p (or higher), the virtual webcam may not happen in e.g. Jitsi and many other web-based systems. If you plan on using these, change the “canvas size” in OBS to a smaller resolution. (It will also significantly reduce CPU usage, so yay for that!)

Also, note that most video conferencing systems don’t handle 1080p, even when they boast “HD” quality. As explained in part 3 of this series, Zoom will scale down webcam resolution to 720p (or even lower), so sending 1080p output will be completely useless.

This will be very noticeable if you share a browser or terminal window this way. It will appear very pixellated to your viewers, regardless of you’re available network bandwidth and CPU.

However, if you want to send 1080p output to Zoom and retain HD quality, you can do it by using the “window or desktop sharing” feature of Zoom. When you share a window or desktop, Zoom switches encoding settings to give you an outstanding picture quality, at the expense of the frame rate.

You can try that by telling OBS to preview the video stream in a window of that size (or a screen of that resolution), then use the “share desktop” feature of Zoom to share that window or screen.

Note that if you want to use that desktop sharing trick, you don’t need the V4L2 Video Ouput plugin, nor the V4L2 Loopback module.

Sending OBS output to a virtual webcam (or to a screen and then capturing that screen) is also a good way to grab that output and then do whatever you want with it. I use GPU acceleration to encode simultaneously 4 different bitrates, send them all to a broadcast server, and save the highest one to disk, for instance. (More on that later.)

Using a DSLR or a phone as a webcam

Now that we familiarized ourselves with V4L2 Loopback, let’s see something more useful that we can do with it.

We’ll see how to use a DSLR as a webcam, and how to use a phone as a webcam.

Using a DSLR as a webcam

I already mentioned this in part 2 of this series. The advantage of using a DSLR as a webcam is that it should have a better sensor, and most importantly, better lenses. If you have a good DSLR but with a basic lens, there is a good chance that it won’t be better than a decent webcam. For instance, I tried with a Canon EOS1100D, known as the Rebel T3 in the US or the Kiss X50 in Japan; and the image was actually worse than with my Logitech webcams. But don’t let that discourage you, especially if you have a good lens kit!

One way to use a DSLR as a webcam is to use the HDMI output, and an HDMI capture device. I already cover that in part 2 of this series, so I will talk about another method here.

Many DSLRs support a “Live View” feature that can be accessed over USB using a fairly standard protocol. This “Live View” feature essentially gives a stream of JPEG pictures … so basically a MJPEG video stream.

To see if your DSLR supports it, connect it over USB and run gphoto2 --abilities. If the output includes “Live View”, it will probably work. Otherwise, it probably won’t.

On the cameras supporting it, all we have to do is:

gphoto2 --stdout --capture-movie \
| ffmpeg -i - -vcodec rawvideo -pix_fmt yuv420p -threads 0 -f v4l2 /dev/video8

And now we can use /dev/video8 in any application expecting a webcam.

The gPhoto remote page has some details, and a long list of cameras indicating if they are supported or not.

Note that the Live View typically has a much lower resolution than the camera. On the EOS 1100D, it was about 720p, so about 1 megapixel; much less than the 12 MP that this camera is capable of.

Using a phone as a webcam

You can also use a phone (or tablet) as a webcam. Here are a few reasons (or excuses) to do that:

you really want an extra camera for your multi-cam setup
your main webcam is broken
webcams are out of stock everywhere
you have a bunch of old phones lying around
you want to place the camera in a location that would make it difficult to connect it to the computer (or the wires would bother you)

In this example, we’re going to use an Android app. I’m pretty sure that similar apps exist on Apple devices, but you’ll have to find them on your own.

The app is IP Webcam. Install it, start it, then at the bottom of the main menu, tap “Start server”. It will show a camera view and a connection URL looking like “IPV4: http://192.168.1.123:8080”.

You can go to that URL and click on “Video renderer: Browser” to check that everything is fine.

The next step is to check with our favorite swiss-army knife if it can read that video stream:

ffplay http://192.168.1.123:8080/video

This should display the video coming from the phone. You might notice that the video starts lagging, and that the lag increases. This is because the app is probably sending at 30 fps, but ffplay thinks this is 25 fps, so it plays a bit slower than it should, and it “gets late”.

One way to address that is to force the frame rate:

ffplay http://192.168.1.123:8080/video -f mjpeg -framerate 30

Another way is to force immediate presentation of frames as they show up:

ffplay http://192.168.1.123:8080/video -vf setpts=0

(The PTS is the “presentation timestamp”, which tells to the player when the frame should be displayed. This can be manipulated to achieve slow-motion or accelerated playback. Here, we set it to zero, which apparently has the effect of telling ffplay “omg dude you’re late, you were SO supposed to display that frame, like, forever ago, so do it now and we won’t tell anyone about it!”)

Now we can shove that video stream into our virtual webcam like this:

ffmpeg -i http://192.168.1.123:8080/video -pix_fmt yuv420p -f v4l2 /dev/video9

We don’t need to meddle with the PTS here, because by default, ffmpeg tries to read+convert+write frames as fast as it can, without any concern for their supposed play speed. You might see in the output that it’s operating at “1.2x” because it computes that it’s processing 30 frames per second on a 25 frames per second video stream. Whatever.

Using a phone as a webcam in OBS

With the technique described above, your phone is now a webcam (technically, a V4L2 device) that can be used with any Linux application, including OBS.

But there is another way to use the phone as a webcam, specifically with OBS. The app on the phone is exposing an MJPEG stream that can be shown in a browser. OBS can use a browser as a source. So you can display a browser in OBS, and in that browser, open the camera stream.

You need to install the OBS “linuxbrowser” plugin. (I’ve generally seen it packaged separately.) Then, in OBS, add a source of type “Linux Browser”. Change the URL to be http://192.168.1.123:8080/video (adapt to your phone’s IP address, of course) and you should be set.

I don’t know if this approach is better or worse than going through ffmpeg and V4L2 loopback. While ffmpeg and V4L2 loopback involve a few more steps, it’s likely that their code is a bit more optimized than the one in the OBS Linux Browser; but that’s just a completely unsupported guess from me. Try them and see what works best for you!

All hands on the Stream Deck

Getting the Stream Deck to work on Linux was easy. There is a Stream Deck UI project that just works.

However, getting the Stream Deck to play nice with OBS Studio was a different story. The Steam Deck UI can be configured to generate key strokes, and OBS Studio can be configured to use keyboard shortcuts. However, for unknown reasons, on my machine, OBS doesn’t seem to recognize the key strokes generated by the Steam Deck UI.

To work around the issue, I use obs-websocket, a plugin to allow OBS to be controlled through WebSockets. Then I use obs-websocket-py, a Python client library to interface with that plugin; and a little custom script called owc to invoke that library from the command line. Then I set up the Stream Deck UI to execute that script with the right arguments.

I also use the leglight library and another small custom script called elgatoctl to control my Elgato Key Light from the command line (and, by extension, from the Stream Deck).

Audio

When using OBS, you have at least two options for audio: route it through OBS, or bypass OBS entirely. “Routing audio through OBS” means adding some “Audio Capture” device in our OBS sources. “Bypassing OBS” means that we do not add any audio device in OBS, and use a separate tool to deal with audio.

If we stream or record directly from OBS, we must route audio through OBS, so that the RTMP stream or the recorded file includes audio.

Routing the audio through OBS is useful if you have e.g. a waiting music, or if you want to mute the mic when switching to specific scenes (like a “Let’s take a break, we’ll be right back” scene).

Each time you add an audio source in OBS, you can switch that source between three modes:

Monitor Off (the default)
Monitor only
Monitor and Output

(You can find these modes by right-clicking on the audio source in the “Audio Mixer” section, and selecting “Advanced Audio Properties”.)

“Monitor” means that OBS will play that audio input back to you. This can be useful if you have an audio source playing some background music, for instance. You would then select “Monitor and Output” so that the music goes to the RTMP stream and so that you can hear it locally.

Be careful if you use simultaneously “Monitor”, speakers, and an active mic input: the monitored audio input would play through the speakers and be picked up by the mic.

This whole “Monitor/Output” situation becomes even more confusing when you add PulseAudio to the mix (no pun intended). PulseAudio lets you remap application sounds to different outputs, so that you could e.g. play that song on your speakers but hear that Zoom call on your Bluetooth headset. Each monitored channel shows up separately in PulseAudio. The computer that I use for streaming has 3 audio interfaces and a virtual one. Troubleshooting the whole audio pipeline often turns into a wild goose chase.

Since I don’t really need anything fancy on the audio side (I just stream my voice), it is easier to directly route audio inputs to the system that I use for conferencing or streaming, rather than go through OBS. My mic is plugged into a small USB interface that has a very distinctive name (ATR2USB), so I just pick that one in e.g. Jitsi or Zoom; and when streaming with ffmpeg, I set up ffmpeg to directly grab audio from that interface.

ALSA Loop

One thing that seemed promising was to use a virtual ALSA loopback device. ALSA is a popular API to access audio input/output on Linux. While many applications nowadays use PulseAudio or JACK to access audio interfaces, PulseAudio and JACK themselves generally use ALSA to communicate with the hardware. ALSA is to audio what V4L2 is to video: they’re both low-level interfaces to access respectively audio and video input/output devices.

As you can guess from the title of this subsection, there is an ALSA loop device, and it can be used similarly to the V4L2 loop device.

For instance, we can configure OBS to send its audio to an ALSA loop device, and then use this loop device as a virtual mic in another application.

At least in theory.

In practice, if PulseAudio gets involved, good luck, have fun.

The following command will load the ALSA loopback device:

modprobe snd-aloop index=9 id=Loopback pcm_substreams=1

index=9 means that the loopback device will be ALSA device “card 9”. Again, I pick a number high enough to avoid any conflicts with my existing audio interfaces.

id=Loopback will be the name of the card.

pcm_substreams indicates how many streams you want that card to have. This is in case you want to have multiple cards with multiple streams each. I might be wrong, but I think that this does not correspond to the number of channels. pcm_substreams=1 seems to give you a stereo channel. More research needed here.

After loading the loopback device, you can check that it appears by listing input and output devices:

arecord -l
aplay -l

Since many webcams include microphones, the list of capture devices can get quite large:

$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: PCH [HDA Intel PCH], device 0: ALC298 Analog [ALC298 Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: C920 [HD Pro Webcam C920], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 3: StreamCam [Logitech StreamCam], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 4: ATR2USB_1 [ATR2USB], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 9: Loopback [Loopback], device 0: Loopback PCM [Loopback PCM]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 9: Loopback [Loopback], device 1: Loopback PCM [Loopback PCM]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

The name of our device shows up correctly here, but PulseAudio doesn’t use it, and the ALSA loopback device shows up as “Built-In Audio”. Unfortunately, this is exactly the same name as the actual built-in audio interface, so good luck when trying to distinguish them.

But the trickiest part is that we have one card with two devices, each device sending to (and receiving from) the other one. In my setup above, if I send audio to ALSA device hw:9,0 I will be able to read it from hw:9,1, and vice versa.

By default, however, most (if not all) ALSA applications will use the first device (device 0) of a given card. If you want the ALSA loopback device to work for you, you need to either explicitly write to device 1, or read from device 1. Basically, you need to decide if you want to make your life more complicated on the sender (playing audio) side, or the receiver (recording audio) side.

In my case, if I want to record or stream audio with ffmpeg, it is very easy to tell ffmpeg to read from hw:9,1, so here is what I do:

Rename the audio interface in PulseAudio, so that it shows up in all applications as “Loopback”:

pacmd 'update-sink-proplist alsa_output.platform-snd_aloop.0.analog-stereo device.description="Loopback"'

Open e.g. pavucontrol, find OBS in the “Playback” tab, and assign it to the “Loopback” output.
Test audio playback with ffplay -f alsa hw:9,1, or use it as an input source in ffmpeg with -f alsa -i hw:9,1.

If you want to be able to use the loop device as an input in Jitsi, Zoom, or whatever, you can try the following additional steps:

Manually add the hw:9,1 audio source in PulseAudio:

pactl load-module module-alsa-source device=hw:9,1

Rename it to “Loopback”:

pacmd 'update-source-proplist alsa_input.hw_9_1 device.description="Loopback"'

Now you should see a “Loopback” option along your mics and other audio inputs.

Configuring PulseAudio to automatically load this module and renaming audio inputs and outputs is left as an exercise for the reader.

NVIDIA

If you have a NVIDIA GPU in your streaming machine, it can be advantageous to use it to leverage hardware-accelarated encoding (NVENC). In theory, this is great, because NVIDIA GPUs are very efficient with that. In practice, proprietary drivers get involved, and it’s a tire fire, especially when using multiple monitors. (The monitors have nothing to do with hardware encoding; but proprietary drivers mess things up pretty quickly in that regard.)

Disclaimer: it took me a while to find a combination of drivers, options, video modes, etc. that would work and be stable and not randomly screw everything up when rebooting or when suspending and resuming. At one point, it took me 20 minutes of rebooting, connecting/disconnecting screens, trying various tools (between e.g. xrandr and the NVIDIA control panel) until I got all my screens working properly, because the procedure that I had used successfully multiple times before just didn’t work anymore. My overall impression is that while the NVIDIA hardware is pretty impressive, the software is at the other end of the spectrum. That being said, if you’re not afraid of wrestling with proprietary drivers and dealing with a lot of nonsense, let’s see what we can do.

Handling multiple monitors

When streaming, I use at least one external monitor, preferably two (in addition to the internal monitor of the laptop that I’m using). The laptop is a ThinkPad P51 with a docking station. Both monitors are connected to the docking station over DisplayPort.

On Linux, the usual way to deal with multi-monitor setups is to use the RandR extension (unless you use Wayland, but I’m going to leave that aside). There is a command-line utility xrandr, a crude but effective GUI called arandr, and you can use a tool like autorandr to automatically switch modes when screens are connected and disconnected.

This works alright when using the NVIDIA open source drivers (nouveau), but in my experience, that doesn’t work very well when using the proprietary drivers. It looks like the proprietary drivers try to implement the RandR extension, but will fail randomly. At some point, I almost had found a particular xrandr command line that would work after booting the machine, and another one that would work when resuming after suspend - most of the time.

The most reliable method involves using nvidia-settings in CLI mode to set NVIDIA MetaModes. MetaModes are NVIDIA’s proprietary way to handle multiple monitors.

This is what I’m using to scale the internal 4K LCD to 1080p, and add one external monitor above the LCD, and another one to the right:

TRANSFORM="Transform=(0.500000,0.000000,0.000000,0.000000,0.500000,0.000000,0.000000,0.000000,1.000000)"
LCD="DPY-4: nvidia-auto-select @1920x1080 +0+1080 {ViewPortIn=1920x1080, ViewPortOut=3840x2160+0+0, $TRANSFORM, ResamplingMethod=Bilinear}"
RIGHT="DPY-0: nvidia-auto-select @1920x1080 +1920+0 {ViewPortIn=1920x1080, ViewPortOut=1920x1080+0+0}" 
TOP="DPY-1: nvidia-auto-select @1920x1080 +0+0 {ViewPortIn=1920x1080, ViewPortOut=1920x1080+0+0}"
nvidia-settings --assign CurrentMetaMode="$LCD, $TOP, $RIGHT"

This seems to work in a deterministic way. I think I had to restart the X session once because one screen didn’t come up, but otherwise this has been robust enough for my needs.

NVENC

After installing the proprietary drivers, make sure that you also have libnvidia-encode.so on your system. Then, if you already know how to use ffmpeg, all you have to do is replace -c:v libx264 with -c:v h264_nvenc and you’re pretty much set.

Explaining the intricacies of H264 codecs, with profiles, tuning, bit rates, etc. would be beyond the scope of this article; but just for reference, here is the ffmpeg command line that I’m using:

ffmpeg \
  -thread_queue_size 1024 -f alsa -ac 2 -i hw:4,0 \
  -thread_queue_size 1024 -f v4l2 -frame_size 1920x1080 -framerate 30 -i /dev/video9 \
  -c:a aac \
  -map 0:a:0 -ac:a:0 1 -b:a:0 128k \
  -map 0:a:0 -ac:a:1 1 -b:a:1 64k \
  -map 0:a:0 -ac:a:2 1 -b:a:2 48k \
  -c:v h264_nvenc -preset ll -profile:v baseline -rc cbr_ld_hq \
  -filter_complex format=yuv420p,split=2[s1][30fps];[s1]fps=fps=15[15fps];[30fps]split=2[30fps1][30fps2];[15fps]split=2[15fps1][15fps2] \
  -map [30fps1] -b:v:0 3800k -maxrate:v:0 3800k -bufsize:v:0 3800k -g 30 \
  -map [30fps2] -b:v:1 1800k -maxrate:v:1 1800k -bufsize:v:1 1800k -g 30 \
  -map [15fps1] -b:v:2 900k -maxrate:v:2 900k -bufsize:v:2 900k -g 30 \
  -map [15fps2] -b:v:3 400k -maxrate:v:3 400k -bufsize:v:3 400k -g 30 \
  -f tee -flags +global_header \
  [f=mpegts:select=\'a:0,v:0\']udp://10.0.0.20:1234|[select=\'a:0,v:0\']recordings/YYYY-MM-DD_HH:MM:SS.mkv|[f=flv:select=\'a:0,v:0\']rtmp://1.2.3.4/live/4000k|[f=flv:select=\'a:0,v:1\']rtmp://1.2.3.4/live/2000k|[f=flv:select=\'a:1,v:2\']rtmp://1.2.3.4/live/1000k|[f=flv:select=\'a:2,v:3\']rtmp://1.2.3.4/live/500k

(Sorry about these two very long lines; in practice, I build them one piece at a time using a script.)

This will:

acquire audio over ALSA device hw:4,0
acquire video over the V4L2 device /dev/video9 (the virtual webcam that comes out of OBS)
encode audio in mono (it’s just my voice, so I don’t care about stereo) with 3 bit rates, providing high / medium / low quality
encode video with 4 bit rates, using NVENC hardware encoding
record the highest audio and highest video bit rates to a local file
also send that high quality output over UDP to another machine on my LAN (10.0.0.20 in that example)
generate 4 streams and send them over RTMP to a remote broadcast server (1.2.3.4 in that example)

The encoder is tuned here for low latency: it’s using the Constrained Baseline Profile to disable B-frames. The two highest bit rates have 30 frames per second and a 1-second GOP size. The two lowest bit rates are reduced to 15 frames per second and use a 2-second GOP size. (This means higher latency when using HLS streaming, but significantly improves quality.)

You can view GPU usage the nvidia-smi CLI tool.

On the other machine on my LAN (10.0.0.20), this is how I receive the stream and “feed” it into a virtual webcam for use with Jitsi, Zoom, etc. as well as an ALSA loop device as described above:

ffmpeg -f mpegts -i udp://0.0.0.0:1234 \
  -f v4l2 -s 1280x720 /dev/video9 \
  -f alsa -ac 2 -ar 44100 hw:9,1

(The -ac 2 lets us go back to stereo channels, and -ar 44100 resamples to 44.1 KHz, which may or may not be necessary, but ffmpeg does a much better job than PulseAudio when it comes to resampling audio.)

NVIDIA patch

One last thing about NVENC: NVIDIA artificially limits the number of streams that you can encode in parallel. For instance, with my GPU, I could get 3 streams, but when I added the 4th one, I got an error message similar to OpenEncodeSessionEx failed: out of memory (10).

However, nvidia-smi said that I still had lots of available memory (I had about 800 MB GPU memory used over 4 GB).

Surprisingly, according to the NVENC support matrix, my GPU (a Quadro M2200 / GM206) is supposed to be “unrestricted”. I suppose that the detection code is broken, or that the information on the NVIDIA website is invalid.

Still, since this is just a software restriction, you can remove it by using nvidia-patch, which will patch your NVIDIA drivers to remove the limitation.

(If you’re nervous about running nvidia-patch as root, you can easily confirm for yourself that it’s only touching the NVIDIA .so files, and it’s not replacing them with new versions, but actually patching the binary code with sed. While it’s not technically impossible that this would result in turning your GPU into a pumpkin, it’s probably okay-ish.)

Docker

I run all the things described above in Docker containers. This gives me a way to move the whole setup (including the OBS plugins, which need to be compiled) to another machine relatively easily.

There are many tricks involved, and I wouldn’t recommend that to anyone, except if you already have experience running desktop applications in containers.

The Dockerfiles and the Compose file that I use is available on my obs-docker repository on GitHub.

Wrapping up

There are many quirks and tweaks involved to get a good streaming setup up and running with Linux. I don’t know if things are easier on other systems. It’s generally easier to get started on a Mac or Windows system; but customization is also harder (or downright impossible).

For instance, hardware encdoding on a Mac seems pretty random. Windows obviously has first-class support for NVIDIA GPUs, but if you want to encode multiple bit rates at the same time like I do, you probably have to break out ffmpeg anyway.

If you’re exploring the possibilities of streaming content from Linux, I hope that this article could give you some useful information!

Streaming tech talks and training / To OBS or not to OBS

2020-06-04T00:00:00+00:00

In this article, I’ll talk about the various services and tools that I tried to stream my presentations. I’m going to talk about OBS Studio, why and how I use it. I will also review a bunch of video conferencing and streaming platforms like Jitsi, Twitch, YouTube, Zoom.

This section should be relevant regardless of your operating system(i.e. applicable to Linux, Mac, or Windows), while part 4 will dive into everything specific to Linux. For context, please check part 1!

OBS Studio

When I started to look at what people where using to stream (whether it’s games, educational content, whatever), I saw OBS Studio coming up a lot. OBS stands for Open Broadcaster Software, and that’s exactly what it is.

I imagine that when you make a live TV show, you have possibly multiple cameras, mics, and a kind of mixer that lets you pick which camera you want to show at a given time; perhaps show multiple things at the same time (“picture-in-picture”), or add banners, titles, effects, and so on. OBS does exactly that, entirely in software.

You can arrange multiple “sources” (cameras, images, pre-recorded videos, text…) into “scenes”. Then you can switch between scenes just by pushing a button. If that sounds confusing, you can check the video of my talk, Troubleshooting Troublesome Pods, for an example. (Keep in mind that this was one of my first talks using OBS, and I was still getting used to it, working on the transitions, streaming quality, etc.)

Why OBS

So, why would someone want to use something like that, instead of just sharing their webcam and screen?

I’m going to give you a very personal answer. You’re welcome to disagree (strongly) with it.

It’s very difficult to keep an audience engaged, especially through a video. That’s why TED Talks are only 18 minutes, and that duration isn’t random, it was determined by science. My technical workshops and training courses are way, way longer than that. Over the years, I learned (consciously or not) a lot of techniques to be as engaging as possible and keep my students interested. Many of these techniques do not work for video content. For instance, walking on the stage, pointing things (physically, with my arms and hands) on the screen. Projecting my voice to different parts of the room. The overall body language.

I want my training sessions to be successful, and that means keeping people interested. And it’s not just their responsibility, but also mine. Some folks can keep their attention to a screen share. I can do it maybe 10 minutes, but certainly not for hours. This means deploying many new tricks and techniques. Dynamic video content is one of them. It’s obviously not the only one; and it doesn’t work the same way for everyone.

In my case, that means that I want to be able to switch between multiple cameras: one showing the upper half of my body (I present standing), typically when addressing the audience and showing slides; and the other one showing just my head, when running demos. So I need a way to efficiently put these things together and switch between views. That’s OBS.

OBS workflow overview

OBS works on Linux, Mac, and Windows, and the interface is virtually the same on all three platforms. You can use it with your webcam (or webcams, if you have multiple ones), mics. You can share your screen (or individual windows) with it. It supports live video effects (like chroma key or “green screen”).

When using OBS, you define one or multiple “scenes” (I will tell you the ones I use a bit later) and then you can output your video+audio feed in two ways:

by sending it over RTMP, a protocol very popular with virtually all streaming services including Twitch, Youtube, etc.;
by recording it to a local file.

As you can see, this doesn’t include familiar stuff like Skype, Zoom, etc., but there are ways to make it work, including:

showing the OBS preview (your live video) on a screen and sharing that screen,
using a virtual webcam plugin for OBS.

The first option works great if you have an extra monitor (it could be a virtual one if you know how to set that up), but will typically use a lot of CPU resources and may not always give you the best results (more on that later).

The second option should make OBS work with any system that can use any webcam. The “virtual webcam” setup will depend on your platforms (it works differently on Linux, Mac, Windows).

My OBS scenes

I continuously tweak and iterate on this, but at the moment, I am using:

pre-roll,
slides with camera,
slides witout camera,
fullscreen with camera,
fullscreen without camera,
break.

Pre-roll and break show a video in a loop, with a big countdown indicating when we will start (for the pre-roll) or when we will resume (for the break).

This is now the “slides with camera” scene:

I use that one when I don’t necessarily need to full resolution of the screen, and I want my body language to be visible. This is great for slides and diagrams, for instance. (My slides use very big text, so it’s generally not an issue if they only take a part of the screen.)

As you can see, that scene also shows important links. This is useful, because when people join, they connect to the video stream, but they don’t always have access to the other links (slides, chat room, etc.) so I found that it was helpful to have these links on screen regularly.

I have a similar scene without the camera, which I rarely use.

This is now the “fullscreen with camera” scene:

This is great when showing a text mode terminal, web browser, or anything where I need the full resolution of my screen and the full “real estate” of the stream; but I keep my head in a corner. And there is the same scene, without my head - because sometimes there is something important in that part of the screen.

I’m using a “mask” effect on that camera (the hexagon shape on the example above). It’s a tiny little detail, but it’s more pleasant to the eye, and when I show my slides using that scene, the slide number is in the top right corner. The mask lets the slide number show up .

Green screens

You might wonder why I’m not using a green screen. I do have a green screen, but as soon as you try to use multiple angles, it gets tricky to have the green screen as a consistent backdrop against all possible angles.

I personally think that it’s better to have multiple angles, rather than the transparent background effect that the green screen offers

Countdowns

OBS lets you show text either as a “constant” (you define the text once for all) or by pulling it from a file. In that case, it will periodically re-read that file and update the text. I have a Python script that runs in a loop and continuously updates a text file with the countdown, and then I set OBS to show that text file in the countdown scenes.

Switching scenes

You can switch scenes by clicking in the OBS interface, or with keyboard shortcuts. I am using a Stream Deck that sits next to my sceen, and gives me buttons for each scene. The Stream Deck also has button to “start a five minutes break” as well as adding/subtracting one minute from the break time (so that I can adjust the break duration in a pinch).

Studio mode

OBS also has a “studio mode” that lets you show a scene while you edit another one. This is great to prepare a “next shot” backstage, and then activate it. This sounds amazing to achieve something even more dynamic, but I imagine that it requires at least two persons: one in front of the camera, another one behind (or rather, in front of the OBS interface, with their attention fully dedicated to it). I haven’t used it yet.

Quirks

I’m pretty happy about OBS, but there are also some downsides.

I’m going to list some of them here. That way, if one of them is a dealbreaker for you, you will know!

Out of the box, OBS can only stream over RTMP. As said above, most streaming sites support that, so that’s great; but if you want to use it for your video calls, you will have to install an extra plugin or do some hacks, as mentioned above.

It can’t stream to multiple destinations at the same time. Sometimes, this would be very convenient. Again, there are hacks to do that anyway if you need to.

The text features are “OK but not great”. If you are streaming in HD, you will want to use a ridiculously high font size, otherwise the text will show pixels. Since most fonts are vector-based these days, it would be great if it could handle that better. It would also be amazing to be able to change the color of the shadow, or put a backdrop, behind text.

When you get disconnected from the server to which you’re streaming, sometimes it will gracefully recover, but sometimes it will also remain stuck and you will have to quit and restart it.

It doesn’t refer to sources in a consistent way. On Linux, for instance, it will refer to webcams using their device nodes (something like /dev/video0, /dev/video4, etc.) and when you connect / disconnect cameras, these numbers can change. The cameras are then all messed up in OBS and you need to reassign them. It’s not a huge deal but I find it mildly annoying with just 2 cameras (3 if we count the internal webcam of the laptop, which I’m not using), so I imagine that it could get really obnoxious with lots of cameras. I’ve seen similar complaints from folks using it on a Mac, when their device names change for some reason, they have to re-add them to OBS.

Not really a quirk, but: keep in mind that OBS (and the associated protocols and services) is more complex than just firing up Zoom and go. The results can be amazing, but you should be prepared to spend some time figuring it out. See for yourself if you think it’s worth the investment. In particular, if your goal is high quality (like 30 fps, full HD video), you will need some good hardware for encoding, and perhaps learn about video codecs and tuning. This is a whole other can of worms.

Broadcasting our content

Now that we’ve talked about OBS, let’s talk about how we get that precious video and audio content to our viewers.

Video calls vs streaming

First, let’s start with some general considerations.

From both a practical and technical standpoint, there are two kinds of systems: video calls, and streaming.

Video calls are real time (or almost real time, with typically less than half a second of delay, which is imperceptible, except in some specific scenarios, for instance if you try to perform live music with other people). There can be multiple participants sending audio or video at the same time, meaning that it’s possible to interact directly with the presenter. Most platforms accommodate dozens of viewers, some of them can even do hundreds.

Streaming is generally one person (or a very small group) sending to a larger audience. Since audio and video flows in one direction only, interaction requires a separate channel, like a live text chat or separate Q&A app. Most streaming platforms can accommodate thousands of viewers, and some of them will scale to millions of viewers. This is achieved by using very different protocols and techniques, which come with a higher latency. The “glass to glass delay” (the delay between the moment when you say or show something, and the moment when your audience hears or sees it) will be a few seconds in the best case scenario, but typically at least 20-30 seconds. The delay is acceptable to address questions as they come, but makes it harder to do quick “show of hands”, or generally speaking, to ask a question to the audience and immediately react to it. Finally, streaming tends to offer better quality, because the longer delay allows to use more efficient encoding and distribution mechanisms, in particular for viewers with slower connections.

Insinctively, a video call is great for a smaller, trusted audience. It allows to re-create the level of interaction that you could expect from a traditional in-person meet-up, or a mature classroom.

Streaming is great for a larger audience. It’s also less prone to trolling, heckling, or Zoombombing, since the audience cannot speak or show themselves. (They can still troll or harass through the Q&A or chat platform when there is one, though.) It re-creates something more similar to a large college amphitheater or conference talk.

If you’re wondering about technical differences: video calls transmit data directly, or with very little intermediaries. They can use a whole range of protocols, including proprietary and custom ones. On the other hand, streaming is generally done within a web browser, and will often use protocols like HLS or DASH, which break down the content into very short segments (a few seconds each) that are then played back to back by the client. These short segments are normal static files that can be distributed efficiently by a CDN. The whole process introduces the delay mentioned above, since content now needs to be transcoded, sliced, pushed to a CDN, buffered on the receiving side. The codecs used are also different, or tuned differently. Some codecs like the popular H264 can yield higher quality when they can “look ahead” at incoming frames, but that introduces extra latency. (I’m simplifying a lot of things here, but I hope this helps to understand why there is such a dramatic difference between the two approaches.)

My specific requirements

I’m going to give you a list of platforms and services that I tried. Again, this list is by no means exhaustive, and keep in mind that my needs are certainly very different from yours, so our final choices will certainly differ.

For reference, here is the use-case that I’m optimizing for.

I’m delivering tech training that spans multiple hours, with an audience of 10-100 people.
I want people to be able to see my face so that things remain as engaging as possible.
I don’t need my face to be in super high resolution.
I also want to be able to show my screen, with slides, text terminals, web browser.
These things, however, need to be as clean as possible. I am used to zoom text when needed (since I usually present on a video projector) but blurry text with compression artefacts can be tiresome to read.
I want latency to remain small so that I can easily interact with the audience, ask them questions, react to their answers.
I also want to record what I’m doing so that the audience can get a high quality replay.

I do not need to stream to hundreds, or thousands, of people.

I do not need to bring another speaker on the virtual stage (at least not at this point).

YouTube

I haven’t used it directly myself, but I’ve been on multiple shows, live podcasts, etc., that were streamed to YouTube.

I found the latency to be very high and ruled it out for my work. I’m aware that there are settings to supposedly reduce the latency, but I haven’t tried them. I couldn’t find an official document telling what would be the typical latency to expect; just individual statements mentioning anything from 1.5s (which would be great!) to 15s (which would be less great).

Google also has the reputation to change how its products work over time, or even discontinue them, so I didn’t want to invest much time or effort into investigating that. (For instance, there seems to be a whole thing around a “new” vs “classic” interface, with lots of people asking how to do things that they used to be able to and can’t find how to do anymore. That didn’t bode well.) However, if you have a great experience with YouTube streaming, don’t hesitate to let me know!

Jitsi

Jitsi is an open source video conferencing system. You can deploy it on your own servers, and there is also a free option, Jitsi Meet.

During my workshops, I typically switch between three different windows:

a web browser showing my slides,
a terminal where I run demos,
another web brwoser to show the result of these demos.

I thought that I could come up with something with Jitsi, where I would share these three windows + my webcam as 4 separate streams, allowing the viewers to pick what they wanted to see, and how they wanted to see it, at any given time.

Unfortunately, that didn’t turn out to be practical. Jitsi is fantastic if you want something that works “right here right now”, without having to install a program: it works in modern web browsers, using the WebRTC framework. However, sharing multiple windows turned out to be very CPU-intensive, and the quality wasn’t there. It was also inconvenient for the viewers. Overall, Jitsi is great for what it does (video calls) but not for my use-case.

I still plan on using it to provide live interaction with the students to promote a “classroom” kind of atmosphere.

Zoom

At this point, if you’re reading this article but haven’t heard about Zoom yet, I don’t know under which kind of rock you’ve been living :)

What you may or may not know is that Zoom has two products: Zoom Meetings and Zoom Webinars. Meetings are video calls (the one that you probably love or hate), Webinars look more like streaming: you’re the only one to present (optionally with co-hosts), there is a tiny bit extra latency (but barely), and the quality seems to be a bit more robust for the audience.

I discovered another difference between Meetings and Webinars. In Meetings, the audience can interact with you with “non verbal communication cues”. There are buttons to indicate “yes”, “no”, “faster”, “slower”, “I need a break”, that kind of thing. In Webinars, there is only a button to raise hand.

Zoom is great for live video calls. In my experience, it does really well on slow or unreliable network connections. It also makes it super easy to switch cameras and mics. The screen sharing has a really high quality (more on that later). On the down side, there has been stories in the news highlighting security concerns. I have opinions about that, but they are not relevant to the present conversation, so I will leave them aside. And more importantly, it has other issues that make it inferior for my use-case. I’ll talk about them now.

The Zoom chat

Zoom has an integrated chat. It’s convenient if you just need to paste some information to someone, like an URL or short command to type or error message. However, it lacks:

proper formatting (not just bold and italics, but most importantly, the ability to have monospaced code blocks; or even better, syntax highlighting),
an easy way to highlight someone,
efficient scrolling when there are lots of messages,
a better way to notice when a message is addressed to the whole audience vs just you.

You might think, “whoa, that guy seems picky about their chat room!” and you wouldn’t be wrong. But as it turns out, I regulary use Gitter when delivering workshops and training sessions, and it’s a completely different experience. It addresses all the shortcomings mentioned above, and when I polled training participants, they universally preferred Gitter. I will talk again about it later.

Zoom video codecs and tradeoffs

Zoom does something extremely smart with video codecs. When you share your webcam, it uses an average quality video encoding with low latency and a good frame rate. When you share your screen, it uses a very high quality video encoding, but with a much lower frame rate.

This is great for most people who want to share their screen (with slides, demos, whatever) and show their face as two separate streams. However, as mentioned above, I use OBS Studio to create a single video stream that alternates between my face, the slides, me next to the slides, etc.

There are at least two ways (that I’m aware of) to send my video to Zoom.

The first method is to use a virtual webcam. OBS sends my fancy video to the virtual webcam, and it shows up in Zoom (or in any other app for that matter). Unfortunately, this degrades the video quality: since Zoom “thinks” that I’m sharing a webcam, it’s using a lower quality video encoding. It’s not really visible when seeing someone’s face in a video call, but it becomes very apparent when sharing a terminal or browser.

The second method is to share a screen. The trick that I use is to get OBS to show the video output on a dedicated screen, then use Zoom’s “desktop sharing” on that specific screen. The quality is then crystal-clear, but the frame rate drops significantly, and it becomes very noticeable when I am visible on screen.

Zoom little details

When sharing a screen, the Zoom controls are always visible on that screen, and you may or may not be able to hide them. I couldn’t find a way to completely hide them, so what I do is that I move them off screen. (With a minor annoyance, though: my streaming setup has 3 screens, and for some reason, I cannot move the Zoom controls to the control screen, which is the only one hidden from the audience; so instead, I move them to the side, in way that they are 90% off screen, but they still partly show up.)

One last thing: when sharing your desktop with Zoom, it uses a rather smart privacy feature that will grey out its own windows. For instance, if someone sends you a message through Zoom, and that message shows up on the desktop that is shared with the audience, they won’t see it: they will see a greyed out window instead. I imagine that this is pretty ueful if someone sends you some private information (like a password) or some profanity, to prevent it from being seen by the audience.

Discord

It might surprise you to see Discord here. If you haven’t heard about Discord before, some people describe it as “Slack for gamers”. It has excellent audio and video sharing capabilities. I’ve seen and heard lots of people dismissing it on the grounds that it’s “for games”, but it looks promising. I haven’t had the opportunity to use it for a workshop or training yet, but I hope to try it at some point in the future.

In particular, I wish all the communities and groups out there that are systematically deploying Slack to provide chat communication would consider something like Discord. It seems to be using an order of magnitude less resources, and it doesn’t require you to create one separate account for each “team” (community, company, group…) that you want to join. But I digress!

Twitch

After watching some folks stream on Twitch, I was impressed by the video quality (and the fact that for the audience, it just works in a web browser), so I decided to try it out.

It is very straightforward to set up. Note that while the audience doesn’t need anything special, you need to send your video as a RTMP stream. In practice, that means using something like OBS Studio. (There are tons of other options, of course.)

And indeed, the quality was great. But!

There are a few downsides that you might want to consider.

First, there is no way to make a private stream on Twitch. You can kind of work around this by creating a new user for each stream, with a weird name like validcowgeneratorpotato, and rely on the fact that nobody will find it; but … it’s far from perfect, and while I don’t know if it breaks Twitch’s user agreements, it’s probably not what they have in mind!

More importantly, Twitch will probably not transcode your stream. Transcoding is the action of decoding and re-encoding your stream, generally with different (lower) bitrates and resolutions.

This means that if you stream at e.g. 2500 kb/s, your viewers will all receive a 2500 kb/s stream. This is great if they do have that capacity, because it will guarantee that they get the best possible stream (or at least, the exact quality that you’re sending). But if someone has a slower connection, they’ll be out of luck and there isn’t anything that you can do about it.

Twitch will offer transcoding if you are a “partner”, and might offer it (depending on available capacity) if you are an “affiliate” or even a regular user. (You can find more details on Twitch’s affiliate program page.)

This makes Twitch suitable for public events (and for regular streaming), but not for private workshops or training sessions.

I wish their technology was available by paying them, though, because I found it awesome.

Other streaming services

I also tried a few other streaming services. Generally speaking, the quality was great, but the latency was too high for my needs. (I typically had 20-30 seconds of latency.)

These platforms are designed for massive streaming to audiences of thousands or even millions of viewers, so they’re optimizing along different angles, of course.

Here are some very brief notes on the ones I tried.

Dailymotion

Super easy to set up once your account gets approved. I really liked the straightforward, “no-nonsense” interface. There isn’t a lot of things to tune or tweak, but at least I didn’t waste hours trying to fit a square peg in a round hole.

Wowza Streaming Cloud

The setup is relatively easy. However, there are lots of moving parts. It looks like you can customize a lot of things, but when I tried to reduce latency, I quickly got myself in situations where I was wondering “is this going to work, or blow up to my face?”

AWS Elemental MediaLive and MediaPackage

The setup was relatively hard, even for someone familiar with both the AWS ecosystem and the general streaming/ecoding lingo. If you follow the docs and tutorials step by step, it’s easy to get something that works, but as soon as I tried to tweak things, I got myself in corners where it wouldn’t work and give me rather obscure error messages.

Ant Media Server

I ended up trying Ant Media Server, because it promised “ultra low latency, 4k, 60fps streaming for thousands of viewers”. To be clear, I don’t care about 4k and 60fps, but if it can do that, it can certainly do 1080p at 30fps, and the low latency feature got my attention. The low latency feature is only available for the enterprise edition, but the enterprise edition is available on the AWS and Azure marketplaces with hourly prices. Since I don’t need this on 24/7, I thought it could be a good idea.

I’m still in the process of validating my whole setup with Ant Media Server, but (after a lot of tinkering) I’ve seen some pretty good results. Expect an update (or even a complete follow-up article) about it in the future.

(At the moment, I’m happy with the low latency streaming, less so with the adaptive transcoding, but I’ve found ways to work around it by encoding multiple streams at the source. Anyway!)

Virtual classrooms and webinar platforms

There are many products out there. Some of them seem extremely promising, and for many people, are probably better solutions than what I’m building.

Unfortunately, I haven’t found any solution yet that would let me stream my own video composition, or have the countdowns that I use for breaks, for instance. Most of them also can’t do high quality recording.

I expect this space to evolve a lot these days, since a lot of activity is switching to be online during the pandemic, so let me know if you hear about a product that you think I should test!

Everything else

I talked a lot about the video content and how to send it to the audience, but there are other things that matter to me.

Important information should be easily available

I alluded to this earlier in the OBS section. When I deliver a training or a workshop, there are more resources than just the video feed. There are slides; a chat room; possibly other things. I think it’s important to make sure that the links to these resources are super easy to get.

When delivering in-person training, I would often have e.g. the WiFi password and the URL of the slides on a flipchart or whiteboard; that way, if someone shows up late, they can easily get that essential information and catch up.

Same idea here. We’re not likely to be caught in traffic or delayed by public transit before connecting to a remote classroom; but we could have an unexpected mandatory OS reboot, a kid or other family member that needs immediate assistance, a headset that we thought would be charged but the battery is now empty, etc., so some folks will still be late, and it’s not their fault, and we need to make it easier for them. So I try to make sure that people have at least the link to the stream (or some other landing page) and then I have all the relevant information in the stream.

Chat platforms

The chat rooms that come with Twitch, Zoom, and many other video conference or streaming platforms generally provide the bare minimum level of functionality.

I gave some details earlier about the limits of the Zoom chat and suggested to use Gitter instead.

You might wonder, “why not Slack?” - I think Slack is great for some scenarios; specifically the ones where people are expected to commit a significant amount of time to set it up and use it. But for a short event like a workshop, even a week-long training, I am not a huge fan of Slack. It requires setting up an account, getting a confirmation e-mail, and then you get all these features and channels. I prefer something lightweight like Gitter. Gitter can use SSO with GitHub, GitLab or Twitter (if you already have an account with these platforms, joining a Gitter chat room will be literally two clicks). It also uses significantly less resources.

Of course, you do you!

Q&A and polling

I want to keep exploring options here. For instance, I intend to soon test slido to see if it helps to do some quick “hand raising” kind of poll.

Conclusions

After doing my research, I decided to build my own “virtual classroom” by putting together various software bricks and services.

It’s a lot of moving parts (especially as you will see in part 4, where I describe the OBS and streaming setup), and sometimes that can be scary; you really don’t want everything to fall apart minutes before starting a course.

However, I really like the flexibility that this is giving me; the ability to pick the tools that fit my teaching style (and the nature of what I’m teaching).

I’d like to emphasize one last time that this is not “the” best way of doing things; it’s just how I do them right now, and it’s likely to change over time. But I hope that this (which started as a disorganized collection of notes for my future self) can be useful for you as well!

In part 4, I will describe how I got OBS (and associated paraphernalia) to run on Linux. In fact, I even got everything running in Docker containers, and I’ll also explain why.

Streaming tech talks and training / Hardware

2020-04-17T00:00:00+00:00

This is a long description of the various equipment (cameras, lights, mics, and more) that I am using, or that I have tried to use, to deliver online training and tech talks.

For context, please check part 1!

Desk and desk placement

Priority: low to high

It helps me a lot to stand when I’m presenting. I suppose that this is irrelevant for most people, and that many of you might actually prefer to sit; but it is easier for me and I have more energy when standing.

I considered using a standing desk (and might end up getting one) but for the time being, I’ve set up my streaming laptop on a delivery box and propped up the associated monitor.

August 2020 update: I got a Fully Jarvis standing desk and I’m very happy with it. It freed up a lot of real estate on the desk (because I don’t need the cardboard box and the weird wooden thing to prop up the streaming laptop). If you get one (or any kind of standing desk), I recommend one with memory buttons (so you can switch from one position to another very quickly) because it’s only a small fraction of the overall price and I found it very useful.

This section is first and foremost an excuse to show you an overview of what my setup looks like!

The cameras that I use are highlighted in red. The mic that I use is highlighted in green. The Stream Deck is highligthed in light blue.

There will be more pictures of specific parts when needed.

Note that the darkening curtain that you can see in front of the window behind me is typically pulled more to the left, so that the window doesn’t produce a backlight behind me when using the right-hand-side camera. Without that curtain, I would probably have to move things around a bit.

Now let’s move on to the nerdy tech stuff!

Mic

Priority: high

It’s important that people can hear me loud and clear, so this was one of my first investments.

I don’t want to use the mic that is built into my laptop, because it picks up too much background noise.

A built-in mic will pick up keyboard presses. It wasn’t obvious to me at first, until a friend told me to listen to my recording with headphones on. After 10 seconds I wanted to murder the person who was hammering on the keyboard (i.e. myself).

It will also pick up the whirring of the computer’s fans. Usually, my fan is idle, so it’s not a big deal. But when streaming video, the CPU will be hard at work to encode, and the fan will turn on.

Generally speaking, I want the mic to be as close as possible to my mouth, but also at a constant distance, so that I don’t have to worry about variations in volume.

TL,DR

I use a RØDE Wireless Go with a cheap lav mic and a cheap USB audio interface. The lav mic came with a super cheap kit ($10-20 online) and the USB audio interface was in the same price range (it’s an Audio-Technica ATR2USB).

Now here is a list of all the things I tried and considered …

Blue Yeti

Many people recommended the Blue Yeti, so I got one. Unfortunately, it doesn’t work for me, at least not for tech talks and training.

It’s an amazing mic to record voice or instruments in a very quiet environment, but for me, it didn’t help at all. I could hear the keyboard (and the fans) in even better quality than before! But I could still hear them, that’s the problem.

(The Blue Yeti has multiple modes: cardioid, stereo, etc.; yes, I tried them all; no, it didn’t help.)

The Blue Yeti is now in our living room. We use it when videoconferencing with friends, and apparently it’s pretty awesome at that. I also used it a few times in the past to record a cello.

Cheap lav mics

Lav mics are these tiny mics that you typically clip to your collar. “Lav” stands for “lavalier”; they are also called “lap mics” (which stands for “lapel”). They’re probably the most common type of mic when speaking at conferences.

The ones we use at conferences are usually high quality, wireless mics, and are pretty expensive.

But you can also get some really cheap wired ones. That’s what I did; I found a cheap two-for-one pack with extension cords as well as adapters to plug them to either a mic input, or a combined mic+headphones input. It was so cheap that I didn’t have much hope about the mics themselves, but I thought the adapters could be handy.

It turns out that these mics were great when plugged into my phone. The sound quality was excellent. However, when plugged into my computer (either directly or through a cheap USB audio interface) there was a ground hum. (The ground hum went away when unplugging the HDMI cable; not exactly an option when presenting to an audience.)

I now use one of these mics in combination with the next one, which is …

RØDE Wireless Go

So I kept looking, and eventually acquired a RØDE Wireless Go. It is a wireless mic with a receiver (that you plug into a 1/8in jack audio input) and a transmitter. The transmitter has a built-in mic, but also has a 1/8in jack mic input (allowing to use a fancy lav mic if you have one).

Conceptually, this is very similar to the belt packs that are ubiquitous on the conference circuit, except that it’s much smaller (which is a plus for me when travelling). It is so small that you can just clip the transmitter directly to the collar of your shirt or t-shirt. It will be visible, but for tech talks, I don’t consider it an issue. It might actually give you some extra cyborg style points if you’re into that. The embedded mic is great, but according to some subjective testing, the quality is even better if I use one of the abovementioned lav mics and put the transmitter in my pocket.

It has one serious downside, though: instead of using AA batteries, the transmitter and the receiver have integrated batteries that you charge over USB. The batteries will last 6-7 hours. This is totally fine for a half-day workshop; but to go through the entire day, I need to recharge them during the breaks.

Why do I bother with a wireless system instead of just plugging it in? For starters, my cheap lav mic has this ground hum problem when it’s plugged in directly into the computer. The issue doesn’t show up when going through the Wireless Go kit. Also, even if I am going to move around on stage, I’m trying to have as little wires as possible. And when I want to take a quick break away from the computer, having to remove the mic (and then re-equip it again later) is annoying. The most important bit is to remember to turn it off or mute it before going to the bathroom. (Honestly, that might be one of the arguments in favor of a wired mic!)

Gaming headsets

When I do group conf calls, I get the opportunity to listen to everyone speak, and the folks with the best audio are often using gaming headsets, like the Sennheiser Game Zero or the HyperX Cloud. If you have one of those, try it out. You may or may not like how you look with these big bulky cans on your head, and that little operator-style mic stick in front of you, but I guarantee that your audience will love how you sound.

Earbuds

This is a budget-friendly option!

You can also use earbuds that have a mic, like the ones we sometimes use with headphones. These will already be vastly better than using the integrated mic of your laptop.

Don’t go too cheap though. Make sure that there is no weird hum or crackling sounds when you touch the cable. Also make sure that the mic doesn’t pick up the rustle of your shirt, or jewelry that you might be wearing. (Particularly important if you have nice, big earrings; if they touch the wires of your earbuds when you move your head, the result will be very unpleasant for your listeners.)

Bluetooth headsets

You can use a Bluetooth headset to listen, but whatever you do, never use the mic part of any Bluetooth headset: it won’t sound good. (Yes, even with $200 headsets or Air Pods or whatever.)

Some Bluetooth headsets advertise a bunch of fancy codecs (like aptX) to get super-duper audio quality, but that’s only for the audio going to your ears. The headset is then usually using the A2DP profile. As soon as you use the headset mic, it switches to the Bluetooth Headset Profile (HSP), which uses different codecs. These codecs are made to encode voice to be carried by the telephone network. When placing a phone call, you won’t see the difference; but when streaming it will be very noticeable.

(If you want to dive into Bluetooth codecs and profiles and other details, you can check this excellent Audio over Bluetooth blog post.)

USB audio interfaces

Since I had ground hums with the built-in mic input of my computer, I tried other options as well.

I tried with a Focusrite Scarlett 18i20 and a PreSonus Studio 24c. Unfortunately, plugging 1/8in jack mics in these interfaces wasn’t straightforward.

So instead, I’m using a cheap Audio-Technica ATR2USB. Don’t be fooled, it’s nothing fancy; I got it for $10 a few years ago when I was messing around with Raspberry Pis.

It’s not as much for the quality upgrade (I didn’t notice any difference between the built-in mic input, the ATR2USB, the Scarlett, and the PreSonus interfaces), but for convenience: the mic can stay plugged into the interface, which is plugged into a USB hub or docking station, and I can plug/unplug everything via the hub or docking station instead of fiddling with a dozen of cables.

Also, when I list my audio inputs, instead of “Built-in Audio Analog Input” I will see something like “ATR2USB Audio Input” and I know that this is my mic. It’s less mental load for me.

Other options

I realize that many folks might be shocked by the fact that I’m using a $10 lav mic and a $10 USB interface, with a $200 wireless pack in between.

I’m using the wireless pack to have less wires. It helps to feel “unencumbered” when presenting.

And the reason why I’m using the $10 mic and interface is because I couldn’t perceive a difference between them and $200 mics and interfaces. Or rather: somebody listening to me over a Zoom call couldn’t hear the difference when I switched from one mic or interface to another. It might be a whole different story if you’re recording; and perhaps I did something wrong that was leveling down my audio quality.

Lights

Priority: high

I’ve put lights before the camera. Which makes sense if you think about lights … camera … ACTION! but also if you consider the fact that the best camera in the world won’t give you a nice image if you don’t have good lighting. In fact, it will easily be outperformed by a very cheap webcam (including your laptop’s integrated camera) with proper lights.

Elgato Key Light

I’m going to admit that I went a bit overboard here. I got three Elgato Key Light, because I’ve seen someone seen that they were the best thing since sliced bread and that I didn’t have much time to do my research. They’re expensive, and I could probably have gotten just two.

But you can control them over a REST API, so you’re damn sure as heck I’m not sending them back!

These lights are LED panels giving a very smooth light. They only give you white light (not full colors) but you can set the brightness and the temperature. The temperature setting lets you make them white-slightly-reddish to white-slightly-blueish and everything in between.

You can control the lights with:

a macOS or Windows app (didn’t try it as I’m running Linux)
an Android app (that’s what I used to set them up)
probably an iOS app too (didn’t try)
a REST API

I fiddled a bit with leglight, a Python library to control these lights. It works. The Android app is enough for my needs anyway, but if some day, I want to code some automation to change the lights with a keyboard shortcut or the Stream Deck, I know it’ll be possible and easy.

August 2020 update: eventually, I wrote 10 lines of Python so that I could control the lights from my Stream Deck.

Note that the lights come with telescopic arms that you can easily clamp on your desk. The arms and lights are very lightweight, so you don’t need to screw the clamp super hard. At the end of the arm, there is a kind of ball and socket joint that lets you orient the lights any way you like.

It’s interesting to note that the lights are screwed to the arms using what appears to be a standard 3/8-16 screw. I happen to also have a camera tripod using the same screw, which means that I can put an Elgato Key Light on my camera tripod, or conversely, that I can put a camera on the arm mount of the light.

Pictured here: two mounting arms that came with the Elgato Key Lights. On the left one, that’s a DSLR in portrait orientation. On the right one, an Elgato Key Light, on top of which I’ve installed a Logitech StreamCam.

How I’ve put my lights

I did some reading on Three-Point Lighting and that’s what I tried to achieve.

I placed one light on one side (close to my main camera), another light on the other side (set less bright), and the third one on the back. I’m going to keep experimenting, because that’s a topic I know nothing about, and I’m not even a good judge of what a good result should look like.

In the evening, I sometimes use the lights as desk lights (setting them on the lowest brightness) but that’s mostly because I don’t have a good desk lamp at the moment. For reference, other people might find these lights way too bright! In fact, if I ever implement some automation with these lights, it might be to have a quick off (or dim) button so that my partner can walk on the set without getting a headache because of how bright these are.

Other options

If you don’t want to pour that much money in lights, there are many other options. I’ve seen people get excellent results with cheap bedside lamps. Anything that gives you a white diffuse light (for instance something with a white, opaque lampshade) might work, as long as it’s uniform (so that it doesn’t cast too harsh shadows) and not too bright (so that it doesn’t make you wince and get headaches over time).

If you are next to a window with lots of daylight, make sure that the window is not behind you, or you might show up as a shadow or silhouette against the backlight.

Internet connection

Priority: high

The next most important thing (still before cameras, in my opinion) is a stable, reliable internet connection.

Because if your internet goes down, you might have the best equipment in the world, your audience won’t see anything. Duh.

I have two connections here: one is DSL, the other one is cable. It might feel overkill, but it still costs way less than what most US families pay for a cable TV+internet bundle.

To make the setup as simple as possible, we have a WiFi router (we use a Netgear R6400 but honestly anything will do), and we then plug that router to either the DSL router, or the cable router (both provided by the telephone and cable companies respectively).

Using this extra router means that to switch the connection, we just have to plug “our” router to the cable or DSL router, and everything follows suit: we don’t need to tell our computers, phones, TVs, whatever, to switch to another WiFi network. (Some devices are wired anyway.)

I already had one network glitch while streaming an online training, and all we had to do was move that network cable, and we were back online in a couple of minutes.

From left to right: cable router, home router, and DSL router.

All I have to do to switch connections is move the white cable with the blue jacket from the yellow ports on the right, to the yellow ports on the left. Way easier than fiddling with routing tables on a Raspberry Pi in the middle of a training, if you ask me!

Cameras

Priority: high

Ah, finally, we’re talking about cameras!

This section might be anticlimactic, because after testing a bunch of different things, I end up using simple USB webcams. The main thing is not the quality of the camera itself, but the fact that it’s easy to place it wherever you need it. That’s why the built-in webcam of my laptop isn’t great: it’s basically a nose cam!

Let’s see what I use, and then I’ll describe at lengths what I tried.

TL,DR

I’m using a Logitech C920s as a “face cam” placed right on top of the monitor right ahead of me, and a Logitech StreamCam in portrait orientation, to get a 3/4 angle, half-body shot.

I use the former to have my face in a corner of the screen when doing demos in the terminal, and the latter to get a more “present” feel, e.g. on title screens or during Q&A. You can see an example of both layouts in my talk at FiqueEmCasaConf. (That link will take you straight to a point where I’m about to do a transition from one layout to the other one.)

Now, let’s see the various things I tried before getting there …

Logitech C920s

When I realized I needed a webcam, I was hesitating between the Logitech Brio, the C920s, and the Razer Kiyo.

The Brio supposedly has better quality; the Kiyo has a built-in LED ring. But both were out of stock and all I could find was the C920s, so I promptly acquired one before it went out of stock too.

Pros:

it works and the quality is mostly OK;
it can be mounted on top of a screen, on top of one of my Elgato Key Lights, or screwed on a camera tripod.

Cons:

it has auto-focus issues;
it is USB2 (not USB3) and this might cause problems in multi-camera setups.

I don’t know if these two issues are hardware or software related, so I will cover them in detail in part 4.

I use the webcam as my “face cam”. It is set on top of my “confidence monitor” right ahead of me.

DSLR / mirrorless cameras

Some folks are using DSLR or mirrorless cameras as webcams, and I wanted to give this a try.

Supposedly, you can get much better picture quality with a DSLR or mirrorless camera, than with a webcam.

Mostly because the sensor on a DSLR or mirrorless camera is much bigger than on a webcam, so it will receive more light per pixel, and will therefore perform much better in low lighting conditions. With the right lens (e.g. a 50mm equivalent with a large aperture), you can get amazing evening pictures without using a flash light. Cameras also have way more settings than webcams.

In practice, though … My experience was very different.

TL,DR: the bigger sensor of a nice camera helps with low lighting conditions, but if you have proper lighting, a webcam will do just as well. Also, cameras are designed to be operated by someone placed behind them, not in front, so manipulating settings can complicated.

There are at least two ways to use cameras as a webcam:

connect it over USB and use a PTP-capable program like gphoto2 to obtain a “live view” from the sensor;
use the camera’s HDMI output and some HDMI capture interface.

Important: not all cameras support these modes! Some cameras might support none of them. Then you’re out of luck. Some cameras might support them with limitations.

Either way, if you’re using a camera as a webcam, you will probably need a way to keep it powered on for long periods of time. Some cameras might have DC adapters. Mine didn’t, but most cameras can be powered with dummy batteries: it has the shape and connector of the normal battery, but has a power cable that runs to an AC adapter. You will need to get one that is specific to your camera model (since different cameras use different batteries).

My partner has a Canon EOS1100D (also known as the EOS Rebel T3 in the US). I got it to work with both USB and HDMI. I also have a Sony NEX-C3 with some nice lenses. Unfortunately, that one didn’t work too well.

EOS1100D over USB

I will give more details in part 4, but the general idea is to get a “live view” from the sensor of the camera, and then turn that into a standard webcam device.

It worked reasonably well, but it was “only” 720p, so I was wondering if I might be able to use the camera’s HDMI output with an HDMI capture interface.

EOS1100D over HDMI

The challenge here is to obtain what is called “clean HDMI output”. What does that mean? On many cameras, the HDMI output was designed to provide a bigger monitor for the camera. So on top of the image, we get a lot of information about the camera’s parameters (aperture, exposition…), image characteristics (histograms and such).

Above: example of “unclean” HDMI output. Arguably, I could crop the exposure stops on the right; but the focus indicator, the green rectangle in the middle, is more annoying!

Below: example of “clean” HDMI output.

It turns out that with this EOS1100D, the only way to get clean HDMI output is to install a custom firmware. Yes, you read that right!

I gave it a try, and installed Magic Lantern on the camera. I’ve done a bit of reverse engineering a while ago, and I must say, Magic Lantern is mind blowing. Mad props to the folks who put in all that work!

After installing Magic Lantern, I could use the “Clear Overlays” feature to remove the focus rectangle. But after 30 minutes, the camera turns off the live view automatically! Unless you keep the shutter button half-pressed. Of course, keeping the shutter button half-pressed is not realistic while streaming … But there is another option in Magic Lantern to make the half-shutter button “sticky”, i.e. you press it once, and it behaves as if you were keeping it pressed. Press it again, and it releases it.

Alright, was it worth all the trouble?

Not so much, alas.

I’ll let you judge for yourself with the screenshots below. The first picture uses natural light. The second picture uses my Elgato Key Lights.

Left: EOS1100D with HDMI capture interface. Center: Logitech StreamCam. Top right: Logitech C920s. Bottom right: integrated webcam (in Thinkpad P51).

In my opinion, the colors on the EOS1100D (on the left) are a bit more red, and less flattering. I’m aware that this is very subjective. I’m also aware that it might be possible to fix that by fiddling with the camera settings. However, given that the camera is placed on an arm mount, accessing the settings it quite inconvenient. There might be a way to access the settings over USB, but I didn’t look into that (yet).

This comparison also shows the effect of lighting. I didn’t spend much time tuning the lights by the way; just turned them on to take that shot (40% brightness on the light on the right, located next to the EOS and the StreamCam; 20% on the lights on the left and behind me).

It also shows that the angle of the integrated webcam isn’t great, except to emphasize my beard and nostrils!

Sony NEX-C3

Unfortunately, that camera doesn’t seem to support live preview with gphoto2. It does have HDMI output, but it’s not clean HDMI output. It’s a much older camera so support is fairly limited. I have a few nice lenses for it, but I decided that it was not worth the time investment at this point, and I left it aside.

Logitech StreamCam

For my first training, I used the C920s and the EOS1100D. At the end of the training, since it had been successful and I was convinced by the usefulness of the two-cameras setup, I decided to buy another camera, and managed to find a store that sitll had theLogitech StreamCam in stock.

It seems (to me at least) vastly superior to the C920s. The connection is USB3, so it’s not plagued by the same bandwith issue (more on that in part 4).

Also, it can be mounted in portrait or landscape, which is great because I was using the EOS1100D in portrait anyway. It comes with both a clip-on mount and a screw mount. I clip it on top of one of my lights but I could also mount it on a tripod if needed.

And of course it’s an UVC device so it works without requiring special drivers.

Using your phone as a camera

If your phone has a good camera, you might be able to use it as a webcam.

I did some experiments with IP Webcam and got it to work. I didn’t end up using it, because I don’t have an arm or holder that would help me prop up the phone in a good position. I will give more details about how to get that to work in part 4, since my experiment was Linux-specific.

Razer Kiyo

August 2020: I got a Razer Kiyo when they got back in stock, because I needed an extra cam and I was curious about the LED ring feature.

The video quality on the Kiyo is pretty good (consistent with the C920s and the Streamcam). However, the LED ring is quite underwhelming. Sure, it’s better than nothing; but the light is not very flattering, especially compared to what I get with the key lights. I imagine that it could be useful when traveling, to have a good webcam without carrying a separate light?

Monitors

Priority: high

For me, it’s important to have at least two screens:

a captured screen, which will be seen by the audience, and where I will present my slides, terminal, and other demos;
a control screen, which will not be seen by the audience, and where I have my video studio software, controls, chat windows, and so on.

Note: there might be better, official names for these screens! If you know them, please let me know, I’ll be happy to update my terminology.

This is what my setup looks like:

On the left, that’s my streaming laptop. The internal monitor of that laptop, currently showing a terminal, is my “captured screen”. The external monitor above is my “control screen”, with OBS Studio as well as a browser window (shown here to control a Google Slides presentation deck). On the right, that’s my usual work laptop. The internal monitor is for whatever I want to do outside of the streaming session, during the breaks, etc.; and the external monitor shows a Gitter chat window used to interact with the audience.

Captured screen

This screen is what I want to show to the audience. It may or may not be visible at all times (for instance, at some point, I might switch to a thing where the audience sees me, but not my screen; or I might put a “break” screen showing countdown indicating when we will resume). But I keep in mind that it could be shown to the audience at any point, so I must not show any sensitive information there.

That screen shouldn’t be too fancy, because the goal is to capture it and stream it. Using Hi-DPI (retina) displays or non-standard aspect ratios might be counter-productive here.

In practice, I use the built-in monitor of my streaming laptop.

Control screen

That screen is the one that the audience doesn’t see, and has all the things that I need to manage my stream and the interaction with the audience.

In practice, I have OBS (the video compositing and streaming software that I use) on that screen, so that I can see exactly what I’m sending to my audience. (More about OBS in part 3!)

If I need to run a quick command, load up a password from my password manager, etc., I would also do that on that screen.

That’s also the right place for chat windows, so that I can keep an eye on them without having to constantly bring them up on the captured screen.

That screen can be as fancy as you like! For instance, it would be a great use for these huge, ultra-wide monitors. If your setup allows it, you could also use multiple monitors there. Another possibility is to have another laptop, or a tablet, for some extra screen estate. There are many things that don’t need to run on the main streaming laptop, for instance:

chat windows,
speaker notes (e.g. list of commands for a particularly tricky demo),
an extra terminal or browser for quick tests and lookups.

Headphones or earbuds

Priority: high (if there is voice interaction) / low (otherwise)

If you will have voice interaction with other participants, please in the name of everything that is holy, use headphones.

Yes, there are echo cancellation and other systems to allow using a mic with a speakerphone at the same time; no, I do not rely on them. At best, they degrade the audio quality a bit; at worst, you get the most obnoxious echo or feedback sounds.

I’m using a simple pair of earbuds, but I’m considering getting AfterShokz Aeropex bone-conduction headphones. Why? For a few reasons.

Earbuds are OK for an hour or two, but for a half-day or even full-day training, they’re less comfortable. I’m told that the Aeropex are extremely comfortable to wear, even day long, and the battery apparently lasts long enough, too.
Going wireless! Just like for the wireless mic, I would love to avoid being distracted by the extra wire.
I would like to be able to keep hearing other sounds. Yes, some people would rather block outside interference, and sometimes, it’s what I want; but if I’m streaming a course all day long from home, it’s useful to be able to hear the doorbell, or if my partner calls me, or notifications on my phone (which I can silence if I don’t want to hear them otherwise).

May 2020 update: I got the Aeropex, and I’m extremely happy with them. They are very light. So light, that I can carry them all day long without being bothered by them. (It’s almost like wearing glasses.) The battery lasts almost all day. As noted above in the section above Bluetooth headsets, I do not use the mic of the Aeropex; I only use them to listen.

Stream Deck

Priority: nice-to-have

The Stream Deck is basically an external control device with a bunch of programmable buttons. I got it thinking that there was a 50/50 chance that it would just be a useless gadget; but I actually like it quite a bit.

It’s not just programmable buttons: each button has an embedded LCD screen, so I can put a custom label or icon on it.

I programmed it so that a row of buttons lets me change scenes in OBS. Of course, I could use keyboard shortcuts, but it helps a lot to have a dedicated device to do that. Having labels on the keys to remind me which key corresponds to which layout.

To be fair, if you have a fancy keyboard with programmable keys, and can add custom labels next to these keys, you’ll be set just as well.

On the Stream Deck, I also have buttons to engage a break timer and adjust its duration. I’ll tell more about that in part 3.

In the future, I plan to also use it to adjust lights; at least to get a “master on/off” switch.

HDMI capture device

Priority: low (unless…)

An HDMI capture device is useful in at least two scenarios:

to use the HDMI output of a DSLR or mirrorless camera, as described earlier;
to use two separate laptops: one as a presenter laptop, the other as a streaming laptop.

I initially intended to get the Elgato Cam Link, but it was out of stock, so I looked for other options. I wasn’t going for a particular model; I just bought a couple of adapters and prayed that at least one of them would work.

ezcap265C

The ezcap265C is an UVC (USB video class) device, just like most webcams. This means that it follows a specific spec and doesn’t require any drivers (just like USB sticks and other external disks don’t require drivers, because they follow the USB “mass storage” specification).

It shows up as a webcam, and it “just works”. It’s important to connect it on an USB-C or USB3 port, because it appears to only support raw uncompressed capture in YUYV format, meaning that capturing 1080p at 60 fps requires about 2 Gb/s of bandwidth!

Some folks asked me about latency. I didn’t see any latency with that device. I understand that some devices might use H264 encoding, making them suitable for use on USB2; perhaps the H264 encoding adds the latency.

Pengo HDMI grabber

I was less lucky with that device. It is supposed to be a bit better than the other one. In particular, it has HDMI pass-thru, meaning that you can plug a screen “after” it, to see what’s getting captured on that screen.

Unfortunately, it didn’t work. It might be defective, because when I plug it on one of my laptops, it shows up, then disappears after a few seconds, then shows up again, and so on. On the other laptop, it randomly shows up (but not always), but when it shows up, it doesn’t work and sometimes after a while disappears again.

(Technically, when I say “show up” and “disappear”, I mean by checking Linux kernel messages that typically indicate device connection and disconnection.)

Green screen

Priority: very low

The room where I’m streaming tends to be a bit messy, so I thought that a green screen could be a good investment. Unfortunately, its shipping was delayed by a few weeks, so I just cleaned up a bit, and ended up arranging the cameras differently anyway. When my green screen arrived, I didn’t unpack it for a while; and eventually I used it to show myself in front of a shipping container terminal during a presentation (while breaking the fourth wall at some point by lowering the screen to remind people that to prevent the spread of COVID-19, we should stay at home anyway).

You can definitely do funny things with a green screen but I don’t see it as an important investment.

Moreover, when using multiple cameras, it can be tricky to have an uniform green background everywhere!

Overhead scanner or camera

Priority: low or very low

My partner has a CZUR Aura scanner. It looks like a desk lamp, and it’s essentially a webcam mounted in a way that it can “look” at a document placed under it.

I tried to use it as a virtual whiteboard. During a training, if I need to make a quick diagram, I could just sketch it on actual paper placed under the camera.

Unfortunately, it’s not fully compliant UVC device. I was able to get it to work (kind of) in low resolution, but it was very finicky. Mistakenly switching up the resolution or frame rate could cause it to hang, requiring to reboot the scanner.

Another (more reliable) option would be to use a normal webcam or DSLR (as explained earlier) on some kind of overhead arm. This would be particularly useful if I needed to show objects (e.g. for a “maker” stream, or when programming embedded devices, Arduinos, Raspberry Pis, etc).

Another option might be to use a tablet to draw things if needed.

This wasn’t really a hard requirement to me; it was mostly an experiment, and so far I don’t miss it.

Conclusions

So that’s all I tried so far!

To recap, I am using:

a streaming laptop, separate from my usual “work laptop”
an external monitor (24” full HD, nothing fancy)
three Elgato Key Light
a cheap lav mic connected to a RØDE Wireless Go
an Audio-Technica ATR2USB audio interface
a Logitech C920s and a Logitech StreamCam
a Stream Deck to change scenes and have quick access to some useful scripts and functions

I also have, but I’m not (or barely) using:

a green screen
an ezcap265C HDMI capture interfaces
a Blue Yeti mic
a Canon EOS1100D flashed with Magic Lantern

In part 3, I will describe my software setup. It’s based on OBS Studio, which is available on Linux, macOS, Windows.

In part 4, I will describe how I got OBS (and associated paraphernalia) to run on Linux. In fact, I even got everything running in Docker containers, and I’ll also explain why.

Streaming tech talks and training / Overview

2020-04-16T00:00:00+00:00

In March 2020, I started delivering online training sessions (instead of doing it in person). In these series of blog posts, I describe how I’ve set up what I call my “video streaming studio”, hoping that my experience and feedback can be useful to others.

In this first article, I’ll give some context so that you can understand what I’m doing and what I’m trying to achieve.

The second article will describe the hardware equipment that I’m using: computers, cameras, lights, and so on.

The third article will describe the general software setup. I’m using OBS Studio and I will explain what I do with it, how, and why.

The fourth article will describe how I got that to work on Linux. The first articles can be useful to you no matter which operating system you use (OBS Studio is available on macOS and Windows as well).

Before we dive in, a little bit of context: I’ve been delivering talks, workshops, training, for almost 10 years now, but it’s been my main source of income (as a freelancer) for a couple of years only. I do not have any kind of formal training in audio or video production. This means that I’ve certainly made some horrendous mistakes in my choice of equpiment, software, and how I use them. Take everything I wrote with a boulder of salt! If you have advice, suggestions, questions, or any kind of feedback, you are welcome to contact me, I’d love to hear from you!

What I do

For the last couple of years, I’ve been delivering Docker and Kubernetes training, almost exclusively in person. I’ve done private training (where a company hires me to train anywhere from for 6 to 60 employees at the same time), public training (where attendees pay per individual seat to attend), conference training (where a conference organizer pays me to deliver a workshop or tutorial at their event). I’ve also delivered free workshops, spoke at meetups, and done a small number of online presentations.

When I present to an audience, what people see on the projector is a mirror of my screen. I do this because I constantly switch between slides (that are designed to be presented full screen), a command-line terminal (usually with a huge font, so that it’s easily readable even from the back of the room), and a web browser (when showing demos or looking up extra information or documentation).

I do not have a separate screen with speaker notes, because it makes the switching (to the terminal and web browser) less seamless.

It is certainly possible to deliver that kind of training using only a laptop computer with its built-in webcam and sharing my screen. But it is very hard to keep the audience engaged this way for long period of times, so I wanted to up my game, so to speak.

Past attempts at producing video content

Since in-person training doesn’t scale, I tried a few times to record my classes. I’ve tried studio recording (without an audience, at my own pace) and live recording.

I found studio recording to be extremely difficult given my current skillset. I had to:

record myself present the course, filming my face;
record the hands-on sequences, labs, and demos separately;
edit everything together (adding slides in the process).

At this point, this is not something that I can do. I tried, and I failed. At best, I could produce 15 minutes of content with 2 weeks of work; and the result wasn’t outstanding. It was very difficult to get demos and the voice-over in sync. It required me to write down most of what I wanted to say, and many, many takes; and a very tedious editing proess.

Live recording

Live recording seemed easier, because in theory, I would just have to hit “record” and present the way I usually do. In practice, of course, things are different.

I wrote another blog post about recording workshop videos with almost no budget that describes my experiences and the process that I used.

Streaming

Before the COVID-19 outbreak, I didn’t think I would like (or be able) to deliver my courses online. In March 2020, when it became obvious that in-person training wasn’t going to happen in the near future, I decided to get some equipment and get into streaming and online courses.

Hindsight is 20/20: streaming is a great format, in the sense that it makes some of my previous technical problems go away. Streamers are not expected to pace on stage, so there is no need for a camera operator to keep you in frame. There is also a lot of equipment, software, and platforms available for streamers, so I’m not in uncharted territory.

The expectations are also different. I think about it like music recorded in a studio vs performed live. I tend to be more demanding and notice problems more in recorded music (because the sound quality is also better), while being simultaneously more forgiving, and more easily moved, by live music. I imagine that my audience will also be more forgiving with live content, where the expectations are different: lower expectations in terms of video quality (because we understand the technical constraints of streaming live video feeds) but higher expectations in terms of interactivity (because that’s the whole point of a live streaming). That’s great, because the interaction and Q&A are precisely the parts that I’m comfortable with!

Results

If you want to see what my online talks look like, here are a couple of examples:

(Note that in both cases, the quality is not as good as it could be, because I was streaming to a third person who was then re-streaming it to YouTube Live. I hope to have “direct” streams soon too, with hopefully a better quality!)

What’s next

In part 2, I will describe the equipment that I am using or that I have tried.

In part 3, I will describe my software setup. It’s based on OBS Studio, which is available on Linux, macOS, Windows.

In part 4, I will describe how I got OBS (and associated paraphernalia) to run on Linux. In fact, I even got everything running in Docker containers, and I’ll also explain why.

The Quest for Minimal Docker Images, part 3

2020-04-01T00:00:00+00:00

In the beginning of this series (first part, second part), we covered the most common methods to optimize Docker image size. We saw how multi-stage builds, combined with Alpine-based images, and sometimes static builds, would generally give us the most dramatic savings. In this last part, we will see how to go even farther. We will talk about standardizing base images, stripping binaries, assets optimization, and other build systems or add-ons like DockerSlim or Bazel, as well as the NixOS distribution.

We’ll also talk about small details that we left out earlier, but are important nonetheless, like timezone files and certificates.

The English version of this series was initially published on the Ardan Labs blog: parts 1, 2, 3. A French version (translated by Aurélien Violet and Romain Degez) is also available on the ENIX blog: parts 1, 2, 3. Enjoy your read!

Common bases

If our nodes run many containers in parallel (or even just a few), there’s one thing that can also yield significant savings.

Docker images are made of layers. Each layer can add, remove, or change files; just like a commit in a code repository, or a class inheriting from another one. When we execute a docker build, each line of the Dockerfile will generate one layer. When we transfer an image, we only transfer the layers that don’t already exist on the destination.

Layers save network bandwidth, but also storage space: if multiple images share layers, Docker needs to store these layers only once. And depending on the storage driver that you use, layers can also save disk I/O and memory, because when multiple containers need to read the same files from a layer, the system will read and cache these files only once. (This is the case with the overlay2 and aufs drivers.)

This means that if we’re trying to optimize network and disk access, as well as memory usage, in nodes running many containers, we can save a lot by making sure that these containers run images that have as many common layers as possible.

This can directly go against some of the guidelines that we gave before! For instance, if we’re building super optimized images using static binaries, these binaries might be 10x bigger than their dynamic equivalents. Let’s look at a few hypothetical scenarios when running 10 containers, each using a different image with one of these binaries.

Scenario 1: static binaries in a scratch image

weight of each image: 10 MB
weight of the 10 images: 100 MB

Scenario 2: dynamic binaries with ubuntu image (64 MB)

individual weight of each image: 65 MB
breakdown of each image: 64 MB for ubuntu + 1 MB for the specific binary
total disk usage: 74 MB (10x1 MB for individual layers + 64 MB for shared layers)

Scenario 3: dynamic binaries with alpine image (5.5 MB)

individual weight of each image: 6.5 MB
breakdown of each image: 5.5 MB for alpine + 1 MB for the specific binary
total disk usage: 15.5 MB

These static binaries looked like a good idea at first, but in these circumstances, they are highly counterproductive. The images will require more disk space, take longer to transfer, and use more RAM!

However, for these scenarios to work, we need to make sure that all images actually use the exact same base. If we have some images using centos and others using debian, we’re ruining it. Even if we’re using e.g. ubuntu:16.04 and ubuntu:18.04. Even if we’re using two different versions of ubuntu:18.04! This means that when the base image is updated, we should rebuild all our images, to make sure that it’s consistent across all our containers.

This also means that we need to have good governance and good communication between teams. You might be thinking, “that’s not a technical issue!”, and you’d be right! It’s not a technical issue. Which means that for some folks, it will be much more difficult to address, because there is no amount of work that you can do by yourself that will solve it: you will have to involve other humans! Perhaps you absolutely want to use Debian, but another team absolutely wants to use Fedora. If you want to use common bases, you will have to convince that other team. Which means that you have to accept that they might convince you, too. Bottom line: in some scenarios, the most efficient solutions are the ones that require social skills, not technical skills!

Finally, there is one specific case where static images can still be useful: when we know that our images are going to be deployed in heterogenous environments; or when they will be the only thing running on a given node. In that case, there won’t be any sharing happening anyway.

Stripping and converting

There are some extra techniques that are not specific to containers, and that can shave off a few megabytes (or sometimes just kilobytes) from our images.

Stripping binaries

By default, most compilers generate binaries with symbols that can be useful for debugging or troubleshooting, but that aren’t strictly necessary for execution. The tool strip will remove these symbols. This is not likely to be a game changer, but if you are in a situation where every byte counts, it’ll definitely help.

Dealing with assets

If our container image contains media files, can we shrink these, for instance by using different file formats or codecs? Can we host them somewhere else, so that the image that we ship is smaller? The latter is particularly useful if the code changes often, but the assets don’t. In that case, we should try to avoid shipping the assets each time we ship a new release of the code.

Compression: a bad good idea

If we want to reduce the size of our images, why not compress our files? Assets like HTML, javascript, CSS, should compress pretty well with zip or gzip. There are even more efficient methods like bzip2, 7z, lzma. At first, it looks like a simple way to reduce image size. And if we plan on serving these assets in compressed form, why not. But if our plan is to uncompress these assets before using them, then we will end up wasting resources!

Layers are already compressed before being transferred, so pulling our images won’t be any faster. And if we need to uncompress the files, the disk usage will be even higher than before, because on disk, we will now have both the compressed and uncompressed versions of the files! Worse: if these files are on shared layers, we won’t get any benefits from the sharing, since these files that we will uncompress when running our containers won’t be shared.

What about UPX? If you’re not familiar with UPX, it’s an amazing tool that reduces the size of binaries. It does so by compressing the binary, and adding a small stub to uncompress and run it transparently. If we want to reduce the footprint of our containers, UPX will also be very counter-productive. First, the disk and network usage won’t be reduced a single bit, since layers are compressed anyway; so UPX won’t get us anything here.

When running a normal binary, it is mapped in memory, so that only the bits that are needed get loaded (or “paged in”) when necessary. When running a binary compressed with UPX, the entire binary has to be uncompressed in memory. This results in higher memory usage and longer start times, especially with runtimes like Go that tend to generate bigger binaries.

(I once tried to use UPX on the hyperkube binary when trying to build optimized node images to run a local Kubernetes cluster in KVM. It didn’t go well, because while it reduced the disk usage for my VMs, their memory usage went up, by a lot!)

… And a few exotic techniques

There are other tools that can help us achieve smaller image sizes. This won’t be an exhaustive list …

DockerSlim

DockerSlim offers an almost magic technique to reduce the size of our images. I don’t know exactly how it works under the hood (beyond the design explanations in the README), so I’m going to make educated guesses. I suppose that DockerSlim runs our container, and checks which files were accessed by the program running in our container. Then it removes the other files. Based on that guess, I would be very careful before using DockerSlim, because many runtimes and frameworks are loading files dynamically, or lazily (i.e. the first time they are needed).

To test that hypothesis, I tried DockerSlim with a simple Django application. DockerSlim reduced it from 200 MB to 30 MB, which is great! However, while the home page of the app worked fine, many links were broken. I suppose this is because their templates hadn’t been detected by DockerSlim, and weren’t included in the final image. Error reporting itself was also broken, perhaps because the modules used to display and send exceptions were skipped as well. Any Python code that would dynamically import some module would run into this.

Don’t get me wrong, though: in many scenarios, DockerSlim can still do wonders for us! As always, when there is a very powerful tool like this, it is very helpful to understand its internals, because it can give us a pretty good idea about how it will behave.

Distroless

Distroless images are a collection of minimal images that are built with external tools, without using a classic Linux distribution package manager. It results in very small images, but without basic debugging tools, and without easy ways to install them.

As a matter of personal taste, I prefer having a package manager and a familiar distro, because who knows what extra tool I might need to troubleshoot a live container issue? Alpine is only 5.5 MB, and will allow me to install virtually everything I need. I don’t know if I want to let go of that! But if you have comprehensive methods to troubleshoot your containers without ever needing to execute tools from their image, then by all means, you can achieve some extra savings with Distroless.

Additionally, Alpine-based images will often be smaller than their Distroless equivalents. So you might wonder: why should we care about Distroless? For at least a couple of reasons.

First, from a security standpoint, Distroless images let you have very minimal images. Less stuff in the image means less potential vulnerabilities.

Second, Distroless images are built with Bazel, so if you want to learn or experiment with or use Bazel, they are a great collection of very solid examples to get started. What’s Bazel exactly? I’m glad you asked, and I’ll cover it in the next section!

Bazel (and other alternative builders)

There are some build systems that don’t even use Dockerfiles. Bazel is one of them. The strength of Bazel is that it can express complex dependencies between our source code and the targets that it builds, a bit like a Makefile. This allows it to rebuild only the things that need to be rebuilt; whether it’s in our code (when making a small local change) or our base images (so that patching or upgrading a library doesn’t trigger an entire rebuild of all our images). It can also drive unit tests, with the same efficiency, and run tests only for the modules that are affected by a code change.

This becomes particularly effective on very large code bases. At some point, our build and test system might need hours to run. And then it needs days, and we deploy parallel build farms and test runners, and it takes hours again, but requires lots of resources, and can’t run in a local environment anymore. It’s around that stage that something like Bazel will really shine, because it will be able to build and test only what’s needed, in minutes instead of hours or days.

Great! So should we jump to Bazel right away? Not so fast. Using Bazel requires learning a totally different build system, and might be significantly more complicated that Dockerfiles, even with all the fancy multi-stage builds and subtleties of static and dynamic libraries that we mentioned above. Maintaining this build system and the associated recipes will require significantly more work. While I don’t have first-hand experience with Bazel myself, according to what I’ve seen around me, it’s not unreasonable to plan for at least one full-time senior or principal engineer just to bear the burden of setting up and maintaining Bazel.

If our organization has hundreds of developers; if build or test times are becoming a major blocker and hinder our velocity; then it might be a good idea to invest in Bazel. Otherwise, if we’re a fledgeling startup or small organization, it may not be the best decision; unless we have a few engineers on board who happen to know Bazel very well and want to set it up for everyone else.

Nix

I decided to add a whole section about the Nix package manager because after the publication of parts 1 and 2, some folks brought it up with a lot of enthusiasm.

Spoiler alert: yes, Nix can help you achieve better builds, but the learning curve is steep. Maybe not as steep as with Bazel, but close. You will need to learn Nix, its concepts, its custom expression language, and how to use it to package code for your favorite language and framework (see the nixpkgs manual for examples).

Still, I want to talk about Nix, for two reasons: its core concepts are very powerful (and can help us to have better ideas about software packaging in general), and there is a particular project called Nixery that can help us when deploying containers.

What’s Nix?

The first time I heard about Nix was about 10 years ago, when I attended that conference talk. Back then, it was already full-featured and solid. It’s not a brand new hipster thing.

A little bit of terminology:

Nix is a package manager, that you can install on any Linux machine, as well as on macOS;
NixOS is a Linux distribution based on Nix;
nixpkgs is a collection of packages for Nix;
a “derivation” is a Nix build recipe.

Nix is a functional package manager. “Functional” means that every package is defined by its inputs (source code, dependencies…) and its derivation (build recipe), and nothing else. If we use the same inputs and the same derivation, we get the same output. However, if we change something in the inputs (if we edit a source file, or change a dependency) or in the build recipe, the output changes. That makes sense, right? If it reminds us of the Docker build cache, it’s perfectly normal: it’s exactly the same idea!

On a traditional system, when a package depends on another, the dependency is usually expressed very loosely. For instance, in Debian, python3.8 depends on python3.8-minimal (= 3.8.2-1) but that python3.8-minimal depends on libc6 (>= 2.29). On the other hand, ruby2.5 depends on libc6 (>= 2.17). So we install a single version of libc6 and it mostly works.

On Nix, packages depend on exact versions of libraries, and there is a very clever mechanism in place so that every program will use its own set of libraries without conflicting with the others. (If you wonder of this works: dynamically linked programs are using a linker that is set up to use libraries from specific paths. Conceptually, it’s not different from specifying #!/usr/local/bin/my-custom-python-3.8 to run your Python script with a particular version of the Python interpreter.)

For instance, when a program uses the C library, on a classic system, it refers to /usr/lib/libc.so.6, but with Nix, it might refer to /nix/store/6yaj...drnn-glibc-2.27/lib/libc.so.6 instead.

See that /nix/store path? That’s the Nix store. The things stored in there are immutable files and directories, identified by a hash. Conceptually, the Nix store is similar to the layers used by Docker, with one big difference: the layers apply on top of each others, while the files and directories in the Nix store are disjoint; they never conflict with each other (since each object is stored in a different directory).

On Nix, “installing a package” means downloading a number of files and directories in the Nix store, and then setting up a profile (essentially a bunch of symlinks so that the programs that we just installed are now available in our $PATH).

Experimenting with Nix

That sounded very theoretical, right? Let’s see Nix in action.

We can run Nix in a container with docker run -ti nixos/nix.

Then we can check installed packages with nix-env --query or nix-env -q.

It will only show us nix and nss-cacert. Weird, don’t we also have, like, a shell, and many other tools like ls and so on? Yes, but in that particular container image, they are provided by a static busybox executable.

Alright, how do we install something? We can do nix-env --install redis or niv-env -i redis. The output of that command shows us that it’s fetching new “paths” and placing them in the Nix store. It will at least fetch one “path” for redis itself, and very probably another one for the glibc. As it happens, Nix itself (as in, the nix-env binary and a few others) also uses the glibc, but it could be a different version from the one used by redis. If we run e.g. ls -ld /nix/store/*glibc*/ we will then see two directories, corresponding to two different versions of glibc. As I write these lines, I get two versions of glibc-2.27:

ef5936ea667f:/# ls -ld /nix/store/*glibc*/
dr-xr-xr-x    ... /nix/store/681354n3k44r8z90m35hm8945vsp95h1-glibc-2.27/
dr-xr-xr-x    ... /nix/store/6yaj6n8l925xxfbcd65gzqx3dz7idrnn-glibc-2.27/

You might wonder: “Wait, isn’t that the same version?” Yes and no! It’s the same version number, but it was probably built with slightly different options, or different patches. Something changed, so from Nix’ perspective, these are two different objects. Just like when we build the same Dockerfile but change a line of code somewhere, the Docker builder keeps track of these small differences and gives us two different images.

We can ask Nix to show us the dependencies of any file in the Nix store with nix-store --query --references or nix-store -qR. For instance, to see the dependencies of the Redis binaries that we just installed, we can do nix-store -qR $(which redis-server).

In my container, the output looks like this:

/nix/store/6yaj6n8l925xxfbcd65gzqx3dz7idrnn-glibc-2.27
/nix/store/mzqjf58zasr7237g8x9hcs44p6nvmdv7-redis-5.0.5

Now here comes the kicker. These directories are all we need to run Redis anywhere. Yes, that includes scratch. We don’t need any extra library. (Maybe just tweak our $PATH for convenience, but that’s not even strictly necessary.)

We can even generalize the process by using a Nix profile. A profile contains the bin directory that we need to add to our $PATH (and a few other things; but I’m simplifying for convenience). This means that if I do, nix-env --profile myprof -i redis memcached, myprof/bin will contain the executables for Redis and Memcached.

Even better, profiles are objects in the Nix store as well. Therefore, I can use that nix-store -qR command with them, to list their dependencies.

Creating minimal images with Nix

Using the commands that we’ve seen in the previous section, we can write the following Dockerfile:

FROM nixos/nix
RUN mkdir -p /output/store
RUN nix-env --profile /output/profile -i redis
RUN cp -va $(nix-store -qR /output/profile) /output/store
FROM scratch
COPY --from=0 /output/store /nix/store
COPY --from=0 /output/profile/ /usr/local/

The first stage uses Nix to install Redis in a new “profile”. Then, we ask Nix to list all the dependencies for that profile (that’s the nix-store -qR command) and we copy all these dependencies to /output/store.

The second stage copies these dependencies to /nix/store (i.e. their original location in Nix), and copies the profile as well. (Mostly because the profile directory contains a bin directory, and we want that directory to be in our $PATH!)

The result is a 35 MB image with Redis and nothing else. If you want a shell, just update the Dockerfile to have -i redis bash instead, and voilà!

If you’re tempted to rewrite all your Dockerfiles to use this, wait a minute. First, this image lacks crucial metadata like VOLUME, EXPOSE, as well as ENTRYPOINT and the associated wrapper. Next, I have something even better for you in the next section.

Nixery

All package managers work the same way: they download (or generate) files and install them on our system. But with Nix, there is an important difference: the installed files are immutable by design. When we install packages with Nix, they don’t change what we had before. Docker layers can affect each other (because a layer can change or remove a file that was added in a previous layer), but Nix store objects cannot.

Have a look at that Nix container that we ran earlier (or start a new one with docker run -ti nixos/nix). In particular, check out /nix/store. There are bunch of directories like these ones:

b7x2qjfs6k1xk4p74zzs9kyznv29zap6-bzip2-1.0.6.0.1-bin/
cinw572b38aln37glr0zb8lxwrgaffl4-bash-4.4-p23/
d9s1kq1bnwqgxwcvv4zrc36ysnxg8gv7-coreutils-8.30/

If we use Nix to build a container image (like we did in the Dockerfile at the end of the previous section), all we need is a bunch of directories in /nix/store + a little bundle of symlinks for convenience.

Imagine that we upload each directory of our Nix store as an image layer in a Docker registry.

Now, when we need to generate an image with packages X, Y, and Z, we can:

generate a small layer with the bundle of symlinks to easily invoke any programs in X, Y, and Z (this corresponds to the last COPY line in the Dockerfile above),
ask Nix what are the corresponding store objects (for X, Y, and Z, as well as their dependencies), and therefore the corresponding layers,
generate a Docker image manifest that references all these layers.

This is exactly what Nixery is doing. Nixery is a “magic” container registry that generates container image manifests on the fly, referencing layers that are Nix store objects.

In concrete terms, if we do docker run -ti nixery.dev/redis/memcached/bash bash, we get a shell in a container that has Redis, Memcached, and Bash; and the image for that container is generated on the fly. (Note that we should rather do docker run -ti nixery.dev/shell/redis/memcached sh, because when an image starts with shell, Nixery gives us a few essential packages on top of the shell; like coreutils, for instance.)

There are a few extra optimizations in Nixery; if you’re interested, you can check this blog post or that talk from NixConf.

Other ways to leverage Nix

Nix can also generate container images directly. There is a pretty good example in this blog post. Note, however, that the technique shown in the blog post requires kvm and won’t work in most build environments leveraging cloud instances (except the ones with nested virtualization, which is still very rare) or within containers. Apparently, you will have to adapt the examples and use buildLayeredImage but I didn’t go that far so I don’t know how much work that entails.

To Nix or not to Nix?

In a short (or even not-so-short) blog post like this one, I cannot teach you how to use Nix “by the book” to generate perfect containers images. But I could at least demonstrate some basic Nix commands, and show how to use Nix in a multi-stage Dockerfile to generate a custom container image in an entirely new way. I hope that these examples will help you to decide if Nix is interesting for your apps.

Personally, I look forward to using Nixery when I need ad-hoc container images, in particular on Kubernetes. Let’s pretend, for instance, that I need an image with curl, tar, and the AWS CLI. My traditional approach would have been to use alpine, and execute apk add curl tar py-pip and then pip install awscli. But with Nixery, I can simply use the image nixery.dev/shell/curl/gnutar/awscli!

And all the little details

If we use very minimal images (like scratch, but also to some extent alpine or even images generated with distroless, Bazel, or Nix), we can run into unexpected issues. There are some files that we usually don’t think about, but that some programs might expect to find on a well-behaved UNIX system, and therefore in a container filesystem.

What files are we talking about exactly? Well, here is a short, but non-exhaustive list:

TLS certificates,
timezone files,
UID/GID mapping files.

Let’s see what these files are exactly, why and when we need them, and how to add them to our images.

TLS certificates

When we establish a TLS connection to a remote server (e.g. by making a request to a web service or API over HTTPS), that remote server generally shows us its certificate. Generally, that certificate has been signed by a well-known certificate authority (or CA). Generally, we want to check that this certificate is valid, and that we know indeed the authority that signed it.

(I say “generally” because there are some very rare scenarios where either that doesn’t matter, or we validate things differently; but if you are in one of these situations, you should know. If you don’t know, assume that you must validate certificates! Safety first!)

The key (pun not intended) in that process lies in these well-known certificate authorities. To validate certificates of the servers that we connect to, we need the certificates of the certificate authorities. These are typically installed under /etc/ssl.

If we are using scratch or another minimal image, and we connect to a TLS server, we might get certificate validation errors. With Go, these look like x509: certificate signed by unknown authority. If that happens, all we need to do is add the certificates to your image. We can get them from pretty much any common image like ubuntu or alpine. Which one we use isn’t important, as they all come with pretty much the same bundle of certs.

The following line will do the trick:

COPY --from=alpine /etc/ssl /etc/ssl

By the way, this shows that if we want to copy files from an image, we can use --from to refer to that image, even if it’s not a build stage!

Timezones

If our code manipulates time, in particular local time (for instance, if we display time in local time zones, as opposed to dates or internal timestamps), we need timezone files. You might think: “Wait, what? If I want to manage timezones, all I need to know is the offset from UTC!” Ah, but that’s without accounting for daylight savings time! Daylight savings time (DST) is tricky, because not all places have DST. Among places that have DST, the change between standard time and daylight savings time doesn’t happen at the same date. And over the years, some places will implement (or cancel) DST, or change the period during which it’s used.

So if we want to display local time, we need files describing all this information. On UNIX, that’s the tzinfo or zoneinfo files. They are traditionally stored under /usr/share/zoneinfo.

Some images (e.g. centos or debian) do include timezone files. Others (e.g. alpine or ubuntu) do not. The package including the files is generally named tzdata.

To install timezone files in our image, we can do e.g.:

COPY --from=debian /usr/share/zoneinfo /usr/share/zoneinfo

Or, if we’re already using alpine, we can simply apk add tzdata.

To check if timezone files are properly installed, we can run a command like this one in our container:

TZ=Europe/Paris date

If it shows something like Fri Mar 13 21:03:17 CET 2020, we’re good. If it shows UTC, it means that the timezone files weren’t found.

UID/GID mapping files

One more thing that our code might need to do: looking up user and group IDs. This is done by looking up in /etc/passwd and /etc/group. Personally, the only scenario where I had to provide these files was to run desktop applications in containers (using tools like clink or Jessica Frazelle’s dockerfiles.

If you need to install these files in a minimal container, you could generate them locally, or in a stage of a multi-stage container, or bind-mount them from the host (depending on what you’re trying to achieve).

This blog post shows how to add a user to a build container, and then copy /etc/passwd and /etc/group to the run container.

Conclusions

As you can see, there are many ways to reduce the size of our images. If you’re wondering, “what’s the absolute best method to reduce image size?”, bad news: there isn’t an absolute best method. As usual, the answer is “it depends”.

Multi-stage builds based on Alpine will give excellent results in many scenarios.

But some libraries won’t be available on Alpine, and building them might require more work than we’d want; so a multi-stage build using classic distros will do great in that case.

Mechanisms like Distroless or Bazel can be even better, but require a significant upfront investment.

Static binaries and the scratch image can be useful when deploying in environments with very little space, like embedded systems.

Finally, if we build and maintain many images (hundreds or more), we might want to stick to a single technique, even if it’s not always the best. It might be easier to maintain hundreds of image using the same structure, rather than having a plethora of variants and some exotic build systems or Dockerfiles for niche scenarios.

If there is a particular technique that you use and that I haven’t mentioned, let me know! I’d love to learn it.

Thanks and acknowledgements

The inspiration to write this series of articles came from that specific tweet by @ellenkorbes. When I deliver container training, I always spend some time explaining how to reduce the size of images, and I often go on fairly long tangents about dynamic vs static linking; and sometimes, I wonder if it’s really necessary to mention all these little details. When I saw L’s tweet and some of the responses to that tweet, I thought, “wow, I guess it might actually help a lot of people if I wrote down what I know about this!”. Next thing you know, I woke up next to an empty crate of Club Mate and three blog posts! 🤷🏻 If you are looking for amazing resources about running Go code on Kubernetes (and other adjacent topics), I strongly recommend that you check out L’s list of talks. In particular, The Quest For The Fastest Deployment Time will be super relevant if you’re working with Kubernetes and want to reduce the time between “saving my code in my editor” and “seeing these changes live on my Kubernetes cluster”. If you liked my blog posts, you will probably enjoy L’s presentation too. (There is also a Portuguese version of that talk on FiqueEmCasaConf.)

Much thanks to the folks who reached out to suggest improvements and additions! In particular:

David Delabassée for Java advice and jlink;
Sylvain Rabot for certificates, timezones, and UID and GID files;
Gleb Peregud and Vincent Ambo for sharing very useful resources on Nix.

These posts were initially written in English, and the English version was proofread by AJ Bowen, who caught many typos, mistakes, and pointed out many ways to improve my prose. All remaining errors are mine and mine only. AJ is currently working on a project involving historical preservation of ancient postcards, and if that’s your jam, you should totally subscribe here to know more.

The French version was translated by Aurélien Violet and Romain Degez. If you enjoyed reading the French version, make sure that you send them a big thank you because this represented a lot more work than it seems!

The Quest for Minimal Docker Images, part 2

2020-03-01T00:00:00+00:00

In the first part, we introduced multi-stage builds, static and dynamic linking, and briefly mentioned Alpine. In this second part, we are going to dive into some details specific to Go. Then we will talk more about Alpine, because it’s worth it; and finally we will see how things play out with other languages like Java, Node, Python, Ruby, and Rust.

So, what about Go?

You might have heard that Go does something very smart: when building a binary, it includes all the necessary dependencies in that binary, to facilitate its deployment.

You might think, “wait, that’s a static binary!” and you’d be right. Almost. (If you’re wondering what a static binary is, you can check the first part of this series.)

Some Go packages rely on system libraries. For instance, DNS resolution, because it can be configured in various ways (think /etc/hosts, /etc/resolv.conf, and some other files). As soon as our code imports one of these packages, Go needs to generate a binary that will call system libraries. For that, it enables a mechanism called cgo (which generally speaking, allows Go to call C code) and it produces a dynamic executable, referencing the system libraries that it needs to call.

This means that a Go program that uses e.g. the net package will generate a dynamic binary, with the same constraints as a C program. That Go program will require us to copy the needed libraries, or to use an image like busybox:glibc.

We can, however, entirely disable cgo. In that case, instead of using system libraries, Go will use its own built-in reimplementations of these libraries. For instance, instead of using the system’s DNS resolver, it will use its own resolver. The resulting binary will be static. To disable cgo, all we have to do is set the environment variable CGO_ENABLED=0.

For instance:

FROM golang
COPY whatsmyip.go .
ENV CGO_ENABLED=0
RUN go build whatsmyip.go

FROM scratch
COPY --from=0 /go/whatsmyip .
CMD ["./whatsmyip"]

Since cgo is disabled, Go doesn’t link with any system library. Since it doesn’t link with any system library, it can generate a static binary. Since it generates a static binary, that binary can work in the scratch image. 🎉

Tags and netgo

It’s also possible to select which implementation to use on a per-package basis. This is done by using Go “tags”. Tags are instructions for the Go build process to indicate which files should be built or ignored. By enabling the tag “netgo”, we tell Go to use the native net package instead of the one relying on system libraries:

go build -tags netgo whatsmyip.go

If there are no other packages using system libraries, the result will be a static binary. However, if we use another package that causes cgo to be enabled, we’re back to square one.

(That’s why the CGO_ENABLED=0 environment variable is an easier way to guarantee that we get a static executable.)

Tags are also used to select which code to build on different architectures or different operating systems. If we have some code that needs to be different on Linux and Windows, or on Intel and ARM CPUs, we use tags as well to indicate to the compiler “only use this when building on Linux.”

Alpine

We briefly mentioned Alpine in the first part, and we said “we’ll talk about it later.” Now is the time!

Alpine is a Linux distribution that, until a few years ago, most people would have called “exotic”. It’s designed to be small and secure, and uses its own package manager, apk.

Unlike e.g. CentOS or Ubuntu, it’s not backed by an army of maintainers paid by a huge company like Red Hat or Canonical. It has fewer packages than these distributions. (With out of the box default repositories, Alpine has about 10,000 packages; Debian, Fedora, and Ubuntu have each more than 50,000.)

Before the rise of containers, Alpine wasn’t very popular, perhaps because very few people actually care about the installed size of their Linux system. After all, the size of programs, libraries, and other system files is usually negligible compared to the size of the documents and data that we manipulate (like pictures and movies for end users; or databases on servers).

Alpine was brought to the spotlight when people realized that it would make an excellent distribution for containers. We said it was small; how small exactly? Well, when containers became popular, everyone noticed that container images were big. They take up disk space; pulling them is slow. (There is a good chance that you’re reading this because you’re concerned by this very problem, right?) The first base images were using “cloud images” which were very popular on cloud servers, and weighed anywhere between a few hundred MB to a few GB. That size is fine for cloud instances (where the image gets transferred from an image storage system to a virtual machine, generally through a very fast local network), but pulling that over cable or DSL internet is much slower. And so distro maintainers started to work on smaller images specifically for containers. But while popular distributions like Debian, Ubuntu, Fedora, struggled to get under 100 MB sometimes by removing potentially useful tools like ifconfig or netstat, Alpine set the score by having a 5 MB image, without sacrificing these tools.

Another advantage of Alpine Linux (in my opinion) is that its package manager is ridiculously fast. The speed of a package manager is usually not a major concern, because on a normal system, we only need to install things once; we’re not installing them over and over all the time. With containers, however, we are building images regularly, and we often spin up a container using a base image, and install a few packages to test something, or because we need an extra tool that wasn’t in the image.

Just for fun, I decided to get some popular base images, and check how long it took to install tcpdump in them. Look at the results:

Base image	Size	Time to install `tcpdump`
alpine:3.11	5.6 MB	1-2s
archlinux:20200106	409 MB	7-9s
centos:8	237 MB	5-6s
debian:10	114 MB	5-7s
fedora:31	194 MB	35-60s
ubuntu:18.04	64 MB	6-8s

The size is reported with docker images; the time was measured by running time docker run <image> <packagemanager> install tcpdump a few times on a t3.medium instance in eu-north-1. (When I’m in Europe, I use servers in Stockholm because Sweden electricity is cleaner than anywhere else and I care about the planet. Don’t believe the bullshit about eu-central-1 being “green”, the datacenters in Frankfurt run primarily on coal.)

If you want to know more about Alpine Linux internals, I recommend this talk by Natanel Copa.

Alright, so Alpine is small. How can we use it for our own applications? There are at least two strategies that are worth considering:

using alpine as our “run” stage,
using alpine as both our “build” and “run” stages.

Let’s try them out.

Using Alpine as our “run” stage

Let’s build the following Dockerfile, and run the resulting image:

FROM gcc AS mybuildstage
COPY hello.c .
RUN gcc -o hello hello.c

FROM alpine
COPY --from=mybuildstage hello .
CMD ["./hello"]

We will get the following error message:

standard_init_linux.go:211: exec user process caused "no such file or directory"

We’ve seen that error message before, when we tried to run the C program in the scratch image. We saw that the problem came from the lack of dynamic libraries in the scratch image. It looks like the libraries are also missing from the alpine image, then?

Not exactly. Alpine also uses dynamic libraries. After all, one of its design goals is to achieve a small footprint; and static binaries wouldn’t help with that.

But Alpine uses a different standard C library. Instead of of the GNU C library, it uses musl. (I personally pronounce it emm-you-ess-ell, but the official pronounciation is like “mussel” or “muscle”.) This library is smaller, simpler, and safer than the GNU C library. And programs dynamically linked against the GNU C library won’t work with musl, and vice versa.

You might wonder, “if musl is smaller, simpler, and safer, why don’t we all switch to it?”

… Because the GNU C library has a lot of extensions, and some programs do use these extensions; sometimes without even realizing that they’re using non-standard extensions. The musl documentation has a list of functional differences from the GNU C library.

Furthermore, musl is not binary-compatible. A binary compiled for the GNU C library won’t work with musl (except in some very simple cases), meaning that code has to be recompiled (and sometimes tweaked a tiny bit) to work with musl.

TL,DR: using Alpine as the “run” stage will only work if the program has been built for musl, which is the C library used by Alpine.

That being said, it’s relatively easy to build a program for musl. All we have to do is to build it with Alpine itself!

Using Alpine as “build” and “run” stages

We’ve decided to generate a binary linked against musl, so that it can run in the alpine base image. We have two main routes to do that.

Some official images provide :alpine tags that should be as close as possible to the normal image, but use Alpine (and musl) instead.
Some official images do not have an :alpine tag; For those, we need to build an equivalent image ourselves, generally using alpine as a base.

The golang image belongs to the first category: there is a golang:alpine image providing the Go toolchain built on Alpine.

We can build our little Go program with a Dockerfile like this:

FROM golang:alpine
COPY hello.go .
RUN go build hello.go

FROM alpine
COPY --from=0 /go/hello .
CMD ["./hello"]

The resulting image is 7.5 MB. It is admittedly a lot for a program that merely prints “Hello, world!”, but:

a more complex program wouldn’t be much bigger,
this image contains a lot of useful tools,
since it’s based on Alpine, it’s easy and fast to add more tools, in the image or on the spot as needed.

Now, what about our C program? As I write these lines, there is no gcc:alpine image. So we have to start with the alpine image, and install a C compiler. The resulting Dockerfile looks like this:

FROM alpine
RUN apk add build-base
COPY hello.c .
RUN gcc -o hello hello.c

FROM alpine
COPY --from=0 hello .
CMD ["./hello"]

The trick is to install build-base (and not simply gcc) because the gcc package on Alpine would install the compiler, but not all the libraries that we need. Instead, we use build-base, which is the equivalent of the Debian or Ubuntu build-essentials, bringing in compilers, libraries, and tools like make.

Bottom line: when using multi-stage builds, we can use the alpine image as a base to run our code. If our code is a compiled program written in a language using dynamic libraries (which is the case of almost every compiled language that we may use in containers), we will need to generate a binary linked with Alpine’s musl C library. The easiest way to achieve that is to base our build image on top of alpine or another image using Alpine. Many official images offer a tag :alpine for that purpose.

For our “hello world” program, here are the final results, comparing all the techniques we’ve shown so far.

Single-stage build using the golang image: 805 MB
Multi-stage build using golang and ubuntu: 66.2 MB
Multi-stage build using golang and alpine: 7.6 MB
Multi-stage build using golang and scratch: 2 MB

That’s a 400x size reduction, or 99.75%. That sounds impressive, but let’s look at the results if we try with a slightly more realistic program that makes use of the net package.

Single-stage build using the golang image: 810 MB
Multi-stage build using golang and ubuntu: 71.2 MB
Multi-stage build using golang:alpine and alpine: 12.6 MB
Multi-stage build using golang and busybox:glibc: 12.2 MB
Multi-stage build using golang, CGO_ENABLED=0, and scratch: 7 MB

That’s still a 100x size reduction, a.k.a. 99%. Sweet!

What about Java?

Java is a compiled language, but it runs on the Java Virtual Machine (or JVM). Let’s see what this means for multi-stage builds.

Static or dynamic linking?

Conceptually, Java uses dynamic linking, because Java code will call Java APIs that are provided by the JVM. The code for these APIs is therefore outside of your Java “executable” (typically a JAR or WAR file).

However, these Java libraries are not totally independent from the system libraries. Some Java functions might eventually call system libraries; for instance, when we open a file, at some point the JVM is going to call open(), fopen(), or some variant thereof. You can read that again: the JVM is going to call these functions; so the JVM itself might be dynamically linked with system libraries.

This means that in theory, we can use any JVM to run our Java bytecode; it doesn’t matter if it’s using musl or the GNU C library. So we can build our Java code with any image that has a Java compiler, and then run it with any image that has a JVM.

The Java Class Files Format

In practice, however, the format of Java Class Files (the bytecode generated by the Java compiler) has evolved over time. The bulk of the changes from one Java release to the next are located within the Java APIs. Some changes concern the language itself, like the addition of generics in Java 5. These changes can introduce changes to the format of Java .class Files, breaking compatibility with older versions.

This means that by default, classes compiled with a given version of the Java compiler won’t work with older versions of the JVM. But we can ask the compiler to target an older file format with the -target flag (up to Java 8) or the --release flag (from Java 9). The latter will also select the correct class path, to make sure that if we build code designed to run on e.g. Java 11, we don’t accidentally use libraries and APIs from Java 12 (which would prevent our code from running on Java 11).

(You can read this good blog post about Java Class File Versions if you want to know more about this.)

JDK vs JRE

If you are familiar with the way Java is packaged on most platforms, you probably already know about JDK and JRE.

JRE is the Java Runtime Environment. It contains what we need to run Java applications; namely, the JVM.

JDK is the Java Development Kit. It contains the same thing as the JRE, but it also has what we need to develop (and build) Java applications; namely, the Java compiler.

In the Docker ecosystem, most Java images provide the JDK, so they are suitable to build and run Java code. We will also see some images with a :jre tag (or a tag containing jre somewhere). These are images containing only the JRE, without the full JDK. They are smaller.

What does this mean in terms of multi-stage builds?

We can use the regular images for the build stage, and then a smaller JRE image for the run stage.

`java` vs `openjdk`

You might already know this if you’re using Java in Docker; but you shouldn’t use the java official images, because they aren’t receiving updates anymore. Instead, use the openjdk images.

You can also try the amazoncorretto ones (Corretto is Amazon’s fork of OpenJDK, with their extra patches).

Small Java images

Alright, so what should we use? If you’re on the market for small Java images, here are a few good candidates:

openjdk:8-jre-alpine (only 85 MB!)
openjdk:11-jre (267 MB) or even openjdk:11-jre-slim (204 MB) if you need a more recent version of Java
openjdk:14-alpine (338 MB) if you need an even more recent version

Unfortunately, not all combinations are available; i.e. openjdk:14-jre-alpine doesn’t exist (which is sad because it might perhaps be smaller than the -jre and -alpine variants) but there is probably a good reason for that. (If you are aware of that reason, please tell me, I’d love to know!)

Remember that you should build your code to match the JRE version. This blog post explains how to do that in various environments (IDE, Maven, etc.) if you need details.

But we can do even better, by building a custom JRE with jlink.

jlink

Java 9 (and later) include a tool called jlink. With jlink, we can build a custom JVM, with only the components that we need. This can help us to reduce even further the size of our images. I find it particularly useful to get a small image with a recent version of the JRE, because the JRE tends to grow over time (since it adds more and more APIs). Thanks to jlink, we don’t have to choose between “small but old JRE” and “recent but big JRE”, we can have the best of both worlds!

Running the following command will create a custom JRE in /dir, with the JVM available as /dir/bin/java:

jlink --add-modules java.base,java.some.other.module,etc --output /dir

How do we find out the list of modules? We can use another tool called jdeps. In fact, jdeps --print-module-deps will specifically output the dependencies in a format suitable for jlink!

The Dockerfile below shows how to use jlink in a multi-stage setup. The build stage compiles the code, computes the dependencies with jdeps, then generates a JRE with jlink. The run stage copies the compiled code as well as the JRE.

FROM openjdk:15-alpine
RUN apk add binutils # for objcopy, needed by jlink
COPY hello.java .
RUN javac hello.java
RUN jdeps --print-module-deps hello.class > java.modules
RUN jlink --strip-debug --add-modules $(cat java.modules) --output /java

FROM alpine
COPY --from=0 /java /java
COPY --from=0 hello.class .
CMD exec /java/bin/java -cp . hello

Note that when using jlink, we need to be mindful about the C library that we’re using. Here, we wanted to go for the smallest possible image size, so we are using alpine in the run stage. Therefore, we need to use an Alpine-based image in the build stage, so that jlink generates a JRE compatible with musl.

(I would like to thank David Delabassée, who told me about jlink and encouraged me to try it out. When learning about jlink, the following resources were useful: this blog post by Yoan Blanc, this tutorial by Nicolai Parlog, and the jlink documentation. David also recommended that I check GraalVM, but I saved that for next time!)

Java: setting the score

You want some numbers? I got some numbers for you! I’ve built a trivial “hello world” program in Java:

class hello {
  public static void main(String [] args) {
    System.out.println("Hello, world!");
  }
}

You can find all the Dockerfiles in the minimage GitHub repo, and here are the sizes of the various builds.

Single-stage build using the java image: 643 MB
Single-stage build using the openjdk image: 490 MB
Multi-stage build using openjdk and openjdk:jre: 479 MB
Single-stage build using the amazoncorretto image: 390 MB
Multi-stage build using openjdk:11 and openjdk:11-jre: 267 MB
Multi-stage build using openjdk:15 with jlink and ubuntu: 106 MB
Multi-stage build using openjdk:8 and openjdk:8-jre-alpine: 85 MB
Multi-stage build using openjdk:15-alpine with jlink and alpine: 47 MB

What about interpreted languages?

If you mostly write code in an interpreted language like Node, Python, or Ruby, you might wonder if you should worry at all about all of this, and if there is any way to optimize image size. It turns out that the answer to both questions is yes!

Alpine with interpreted languages

We can use alpine and other Alpine-based images to run code in our favorite scripting languages. This will always work for code that only uses the standard library, or whose dependencies are “pure”, i.e. written in the same language, without calling into C code and external libraries.

Now, if our code has dependencies on external libraries, things can get more complicated. We will have to install these libraries on Alpine. Depending on the situation, this might be:

Easy, when the library includes installation instructions for Alpine. It will tell us which Alpine packages to install and how to build the dependencies. This is fairly rare, though, because Alpine isn’t as popular as Debian or Fedora, for instance.
Average, when the library doesn’t have installation instructions for Alpine, but has instructions for another distro and you can easily figure out which Alpine packages correspond to the other distro’s package.
Hard, when our dependency is using packages that don’t have Alpine equivalents. Then we might have to build from source, and it will be a whole different story!

That last scenario is precisely the kind of circumstance when Alpine might not help, and might even be counterproductive. If we need to build from source, that means installing a compiler, libraries, headers … This will take extra space on the final image. (Yes, we could use multi-stage builds; but in that specific context, depending on the language, that can be complex, because we need to figure out how to produce a binary package for our dependencies.) Building from source will also take much longer.

There is one particular situation where using Alpine will exhibit all these issues: data science in Python. Popular packages like numpy or pandas are available as pre-compiled Python packages called wheels, but these wheels are tied to a specific C library. (“Oh, no!” you might think, “Not the libraries again!”) This means that they will install fine on the “normal” Python images, but not on the Alpine variants. On Alpine, they will require to install system packages, and in some cases, very lengthy rebuilds. There is a pretty good article dedicated to that problem, explaining how using Alpine can make Python Docker builds 50x slower.

If you read that article, you might think, “whoa, should I stay away from Alpine for Python, then?” I’m not so sure. For data science, probably yes. But for other workloads, if you want to reduce image size, it’s always worth a shot.

`:slim` images

If we want a compromise between the default images and their Alpine variants, we can check the :slim images. The slim images are usually based on Debian (and on the GNU C library) but they have been optimized for size, by removing a lot of non-essential packages. Sometimes, they might have just what you need; and sometimes, they will lack essential things (like, a compiler!) and installing these things will bring you back closer to the original size; but it’s nice to have the possibility to try and use them.

To give you an idea, here are the sizes of the default, :alpine, and :slim variants for some popular interpreted languages:

Image	Size
`node`	939 MB
`node:alpine`	113 MB
`node:slim`	163 MB
`python`	932 MB
`python:alpine`	110 MB
`python:slim`	193 MB
`ruby`	842 MB
`ruby:alpine`	54 MB
`ruby:slim`	149 MB

In the specific case of Python, here are the sizes obtained to install the popular packages matplotlib, numpy, and pandas, on various Python base images:

Image and technique	Size
`python`	1.26 GB
`python:slim`	407 MB
`python:alpine`	523 MB
`python:alpine` multi-stage	517 MB

We can see that using Alpine doesn’t help us at all, and even a multi-stage build barely improves the situation. (You can find the relevant Dockerfiles in the minimage repository; they are the ones named Dockerfile.pyds.*.)

Don’t conclude too quickly that Alpine is bad for Python, though! Here are the sizes for a Django application using a large number of dependencies:

Image and technique	Size
`python`	1.23 GB
`python:alpine`	636 MB
`python:alpine` multi-stage	391 MB

(And in that specific case, I gave up on using the :slim image because it required installing too many extra packages.)

So as you can see, it’s not always clear cut. Sometimes, :alpine will give better results, and sometimes :slim will do it. If we really need to optimize the size of our images, we need to try both and see what happens. Over time, we will gather experience and get a feel of which variant is appropriate for which applications.

Multi-stage with interpreted languages

What about multi-stage builds?

They will be particularly useful when we generate any kind of asset.

For instance, you have a Django application (probably using some python base image) but you minify your Javascript with UglifyJS and your CSS with Sass. The naive approach would be to include all that jazz in your image, but the Dockerfile would become complex (because we’d be installing Node in a Python image) and the final image would be of course very big. Instead, we can use multiple stages: one using node to minify your assets, and one using python for the app itself, bringing in the JS and CSS assets from the first stages.

This is also going to result in better build times, since changes in the Python code won’t always result in a rebuild of the JS and CSS (and vice versa). In that specific case, I would even recommend to use two separate stages for JS and CSS, so that changing one doesn’t trigger a rebuild of the other.

What about Rust?

I am very curious about Rust, a modern programming language initially designed at Mozilla, and with a growing popularity in the web and infrastructure space. So I was wondering what kind of behavior to expect as far as Docker images are involved.

It turns out that Rust generates binaries dynamically linked with the C library. So binaries built with the rust image will run with usual base images like debian, ubuntu, fedora, etc., but will not work with busybox:glibc. This is because the binaries are linked with libdl, which is not included in busybox:glibc at the moment.

However, there is a rust:alpine image, and the generated binaries work perfectly well with alpine as a base.

I wondered if Rust could produce static binaries. The Rust documentation explains how to do it. On Linux, this is done by building a special version of the Rust compiler, and it requires musl. Yes, the same musl used by Alpine. If you want to obtain minimal images with Rust, it should be fairly easy by following the instructions in the documentation, then drop the resulting binaries in a scratch image.

What’s next?

In the first two part of this series, we covered the most common methods to optimize Docker image size, and we saw how they applied to various languages, compiled or interpreted.

In the third part, we will talk about a few more. We will see how standardizing on a specific base image can reduce not only image size, but also I/O and memory usage. We will mention a few techniques that are not specific to containers, but that can always be useful. And we will evoke more exotic builders, for the sake of completeness.

The Quest for Minimal Docker Images, part 1

2020-02-01T00:00:00+00:00

When getting started with containers, it’s pretty easy to be shocked by the size of the images that we build. We’re going to review a number of techniques to reduce image size, without sacrificing developers’ and ops’ convenience. In this first part, we will talk about multi-stage builds, because that’s where anyone should start if they want to reduce the size of their images. We will also explain the differences between static and dynamic linking, as well as why we should care about that. This will be the occasion to introduce Alpine.

In the second part, we will see some particularities relevant to various popular languages. We will talk about Go, but also Java, Node, Python, Ruby, and Rust. We will also talk more about Alpine and how to leverage it across the board.

In the third part, we will cover some patterns (and anti-patterns!) relevant to most languages and frameworks, like using common base images, stripping binaries and reducing asset size. We will wrap up with some more exotic or advanced methods like Bazel, Distroless, DockerSlim, or UPX. We will see how some of these will be counter-productive in some scenarios, but might be useful in others.

Note that the sample code and all the Dockerfiles mentioned here are available in a public GitHub repository, with a Compose file to build all the images and easily compare their sizes.

→ https://github.com/jpetazzo/minimage

What we’re trying to solve

Many people building their first Docker images that compile some code are unpleasantly surprised by the resulting image sizes.

Look at this trivial “hello world” program in C:

/* hello.c */
int main () {
  puts("Hello, world!");
  return 0;
}

We could build it with the following Dockerfile:

FROM gcc
COPY hello.c .
RUN gcc -o hello hello.c
CMD ["./hello"]

… But the resulting image will be more than 1 GB, because it will have the whole gcc image in it!

If we use e.g. the Ubuntu image, install a C compiler, and build the program, we get a 300 MB image; which looks better, but is still way too much for a binary that, by itself, is less than 20 kB:

$ ls -l hello
-rwxr-xr-x   1 root root 16384 Nov 18 14:36 hello

Same story with the equivalent Go program:

package main

import "fmt"

func main () {
  fmt.Println("Hello, world!")
}

Building this code with the golang image, the resulting image is 800 MB, even though the hello program is only 2 MB:

$ ls -l hello
-rwxr-xr-x 1 root root 2008801 Jan 15 16:41 hello

There has to be a better way!

Let’s see how to drastically reduce the size of these images. In some cases, we can achieve 99.8% size reduction (but we will see that it’s not always a good idea to go that far).

Pro tip: to easily compare the size of our images, we are going to use the same image name, but different tags. For instance, our images will be hello:gcc, hello:ubuntu, hello:thisweirdtrick, etc. That way, we can run docker images hello and it will list all the tags for that hello image, with their sizes, without being encumbered with the bazillions of other images that we have on our Docker engine.

Multi-stage builds

This is the first (and most drastic) step we can take to reduce the size of our images. We need to be careful, though, because if it’s done incorrectly, it can result in images that are harder to operate (or could even be completely broken).

Multi-stage builds come from a simple idea: “I don’t need to include the C or Go compiler and the whole build toolchain in my final application image. I just want to ship the binary!”

We obtain a multi-stage build by adding another FROM line in our Dockerfile. Look at the example below:

FROM gcc AS mybuildstage
COPY hello.c .
RUN gcc -o hello hello.c
FROM ubuntu
COPY --from=mybuildstage hello .
CMD ["./hello"]

We use the gcc image to build our hello.c program. Then, we start a new stage (that I will call the “run stage”) using the ubuntu image. We copy the hello binary from the previous stage. The final image is 64 MB instead of 1.1 GB, so that’s about 95% size reduction:

$ docker images minimage
REPOSITORY          TAG                    ...         SIZE
minimage            hello-c.gcc            ...         1.14GB
minimage            hello-c.gcc.ubuntu     ...         64.2MB

Not bad, right? We can do even better. But first, a few tips and warnings.

You don’t have to use the AS keyword when declaring your build stage. When copying files from a previous stage, you can simply indicate the number of that build stage (starting at zero).

In other words, the two lines below are identical:

COPY --from=mybuildstage hello .
COPY --from=0 hello .

Personally, I think it’s fine to use numbers for build stages in short Dockerfiles (say, 10 lines or less), but as soon as your Dockerfile gets longer (and possibly more complex, with multiple build stages), it’s a good idea to name the stages explicitly. It will help maintenance for your team mates (and also for future you who will review that Dockerfile months later).

Warning: use classic images

I strongly recommend that you stick to classic images for your “run” stage. By “classic”, I mean something like CentOS, Debian, Fedora, Ubuntu; something familiar. You might have heard about Alpine and be tempted to use it. Do not! At least, not yet. We will talk about Alpine later, and we will explain why we need to be careful with it.

Warning: `COPY --from` uses absolute paths

When copying files from a previous stage, paths are interpreted as relative to the root of the previous stage.

The problem appears as soon as we use a builder image with a WORKDIR, for instance the golang image.

If we try to build this Dockerfile:

FROM golang
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 hello .
CMD ["./hello"]

We get an error similar to the following one:

COPY failed: stat /var/lib/docker/overlay2/1be...868/merged/hello: no such file or directory

This is because the COPY command tries to copy /hello, but since the WORKDIR in golang is /go, the program path is really /go/hello.

If we are using official (or very stable) images in our build, it’s probably fine to specify the full absolute path and forget about it.

However, if our build or run images might change in the future, I suggest to specify a WORKDIR in the build image. This will make sure that the files are where we expect them, even if the base image that we use for our build stage changes later.

Following this principle, the Dockerfile to build our Go program will look like this:

FROM golang
WORKDIR /src
COPY hello.go .
RUN go build hello.go

FROM ubuntu
COPY --from=0 /src/hello .
CMD ["./hello"]

If you’re wondering about the efficiency of multi-stage builds for Golang, well, they let us go (no pun intended) from a 800 MB image down to a 66 MB one:

$ docker images minimage
REPOSITORY     TAG                              ...    SIZE
minimage       hello-go.golang                  ...    805MB
minimage       hello-go.golang.ubuntu-workdir   ...    66.2MB

Not bad!

`FROM scratch`

Back to our “Hello World” program. The C version is 16 kB, the Go version is 2 MB. Can we get an image of that size?

Can we build an image with just our binary and nothing else?

Yes! All we have to do is use a multi-stage build, and pick scratch as our run image (with some caveats, which we’ll see shortly). scratch is a virtual image. You can’t pull it or run it, because it’s completely empty. This is why if a Dockerfile starts with FROM scratch, it means that we’re building from scratch, without using any pre-existing ingredient.

This gives us the following Dockerfile:

FROM golang
COPY hello.go .
RUN go build hello.go

FROM scratch
COPY --from=0 /go/hello .
CMD ["./hello"]

If we build that image, its size is exactly the size of the binary (2 MB), and it works!

There are, however, a few things to keep in mind when using scratch as a base.

No shell

The scratch image doesn’t have a shell. This means that we cannot use the string syntax with CMD (or RUN, for that matter). Consider the following Dockerfile:

...
FROM scratch
COPY --from=0 /go/hello .
CMD ./hello

If we try to docker run the resulting image, we get the following error message:

docker: Error response from daemon: OCI runtime create failed:
container_linux.go:345: starting container process caused
"exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown.

It’s not presented in a very clear way, but the core information is here: /bin/sh is missing from the image.

This happens because when we use the string syntax with CMD or RUN, the argument gets passed to /bin/sh. This means that our CMD ./hello above will execute /bin/sh -c "./hello", and since we don’t have /bin/sh in the scratch image, this fails.

The workaround is simple: use the JSON syntax in the Dockerfile. CMD ./hello becomes CMD ["./hello"]. When Docker detects the JSON syntax, it runs the arguments directly, without a shell.

No debugging tools

The scratch image is, by definition, empty; so it doesn’t have anything to help us troubleshoot the container. No shell (as we said in the previous paragraph) but also no ls, ps, ping, and so on and so forth. This means that we won’t be able to enter the container (with docker exec or kubectl exec) to look around.

(Strictly speaking, there are some methods to troubleshoot our container anyway. We can use docker cp to get files out of the container; we can use docker run --net container: to interact with the network stack; we can interact with the container’s processes with docker run --pid container: or even directly from the host; similarly, we can enter the container’s various namespaces with a low-level tool like nsenter. Recent versions of Kubernetes have the concept of ephemeral container, though it’s still in alpha. So let’s keep in mind that while these techniques are available, they will definitely make our lives more complicated, especially when we have so much to deal with already!)

One workaround here is to use an image like busybox or alpine instead of scratch. Granted, they’re bigger (respectively 1.2 MB and 5.5 MB), but in the grand scheme of things, it’s a small price to pay if we compare it to the hundreds of megabytes, or even gigabytes, of our original image.

No libc

This one is trickier to troubleshoot. Our simple “hello world” in Go worked fine, but if we try to put a C program in the scratch image, or a more complex Go program (for instance, anything using network packages), we will get the following error message:

standard_init_linux.go:211: exec user process caused "no such file or directory"

Some file seems to be missing. But it doesn’t tell us which file is missing exactly.

The missing file is a dynamic library that is necessary for our program to run.

What’s a dynamic library and why do we need it?

After a program is compiled, it gets linked with the libraries that it is using. (As simple as it is, our “hello world” program is still using libraries; that’s where the puts function comes from.) A long time ago (before the 90s), we used mostly static linking, meaning that all the libraries used by a program would be included in the binary. This is perfect when software is executed from a floppy disk or a cartridge, or when there is simply no standard library. However, on a timesharing system like Linux, we run many concurrent programs that are stored on a hard disk; and these programs almost always use the standard C library. In that scenario, it gets more advantageous tu use dynamic linking. With dynamic linking, the final binary doesn’t contain the code of all the libraries that it uses. Instead, it contains references to these libraries, like “this program needs functions cos and sin and tan from libtrigonometry.so”. When the program is executed, the system looks for that libtrigonometry.so and loads it alongside the program so that the program can call these functions.

Dynamic linking has multiple advantages.

It saves disk space, since common libraries don’t have to be duplicated anymore.
It saves memory, since these libraries can be loaded once from disk, and then shared between multiple programs using them.
It makes maintenance easier, because when a library is updated, we don’t need to recompile all the programs using that library.

(If we want to be thorough, memory savings aren’t a result of dynamic libraries but rather of shared libraries. That being said, the two generally go together. On Linux, dynamic library files typically have the extension .so, which stands for shared object. On Windows, it’s .DLL, which stands for Dynamic-link library.)

Back to our story: by default, C programs are dynamically linked. (This is also the case for Go programs that are using certain packages.) Our specific program uses the standard C library, which on recent Linux systems will be in libc.so.6. So in order to run, our program needs that file to be present in the container image. And if we’re using scratch, that file is obviously absent.

(Same thing if we use busybox or alpine, because busybox doesn’t contain a standard library, and alpine is using another one, which is incompatible. We’ll talk more about that later.)

How do we solve this?

There are at least 3 options.

Building a static binary

We can tell our toolchain to make a static binary. There are various ways to achieve that (depending on how we build our program in the first place), but if we’re using gcc, all we have to do is add -static to the command line:

gcc -o hello hello.c -static

The resulting binary is now 760 kB (on my system) instead of 16 kB. Of course, we’re embedding the library in the binary, so it’s much bigger. But that binary will now run correctly in the scratch image.

We can get an even smaller image if we build a static binary with Alpine. We will talk more about Alpine in the next article; but just for information, the result would be less than 100 kB!

Adding the libraries to our image

We can find out which libraries our program needs with the ldd tool:

$ ldd hello
    linux-vdso.so.1 (0x00007ffdf8acb000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007ff897ef6000)
    /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff8980f7000)

We can see the libraries needed by the program and their paths (as resolved by the linker).

In the example above, the only “real” library is libc.so.6. linux-vdso.so.1 is related to a mechanism called VDSO (virtual dynamic shared object), which accelerates some system calls. Let’s pretend it’s not there. As for ld-linux-x86-64.so.2, it’s actually the dynamic linker itself. (Technically, our hello binary contains information saying, “hey, this is a dynamic program, and the thing that knows how to put all its parts together is ld-linux-x86-64.so.2”.)

If we were so inclined, we could manually add all the files listed above by ldd to our image. It would be fairly tedious, and difficult to maintain, especially for programs with lots of dependencies. For our little hello world program, this would work fine. But for a more complex program, for instance something using DNS, we would run into another issue. The GNU C library (used on most Linux systems) implements DNS (and a few other things) through a fairly complex mechanism called the Name Service Switch (NSS in short). This mechanism needs a configuration file, /etc/nsswitch.conf, and additional libraries. But these libraries don’t show up with ldd, because they are loaded later, when the program is running. If we want DNS resolution to work correctly, we still need to include them! (These libraries are typically found at /lib64/libnss_*.)

I personally can’t recommend going that route, because it is quite arcane, difficult to maintain, and it might easily break in the future.

`busybox:glibc`

There is an image designed specifically to solve all these issues: busybox:glibc. It is a small image (5 MB) using busybox (so providing a lot of useful tools for troubleshooting and operations) and providing the GNU C library (or glibc). That image contains precisely all these pesky files that we were mentioning earlier. This is what we should use if we want to run a dynamic binary in a small image.

Keep in mind, however, that if our program uses additional libraries, those will need to be copied as well.

Recap and (partial) conclusion

Let’s see how we did for our “hello world” program in C. Spoiler alert: this list includes results obtained by leveraging Alpine in ways that will be described in the next part of this series.

Original image built with gcc: 1.14 GB
Multi-stage build with gcc and ubuntu: 64.2 MB
Static glibc binary in alpine: 6.5 MB
Dynamic binary in alpine: 5.6 MB
Static binary in scratch: 940 kB
Static musl binary in scratch: 94 kB

That’s a 12000x size reduction, or 99.99% less disk space (and network usage).

Not bad.

Personally, I wouldn’t go with the scratch images (because troubleshooting them might be, well, trouble) but if that’s what you’re after, they’re there for you!

In the second part, we will mention some aspects specific to the Go language, including cgo and tags. We will also cover other popular languages, and we will talk more about Alpine, because it’s pretty awesome if you ask me.

Containers, microservices, and service meshes

2019-05-17T00:00:00+00:00

There is a lot of material out there about services meshes, and this is another one. Yay! But why? Because I would like to give you the perspective of someone who wish service meshes did exist 10 years ago, long before the rise of container platforms like Docker and Kubernetes. I’m not claiming that this perspective is better or worse than others, but since service meshes are rather complex beasts, I believe that a multiplicity of points of view can help to understand them better.

I will talk about the dotCloud platform, a platform that was built on 100+ microservices and which supported thousands of production applications running in containers; I will explain the challenges that were faced when building and running it; and how service meshes would (or wouldn’t) have helped.

dotCloud history

I’ve already written about the history of the dotCloud platform and some of its design choices, but I hadn’t talked much about its networking layer. If you don’t want to dive into my previous blog post about dotCloud, all you need to know is that it was a PaaS allowing customers to run a wide range of applications (Java, PHP, Python…) supported by a wide range of data services (MongoDB, MySQL, Redis…) and with a workflow similar to the one of Heroku: you would push your code to the platform, the platform would build container images, and deploy these container images.

I will tell you how traffic was routed on the dotCloud platform; not because it was particularly great or anything (I think it was okay for the time!) but primarily because it’s the kind of design that could be easily implemented with today’s tools by a modest team in a short amount of time, if they needed a way to route traffic between a bunch of microservices or a bunch of applications. So it will give us a good comparison point between “what we’d get if we hacked it ourselves” vs. “what we’d get if we used an existing service mesh”, aka the good old “build vs. buy” quandary.

Traffic routing for hosted applications

Applications deployed on dotCloud could expose HTTP and TCP endpoints.

HTTP endpoints were dynamically added to the configuration of a cluster of Hipache load balancers. This is similar to what we can achieve today with Kubernetes Ingress ressources and a load balancer like Traefik.

Clients could connect to HTTP endpoints using their associated domain names, provided that the domain name would point to dotCloud’s load balancers. Nothing fancy here.

TCP endpoints were associated with a port number, that was then communicated to all the containers of that stack through environment variables.

Clients could connect to TCP endpoints using a specified host name (something like gateway-X.dotcloud.com) and that port number.

That host name would resolve to a cluster of “nats” servers (no relationship whatsoever with NATS) that would route incoming TCP connections to the right container (or, in the case of load-balanced services, to the right containers).

If you’re familiar with Kubernetes, this will probably remind you of NodePort services.

The dotCloud platform didn’t have the equivalent of ClusterIP services: for simplicity, services were accessed the same way from the inside and from the outside of the platform.

This was simple enough that the initial implementations of the HTTP and TCP routing meshes were probably a few hundreds line of Python each, using fairly simple (I’d dare say, naive) algorithms, but evolved over time to handle the growth of the platform and additional requirements.

It didn’t require extensive refactoring of existing application code. Twelve-factor applications in particular could directly use the address information provided through environment variables.

How was it different from a modern service mesh?

Observability was limited. There was no metrics at all for the TCP routing mesh. As for the HTTP routing mesh, later versions provided detailed HTTP metrics, showing error codes and response times; but modern service meshes go above and beyond, and provide integration with metrics collection systems like Prometheus, for instance.

Observability is important not only from an operational perspective (to help us troubleshoot issues), but also to deliver features like safe blue/green deployment or canary deployments.

Routing efficiency was limited as well. In the dotCloud routing mesh, all traffic had to go through a cluster of dedicated routing nodes. This meant potentially crossing a few AZ (availability zones) boundaries, and significantly increasing the latency. I remember troubleshooting issues with some code that was making 100+ SQL requests to display a given page, and opening a new connection to the SQL server for each request. When running locally, the page would load instantly, but when running on dotCloud, it would take a few seconds, because each TCP connection (and subsequent SQL request) would need dozens of milliseconds to complete. In that specific case, using persistent connections did the trick.

Modern service meshes do better than that. First of all, by making sure that connections are routed at the source. The logical flow is still client → mesh → service, but now the mesh runs locally, instead of on remote nodes, so the client → mesh connection is a local one, hence very fast (microseconds instead of milliseconds).

Modern service meshes also implement smarter load-balancing algorithms. By monitoring the health of the backends, they can send more traffic on faster backends, resulting in better overall performance.

Security is also stronger with modern service meshes. The dotCloud routing mesh was running entirely on EC2 Classic, and didn’t encrypt traffic (on the assumption that if somebody manages to sniff network traffic on EC2, you have bigger problems anyway). Modern service meshes can transparently secure all our traffic, for instance with mutual TLS authentication and subsequent encryption.

Traffic routing for platform services

Alright, we’ve discussed how applications communicated, but what about the dotCloud platform itself?

The platform itself was composed of about 100 microservices, responsible for various functions. Some of these services accepted requests from others, and some of them were background workers that would connect to other services, but not receive connections on their own. Either way, each service needed to know the endpoints of addresses it needed to connect to.

A lot of high-level services could use the routing mesh described above. In fact, a good chunk of the 100+ microservices of the dotCloud platform were deployed as normal applications on the dotCloud platform itself. But a small number of low-level services (specifically, the ones implementing that routing mesh) needed something simpler, with less dependencies (since they couldn’t depend on themselves to function; that’s the good old “chicken-and-egg” problem).

These low-level, essential platform services were deployed by starting containers directly on a few key nodes, instead of relying on the platform’s builder, scheduler, and runner services. If you want a comparison with modern container platforms, that would be like starting our control plane with docker run directly on our nodes, instead of having Kubernetes doing it for us. This was fairly similar to the concept of static pods used by kubeadm, or by bootkube when bootstrapping a self-hosted cluster.

These services were exposed in a very simple and crude way: there was a YAML file listing these services, mapping their names to their addresses; and every consumer of these services needed a copy of that YAML file as part of their deployment.

On the one hand, this was extremely robust, because it didn’t involve maintaining an external key/value store like Zookeeper (remember, etcd or Consul didn’t exist at that time). On the other hand, it made it difficult to move services around. Each time a service was moved, all its consumers would need to receive an updated YAML file (and potentially be restarted). Not very convenient!

The solution that we started to implement was to have every consumer connect to a local proxy. Instead of knowing the full address+port of a service, a consumer would only need to know its port number, and connect over localhost. The local proxy would handle that connection, and route it to the actual backend. Now when a backend needs to be moved to another machine, or scaled up or down, instead of updating all its consumers, we only need to update all these local proxies; and we don’t need to restart consumers anymore.

(There were also plans to encapsulate traffic in TLS connections, and have another proxy on the receiving side as well to unwrap TLS and verify certificates, without involving the receiving service, which would be set up to accept connections only on localhost. More on that later.)

This is quite similar to AirBNB’s SmartStack; with the notable difference that SmartStack was implemented and deployed to production, while dotCloud’s new internal routing mesh ended up being shelved when dotCloud pivoted to Docker. ☺

I personally consider SmartStack as one of the precursors of systems like Istio, Linkerd, Consul Connect … because all these systems follow that pattern:

run a proxy on each node
consumers connect to the proxy
control plane updates the proxy’s configuration when backends change
… profit!

Implementing a service mesh today

If we had to implement a similar mesh today, we could use similar principles. For instance, we could set up an internal DNS zone, mapping service names to addresses in the 127.0.0.0/8 space. Then run HAProxy on each node of our cluster, accepting connections on each service address (in that 127.0.0.0/8 subnet) and forwarding / load-balancing them to the appropriate backends. HAProxy configuration could be managed by confd, allowing to store backend information in etcd or Consul, and automatically push updated configuration to HAProxy when needed.

This is more or less how Istio works! But with a few differences:

it uses Envoy Proxy instead of HAProxy
it stores backend configuration using the Kubernetes API instead of etcd or Consul
services are allocated addresses in an internal subnet (Kubernetes ClusterIP addresses) instead of 127.0.0.0/8
it has an extra component (Citadel) to add mutual TLS authentication between client and servers
it adds support for new features like circuit breaking, distributed tracing, canary deployments …

Let’s quickly review some of these differences.

Envoy Proxy

Envoy Proxy was written by Lyft. It has many similarities with other proxies (like HAProxy, NGINX, Traefik…) but Lyft wrote it because they needed features that didn’t exist in these other proxies at the time, and it made more sense to build a new proxy than to extend an existing one.

Envoy can be used on its own. If I have a given service that needs to connect to other services, I can set it up to connect to Envoy instead, and then dynamically configure and reconfigure Envoy with the location of my other services, while getting a lot of nifty extra features, for instance in the domain of observability. Instead of using a custom client library, or peppering my code with tracing calls, I direct my traffic to Envoy and let it collect metrics for me.

But Envoy can also be used as the data plane for a service mesh. This means that Envoy will now be configured by the control plane of that service mesh.

Control plane

Speaking of the control plane: Istio relies on the Kubernetes API for that purpose. This is not very different from using confd. Confd relies on etcd or Consul to watch a set of keys in a data store. Istio relies on the Kubernetes API to watch a set of Kubernetes resources.

Aparté: I personally found it really helpful to read this Kubernetes API description that states:

The Kubernetes API server is a “dumb server” which offers storage, versioning, validation, update, and watch semantics on API resources.

End of aparté.

Istio was designed to work with Kubernetes; and if you want to use it outside of Kubernetes, you will need to run an instance of the Kubernetes API server (and a supporting etcd service).

Service addresses

Istio relies on Kubernetes’ allocation of ClusterIP addresses, so Istio services get an internal address (not in the 127.0.0.0/8 range).

On a Kubernetes cluster without Istio, traffic going to the ClusterIP address for a given service is intercepted by kube-proxy, and sent to a backend of that proxy. More specifically, if you like to nail down the technical details: kube-proxy sets up iptables rules (or IPVS load balancers, depending how it was set up) to rewrite the destination IP addresses of connections going the ClusterIP address.

Once Istio is installed on a Kubernetes cluster, nothing changes, until it gets explicitly enabled for a given consumer or even an entire namespace, by injecting a sidecar container into the consumer pods. The sidecar will run an instance of Envoy, and set up a number of iptables rules to intercept traffic going to the other services and redirect that traffic to Envoy.

Combined with Kubernetes DNS integration, this means that our code can connect to a service name, and everything “just works”. In other words, our code would issue a request to e.g. http://api/v1/users/4242, api would resolve to 10.97.105.48, an iptables rules would intercept connections to 10.97.105.48 and redirect them to the local Envoy proxy, and that local proxy would route the request to the actual API backend. Phew!

Extra bells and whistles

Istio can also provide end-to-end encryption and authentication through mTLS (mutual TLS) with a component named Citadel.

It also features Mixer, a component that Envoy can query for every single request, to make an ad-hoc decision about that request depending on various factors like request headers, backend load… (Don’t worry: there are abundant provisions to make sure that Mixer is highly available, and that even if it breaks, Envoy can continue to proxy traffic.)

And of course, me mentioned observability: Envoy collects a vast amount of metrics, while providing distributed tracing. In a microservices architecture, if a single API request has to go through microservices A, B, C, and D, distributed tracing will add a unique identifier to the request when it enters the system, and preserve that identifier across sub-requests to all these microservices, allowing to gather all related calls, their latencies, etc.

Build vs. buy

Istio has the reputation of being complex. By contrast, building a routing mesh like the one that I described in the beginning of this post is relatively straightforward with the tools that we have today. So, does it make sense to build our own service mesh instead?

If we have modest needs (if we don’t need observability, circuit breaker, and other niceties) we might want to build our own. But if we’re using Kubernetes, we might not even need to, because Kubernetes already provides basic service discovery and load balancing.

Now, if we have advanced requirements, “buying” a service mesh can be a much better option. (It’s not always exactly “buying” since Istio is open source, but we still have to invest engineering time to understand how it works, deploy, and operate it.)

Istio vs. Linkerd vs. Consul Connect

So far, we only spoke about Istio, but it’s not the only service mesh out there. Linkerd is another popular option, and there is also Consul Connect.

Which one should we pick?

Honestly, I don’t konw, and at this point, I don’t consider myself knowledgeable enough to help anyone make that decision. There are some interesting articles comparing them, and even benchmarks.

One approach that has a lot of potential is to use a tool like SuperGloo. SuperGloo offers an abstraction layer to simplify and unify the APIs exposed by service meshes. Instead of learning about the specific (and, in my opinion, relatively complex) APIs of various service meshes, we can use the simpler constructs offered by SuperGloo, and switch seamlessly from one service mesh to another. A little bit as if we had an intermediary configuration format describing HTTP frontends and backends, and able to generate actual configuration for NGINX, HAProxy, Traefik, Apache …

I’ve dabbled a bit in Istio using SuperGloo, and in a future blog post, I would like to illustrate how to add Istio or Linkerd to an existing cluster using SuperGloo, and whether the latter holds its promise, i.e. allowing me to switch from one routing mesh to another without rewriting configurations.

If you enjoyed that post and would like me to try out some specific scenarios, I’d love to hear from you!

Recording video tutorials with (almost) zero budget

2019-03-28T00:00:00+00:00

I’ve just published a series of video of a one-day Kubernetes tutorial that I recently delivered in London. I would like to share the method and tools that I used, because although the result is far from perfect, I believe it can be useful for other speakers who want to share their work to a wide audience without a huge investment (in time and equipment).

What are we talking about?

I regularly deliver workshops, tutorials, and other training sessions. The main topics are containers and Kubernetes. Sometimes it is a half-day or full-day workshop at a conference; sometimes a longer tutorial; I also deliver public and private training for various companies.

Speaking of which … Here is a message from our sponsor (i.e. myself)!

In April, I will deliver three training sessions in Paris (in French). There will be getting started with containers, deploying apps with Kubernetes, and Kubernetes administration and operations. French is not your thing? I got you covered with Kubernetes for administrators and operators, a two-day tutorial in June, at the O’Reilly Velocity conference in San Jose (CA). If you know someone who might be interested … I’d love if you could let them know! Thanks ♥

But in-person training doesn’t scale, and I’ve always wanted to reach a wider audience. A lot of high-quality courses are now available online through various platforms. Producing such a course is a lot of work; and for now, I (unfortunately) don’t have the resources to do that.

However, I thought that it should be easier to do a live recording of a workshop, and then make the recording available online. The result wouldn’t be as good as a real online course, but it would be better than nothing (and it would get me one step in the right direction if I ever decide to make such a course after all).

First attempts

When I was working at Docker Inc., I started recording the workshops I delivered at conferences. To keep things simple, I decided that I would just do a screen recording. Of course, having a camera is better (it’s more engaging to see the speaker) but it’s also way more complex.

When using a Mac, I used Quicktime in “screen recording” mode; when using a Linux machine, I used vokoscreen. I would stop the recording at each break (for coffee and lunch) and start it again before resuming. As a result, at the end of a one-day workshop, I would typically have 4 files, each about 90 minutes long.

These files were a good start, and they were pretty helpful for me to improve my workshops. I don’t know how it is for other speakers, but for me, during the workshop, I always feel like there is one thousand little things that I want to improve (for instance, in the slides) but it’s impossible to take good notes while delivering the workshop at the same time. The video helped me a lot with that.

However, I thought that nobody would want to sit through a 90 minutes video. It’s too long. People probably want to know what’s in the video, and they want to go straight to the part that interests them.

So I wrote a Python script called decoup to help me slice and dice these video files. It works as follows:

first, I watch the video and write down the start/stop times of each section that I want to isolate, as well as the name of that section;
then, I run the script, which uses ffmpeg to do the actual cuts, and spit out a number of separate short files.

I use MPlayer to zoom through the video content and write down the start/stop times. It’s pretty efficient, and it typically takes me a few hours to go through one day of content and break it down in sections of about 5 minutes. (The shorter the sections, the more breaks you make, the longer it takes to write down the timestamps.)

If you want to see details about that process, you can check the decoup repository on GitHub.

After getting a bunch of short video files, I upload them to YouTube, and put them all in a playlist.

First results

Here is the result for a Docker Orchestration Workshop that I delivered in December 2016.

It was a good start! But the sound wasn’t great. I was recording using my laptop’s built-in microphone, so the sound would go up and down when I moved around the podium; and when I typed on the keyboard, the keystrokes were really loud. A lot of people brought that up, and I have to admit that it can quickly get on your nerves; even more so when you listen with headphones.

So, I wanted to improve the sound quality.

Improving the sound quality

Spoiler alert: I tried a number of microphones. (No, not the Propellerheads song😎)

The Blue Yeti is a really nice USB mic

I asked around what people were using to record podcasts and similar things, and I was suggested to try the Blue Yeti. I got one, and I recorded myself delivering a very short segment, featuring slides and demos (and therefore, some fast keyboard action). I compared the sound obtained with the internal microphone of a Macbook Air 12, the internal microphone of a Thinkpad T440s, and the Blue Yeti. The Blue Yeti has various modes (mono/omni directional, etc), I tried them all.

Alas, this microphone didn’t help to isolate the sound of my keyboard. Don’t get me wrong: this microphone is amazing. At some point, I set it to stereo and recorded myself walking around the room while talking; and when I played back the recording with my headphones, I could locate myself in space, and it was able to capture faint remote sounds that I hadn’t otherwise noticed. Really impressive! But it also captured my keyboard really well, unfortunately.

Hiring a pro

In September 2018, I delivered a bunch of Kubernetes training sessions with Enix SAS and we hired a pro to capture one session. He also interviewed some of the students.

We had a high-quality camera filming both speakers (there was me, but also Alexandre Buisine), wireless lapel mics, and I was also recording my screen like before.

The videos that we got out of this are of very high quality. Here are just a couple of examples. They are in French, but it will give you an idea of the result:

a promo video to show the training venue and atmosphere;
an explanation about declarative and imperative models.

The result is definitely worth it, but it’s a lot of work: you need an extra person during the workshop to film, and then it’s many, many, many hours of work after the workshop to produce the videos.

So I wanted to find something that I could do and re-do without having to hire a pro each time.

Multiple presenters

Quick aparté: delivering with a co-speaker can make things really tricky if each speaker presents with their own laptop. Now we need the recording from both computers; and if a speaker can intervene while the other is presenting, capturing their voice is another added challenge.

I asked for advice to the best A/V tech I know, Joe Laha. Joe has done A/V for countless conferences and tech events; including recording all the sessions from multiple editions of DevopsDays Minneapolis. Alas, his verdict was loud and clear: if I want to record multiple HDMI sources (and multiple audio inputs) reliably, I need equipment that is (a) expensive (b) bulky. (OK, to be fair, it’s not that bulky, but bigger than I want to fit in my suitcase when traveling.)

Of course, I should have listened to the pro. But I wanted to see for myself, so I bought a tiny, cheap HDMI recorder. Honestly, it’s a nice little gadget, especially for that price. I connected it between my laptop and the videoprojector, inserted an USB key, and voilà, it records my HDMI output.

I thought that I could combine it with a cheap HDMI switcher, and that would give me a way to record two presenters.

Problem: sometimes, the recording would stop. It’s not completely random; I think it happens when the output device (the videoprojector) shuts down. And I think that the projector shuts down when my computer screen saver is on for too long. The recorder has a LED indicating when it is recording, but it’s easy to forget about it.

And, it still doesn’t solve the annoying keyboard noise.

Get more microphones

We did a brief interview with Bret Fisher at a conference, and he used a couple of lapel mics connected to his phone. I thought it was a good idea, so I ordered a pair of cheap lapel mics and gave them a try.

Good news: with these, the noise of the keyboard is almost gone!

There are some downsides, though.

Wires. These are wired mics, meaning that I have to remove or unplug them each time I want to walk away from the podium (during the breaks, for instance). I found that it was only a minor inconvenience. However …

No signal indicator. Obviously, these are just simple mics, so they don’t have a LED or vu-meter indicating the strength of the signal. This caused me two problems. One time, when coming back from the break, I didn’t plug the mic correctly (the plug wasn’t all the way in). As a result, on the corresponding video segment, there is no sound. Oops. Second problem, since there is no vu-meter, it’s hard to know if you’re recording at a correct level. On some videos, my voice is clearly too loud and saturates the input. It’s not horrible, but it could have been easily avoided. (By doing a quick check with a program like pavucontrol or something equivalent.)

Hum. This problem doesn’t come from the mics themselves, but rather from the mic input on the laptop. On most laptops, these inputs are not properly isolated. As a result, the recording has a 50 Hz (60 Hz in the US) low frequency hum. Unfortunately, disconnecting the laptop AC power didn’t help; it turns out that each time I got a hum, it came from the HDMI, and since the HDMI goes to the projector, disconnecting it is not really an option!

(Note: when my hands are resting on the keyboard’s palm rest, the hum disappears almost entirely. So perhaps I could work something out with e.g. an ESD bracelet?)

Removing the hum

I thought it should be possible to filter out the hum, since it has always the same level, is always in the same frequency bands …

There are a couple of noise filters in recent versions of ffmpeg, but they are not documented properly (or, if you prefer, I was too stupid to understand the docs) and I wasn’t able to get them to work.

However, sox has much better documentation, and I was able to use it to automatically process all my video files.

Here are the steps if you’re interested:

Using the “decoup” script mentioned above, isolate a few seconds of noise (i.e. a moment when I don’t speak, and nobody speaks, and there is just the loud BZZZZ sound). Let’s say this is noise.mp4.

Extract the sound track from that file:

# This generates noise.wav
ffmpeg -i noise.mp4 -vn noise.wav

Generate a “noise profile” from that file:

# This generates noise.prof
sox noise.wav -n noiseprof noise.prof

Extract the sound track that I want to process:

# This generates video.wav
ffmpeg -i video.mp4 -vn video.wav

Process it with the noise reduction filter:

# This generates filtered.wav
sox video.wav filtered.wav noisered noise.prof

Merge back the filtered audio track with the video:

# This generates video.avi
ffmpeg -i video.mp4 -i filtered.wav \
       -vcodec copy -acodec copy \
       -map 0:v:0 -map 1:a:0 \
       video.avi

Delete the temporary files:
```
rm video.wav filtered.wav
```
Repeat steps 4-7 for all the other files to process.

I work with .wav files because sox cannot work directly with compressed audio (at least, not with the audio format that I have). At the end, I generate a .avi file because it’s a flexible container (it can hold the codec from the .wav file, whereas a .mp4 file wouldn’t be able to).

It doesn’t really matter to recompress the audio, since I will upload it to YouTube, and YouTube will recompress it anyway.

Upload to YouTube

The most painful part of the whole process is the upload. I couldn’t find an easy way to sort by name the videos in a playlist. I had scripted it a while ago (using Google Spreadsheets, sic!) but I couldn’t find the script this time. So I had to drag all the videos at the right place, one by one.

Ideally, I would also need to edit descriptions and titles en masse, and this doesn’t seem to be possible. I saw a few products that will do it for $$$. I might end up buying one of these, but I would prefer something that I can script easily.

Next steps

My friend Sébastien Wacquiez (who helped a lot with the logistics for our training sessions in Paris) strongly recommended that I use high-quality, wireless mics. I agree that it would be nice, but when I deliver a workshop by myself (without anyone to help me), I don’t have much time during the breaks, so I’m not even sure that I would have the time to change the batteries.

I’m considering getting a USB lapel mic (this should get rid of the hum, hopefully), or a nice USB audio interface. The latter would hopefully have vu-meters (making sure that I don’t record silence!), and while it sounds a bit overkill, I also do some music recording and mixing once in a while, so it could serve these purposes as well.

Another option (that I will almost certainly do!) is to display a small vu-meter in a corner of the screen. That would hopefully help me to realize immediately when the recording level is too high, or when something is not plugged properly.

I hope that the end result (this Kubernetes workshop video recording) is helpful to many people who want to learn about Kubernetes. And if you like that kind of content and want it delivered to your team or organization, I can totally make that happen!

This is the perfect opportunity to bring up the training sessions that we’re organizing in the coming months!

There will be three sessions in French in April in Paris, and one in English in June in San Jose (CA).

Bien démarrer avec les containers, April 15-16th, Paris
Déployer ses applications avec Kubernetes, April 23-24th, Paris
Opérer et administrer Kubernetes, April 26th, Paris
Kubernetes for administrators and operators, June 10-11th, San Jose (CA)

I can also deliver private training, customized to your team. Please get in touch if you’re interested!

If you wonder what these training sessions look like, our slides and other materials are publicly available on http://container.training/. You will also find a few videos taken during previous sessions and workshops. This will help you to figure out if this content is what you need.

Running Kubernetes without nodes

2019-02-13T00:00:00+00:00

Capacity planning with Kubernetes is a non-trivial challenge. How many nodes should we deploy? What should be their size? When should we add or remove nodes to accommodate variations in load? One solution is to not deploy nodes, and provision resources on-demand instead. Let’s see how to do that.

What we’re trying to solve

When we deploy a Kubernetes cluster, we need to provision a given number of nodes to run our container workloads. If we provision too many nodes, we’re wasting money, because a lot of that capacity won’t be used. If we don’t provision enough nodes, our workloads won’t run. (Our pods will remain in Pending state until there is available capacity.)

We also need to pick the right size for our nodes. This is another opportunity to waste resources! If we provision smaller nodes, there could be some unused resources on each node. Imagine what happens if we deploy containers needing 10 GB of RAM, on nodes that have 16 GB of RAM: then we waste 6 GB of RAM per node! It would be much more efficient to use nodes with 32 GB of RAM. But, conversely, bigger nodes mean more unused resources when we’re not using all the capacity. Having nodes with 32 GB of RAM, but just a few small containers on them, isn’t a very good use of our cloud budget.

Finally, we need to pick the right type of node. This will sound obvious, but if our workloads are RAM-intensive or CPU-intensive, we need to pick nodes with more RAM or CPU respectively. Otherwise, we end up with more unused resources, and wasted money.

What about cluster auto scaling?

One approach is to automatically add nodes when we are at capacity. Doing this is easy. Doing it properly requires to be more careful. If the auto scaling logic lives in a pod on your cluster, what happens when that pod gets evicted, but cannot be rescheduled because the cluster is out of capacity?

There are solutions to that problem, for instance:

using priorities or tolerations to make sure that this critical component can always run,
using a mechanism provided by the cloud infrastructure (for example, on AWS, a Lambda that would poll some Kubernetes metric and adjust the size of an Auto Scaling Group) …

But as we can see, this can get tricky. In particular, it’s tricky to test for these failure modes.

And how do we scale down? When we have e.g. 10 nodes, each using less than 50% of their capacity, we should be able to pack everything on 5 nodes and cut our costs in half. But Kubernetes will not repack pods for us. There are tools out there to achive the opposite thing (i.e. rebalance pods after scaling up) but I’m not aware of tools to help us to scale down clusters. (If you know of such tools, let me know, I’ll be happy to reference them here!)

Conclusion: cluster auto scaling is great to accommodate more capacity; but less great to save resources.

Can we just not run nodes?

There are at least two promising services out there, which allow us to run containers directly, without running servers.

AWS Fargate proposes to “run containers without managing servers or clusters”.
Azure Container Instances proposes to “easily run containers on Azure without managing servers”.

(There are other similar services out there; if you think I should include your favorite one, let me know, I’ll be happy to add it to the list!)

How does this work?

These services will provision containers directly on some infrastructure managed by the cloud provider. We are billed for the resource usage of these containers, without paying for the overhead of the Kubernetes nodes.

This sounds great! With two caveats.

First, resource usage is more expensive. This is absolutely normal: we pay for the convenience of not running and maintaining our servers, and not wasting extra capacity. I did some back-of-the-envelope calculations, and found that Fargate would be significantly more expensive than EC2 if you do an apples-to-apples comparison (e.g., pick an EC2 instance size and match it to Fargate) but that Fargate would be cheaper if you try to run containers that are just a bit bigger than a given instance size (because then you have to pick a much bigger instance, and end up wasting money).

The second challenge is that Fargate is primarily designed to work with ECS. ECS is Amazon’s container service, and it is not Kubernetes.

Enter Virtual Kubelet.

Virtual Kubelet

Kubelet is the name of the Kubernetes agent that runs on every node of our cluster. When a node boots up, Kubelet is started. It connects to the Kubernetes API server, and it says (more or less) “Hi there, my name is node752. I have that many cores, that much RAM and disk space. Do you per chance have any pod that I should run?” and after that, it waits for the Kubernetes API server for instructions. The Kubernetes API server registers the node in etcd. From that point, the scheduler knows about the node, and will be able to assign pods to it. When a pod gets assigned to the node, the pod’s manifest is pushed to the node, and the node runs it. Later on, the Kubelet will keep updating the API server with the node’s status.

Virtual Kubelet is a program that uses the same API as Kubelet. It connects to the API server, introduces itself, and announces that it can run pods. Except, when it is assigned a pod, instead of creating containers (with Docker, CRI-O, containerd, or what have you), Virtual Kubelet will defer that work to a provider like Fargate or ACI.

So Virtual Kubelet looks like a regular cluster node (it shows up in the output of kubectl get nodes) except that it doesn’t correspond to an actual node. Anything scheduled on Virtual Kubelet will run on its configured provider.

Virtual Kubelet is not ready for production (yet). The GitHub page says:

Please note this software is experimental and should not be used for anything resembling a production workload.

But it is under active development by many contributors and I wouldn’t be surprised if it reached a more mature status soon, at least for Azure workloads.

It turns out that there is another solution out there allowing to provision resources on the fly for our pods.

Kiyot

Kiyot is a product by Elotl Inc. which implements the CRI (the Container Runtime Interface). The CRI is the interface between Kubelet and our specific container engine. For instance, CRI-O and containerd implement the CRI.

Kiyot looks like a container engine, but when asked to run a container, it will provision a cloud instance and run the container in it. It also deals with pods, i.e. containers sharing the same network namespaces, volumes, etc.: it runs all the containers of a pod within the same cloud instance.

(Implementation detail: the heavy lifting is actually done by Milpa, another product from Elotl; Kiyot is the CRI shim between Kubelet and Milpa.)

I was given the opportunity to try the beta version of Milpa and Kiyot, so I did! And I found it remarkably easy to setup and operate. Of course, there are some scenarios when it doesn’t behave exactly like a normal Kubernetes node, but that’s expected (more on that later).

The installation was straightforward. You can run Kiyot as a standalone process in an existing cluster, but the beta came with an installer based on kops, and all I had to do was:

set 3 environment variables (to provide my AWS credentials and indicate which region I wanted to use)
run a provisioning script
wait 5-10 minutes

… and at that point, I could run:

$ kubectl get nodes -o wide
NAME      STATUS  ROLES   AGE  VERSION  …  CONTAINER-RUNTIME
ip-172-…  Ready   master  5m   v1.10.7  …  docker://17.3.2
ip-172-…  Ready   node    1m   v1.10.7  …  kiyot://1.0.0

I see a node using the Kiyot container runtime. Whenever a pod is scheduled to that node, Kiyot will provision a virtual machine for it, and run the pod in the virtual machine.

I wanted to try to real-world workload on Kiyot. My main job these days is to deliver Kubernetes training. I have a bunch of labs and exercises that I use during my training sessions. I thought that running all these labs and exercises on my brand new Kiyot-powered cluster would be a good experiment. I was positively impressed by the results.

The screenshot above shows my AWS console after deploying one of my demo apps on the cluster. Each t3.nano instance corresponds to a pod on the cluster. My demo app is started with multiple kubectl run commands. When we use kubectl run to create resources on Kubernetes, they automatically get a run label matching the resource name. So if we do kubectl run redis --image=redis, we create a deployment named redis, with a label run=redis, and all the resources created by this deployment (specifically, pods and replica sets) will also have this label run=redis. Kiyot propagates these labels, materializing them as regular EC2 tags, which we can then display in the console. Long story short: the run column above shows the run label of our Kubernetes pods (and we can show any Kubernetes label in the console).

Implementation details

Each Milpa “cell” (that’s the name for the VMs running pods) runs a very lightweight REST API. I do not have shell access to the cells, but I am guessing that the “cells” are running a very trimmed Linux distribution. (Perhaps Alpine, perhaps even just a barebones kernel + initrd.) In fact, since each pod runs in its own VM, the cells wouldn’t even need a full-blown container engine. This means that the overhead for each pod is very minimal. This is a pretty big deal, because on a “normal” Kubernetes node, there is a significant amount ot resources used by Kubelet (and other essential Kubernetes components). I know that it’s possible to use tiny machines (like Raspberry Pis) as Kubernetes nodes, but usually, I do not recommend to use machines with less than 4 GB of RAM as Kubernetes nodes. With the approach used by Milpa, tiny nodes (with 512 MB of RAM) work just fine.

Setting each pod in its own EC2 instance also simplifies the network setup a lot. Kubernetes networking can be complex, especially in cloud environments. We need overlay networks and/or a way to distribute routes and/or custom mechanisms (like the ENI plugin for AWS). With Milpa and Kiyot, the IP address of a pod is just the IP address of the underlying EC2 instance. We don’t need to map ports, encapsulate traffic, distribute routes, etc., everything is managed by the AWS network fabric, like for normal EC2 instances (since pods are normal EC2 instances).

The most noticeable difference is that it takes a bit longer to start a pod, since it involves provisioning an EC2 instance. In my experience, it took less than a minute for the pod to come up. That’s pretty good, since it includes instance provisioning, booting, pulling the image, and starting it.

Conclusions

Both Virtual Kubelet and Kiyot let us run Kubernetes workloads without provisioning Kubernetes nodes. Virtual Kubelet runs Kubernetes pods through a “container-as-a-service” provider, while Kiyot creates regular cloud instances for our pods.

In both cases, we pay for what we use, instead of provisioning extra capacity that we don’t use. Depending on your workloads, Kiyot can also be significantly cheaper, since it uses normal instances (instead of Fargate or ACI, which come at a premium).

In both cases, we benefit from additional security. (For instance, when using Kiyot, each pod runs in its own virtual machine.)

If you run large Kubernetes clusters (or, to put things differently: if your Kubernetes clusters incur non-trivial infrastructure bills!), I definitely recommend that you check Virtual Kubelet and/or contact Elotl to get a free trial of Milpa and Kiyot.

Update: there is now a Community Edition available for Milpa and Kiyot, suitable for smaller deployments.

Using Compose to go from Docker to Kubernetes (1/2)

2019-01-22T00:00:00+00:00

For anyone using containers, Docker is a wonderful development platform, and Kubernetes is an equally wonderful production platform. But how do we go from one to the other? Specifically, if we use Compose to describe our development environment, how do we transform our Compose files into Kubernetes resources?

This is a translation of an article initially published in French. So feel free to read the French version if you prefer!

Before we dive in, I’d like to offer a bit of advertising space to the primary sponsor of this blog, i.e. myself: ☺

In April, I will deliver three training sessions in Paris (in French). There will be getting started with containers, deploying apps with Kubernetes, and Kubernetes administration and operations. French is not your thing? I got you covered with Kubernetes for administrators and operators, a two-day tutorial in June, at the O’Reilly Velocity conference in San Jose (CA). If you know someone who might be interested … I’d love if you could let them know! Thanks ♥

What are we trying to solve?

When getting started with containers, I usually suggest following this plan:

write a Dockerfile for one service, i.e. one component of your application, so that this service can run in a container;
run the other services of that app in containers as well, by writing more Dockerfiles or using pre-built images;
write a Compose file for the entire app;
… stop.

When you reach this stage, you’re already leveraging containers and benefiting from the work you’ve done so far, because at this point, anyone (with Docker installed on their machine) can build and run the app with just three commands:

git clone ...
cd ...
docker-compose up

Then, we can add a bunch of extra stuff: continuous integration (CI), continuous deployment (CD) to pre-production …

And then, one day, we want to go to production with these containers. And, within many organizations, “production with containers” means Kubernetes. Sure, we could debate about the respective merits of Mesos, Nomad, Swarm, etc., but here, I want to pretend that we chose Kubernetes (or that someone chose it for us), for better or for worse.

So here we are! How do we get from our Compose files to Kubernetes resources?

At first, it looks like this should be easy: Compose is using YAML files, and so is Kubernetes.

Original image by Jake Likes Onions, remixed by @bibryam.

There is just one thing: the YAML files used by Compose and the ones used by Kubernetes have nothing in common (except being both YAML). Even worse: some concepts have totally different meanings! For instance, when using Docker Compose, a service is a set of identical containers (sometimes placed behind a load balancer), whereas with Kubernetes, a service is a way to access a bunch of resources (for instance, containers) that don’t have a stable network address. When there are multiple resources behind a single service, that service then acts as a load balancer. Yes, these different definitions are confusing; yes, I wish the authors of Compose and Kubernetes had been able to agree on a common lingo; but meanwhile, we have to deal with it.

Since we can’t wave a magic wand to translate our YAML files, what should we do?

I’m going to describe three methods, each with its own pros and cons.

100% Docker

If we’re using a recent version of Docker Desktop (Docker Windows or Docker Mac), we can deploy a Compose file on Kubernetes with the following method:

In Docker Desktop’s preferences panel, select “Kubernetes” as our orchestrator. (If it was set to “Swarm” before, this might take a minute or two so that the Kubernetes components can start.)

Deploy our app with the following command:

docker stack deploy --compose-file docker-compose.yaml myniceapp

That’s all, folks!

In simple scenarios, this will work out of the box: Docker translates the Compose file into Kubernetes resources (Deployment, Service, etc.) and we won’t have to maintain extra files.

But there is a catch: this will run the app on the Kubernetes cluster running within Docker Destkop on our machine. How can we change that, so that the app runs on a production Kubernetes cluster?

If we’re using Docker Enterprise Edition, there is an easy solution: UCP (Universal Control Plane) can do the same thing, but while targeting a Docker EE cluster. As a reminder, Docker EE can run on the same cluster, side-by-side, applications managed by Kubernetes, and applications managed by Swarm. When we deploy an app by providing a Compose file, we pick which orchestrator we want to use, and that’s it.

(The UCP documentation explains this more in depth. We can also read this article on the Docker blog.)

This method is fantastic if we’re already using Docker Enterprise Edition (or plan to), because in addition to being the simplest option, it’s also the most robust, since we’ll benefit from Docker Inc’s support if needed.

Alright, but for the rest of us who do not use Docker EE, what do?

Use some tools

There are a few tools out there to translate a Compose file into Kubernetes resources. Let’s spend some time on Kompose, because it’s (in my humble opinion) the most complete at the moment, and the one with the best documentation.

We can use Kompose in two different ways: by working directly with our Compose files, or by translating them into Kubernetes YAML files. In the latter case, we deploy these files with kubectl, the Kubernetes CLI. (Technically, we don’t have to use the CLI; we could use these YAML files with other tools like WeaveWorks Flux or Gitkube, but let’s keep this simple!)

If we opt to work directly with our Compose files, all we have to do is use kompose instead of docker-compose for most commands. In practice, we’ll start our app with kompose up (instead of docker-compose up), for instance.

This method is particularly suitable if we’re working with a large number of apps, for which we already have a bunch of Compose files, and we don’t want to maintain a second set of files. It’s also suitable if our Compose files evolve quickly, and we want to maintain parity between our Compose files and our Kubernetes files.

However, sometimes, the translation produced by Kompose will be imperfect, or even outright broken. For instance, if we are using local volumes (docker run -v /path/to/data:/data ...), we need to find another way to bring these files into our containers once they run on Kubernetes. (By using Persistent Volumes, for instance.) Sometimes, we might want to adapt the application architecture: for instance, to ensure that the web server and the app server are running together, within the same pod, instead of being two distinct entities.

In that case, we can use kompose convert, which will generate the YAML files corresponding to the resources that would have been created with kompose up. Then, we can edit these files and touch them up at will before loading them into our cluster.

This method gives us a lot of flexibility (since we can edit and transform the YAML files as much as necessary before using them), but this means any change or edit might have to be done again when we update the original Compose file.

If we maintain many applications, but with similar architectures (perhaps they use the same languages, frameworks, and patterns), then we can use kompose convert, followed by an automated post-processing step on the generated YAML files. However, if we maintain a small number of apps (and/or they are very different from each other), writing custom post-processing scripts suited to every scenario may be a lot of work. And even then, it’s a good idea to double-check the output of these scripts a number of times, before letting them output YAML that would go straight to production. This might warrant even more work; more than you might want to invest.

This table (courtesy of XKCD) tells us how much time we can spend on automation before it gets less efficient than doing things by hand.

I’m a huge fan of automation. Automation is great. But before I automate something, I need to be able to do it …

… Manually

The best way to understand how these tools work, is to do their job ourselves, by hand.

Just to make it clear: I’m not suggesting that you do this on all your apps (especially if you have many apps!), but I would like to show my own technique for converting a Compose app into Kubernetes resources.

The basic idea is simple: each line in our Compose file must be mapped to something in Kubernetes. If I were to print the YAML for both my Compose file and my Kubernetes resources, and put them side by side, for each line in the Compose file, I should be able to draw an arrow pointing to a line (or multiple lines) on the Kubernetes side.

This helps me to make sure that I haven’t skipped anything.

Now, I need to know how to express every section, parameter, and option in the Compose file. Let’s see how it works on a small example!

# Compose file                                                      | translation
version: "3"                                                        |
  services:                                                         |
    php:                                                            | deployment/php
      image: jpetazzo/appthing:v1.2.3                               | deployment/php
      external_links:                                               | service/db
      - 'mariadb_db_1:db'                                           | service/db
      working_dir: /var/www/                                        | ignored
      volumes:                                                      | \
      - './apache2/sites-available/:/etc/apache2/sites-available/'  |  \
      - '/var/logs/apptruc/:/var/log/apache2/'                      |   \
      - '/var/volumes/appthing/wp-config.php:/var/www/wp-config.php'|    \ volumes
      - '/var/volumes/appthing/uploads:/var/www/wp-content/uploads' |    /
      - '/var/volumes/appthing/composer:/root/.composer'            |   /
      - '/var/volumes/appthing/.htaccess:/var/www/.htaccess'        |  /
      - '/var/logs/appthing/app.log:/var/www/logs/application.log'  | /
      ports:                                                        | service/php
      - 8082:80                                                     | service/php
      healthcheck:                                                  | \
        test: ["CMD", "curl", "-f", "http://localhost/healthz"]     |  \
        interval: 30s                                               |   liveness probe
        timeout: 5s                                                 |  /
        retries: 2                                                  | /
      extra_hosts:                                                  | hostAliases
      - 'sso.appthing.io:10.10.22.34'                               | hostAliases

This is an actual Compose file written (and used) by one of my customers. I replaced image and host names to respect their privacy, but other than that, it’s verbatim. This Compose file is used to run a LAMP stack in a preproduction environment on a single server. The next step is to “Kubernetize” this app (so that it can scale horizontally if necessary).

Next to each line of the Compose file, I indicated how I translated it into a Kubernetes resource. In another post (to be published next week), I will explain step by step the details of this translation from Compose to Kubernetes.

This is a lot of work. Furthermore, that work is specific to this app, and has to be re-done for every other app! This doesn’t sound like an efficient technique, does it? In this specific case, my customer has a whole bunch of apps that are very similar to the first one that we converted together. Our goal is to build an app template (for instance, by writing a Helm Chart) that we can reuse, or at least use as a base, for many applications.

If the apps differ significantly, there’s no way around it: we need to convert them one by one.

In that case, my technique is to tackle the problem by both ends. In concrete terms, that means converting an app manually, and then thinking about what we can adapt and tweak so that the original app (running under Compose) can be easier to deploy with Kubernetes. Some tiny changes can help a lot. For instance, if we connect through another service through a FQDN (e.g. sql-57.whatever.com), replace it with a short name (e.g. sql) and use a Service (with an ExternalName or static endpoints). Or use an environment variable to switch the code behavior. If we normalize our applications, it is very likely that we will be able to deal with them automatically with Kompose or Docker Enterprise Edition.

(This, by the way, is the whole point of platforms like OpenShift or CloudFoundry: they restrict what you can do to a smaller set of options, making that set of options easier to manage from an automation standpoint. But I digress!)

Conclusions

Moving an app from Compose to Kubernetes requires transforming the application’s Compose file into multiple Kubernetes resources. There are tools (like Kompose) to do this automatically, but these tools are no silver bullet (at least, not yet).

And even if we use a tool, we need to understand how it works and what it’s producing. We need to be familiar with Kubernetes, its concepts, and various resource types.

This is the perfect opportunity to bring up the training sessions that we’re organizing in the coming months!

There will be three sessions in French in April in Paris, and one in English in June in San Jose (CA).

Bien démarrer avec les containers, April 15-16th, Paris
Déployer ses applications avec Kubernetes, April 23-24th, Paris
Opérer et administrer Kubernetes, April 26th, Paris
Kubernetes for administrators and operators, June 10-11th, San Jose (CA)

I can also deliver private training, customized to your team. Please get in touch if you’re interested!

In the second part of this article (to be published next week), we’ll dive into the technical details and explain how we adapted this LAMP application to run it on Kubernetes!

De Docker à Kubernetes en passant par Compose (2/2)

2018-11-14T00:00:00+00:00

Cette article est la suite du précédent. Aujourd’hui, on va entrer dans les détails pour voir comment adapter une application décrite par un fichier Compose afin de la faire tourner sur Kubernetes.

If you still can’t read French and wonder what this post is about: it’s an in-depth description of a technique that one can use to transform an app described by a Compose file into a set of Kubernetes resources.

J’aime bien écrire des articles pour mon blog, mais j’aime encore mieux former des gens brillants (par exemple, vous, chers lecteurs) à tous ces sujets : les conteneurs, Kubernetes, Docker … Du coup, petite annonce :

En avril, je dispenserai trois formations à Paris (en français) : bien démarrer avec les containers, déployer ses applications avec Kubernetes, opérer et administrer Kubernetes. Puis, en juin, il y aura Kubernetes for administrators and operators à San Jose (Californie), en anglais. Si vous connaissez quelqu’un que ça peut intéresser … N’hésitez pas à faire suivre ; merci beaucoup ! ♥

Résumé des épisodes précédents

On veut donc “Kubernetiser” le Compose file ci-dessous :

# Fichier Compose                                                  | traduction
version: "3"                                                       |
  services:                                                        |
    php:                                                           | deployment/php
      image: jpetazzo/apptruc:v1.2.3                               | deployment/php
      external_links:                                              | service/db
      - 'mariadb_db_1:db'                                          | service/db
      working_dir: /var/www/                                       | ignoré
      volumes:                                                     | \
      - './apache2/sites-available/:/etc/apache2/sites-available/' |  \
      - '/var/logs/apptruc/:/var/log/apache2/'                     |   \
      - '/var/volumes/apptruc/wp-config.php:/var/www/wp-config.php'|    \ volumes
      - '/var/volumes/apptruc/uploads:/var/www/wp-content/uploads' |    /
      - '/var/volumes/apptruc/composer:/root/.composer'            |   /
      - '/var/volumes/apptruc/.htaccess:/var/www/.htaccess'        |  /
      - '/var/logs/apptruc/app.log:/var/www/logs/application.log'  | /
      ports:                                                       | service/php
      - 8082:80                                                    | service/php
      healthcheck:                                                 | \
        test: ["CMD", "curl", "-f", "http://localhost/healthz"]    |  \
        interval: 30s                                              |   liveness probe
        timeout: 5s                                                |  /
        retries: 2                                                 | /
      extra_hosts:                                                 | hostAliases
      - 'sso.apptruc.fr:10.10.22.34'                               | hostAliases

Pour rappel, c’est un vrai fichier Compose utilisé par un de mes clients. J’ai uniquement changé les noms d’image et d’hôte par souci de confidentialité, mais en dehors de ça, cest un vrai fichier représentatif de ce qu’on trouve dans la nature.

J’ai annoté le fichier pour montrer (dans la partie droite) à quel concept ou ressource Kubernetes correspond chaque ligne.

Maintenant, voyons ça un peu plus en détail !

Où sont mes conteneurs ?

Tout d’abord. pour chaque service (au sens de Compose), j’ai créé un Deployment dans Kubernetes. Par simplicité, je nomme ce Deployment comme le service Compose (ici, php).

Pour générer le YAML de mon Deployment, j’utilise la commande suivante:

kubectl create deployment php \
        --image jpetazzo/apptruc:v1.2.3 \
        --dry-run -o yaml

Normalement, cette commande génère la description d’une ressource (ici, un deployment) puis crée cette ressource sur le cluster. Mais comme on utilise l’option --dry-run, on se contente de générer la description, sans créer la ressource. On assure l’affichage de cette description au format YAML avec (vous l’aurez sûrement deviné) le -o yaml.

Juste là, tout va bien.

Connexions sortantes

Ensuite, je vois une section external_links, qui va faire correspondre le conteneur mariadb_db_1 au nom db. Je vais donc créer un service Kubernetes qui va s’appeler db. Plusieurs options s’offrent à moi.

Dans le cas présent, il se trouve que la base de données mariadb_db_1 est exposée sur le port 3306 sur une machine appelée db.apptruc.fr. La solution la plus simple est alors de créer un service de type ExternalName. Concrètement, cela va se contenter d’ajouter un enregistrement DNS de type CNAME dans le DNS de Kubernetes (kube-dns ou CoreDNS, selon la version de Kubernetes qu’on utilise). Du coup, quand mon application va résoudre le nom db, le DNS de Kubernetes va lui dire “le nom db correspond au CNAME db.apptruc.fr ; au passage, l’adresse IP correspondante est 10.20.30.40.”

Si mon serveur MariaDB n’est pas dans le DNS (et que j’ai juste son adresse IP), mauvaise nouvelle : à l’heure où j’écris ces lignes (Kubernetes 1.12), un ExternalName ne peut pas pointer directement vers une adresse IP. Si je ne peux pas (ou ne veux pas) créer une entrée DNS pour mon serveur MariaDB, je peux utiliser nip.io. Grâce à nip.io, je peux obtenir un nom DNS pour n’importe quelle adresse IP. Il suffit d’ajouter .nip.io derrière l’adresse IP ! Autrement dit, si mon serveur MariaDB a l’adresse 10.20.30.40, je peux créer un ExternalName pointant vers 10.20.30.40.nip.io et le tour est joué.

(Notons au passage que même si nip.io est très pratique, l’utiliser crée un dépendance à un service externe. Cela implique aussi que notre cluster a accès à Internet. Ce n’est pas une contrainte très lourde dans la majorité des cas, sauf pour les gens qui font tourner des clusters totalement isolés de l’extérieur…)

Tout ça fonctionne uniquement si mon serveur MariaDB est exposé sur le port par défaut (3306). Comment faire si mon serveur est exposé sur un autre port ?

Option 1 : un ambassadeur. Dans le cas présent, je pourrais utiliser hamba. Cela me ferait ajouter un Deployment. L’ambassadeur va écouter sur le port 3306, et relayer chaque connexion vers l’adresse et port qu’on voudra. On pourrait aussi utiliser un proxy MySQL comme ambassadeur.

Option 2 : un service ClusterIP et un backend statique. Normalement, dans Kubernetes, un service obtient la liste des backends (ou endpoints) grâce à un sélecteur. Par exemple, le sélecteur peut indiquer “ce service correspond à tous les pods ayant le label app=toto”. À chaque fois qu’un pod ayant ce label apparaît ou disparaît, il est ajouté ou enlevé de la liste des backends pour le service. Cela revient à une reconfiguration dynamique de load balancer. Mais on peut aussi créer un service sans sélecteur, puis gérer les backends soi-même.

Comme faire en pratique ? Tout simplement en chargeant un fichier YAML similaire à l’exemple ci-dessous via kubectl create:

---
apiVersion: v1
kind: Service
metadata:
  name: db
spec:
  ports:
  - name: "3306"
    port: 3306
    protocol: TCP
    targetPort: 3306
  type: ClusterIP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: db
subsets:
- addresses:
  - ip: 10.20.30.40     # Changer ça
  ports:
  - name: "3306"
    port: 12345         # Et ça
    protocol: TCP

(Remplacez 10.20.30.40 et 12345 par l’adresse IP et le port auquel le service doit correspondre, et le tour est joué!)

Évitez de déployer trop gras, trop salé ou trop sucré

La ligne suivante du fichier Compose est working_dir. Dans l’absolu, je pourrais répercuter cette directive dans le fichier YAML du déploiement php. Mais dans ce cas précis, je me suis posé la question : est-ce que cette directive est nécessaire ? Il s’avère qu’elle n’était pas utile, donc on s’en est débarrassé.

Il y a une petite leçon importante ici : d’un côté, c’est important de s’assurer qu’on a bien transcrit toutes les informations présentes dans le fichier Compose. De l’autre, recopier aveuglément les informations peut conduire à une accumulation de petites choses inutiles (voire contre-productives), dont on ne sait plus trop à quoi elles servent.

C’est particulièrement vrai dans des (longs) fichiers de configuration, et tout particulièrement des fichiers générés. Ces fichiers ont tendance à être longs (un programme sera toujours moins paresseux qu’un humain et ne rechignera jamais à ajouter des lignes!) et pas toujours commentés.

Il m’est arrivé bien trop souvent de faire le ménage dans un configuration de plusieurs centaines de lignes, la réduisant à moins de dix lignes utiles. Tout le reste, c’était des valeurs par défaut, ou bien sans incidence sur l’application. Le résultat, c’est une configuration beaucoup plus lisible, facile à comprendre, et facile à porter ou traduire lorsqu’on change de système ou tout simplement qu’on fait une montée en version.

Monter les volumes

Puis, on a une ribambelle de volumes. On a pu les classer en trois catégories :

configuration,
logs,
assets (images et autres).

La configuration et les logs sont répartis sur plusieurs répertoires. On aurait pu créer plusieurs volumes de configuration et plusieurs volumes de logs, mais on a choisi une méthode légèrement différente.

Pour commencer, on rassemble tous les fichiers de configuration identifiés dans un répertoire config, puis on transforme ce répertoire en une ConfigMap Kubernetes avec la commande suivante :

kubectl create configmap config --from-file=config \
        --dry-run -o yaml > configmap-config.yaml

Cette ConfigMap sera montée sous forme de volume (par exemple dans /config), ce qui va rematérialiser le contenu du répertoire config dans chaque container de l’application.

Puis, on va modifier la commande de lancement de l’application, afin de créer des liens symboliques vers tous ces fichiers. Ainsi, à l’emplacement de chaque fichier de configuration attendu par l’application, on aura un lien symbolique pointant vers le fichier de configuration contenu dans /config, et ce répertoire correspond à une ConfigMap Kubernetes.

On procède de manière similaire pour les logs. Là encore, chaque fichier ou répertoire de log de l’application est remplacé par un lien symbolique vers /logs, et /logs est un volume.

Voici un extrait du fichier YAML du Deployment php:

command:
- "sh"
- "-c"
- |
  set -e
  ln -sf /config/wp-config.php /var/www/wp-config.php
  ln -sf /config/.htaccess /var/www/.htaccess
  mkdir /etc/apache2/sites-include
  ln -sf /config/url-redirections /etc/apache2/sites-include/url-redirections
  ln -sf /config/000-default.conf /etc/apache2/sites-available/000-default.conf
  [ -d /logs/apache2 ] || mv /var/log/apache2 /logs/apache2
  ln -sf /logs/apache2 /var/log/apache2
  ln -sf /logs/application.log /var/www/logs/application.log
  exec sudo apachectl -DFOREGROUND

Il y a pas mal de choses à dire sur cette section command !

Chaque volume déclaré dans le fichier Compose se trouve traduit ici (par une commande ln -sf adéquate).
Puisqu’on a plusieurs commandes à exécuter, on le fait via sh -c "une_commande && une_autre_commande && encore_une".
Plutôt que d’enchaîner toutes les commandes avec &&, on place un set -e au début. Cela évite d’oublier un && malencontreusement (ce qui aurait pour conséquence de permettre le lancement de l’application même si un lien n’a pas pu être créé correctement).
Afin de rendre ça lisible, on utilise une chaîne YAML multi-lignes comme argument de sh -c. Imaginez ce que ça donnerait si le script était condensé sur une seule ligne avec des ; pour séparer les commandes !
À la fin, quand on lance le point d’entrée du conteneur, on le fait avec exec, afin que le point d’entrée soit bien le PID 1 dans le conteneur. Si on faisait directement sudo apachectl (sans exec), alors le PID 1 serait sh et sudo apachectl serait un sous-processus.
Pour savoir quoi lancer (d’où vient ce sudo apachectl?) on a simplement fait un docker inspect sur l’image.

Enfin, pour les assets, la meilleure méthode serait (idéalement!) de remplacer ce répertoire partagé par un object store. Mais cela implique des modifications assez lourdes sur l’application, donc en attendant, on peut utiliser (par exemple) un partage NFS.

Les volumes et les ConfigMaps sont des concepts complexes. Si vous voulez en savoir plus à ce sujet, vous pouvez consulter :

Connexions entrantes

On poursuit avec la section ports. Cette application se trouve derrière un load balancer HAProxy, configuré pour envoyer les requêtes sur le port 8082 de l’hôte Docker où elle se trouve. On va garder le même schéma, mais on va utiliser un service de type NodePort et configurer HAProxy pour envoyer les requêtes vers tous les nœuds de notre cluster Kubernetes, sur le port alloué.

Si on avait voulu aller plus loin, on aurait pu créer un Ingress. Cela aurait permis de remplacer le load balancer HAProxy par un mécanisme mieux intégré à Kubernetes, comme Traefik par exemple.

Dans ce cas précis, mon client souhaitait garder ses load balancers existants afin de migrer plus progressivement. C’est une démarche très saine, qui limite la quantité de nouveaux outils à prendre en main pour les équipes opérationnelles. Du coup, on utilise un NodePort pour coller au plus près à l’existant.

Pour en savoir plus sur les Ingress, vous pouvez consulter la documentation Kubernetes ou notre support de formation.

Sondes

La section healthcheck est remplacée par une liveness probe dans le Deployment. Je ne vais pas entrer dans les détails (cet article est déjà assez long comme ça), et simplement mentionner que cela permet de détecter si le conteneur a un problème, et le redémarrer automatiquement le cas échéant. Pour en savoir plus sur ces sondes, et sur la différence entre les sondes de liveness et de readiness, je vous invite à consulter la documentation ou bien encore une fois notre support de formation.

Connexions sortantes (bis)

Enfin, la section extra_hosts permet d’injecter des entrées DNS supplémentaires. Dans le cas présent, le nom sso.apptruc.fr correspond à une adresse IP publique, et (dans le cas précis du réseau de ce client) utiliser cette adresse IP publique fait passer le trafic par le firewall. La section extra_hosts permet de surcharger ce nom DNS afin de lui faire correspondre l’adresse IP privée du service, et y accéder directement, sans passer par le firewall. (C’est une topologie spécifique à ce client, mais qu’on retrouve dans d’autres circonstances ; par exemple, dans une infrastructure cloud, lorsqu’une machine interne accède à un service interne, mais via son adresse IP externe.)

Cette section extra_hosts peut se traduire via une section hostAliases dans le Deployment. (C’est particulièrement bien expliqué dans la documentation Kubernetes.)

Cela dit, si on a plusieurs Deployment qui accèdent à un service de cette façon, on peut aussi souhaiter mettre en place quelque chose qui surcharge le nom DNS de ce service automatiquement pour tous les services.

Pour des noms courts (comme db ou api) on peut créer un service Kubernetes (comme expliqué plus haut pour db), mais pour un nom contenant des points (comme sso.apptruc.fr) cela n’est pas possible, car on ne peut pas avoir de point dans le nom d’une ressource Kubernetes. On peut, en revanche, configurer le DNS de Kubernetes pour “détourner” les requêtes pour sso.apptruc.fr afin de renvoyer une adresse IP de notre choix. Là aussi, il s’agit d’une opération non triviale. Si vous voulez en savoir plus à ce sujet, vous pouvez consulter cet excellent article en anglais.

Une autre solution est de changer le code afin d’accéder à sso (au lieu de sso.apptruc.fr) puis créer un service sso.

Conclusions

Ouf ! On a converti notre application. Et comme vous pouvez le constater sur cet exemple en conditions réelles, les outils automatiques ont leurs limites. Un outil comme Kompose, aussi sophistiqué soit-il, n’aurait pas pu créer automatiquement un partage NFS pour nous. Les outils actuels ne sont pas capables de deviner quels fichiers sont des fichiers de configuration (et peuvent être encapsulés dans une ConfigMap) et quels fichiers sont des logs (et peuvent être placés dans un volume EmptyDir partagé avec un conteneur sidekick les relayant vers notre plateforme de logging). Peut-être que ça viendra, mais on n’y est pas encore.

Comme évoqué dans l’article précédent, il est plus efficace de prendre le problème par les deux bouts : d’un côté, utiliser un outil comme Kompose pour automatiser le boulot ; de l’autre, analyser le résultat, comprendre ce qui n’est pas traduit correctement, le corriger à la main, mais à terme, modifier le fichier Compose en amont de manière à ce que Kompose puisse mieux faire son travail lors de la prochaine passe.

Dans tous les cas, on n’y coupe pas : il faut se familiariser avec Kubernetes !

J’en profite donc pour mentionner mes prochaines formations à Paris et en Californie !

Il y aura:

Bien démarrer avec les containers le 15-16 avril à Paris
Déployer ses applications avec Kubernetes le 23-24 avril à Paris
Opérer et administrer Kubernetes le 26 avril à Paris
Kubernetes for administrators and operators le 10-11 juin à San Jose

Les formations à Paris sont en français. Si vous voulez monter en compétence vitesse grand V, vous pouvez enchaîner les 3 formations (elles sont conçues pour fonctionner ensemble).

La formation à San Jose aura lieu dans le cadre de la conférence Velocity.

Je peux aussi assurer des formations sur mesure pour votre équipe. N’hésitez pas à me contacter.

Si vous vous demandez à quoi ressemblent ces formations, nos supports sont en libre accès sur le site http://container.training/, ainsi que quelques vidéos de précédentes formations. Cela vous permettra de juger au mieux si elles sont adaptées à votre besoin.

De Docker à Kubernetes en passant par Compose (1/2)

2018-11-07T00:00:00+00:00

Dans le monde des conteneurs, Docker est une formidable plateforme de développement, et Kubernetes une tout aussi formidable plateforme de production. Comment passe-t-on de l’un à l’autre ? En particulier, si l’on utilise Compose pour décrire son environnement de développement, comment traduit-on ses fichiers Compose en ressources Kubernetes ?

If you can’t read French and wonder what this post is about: It’s an overview of techniques that one can use to transform an app described by a Compose file into a set of Kubernetes resources. An English translation is available if you’re interested!

Cet article s’inscrit dans une série d’articles en français parlant de Docker, Kubernetes, et des conteneurs en général. Si vous souhaitez une introduction sur le sujet, je vous invite à lire « Les conteneurs : par où commencer? » ; si vous êtes plutôt du côté « ops » et que vous vous demandez ce que Docker (ou les conteneurs en général) peut vous apporter, je vous propose « Dérisquer son infrastructure avec les conteneurs ».

Avant d’entrer dans le vif du sujet, un petite page de pub pour le sponsor de ce blog (autrement dit, moi) :

En avril, je dispenserai trois formations à Paris (en français) : bien démarrer avec les containers, déployer ses applications avec Kubernetes, opérer et administrer Kubernetes. Puis, en juin, il y aura Kubernetes for administrators and operators à San Jose (Californie), en anglais. Si vous connaissez quelqu’un que ça peut intéresser … N’hésitez pas à faire suivre ; merci beaucoup ! ♥

Énoncé du problème

Pour se mettre aux conteneurs, je conseille souvent de procéder comme suit :

écrire un Dockerfile pour un service au sein d’une application afin de faire tourner ce service dans un conteneur,
faire tourner de la même manière les autres services de cette application,
écrire un fichier Compose pour l’application,
… pause.

Une fois à cette étape, on profite déjà des avantages des conteneurs, car toute personne disposant de Docker sur sa machine peut lancer l’application en tapant trois lignes :

git clone ...
cd ...
docker-compose up

Ensuite, on peut ajouter pas mal de belles choses : de l’intégration continue, pourquoi pas du déploiement continu en pré-production …

Mais un beau jour, on veut passer en production. Et dans de nombreux cas, la production pour les conteneurs, ça sera avec Kubernetes. On pourrait avoir un débat sur la pertinence de Mesos, Nomad, Swarm, etc., mais dans le cas présent, je vais supposer qu’on a choisi Kubernetes (ou bien que quelqu’un a choisi pour nous).

Comment passe-t-on de nos fichiers Compose à nos ressources Kubernetes ?

En première approche, vu de très (très) loin, ça devrait être facile : Compose utilise du YAML, Kubernetes aussi.

Image originale par Jake Likes Onions, remixée par @bibryam.

Le problème, c’est que le YAML de Compose et le YAML de Kubernetes n’ont absolument rien à voir l’un avec l’autre. Pire : certains concepts ont des significations complètement différentes. Par exemple, dans Docker Compose, un service est un ensemble de conteneurs identiques (parfois placés derrière un load balancer), tandis que dans Kubernetes, un service est un mécanisme permettant d’accéder à des ressources (par exemple des conteneurs) dont l’adresse réseau n’est pas fixe. Lorsqu’il y a plusieurs ressources derrière un même service, celui-ci fait aussi office de load balancer. Oui, c’est un bon moyen de semer la confusion ; oui, je regrette moi aussi que les concepteurs de Compose et de Kubernetes n’aient pas eu l’occasion de se mettre d’accord sur le vocabulaire, mais en attendant il faut faire avec.

Puisqu’on ne peut pas traduire notre YAML d’un coup de baguette magique, comment faire ?

Je vais présenter trois façons de procéder, chacune avec ses avantages et inconvénients.

100% Docker

Si on utilise une version à jour de Docker Desktop (Docker Windows ou Docker Mac), on peut déployer un Compose file sur Kubernetes de la manière suivante :

Dans les préférences de Docker Desktop, sélectionnez Kubernetes comme orchestrateur. (Si on était sur Swarm auparavant, il faudra peut-être une minute ou deux pour que les composants Kubernetes démarrent.)

Déployez votre application, avec la commande:

docker stack deploy --compose-file docker-compose.yaml mabelleappli

C’est tout !

Pour les cas les plus simples, cela marchera directement : Docker traduit le Compose file en ressources Kubernetes (Deployment, Service, etc.) et nous n’aurons pas besoin de maintenir des fichiers supplémentaires.

Mais il y a un hic : cela lance l’application sur notre Docker Desktop. Comment faire pour qu’elle se lance sur un cluster Kubernetes de production ?

Si on utilise Docker Enterprise Edition, on est sauvé : UCP (Universal Control Plane) permet de faire exactement la même chose, mais en ciblant son cluster Docker EE. Pour rappel, Docker EE permet de faire tourner simultanément des applications gérées par Kubernetes, et des applications gérées par Swarm. Quand on déploie une application en fournissant un fichier Compose, on indique quel orchestrateur on veut utiliser, et le tour est joué.

(La documentation d’UCP explique ça plus en détail. On peut aussi consulter cet article sur le blog de Docker.)

Cette méthode est particulièrement adaptée si on est déjà client de Docker Enterprise Edition, ou bien si on envisage de l’être ; car en plus d’être la plus simple du lot, elle sera aussi la plus solide, car on bénéficiera du support de Docker Inc. en cas d’incompatibilité.

D’accord, mais pour les gens qui n’utilisent pas Docker EE, comment faire ?

Avec des outils

Il y a plusieurs outils permettant de traduire un fichier Compose en ressources Kubernetes. Je vais surtout m’attarder sur Kompose, car il est (à mon humble avis) le plus complet à ce jour, et le mieux documenté.

On peut utiliser Kompose de deux façons : en travaillant directement avec vos fichiers Compose, ou bien en les traduisant en fichiers YAML Kubernetes, qu’on déploie ensuite avec kubectl, la CLI Kubernetes. (Techniquement, on n’est pas obligé d’utiliser la CLI ; on peut utiliser ces fichiers YAML avec d’autres outils, par exemple WeaveWorks Flux ou Gitkube, mais je simplifie un peu.)

Si on décide de travailler directement avec nos fichiers Compose, on utilisera simplement kompose à la place de docker-compose pour la plupart des commandes. Concrètement, on lancera notre application avec kompose up (au lieu de docker-compose up), par exemple.

Cette méthode est adaptée lorsqu’on travaille avec un grand nombre d’applications, pour lesquelles on a déjà différents fichiers Compose, et qu’on ne souhaite pas maintenir un deuxième jeu de fichiers. Ou encore, lorsque nos fichiers Compose évoluent rapidement, et qu’on veut éviter de gérer des divergences entre nos fichiers Compose et nos fichiers Kubernetes.

Dans certains cas, la traduction effectuée par Kompose sera imparfaite, voire ne marchera pas du tout. Par exemple, si on utilise des volumes locaux (docker run -v /path/to/data:/data ...), il faudra trouver une autre manière d’apporter ces données dans nos conteneurs sur Kubernetes. (Par exemple, en utilisant des Persistent Volumes.) Ou bien on voudra en profiter pour restructurer un peu l’application afin de faire tourner ensemble le serveur web et le serveur applicatif, au sein d’un même pod, au lieu d’en faire deux entités séparées.

En ce cas, on peut utiliser kompose convert, qui va nous générer les fichiers YAML correspondant à toutes les ressources qui auraient été créées par kompose up, et on peut ensuite retoucher ces fichiers à loisir avant de les charger dans notre cluster.

Cette méthode offre beaucoup de souplesse (puisqu’on peut transformer le YAML à loisir avant de s’en servir), mais cela veut aussi dire que toute modification du fichier Compose implique de choisir s’il faut générer à nouveau (et le cas échéant, modifier) nos ressources Kubernetes.

Si vous maintenez beaucoup d’applications, mais avec des architectures (et des patterns) similaires, vous pouvez utiliser kompose convert puis appliquer un post-traitement automatique aux fichiers YAML générés. Par contre, si vous maintenez peu d’application (et/ou qu’elles sont très différentes les unes des autres), écrire une moulinette de post-traitement adaptée à tous les cas va probablement représenter un investissement assez lourd ; et vous voudrez certainement vérifier son travail pendant un bon moment avant de la laisser aveuglément générer du YAML qui partira directement en production.

Je suis un grand partisan de l’automatisation, mais avant d’automatiser quelque chose, il faut être capable de le faire …

… À la main

Pour bien comprendre comment les outils évoqués fonctionnent, le meilleur moyen, c’est encore de faire leur travail à la main.

Entendons-nous bien : je ne conseille pas particulièrement de faire ce boulot sur toutes vos applications (surtout si vous en avez beaucoup!), mais je voudrais présenter “ma” méthodologie pour convertir une application Compose en ressources Kubernetes.

L’idée fondamentale est simple : chaque ligne du fichier Compose doit être traduite dans le résultat sur Kubernetes. Si j’affichais ou imprimais les deux côte à côte, depuis chaque ligne du fichier Compose, je devrais être capable de tracer un flèche vers son expression dans Kubernetes.

Cela me permet d’être sûr que je n’ai rien oublié.

Ensuite, il faut savoir comment exprimer chaque section, chaque paramètre, chaque option du fichier Compose. Voyons un petit exemple en action !

# Fichier Compose                                                  | traduction
version: "3"                                                       |
  services:                                                        |
    php:                                                           | deployment/php
      image: jpetazzo/apptruc:v1.2.3                               | deployment/php
      external_links:                                              | service/db
      - 'mariadb_db_1:db'                                          | service/db
      working_dir: /var/www/                                       | ignoré
      volumes:                                                     | \
      - './apache2/sites-available/:/etc/apache2/sites-available/' |  \
      - '/var/logs/apptruc/:/var/log/apache2/'                     |   \
      - '/var/volumes/apptruc/wp-config.php:/var/www/wp-config.php'|    \ volumes
      - '/var/volumes/apptruc/uploads:/var/www/wp-content/uploads' |    /
      - '/var/volumes/apptruc/composer:/root/.composer'            |   /
      - '/var/volumes/apptruc/.htaccess:/var/www/.htaccess'        |  /
      - '/var/logs/apptruc/app.log:/var/www/logs/application.log'  | /
      ports:                                                       | service/php
      - 8082:80                                                    | service/php
      healthcheck:                                                 | \
        test: ["CMD", "curl", "-f", "http://localhost/healthz"]    |  \
        interval: 30s                                              |   liveness probe
        timeout: 5s                                                |  /
        retries: 2                                                 | /
      extra_hosts:                                                 | hostAliases
      - 'sso.apptruc.fr:10.10.22.34'                               | hostAliases

Ci-dessus, un vrai fichier Compose utilisé par un de mes clients. J’ai remplacé les noms d’image et d’hôte pour respecter la confidentialité de mon client, mais en dehors de ça tout est authentique. Ce fichier Compose est utilisé pour faire tourner en préproduction une application basée sur une stack LAMP. Pour l’instant l’application tourne sur une seule machine, mais la prochaine étape est de la “Kubernetiser” (et permettre un scaling horizontal si nécessaire).

J’ai annoté le fichier Compose afin d’indiquer en face de chaque ligne comment je l’ai traduite en ressources Kubernetes. Dans ce second article, je vais détailler point par point comment j’ai établi la correspondance entre Compose et Kubernetes.

Tout ça a demandé beaucoup de travail ; travail spécifique à cette application, de surcroît. Comment répéter ça efficacement pour d’autres applications ? Dans le cas de mon exemple, mon client a toute une brochette d’applications similaires. Le but est alors de construire un modèle d’application (par exemple, sous forme de Helm Chart) qu’on pourra réutiliser, ou au moins utiliser comme base, pour plusieurs applications.

Si les applications sont différentes les unes des autres, on n’y coupe pas : il faut les convertir une par une.

Je conseille alors de prendre le problème par les deux bouts. C’est-à-dire qu’on peut convertir une application à la main, puis se demander “qu’est-ce que je peux modifier dans l’application originale (au format Compose) pour la rendre plus facile à lancer sur Kubernetes?” Parfois, il s’agit de changements très simples. Remplacer un nom DNS par un nom court ; utiliser une variable d’environnement pour changer le comportement du code … Si on normalise suffisamment nos applications, il est fort possible qu’on puisse ensuite les traiter automatiquement avec Kompose ou Docker Enterprise Edition ou un outil du même genre.

Conclusions

Passer de Compose à Kubernetes nécessite de transformer le fichier Compose en multiples ressources Kubernetes. Il existe des outils (comme Kompose) permettant de le faire automatiquement, mais ces outils ne sont pas la panacée (en tout cas, pas encore).

Même si on utilise un outil, il faut être capable de comprendre ce qu’il produit. Il faut donc être familier avec Kubernetes, ses concepts, et ses différents types de ressources.

J’en profite donc pour mentionner mes prochaines formations à Paris et en Californie !

Il y aura:

Bien démarrer avec les containers le 15-16 avril à Paris
Déployer ses applications avec Kubernetes le 23-24 avril à Paris
Opérer et administrer Kubernetes le 26 avril à Paris
Kubernetes for administrators and operators le 10-11 juin à San Jose

Les formations à Paris sont en français. Si vous voulez monter en compétence vitesse grand V, vous pouvez enchaîner les 3 formations (elles sont conçues pour fonctionner ensemble).

La formation à San Jose aura lieu dans le cadre de la conférence Velocity.

Je peux aussi assurer des formations sur mesure pour votre équipe. N’hésitez pas à me contacter.

Dans la seconde partie de cette article, on va entrer dans les détails techniques pour expliquer comment on a adapté cette application LAMP pour la faire tourner sur Kubernetes!

The depression gnomes

2018-09-06T00:00:00+00:00

I’m going to try to explain how I felt when I was struggling with depression. There will be gnomes and other lousy metaphors, but don’t let that distract you.

Sometimes, depression feels like two miniature gnomic versions of me are fighting over the control of my brain. You know, a bit like the little do-gooder angel and the mischievous imp that we imagine sitting on our shoulders, giving us advice when moral dilemmas arise.

“You shouldn’t eat that cake! You’ve already eaten a big lunch! Furthermore, it’s full of fat and sugar, which are bad for your health!”

“Chocolate is delicious! You love the cake, and the cake loves you… Furthermore, just one slice won’t kill you! You can always go to the gym to sweat it out… Tomorrow… (Muhahaha!)”

Except the depression gnomes are not good and evil; they’re merely happy and sad.

The happy gnome is the one that is usually in charge, when I’m not depressed. He’s the one who tells me how to bring joy to myself and others. He’s an optimist, always sees the upside in every situation, and he sings “Always Look At The Bright Side Of Life” under the shower.

The sad gnome is not only sad; he’s also an incorrigible defeatist. He’s normally not around, because he’s locking himself up in the basement. Everything scares him, including getting out of the basement. But when he roams free, he writes downer thinkpieces like “Ten Reasons Why It Will Never Work — Number Nine Will Scare You!” and he tends to be very, very convincing.

In January 2017, a few months after being diagnosed with depression, I took a big decision: I would learn and play the cello! Why the cello, and which impact it had on me, are a whole another story; which I won’t tell now. All we need to know is that I managed to rent a cello and to find a teacher. However, my cello lessons were rather far. I had to drive half an hour to get there (and as much to get back). The drive was quite an ordeal, because my lessons were in the evening, it was dark outside, our car’s windshield was very dirty, and the headlights didn’t help much. Furthermore, I had just switched to a different antidepressant medication, and my body and brain were very busy coping with various chemical unbalances, making me even more tired than usual. The drive to my cello lessons and back was excruciating.

A little part of me was thinking, “Hey, you could get that car washed; it would maybe help?” That was the happy gnome, always willing to provide useful suggestions. But I wouldn’t act on it. Why? Because the sad gnome was in charge. And this is what he was saying:

“Whoa whoa whoaa there … Washing the car seems complicated. You certainly won’t wash that car yourself; you don’t even have the cleaning products for that, and it’s cold as duck outside. Taking it to a carwash, you say? And where are you going to find one? Google Maps? Oh yeah? Shall we talk about what happened last time we looked something up on Google Maps?” (Nothing happened, I don’t know what this was about. The sad gnome sometimes seem to know things that I ignore.) “That sounds dangerous, and, you know, complicated. You better drive that old van as it is, Sir.”

And I would listen to the sad gnome, because that’s what you do when you’re depressed. The happy gnome could put together giant neon signs advertising for free cookies when you’re hungry: you wouldn’t notice them; or you’d think it’s a scam. (Honestly, who would give away free cookies these days?)

Right on the road back from my cello teacher’s place, just before hopping on the freeway, there is a carwash. The automated kind, where you plop a few bucks in a machine or swipe your credit card, then drive over a huge contraption that plays Rocky’s theme and then brushes and waxes on and waxes off and does unspeakable things to your car, while you patiently wait inside like a sloth on Noah’s Ark.

I drove past this thing every single time. The happy gnome was jumping wildly up and down, cheering “Hey, look at that, by golly! Isn’t that precisely a beautiful carwash, exactly what we are looking for?” But the sad gnome was shaking his head. This is what he was saying:

“Naaaah, that’s going to be complicated. You will have to slow down, put your blinkers, turn the wheel to get into their parking lot; then figure out their pricing structure, which probably has seven tiers of various upsells and options. This is all complicated and shit. You drive home now. We are tired and we want our bed.”

And I would just drive past it, because the sad gnome’s arguments sounded solid to me.

But the happy gnome wouldn’t give up. Eventually, I got a bit better. My old medication got purged from my body, and the new one helped a bit (for a short while). Of course, I drew lots of satisfaction and joy from playing the cello, too. And we got a subscription to Blue Apron, a service that delivers fancy meal ingredients to your door, complete with recipes, and you just have to follow to instructions and boom! Delicious food happens. All these good things allowed the happy gnome to be in charge once in a while. The sad gnome would tell us, “Sure! You try your thing! But when it will fail miserably, I will have told you so!” — but miserable failures were rare.

One day, as I was driving back from the cello, I saw this carwash for the Nth time. And this time, I signaled, slowed down, turned the wheel, and pulled in. I got their cheapest option. When the thing started playing the Rocky theme, I laughed my ass off. There were multi-colored brushes and cleaning products and stuff making a rainbow on my windshield. When it was done, I drove to the vacuum cleaner station, and vacuumed the shit out of this car. I threw away all the things that were too encrusted with dirt to be recognizable, and then vacuumed again before driving home.

Lo and behold: I could now actually SEE on the road! I don’t know if it was because the windshield was cleaner, or the headlights were cleaner, or purely psychological, but either way, it felt so much better!

Encouraged by this immense success, the creative part of my brain was on fire!

“Let’s go do some shopping!”

I got some amazing cereal from Aldi. Duck breast at the Broadway Butcher Shop, to cook one of my favorite dishes magret de canard avec pommes de terre sarladaise. I did the groceries, took out the recycling, then collected the various archeological artifacts lying around the car. I discarded what was obviously of no value (like the coupons for $2 off a dog wash valid until March 2013) and kept whatever might have sentimental value to the owner of the car (my partner’s mother, who was generously letting us borrow her car, by the way). I ordered one of these magic trees that you hang to your rear view mirror to dispense a light scent of “mountain breeze” or whatever; and I put a small garbage bag in the car to collect future trash instead of letting it pile up all over.

That felt awesome.

I’ll tell you, that car (which had 180,000 miles back then) looked so spanking new and shiny that when I picked up my Mom at the airport, she called it “nice.” (Granted, she had just spent more than 12 hours in multiple planes, so her judgement might have been slightly altered by that time; but still!)

So, where am I getting at?

For me, depression is this sad little gnome that constantly sits on the brakes of my brain, and tells me that it’s not gonna work, and therefore it’s not worth trying.

That sad little gnome is even actively sabotaging any effort at making things better. He’s hiding from my view all the nice things that people are doing to help me (including myself), and he’s telling me that everything they do is meant to hurt me and make me feel bad. He makes me forget that I have something delicious in the fridge, only to remind me sourly about it long after it’s past due date and has grown multicolored lifeforms.

Why the hell am I listening to this creepy sad little gnome? Because I know him. At that point, he had been around for 36 years in my head. And he knows me, too. He has known me for just as long. He knows me and my thoughts and my apprehensions and my fears, better than anyone else. He knows how to be convincing and carry his point home, alas. In the past, before I was depressed, he was silenced by the optimist, the adventurous, the creative, the imaginative part of my mind. “It’s not gonna work! It’s NEVER gonna wo—” “Well, I did it anyway, sorry!”

Sometimes, the sad gnome is right. It doesn’t work. But that’s OK, I try again, or try other things, and it’s all good. Depression kicked in when for some reason, I started paying attention to the sad gnome. I noticed that he is able to prove wrong the other one more and more often. So, unwillingly, I listened to him more; and that was a vicious circle.

Breaking that vicious circle was hard. It still is. Before being diagnosed with depression, in a last ditch attempt to get better, I took a break home in France in the summer. One of my best friends visited from the US. We rented a little convertible and we toured Brittany and Normandy together. It was wonderful, and it helped, but it wasn’t enough. I started medication and therapy. More friends visited us in Kansas City for Christmas, and after their visit, there was a dancing pole in our bedroom, and I learned a few tricks on it. I started the cello, bought half a dozen Raspberry Pis, a soldering iron, 50 feet of LED strips, and built things. It also took a visit of my Mom, and of another one of my best friends; the unwavering support from the woman I love and cherish; and a cocktail of dubious chemicals flooding my central nervous system every morning.

But I feel alive again.

The sad little gnome is still up there; he’s still babbling about nonsense and sad little gnome things. But I’m re-learning to not pay attention to his endless rants. And sweet José Herbert Philemon Gontrand Creeps, it feels better.

If you’re depressed, I don’t know if you have a sad little gnome or fairy in your head. I bet you do. She’s been telling you a bunch of bad advice over the last few years. “You’re not happy with that person! Do this instead! Hey, what if we got drunk? It solves everything!” She also knows you very well, and knows which strings to pull to make you do what she wants, what she thinks is right. She doens’t really mean you harm; no more than my sad little gnome. But she doesn’t believe that things can be better, because she’s looking down in the dirt and can’t see the birds flying in the sky above. You have to stop listening to the sad little fairy, and let your creative mind or soul or spirit take over again. Or maybe another part: creativity helped me, but everyone is different. Video games helped me too, because even when I couldn’t be good at anything, I could still be good at video games. (Sometimes it was the last thing convincing me that I hadn’t become completely stupid.)

Find your cello, create something, conquer something that’s easy, so that you can cherish these victories, even if they are of little merit; so that they give you the confidence to move to harder things. Never forget that no matter what happens and what you do, the happy fairy is never gonna give you up or let you down.

Juniors, seniors, and mentors

2018-08-15T00:00:00+00:00

What’s the difference between a junior and a senior software engineer? Is it the responsibility of a company to provide learning resources (e.g. time or mentoring) to its engineers? What makes a good mentor anyway?

All these questions are particularly important in the context of software engineering, a discipline where the tools and frameworks and languages evolve very quickly. At a first glance, it seems like we need to keep learning if we want to be good at what we do. How can we make that work?

Note: this post is about software engineering roles and practices. It is likely that many of the points still hold in other fields—i.e. engineering in general, or non-engineers in software companies; but reader’s discretion is then advised.

Juniors and seniors

For a while (at least the first decade in my career in software), I never really thought too hard about what it meant to be a “junior” or “senior” engineer. It was probably something that came with experience, I thought. After some time in the industry (how long?) I would be able to tack “senior” next to my title, and that would be about it.

And then, someone challenged this thought process. It was in 2014, at the SCALE 12x conference. Lars Lehtonen said approximately this:

The primary skill of senior engineers is to train junior engineers. If you’re senior with no junior around, you’re not senior.

If you want the full context of that quote, you can check the recording of that talk at the LISA conference (the quote about junior engineers is a bit after the 19” mark).

There are multiple ideas packed in there.

First, in every project, there will be some work that will be exciting and a great learning experience if it’s one of the first times we do it, but less interesting if we have done this 10 times in previous jobs or projects already.

It’s great to have someone “junior” to do that kind of work; with the help and supervisior of someone “senior”. It will free up some time for the “senior” engineer, while helping the “junior” one to ramp up their skills.

Under that lens, the term “junior” just means “someone who has done less, and/or has less experience with, a specific task or tasks in a specific domain”, by contrast with “senior”. In other words, junior/senior is domain-dependent and team-dependent: we can be senior in one field (e.g. databases, containers), junior in another (e.g. frontend, machine learning). We can be senior relative to one team, and junior relative to another.

Juniors and janitors

Of course, the actual situation is not always as rosy as I described above. Sometimes, junior engineers are tasked with the boring and repetitive grunt work. Less exciting, for sure, but in the right environment, that can still be a great opportunity to learn and grow, for instance by trying to automate that work.

This is part of a larger conversation about the different kinds of tasks that need to be done when building and then scaling and operating an application. My favorite talk on this topic is Rock Stars, Builders, and Janitors by Alice Goldfuss.

Juniors and saviors

In this context, it means that we shouldn’t assign exclusively janitorial tasks to our junior team members. But it also means that conversely, if our team is overworked and short-handed, and our senior engineers don’t have the time to do all the complex, value-adding stuff that we’d like them to do, one easy and cost-effective solution is to hire some junior engineers. After a short ramp-up period, they will be able to take over the less complex tasks, freeing up time for the rest of the team.

After a while, junior engineers are not junior anymore, and we now have a better, stronger team. After I wrote about my experience with depression and burnout, many people reached out to share their stories; and I heard more than a few terrifying ones where an entire team was wiped out by burnout, one after the other, because after each departure, the workload and the overall situation got worse for the remaining people, and management failed to course-correct in a timely manner. The more you wait, the more expensive it gets to fix a situation like this one—not even mentioning the appalling damage caused by burnout. Hire junior engineers and train them before your best people start leaving in droves.

Learning resources

Training people requires adequate resources. How do we do that?

Everyone learns differently, and every organization has different budgets and people anyway.

Let’s start with the obvious: we should give time for people to learn and grow during office hours. We shouldn’t expect our employees to spend their evenings and week-ends learning new things. Otherwise, we are penalizing people who cannot invest that time, for instance because they are parents or generally speaking caretakers.

To quote Jen Simmons:

If you are spending a lot of your time learning while you code — while someone is paying you — then you are doing it right. You don’t need to learn it all ahead of time and show up to work already knowing. …

Learning on the job is the job. You’ll accumulate wisdom as you go. You’ll learn to recognize & prevent complex problems earlier & earlier in the process. But you’ll never reach a place where you don’t have to look things up, don’t haveto keep learning (on someone else’s dime).

To progress in our careers, we need to keep learning and pick up new skills. If we cannot do that, we are stuck.

I should clarify, here, that there is nothing wrong with using your free time to gain new skills and work on side projects. It is very likely that this will accelerate your career, of course. Therefore, somebody with more free time and fewer responsibilities is likely to progress faster than a single parent taking care of two young kids and an elder while having to endure a long commute. Such is life. But our responsibility is to make sure that we give enough time for everyone to keep progressing, so that we don’t build an environment that is downright hostile or toxic for less privileged folks.

Always be learning

Some people might be thinking, “but we want to hire engineers that are productive from day 1; we select them because they have the set of skills that are required for the job, so that they can be operational faster!”

Oh dear, I have a few things to break down to you.

In most big organizations (or, really, any place that has been around long enough to have a non-trivial stack), it will take weeks and even months to properly on-board an engineer. Yes, we keep touting how containers help us reduce friction, and how good infrastructure and platform tools enable us to push code with confidence very early after joining a team; but even with all that, every engineer at Facebook goes through a six-week bootcamp. I’ve heard a few times that it could take 3 to 6 months for engineers at Google to reach acceptable levels of productivity. (Keep in mind that there are, of course, outliers; these are just averages.)

In the big picture, it doesn’t matter if a new hire has to spend a few days or a week getting familiar with the specific framework that you’re using, or the API of your Cloud provider.

“But we are a startup; we can’t afford to wait months for people to be adding value!”

If you’re a startup, your employees need even more to keep learning, because your technology stack and your processes are even more likely to evolve than in a bigger company. On the other hand, if you are at an early stage, your existing stack is hopefully less complex, and they can get started faster—but they will still learn a lot during the first weeks and months.

I’m going to give you a personal example. When I joined dotCloud, the infrastructure was 99% AWS EC2. I had zero experience with it (I had perhaps fired up an instance with the console before; but I had never used the CLI or API and I wasn’t familiar with the specifics of AWS). I also had zero experience with ZeroMQ and MessagePack, which were powering the RPC layer used all over the place by dotCloud. That didn’t prevent Solomon Hykes and Sebastien Pahl from hiring me. If memory serves me well, one of the first things that I had to do was to add SSL termination to some services, in a reproducible, automated way. I spent some time messing around with ELBs, only to discover that there was a limit to the number of certificates that we could load back then, and that it wouldn’t work for us. Then I switched to small EC2 instances running NGINX instead. There is a good chance that somebody familiar with AWS wouldn’t have been faster, or not by much. Furthermore, while working on this, I also contributed useful features to the RPC layer (again, if memory serves me well, by improving introspection features and auto-documentation, making it easier for me but also others engineers to discover the services that we were running and how to use them without having to pull up their source code each time). On that topic, requiring a candidate to know beforehand about ZeroMQ and MessagePack would probably have reduced the potential talent pool to unacceptable numbers anyway.

Louder for the folks in the back: you shouldn’t hire people for their current skills, but for their ability to pick up the new skills that they will need to do their job tomorrow, next month, next quarter, next year. Very few software engineers knew about containers in 2013 when Docker launched. Millions of developers learned how to use Docker and containers since then; and most of them learned on the job, for the greater satisfaction of their employers.

Mentors

There is another resource that is crucial to the development of good engineers: mentoring.

What’s that, exactly?

The first thing that comes to mind is usually a long-term, ongoing, one-to-one relationship between a more senior and a more junior person (see, we’re back to the junior/senior theme). I want to use a broader definition, so that it encompasses any kind of situation where someone takes some time to help someone else by providing them with information of any kind that they need to better do their job.

Here are a few examples of situations that many people would probably not consider as “mentoring”, but that I would like to put under that broader definition.

I’m getting started with a new project or in a new team, and a coworker is helping me to set up my environment, walking me through code, docs, wikis, tickets, whatever.
I am different from most of my coworkers, in a way that might be obvious or subtle, and somebody in the company (potentially outside of my team) has regular check-ins with me. This is useful if I’m the only woman in a team of men, or the only person of color in a team of white people, or the only person with a different native language, or the only person coming from a different education background.
I’m part of a cross-functional effort involving multiple teams with very different domains, and I often need to ask for information or clarification from other teams.
I’m a more junior team member, and I need frequent guidance and help from other engineers in the team or other employees in the organization. (This is a bit like a traditional mentoring situation, but shared across multiple people instead of having a designated mentor.)

How much mentoring should we provide? As much as necessary.

Is there such a thing as “too much mentoring”? No.

If we find ourselves thinking, “we are spending too much time training new people”, or similarly, “our senior engineers can’t get anything done because the new hires are taking too much of their time”, then we should re-read that paragraph. The better we help new hires to ramp up their skills, the faster they will be able to accomplish complex tasks and free up time of our senior engineers.

From what I’ve seen in various organizations, when people complain that “this employee is taking too long to be operational”, they are shifting the blame to the employee, while very often, the actual reasons are:

lack of on-boarding process;
lack of documentation;
unequal access to mentors or other resources;
cultural bias;
negative attitude towards asking questions.

Individually, these things can hinder someone’s progress; and combined, they can be even more damaging. For instance, if we don’t have proper documentation explaining how to set up a new hire’s environment, and rely on Alice to do it each time, anyone starting while Alice is on vacation will appear to be slower than the others, even if it’s totally not their fault.

Another example: if only Bob knows the ins and outs of our database setup, and the only way to get information is to sit with him at his desk, this puts remote workers at a disadvantage. If Bob also has bias (which is likely, because Bob is human and all humans have bias), he might not communicate as easily with people who are different from him, and therefore put them at a disadvantage.

Encouraging people to ask questions (rather than discouraging them by sending them the message that they should already know everything, and that asking questions is a sign of weakness) can also make a huge difference. Foster a culture where asking questions is normal and expected, regardless of experience and seniority.

Eventually, once we have fixed our on-boarding processes, documentations, made sure that key people were available in a fair manner, and trained our people to reduce the impact of bias, if someone is still a poor performer, letting them go might be the only option; but it should be done as fairly as possible, and see that as an opportunity to improve our hiring process. But I digress!

Always be teaching

The other side of the coin is that as an engineer, regardless of my level, teaching should be part of my core skills. It doesn’t mean that every engineer should be able to build a course curriculum, deliver a tutorial, give a talk, or anything like that; but every engineer should be able to answer questions from a peer of any level.

Learning is a critical skill for a good software engineer, but teaching is just as important. An awesome 10x engineer who can’t or won’t share what they know only brings short-term value to your organization. In the long term, they will become a liability: at best, by being gatekeepers to important information; at worst, by driving other people away.

Conclusions

To recap, a senior engineer might be more experienced in some areas, but first and foremost, they should be someone who is able and willing to share what they know.

Everyone (not only junior engineers) needs mentoring and easy access to information, during their whole career.

There is no such thing as too much mentoring.

We should promote cultures and environments where asking questions is always OK.

All these things will pay off quickly and make us more effective!

Thanks to AJ for proofreading an early version of that post and suggesting many fixes and improvements. All remaining typos and mistakes are mine.

Dérisquer son infrastructure avec les conteneurs

2018-08-01T00:00:00+00:00

On parle souvent des conteneurs comme un moyen d’accélérer les cycles de développement, mais ils permettent aussi de dérisquer (ou réduire les risques, si le néologisme vous fait grincer des dents ☺) les opérations de déploiement. Comment donc ? Grâce à un pattern sûrement familier à certain·e·s d’entre vous : les « infrastructures immutables ». Nous allons voir comment ce pattern réduit les risques, et comment les conteneurs le rendent accessible à des structures de tailles et de compétences variées.

Avant de commencer, une petite page de pub pour le sponsor de ce blog, c’est-à-dire moi-même !

En avril, je dispenserai trois formations à Paris (en français) : bien démarrer avec les containers, déployer ses applications avec Kubernetes, opérer et administrer Kubernetes. Puis, en juin, il y aura Kubernetes for administrators and operators à San Jose (Californie), en anglais. Si vous connaissez quelqu’un que ça peut intéresser … N’hésitez pas à faire suivre ; merci beaucoup ! ♥

If you can’t read French and wonder what this post is about: it explains how containers can be used to implement immutable infrastructures, thus considerably reducing the risks associated with application deployment. If you understand English and want to know more about this, you can check e.g. this talk that I gave at QCON a few years ago. Also, I would like you to know this:

In April, I will deliver three training sessions in Paris (in French). There will be getting started with containers, deploying apps with Kubernetes, and Kubernetes administration and operations. French is not your thing? I got you covered with Kubernetes for administrators and operators, a two-day tutorial in June, at the O’Reilly Velocity conference in San Jose (CA). If you know someone who might be interested … I’d love if you could let them know! Thanks ♥

Une brève histoire du déploiement

J’ai déjà parlé du déploiement dans un article précédent, en soulignant les facilités apportées par les conteneurs. Grâce aux conteneurs, au lieu de créer des paquetages multiples (deb, rpm, npm, pip, jar, etc.) il suffit d’apprendre à écrire un Dockerfile pour être capable de livrer n’importe quel composant logiciel. Fini les problèmes de dépendances, les différences de versions entre le dev’ et la prod’ : vous avez sûrement déjà entendu ces arguments pas mal de fois !

Mais les conteneurs nous aident aussi à réduire les risques. Plus précisément, je souhaite aujourd’hui traiter de la question suivante :

Que fait-on quand le déploiement se passe mal ?

Il peut y avoir plein de bonnes mauvaises raisons pour que ça arrive : un bug qui passe au travers des mailles de la QA (que celle-ci soit manuelle ou automatique), une régression des performances, mais aussi un problème lié au processus de déploiement lui-même.

Donc, que faire ? Et en quoi les conteneurs vont nous aider ?

Machine arrière, toute

Le premier réflexe quand on se rend compte qu’on a déployé une mauvaise version en production, c’est de revenir en arrière, c’est-à-dire redéployer la version précédente.

Si on a un processus de déploiement bien rodé et qu’on a encore la version précédente du code, c’est en théorie assez facile. Il suffit de s’imposer une certaine discipline ; par exemple « notre code doit toujours être dans un dépôt git, et tout déploiement doit se faire à partir d’un tag ». Dans ce cas, pour revenir en arrière, on reprend le tag précédent et on redéploie.

Note : si votre code n’est pas dans un système de contrôle de source, ou que vous n’utilisez pas encore de branches ou de tags, je vous conseille de commencer par là ; vous avez encore plus à y gagner !

La théorie … et la pratique

Malheureusement, parfois, il y a un hic. Par exemple, la nouvelle version du code nécessite la mise à jour d’un autre composant, et la nouvelle version de ce composant n’est pas compatible avec l’ancienne version du code. Ou bien, dans le même ordre d’idée, ce n’est pas notre code qui a un problème, mais une de ces dépendances qui a été mise à jour lors du déploiement. Quand on fait notre retour en arrière du code, il faut alors aussi penser à faire un retour en arrière des dépendances. Or, ça n’est pas toujours facile, ou même possible ! Si on n’a pas pensé à lister explicitement les versions de toutes les dépendances qu’on utilises (et, récursivement, les dépendances de ces dépendances et ainsi de suite), c’est difficile de savoir ce qui était installé auparavant. Avec un peu de chance, ça peut se trouver dans les logs du déploiement:

$ pip install 'Flask>=1.0'
Collecting Flask>=1.0
...
Installing collected packages: Flask
  Found existing installation: Flask 0.12.4
    Uninstalling Flask-0.12.4:
      Successfully uninstalled Flask-0.12.4
Successfully installed Flask-1.0.2

Mais il faut encore que les anciennes versions de ces dépendances soient encore disponibles. Dans le cas de Flask ci-dessus, tout va bien, car les anciennes versions sont archivées dans PyPI, mais ce n’est pas forcément le cas partout.

Il peut aussi arriver que le processus de déploiement échoue, mais uniquement sur certains serveurs. Par exemple, le déploiement peut nécessiter beaucoup d’espace disque : parce qu’ils télécharge et transforme des gros assets, ou parce qu’il compile des dépendances significatives comme ffmpeg. Ces opérations marchent toujours sur un serveur fraîchement installé (où le disque est vide) mais vont échouer si on tente un déploiement sur un serveur ayant davantage d’heures de vol, et où les disques sont davantage remplis.

Et si on est particulièrement malchanceux, on peut aussi « casser » les serveurs — j’entends pas là, entraîner un crash du serveur, ou bien (plus subtilement) empêcher malencontreusement les futures connexions au serveur (et donc nous empêcher de corriger le problème de déploiement).

Heureusement, tous les problèmes que je viens de décrire sont rares. Malheureusement, ils finissent tous par nous arriver un jour ou l’autre. Et le jour où ça arrive, trouver la source du problème n’est pas toujours facile ou rapide. On veut revenir à la version précédente dans les délais les plus brefs, et sans avoir l’impression de jouer un coup de poker.

C’est là que les conteneurs (et les infrastructures immutables en général) vont nous sauver la mise.

Infrastructures immutables

Le principe de l’infrastructure immutable, c’est qu’on ne fait jamais de modification sur un serveur. Quand on veut déployer une nouvelle version, on prend un nouveau serveur, on installe la nouvelle version sur ce nouveau serveur, puis on remplace l’ancien serveur par le nouveau.

Du coup, quand on veut revenir en arrière, il suffit de ressortir l’ancien serveur du placard et de le remettre en marche.

Le concept est simple ; son implémentation l’est moins.

Si on utilise des machines physiques, le processus est particulièrement lourd. On peut employer des techniques comme le PXE pour provisionner automatiquement des nouveaux serveurs au travers de leur connexion réseau, sans intervention physique). Mais c’est lent, cher, et ça demande des compétences qui ne courent pas les rues.

Avec des machines virtuelles, c’est une stratégie déjà plus réaliste. On peut facilement démarrer et déployer des machines virtuelles de manière automatique : tous les clouds publics ou privés dignes de ce nom offrent une API et/ou une CLI permettant d’écrire des scripts pour lancer des serveurs.

D’autre part, des outils comme Packer de HashiCorp permettent de créer des « golden images » de serveurs ; par exemple, si on utilise AWS, on peut utiliser Packer afin de créer automatiquement une AMI (image de machine virtuelle) à chaque fois qu’on veut réaliser un déploiement. Pour mettre en production, on lance des machines virtuelles avec l’image qu’on vient de créer ; et pour revenir en arrière, on relance des machines virtuelles avec la version précédente.

Move fast and break things

À partir de là, on peut faire encore mieux. Quand on passe en production sur les nouveaux serveurs, au lieu d’arrêter les anciens, on peut les écarter. La manière la plus radicale est de les débrancher du réseau ; mais on peut aussi (de manière un peu plus fine) les sortir des load balancers (ou les déconnecter des message queues dans le cas de workers asynchrones). Puis, quand on veut faire un retour en arrière, il suffit de rebrancher le réseau (ou remettre les backends dans le load balancer) : c’est très facile, très rapide, et aussi très fiable.

Cette idée permet d’implémenter deux techniques particulières : le blue green deployment et les canary releases.

Dans un blue green deployment, lorsqu’on déploie une nouvelle version, on déploie un nouvel ensemble de serveurs (l’ensemble green) pour remplacer l’ancien (le blue) ; puis, on bascule tout le trafic d’une stack à l’autre. Un peu comme si on changeait d’un seul coup le signal d’aiguillage d’une voie ferrée, mais au niveau de nos load balancers. En cas de problème, tout ce qu’il y a à faire, c’est rebasculer vers l’ancienne stack.

Une canary release est une release qui n’est exposée qu’à un petit nombre d’utilisateurs. Au lieu de faire basculer l’intégralité du trafic sur la nouvelle version, on n’en fait passer qu’une partie. Selon les cas, ça peut être une fraction des requêtes, ou bien seulement les requêtes de certains utilisateurs, par exemple. Puis, on observe attentivement ce qui se passe pour ces requêtes (ou ces utilisateurs). Si tout va bien, on peut faire passer tout le trafic sur la nouvelle version (ou même augmenter de manière progressive). Si nos métriques nous indiquent que les taux d’erreur ou la latence sont plus élevés sur la nouvelle version, ou bien que les utilisateurs nous remontent des problèmes, on revient à la version originale — et ce faisant, on n’a impacté qu’une toute petite fraction du trafic (ou des utilisateurs) ; la plupart n’ont même pas vu le problème survenir.

(Le nom canary release vient des canaris qui étaient utilisés dans les mines de charbon pour détecter les gaz toxiques comme le monoxyde de carbone : les mineurs transportaient un canari dans une cage, et si la concentration de gaz toxique devenait trop élevée, le pauvre canari tournait de ĺ’œil ; mais comme les canaris sont plus sensibles que les humains, cela arrivait avant que les mineurs ne soient affectés, et leur laissait donc le temps de faire demi-tour pour revenir en sécurité.)

Ces procédés ont été largement décrits par des organisations comme Netflix par exemple, ou encore Facebook. C’est d’ailleurs comme ça que Facebook a pu abandonner le slogan « move fast and break things », et ne garder que la partie « move fast ».

Le problème de ces techniques, c’est qu’elles nécessitent souvent un outillage assez lourd, voire des équipes entières dont la mission est de fournir une plateforme de développement au reste de l’organisation. Netflix emploie plus de 5000 personnes, Facebook plus de 25000. Est-ce que des organisations de taille plus modeste peuvent se permettre d’adopter des techniques aussi efficaces ?

Spoiler alert : oui !

Les conteneurs à la rescousse

Si vous avez utilisé Docker (même de manière très superficielle), il y a des grandes chances que vous ayiez déjà les compétences nécessaires pour savoir faire un tel rollback.

Si vous faites attention à appliquer un tag différent à chaque fois que vous construisez une image, toutes vos images précédentes restent disponibles en cas de problème.

Par exemple :

# On construit l'image pour notre appli ...
docker build -t monappli:v1.0
# ... Et on la lance.
docker run -d -p 80:80 --name monappli monappli:v1.0
# ... On modifie le code, et on re-build ...
docker build -t monappli:v1.1
# ... Puis on stoppe l'ancienne version ...
docker rm -f monappli
# ... Et on lance la nouvelle.
docker run -d -p 80:80 monappli:v1.1
# ... On se rend compte qu'on a un problème :
# ... Son stoppe la version actuelle ...
docker rm -f monappli
# ... Et on relance l'ancienne.
docker run -d -p 80:80 --name monappli monappli:v1.0
# ... Et voilà !

Ces commandes (docker build/run/rm) sont des commandes de base de Docker. Elles suffisent pour être capable de réaliser un rollback fiable et extrêmement rapide. Pas besoin d’apprendre Packer, Terraform (même si ce sont d’excellents outils!), ou de peaufiner des scripts manipulant la CLI ou l’API de votre cloud.

Si vous voulez davantage de détails, vous pouvez consulter la version gratuite de notre support de formation « introduction aux conteneurs » (ce lien vous emmènera directement au chapitre correspondant).

Et l’orchestration dans tout ça ?

L’exemple ci-dessus met en jeu un seul conteneur déployé sur un serveur unique. Si votre application tourne sur un cluster (ce qui sera le cas tôt ou tard, espérons-le, si votre application rencontre le succès et le trafic qui va avec), les choses se compliquent.

Faut-il lancer les commandes ci-dessus sur tous nos serveurs ? En parallèle, séquentiellement ? On pourrait. Ou bien, on pourrait laisser un orchestrateur comme Kubernetes s’en occuper pour nous.

Avec Kubernetes, passer à la version v1.1 de notre appli devient :

kubectl set image monappli monappli=monappli:v1.1

Cette commande va progressivement remplacer les conteneurs de l’application de manière à utiliser l’image monappli:v1.1. « Progressivement », c’est-à-dire en s’assurant de ne jamais avoir :

plus d’un conteneur hors service (jusqu’à Kubernetes 1.10),
plus de 25% du total hors service (à partir de Kubernetes 1.11).

(Bien sûr, ces nombres ne sont que les valeurs par défaut ; les valeurs exactes — en absolu ou en proportion du total — peuvent être ajustées pour chaque déploiement.)

Quant au rollback, vous l’avez probablement deviné, il se fait avec :

kubectl set image monappli monappli=monappli:v1.0

C’est tout !

Si vous voulez davantage de détails, nous avons aussi une version gratuite de notre support de formation Kubernetes (là aussi, le lien vous emmène directement vers le chapitre en question).

Les avantages des conteneurs

Deployer une image de conteneur va plus vite que déployer une image de machine virtuelle. Mécaniquement, parce qu’une image de conteneur embarque moins de composants qu’une image de machine virtuelle. Ça ira donc plus vite de la construire, mais aussi la déployer sur les serveurs. Et si vous tirez parti du système de cache de Docker, construire une nouvelle image est une affaire de secondes, idem pour son déploiement sur les serveurs à travers une registry — même pour une grosse application, grâce au système de layers employé par Docker.

Lancer un conteneur est aussi plus rapide que lancer une machine virtuelle.

Enfin, de plus en plus de fournisseurs cloud proposent une tarification à la minute dès la première minute, mais il y a encore beaucoup de plateformes qui facturent à l’heure ; du coup, chaque déploiement coûte un peu d’argent pour chaque nouveau serveur lancé.

Bilan : utiliser des conteneurs, c’est non seulement plus facile, mais aussi plus rapide et moins cher.

Bien démarrer avec Docker et Kubernetes

En ce qui concerne Docker, la communauté est extrêmement riche en tutoriels divers pour démarrer tout comme aller plus loin. Je recommande particulièrement les « labs » disponibles sur training.play-with-docker.com.

Et en ce qui concerne Kubernetes, idem, vous trouverez de nombreux tutoriels et formations, y compris en français.

Si vous préférez être formé en personne, c’est aussi possible !

J’en profite donc pour mentionner mes prochaines formations à Paris et en Californie !

Il y aura:

Bien démarrer avec les containers le 15-16 avril à Paris
Déployer ses applications avec Kubernetes le 23-24 avril à Paris
Opérer et administrer Kubernetes le 26 avril à Paris
Kubernetes for administrators and operators le 10-11 juin à San Jose

Les formations à Paris sont en français. Si vous voulez monter en compétence vitesse grand V, vous pouvez enchaîner les 3 formations (elles sont conçues pour fonctionner ensemble).

La formation à San Jose aura lieu dans le cadre de la conférence Velocity.

Je peux aussi assurer des formations sur mesure pour votre équipe. N’hésitez pas à me contacter.

Les conteneurs : par où commencer ?

2018-03-28T00:00:00+00:00

Depuis quelques années, l’industrie du logiciel parle énormément des containers ; notamment de deux projets phares de cet écosystème : Docker et Kubernetes. Cet article donne une introduction de haut niveau (à quoi servent les conteneurs?) et donne un exemple de feuille de route que vous pouvez utiliser dans votre “voyage” pour adopter cette technologie et en tirer le meilleur parti.

Avant de commencer, une petite page de pub pour le sponsor de ce blog, c’est-à-dire moi-même !

En avril, je dispenserai trois formations à Paris (en français) : bien démarrer avec les containers, déployer ses applications avec Kubernetes, opérer et administrer Kubernetes. Puis, en juin, il y aura Kubernetes for administrators and operators à San Jose (Californie), en anglais. Si vous connaissez quelqu’un que ça peut intéresser … N’hésitez pas à faire suivre ; merci beaucoup ! ♥

Si vous connaissez déjà le principe des conteneurs et voulez voir la feuille de route que je propose, c’est par là !

If you can’t read French and wonder what this post is about: it gives a high level intro to containers, as well as a roadmap for someone who wants to leverage them to ship and deploy applications faster and more reliably. Also, I would like you to know this:

In April, I will deliver three training sessions in Paris (in French). There will be getting started with containers, deploying apps with Kubernetes, and Kubernetes administration and operations. French is not your thing? I got you covered with Kubernetes for administrators and operators, a two-day tutorial in June, at the O’Reilly Velocity conference in San Jose (CA). If you know someone who might be interested … I’d love if you could let them know! Thanks ♥

Pourquoi se mettre aux conteneurs ?

Si vous êtes familiers avec la problématique du déploiement, je vous invite à passer directement à la section suivante.

Le quoi ? Le déploiement ?

Le déploiement est un défi technique de l’informatique moderne. Pour clarifier : on parle ici du déploiement du code applicatif sur un (ou plusieurs!) serveurs. En effet, on édite rarement directement le code qui tourne sur les serveurs de production ! On travaille généralement sur une copie locale. Puis, le code de l’application passe par une série d’étapes plus ou moins nombreuses et plus ou moins complexes avant de se retrouver en production — et accédé par nos utilisateurs.

Dans son expression la plus simple, le déploiement d’un site web statique se résume à copier les fichiers du site sur un serveur. On faisait ça dans les années 90 avec le protocole FTP. De nos jours, on est beaucoup plus exigeants : même si un site reste purement statique, c’est une bonne idée de le servir via un CDN (pour offrir des performances optimales depuis n’importe quel point du globe). De plus, on veut être capable de faire un rollback, c’est-à-dire un retour sur une version précédente en cas d’erreur (pour enlever un contenu litigieux, ou si on a fait une boulette et malencontreusement effacé toute une section du site). Du coup, des services sophistiqués comme Netlify sont apparus, permettant d’avoir des fonctionalités modernes tout en gardant la simplicité historique de “je copie mes fichiers sur le serveur et pouf c’est fini!” (Netlify est utilisé, par exemple, pour la documentation de Kubernetes.)

Mais la majorité des applications web modernes nécessitent des opérations beaucoup plus complexes qu’un simple transfert de fichier. Certains langages comme Java ou Go sont compilés. Il faut s’assurer que la bonne version du compilateur (ou de l’interpréteur, pour les autres langages) est utilisée. Quasiment tous les projets modernes ont des dépendances logicielles, et là aussi, il faut prendre soin d’utiliser les bonnes versions. Ces versions sont presque toujours différentes entre l’environnement serveur et celui de développement. Et ceci n’est que la partie visible de l’iceberg !

De plus, le déploiement ne concerne pas que les applications web, mais aussi tous les backends des applications mobiles. Quant aux applications traditionnelles (de bureautique ou ludiques) elles ont de plus en plus souvent besoin, elles aussi, d’un backend pour fonctionner.

En théorie, il existe (depuis longtemps!) beaucoup d’outils solides permettant de résoudre ces challenges :

des package managers (comme npm, rpm, pip, dpkg…),
des outils de configuration management (comme Ansible, Chef, Puppet, Salt…),
des bonnes pratiques telles que la génération de golden images, le blue/green deployment, etc.

En réalité, ces outils et ces pratiques sont souvent difficiles à prendre en main. Cela peut déboucher sur deux situations : des structures modestes qui n’ont pas les moyens de mettre en place ces méthodes (par manque d’expertise en interne), et des structures plus fortunées, dans lesquelles des effectifs dédiés s’en occupent. Mais cela crée alors un fossé entre les équipes de développement et les équipes en charge du déploiement (les “ops”), et ce fossé empêche de s’engager dans une démarche “devops” (où les développeurs sont capables de déployer leur code de manière autonome et fiable).

C’est là que les containers entrent en scène.

(Vous aurez peut-être remarqué que j’utilise tantôt le mot anglais container et tantôt le mot français conteneur. C’est juste pour ne pas faire de jaloux!☺)

Déployer avec les conteneurs

Les conteneurs permettent de résoudre une grande partie des problèmes liés au déploiement. Comment ? Plutôt que de partir dans des considérations techniques sur les namespaces, les control groups, et le copy on write, je vais partager avec vous mon explication favorite. Pour la comprendre, il vous suffit d’avoir un smartphone sur lequel vous avez installé des applications.

Précisément, lorsque vous avez installé ces applications (que ça soit via le “store” d’Apple, celui de Google, ou d’un autre constructeur), tout ce que vous avez eu à faire, c’est appuyer sur un bouton. L’application s’est téléchargée toute seule, ainsi que toutes ses dépendances. Et ensuite, elle s’est lancée sans problème. (En principe!)

Les conteneurs permettent un résultat similaire pour les applications qui s’exécutent non pas sur un téléphone mobile, mais sur un serveur (ou une machine de développement). En tant qu’administrateur système, si je veux lancer un conteneur sur un serveur, j’effectue une opération très simple (l’équivalent du clic dans l’app store), et quelques instants plus tard, le code dans le conteneur se lance. Les applications mobiles font abstraction du modèle exact de téléphone, de la version d’iOS ou Android installée, et des autres applications présentes. De la même manière, les conteneurs font abstraction de mon modèle de serveur (constructeur si c’est une machine physique, hyperviseur si c’est une machine virtuelle), de la version de Linux (voire Windows) installée, et des autres programmes tournant sur le serveur.

À partir de là, les conteneurs “plaisent” à (au moins) deux publics.

Premièrement, les développeurs qui galèrent avec leur poste de travail. Annie travaille sur une machine sous Debian GNU/Linux, Bernard sur un Mac, Christophe sur un PC sous Windows 7, et Diane sous Windows 10. Si vous trouvez cette disparité exagérée, pensez aux structures qui font appel à des consultant·e·s, par exemple. Ou bien au fait qu’au fil du temps, les versions de Java, PHP, Python, etc. vont fortement diverger d’un poste à l’autre.

Les conteneurs permettent d’avoir un environnement de développement cohérent. Cela fonctionne (et améliore le travail de l’équipe) même si les conteneurs sont limités au poste de travail (et ne sont pas utilisés sur les serveurs). Annie, Bernard, Christophe et Diane ont peut-être chacun un système d’exploitation différent, mais s’ils utilisent Docker (et les déclinaisons Docker for Mac et Docker for Windows) ils peuvent tous développer très simplement dans des conteneurs Ubuntu ou CentOS (si c’est la distribution utilisée sur les serveurs).

Lorsqu’une nouvelle recrue rejoint l’équipe, elle sera opérationnelle beaucoup plus rapidement ; idem lorsqu’une personne (interne ou externe à l’entreprise) doit intervenir ponctuellement : fini le temps perdu à installer des dizaines de dépendances, s’assurer que toutes les versions sont correctes, etc.

Deuxièmement, les conteneurs peuvent aussi rendre service aux équipes qui s’occupent de la “mise en production” — soit le fameux déploiement évoqué au début de cet article. Au lieu de nécessiter l’installation (et parfois la mise à jour) de dizaines voire centaines de dépendances, il suffit de lancer un conteneur. Mieux : en cas de problème, il est très facile de revenir à la version précédente. Un peu comme si, avec une application mobile, vous aviez la possibilité d’installer deux versions l’une à côté de l’autre. La nouvelle mise à jour ne fonctionne pas, ou ne vous plaît pas ? Pas de problème : lancez l’ancienne version. Problème réglé !

D’accord, mais fabriquer un conteneur … C’est compliqué, non ?

C’était difficile jusqu’à 2013. Puis, en 2013, Docker a rendu les conteneurs (qui existaient depuis le début des années 2000) accessibles au plus grand nombre. Résultat : aujourd’hui, écrire un Dockerfile (la recette permettant de construire une image de conteneur) est beaucoup plus facile que fabriquer un paquet pour un package manager ou prendre en main un outil de configuration management. C’est ça qui a fait exploser la popularité de Docker et des conteneurs.

OK, par où commencer ?

En 7 ans d’expérience chez Docker Inc., j’ai eu l’honneur d’aider des équipes de toutes sortes à prendre en main les conteneurs (avec Docker ou avec d’autres outils). Je vais donc vous livrer une recette que j’ai vue fonctionner de nombreuses fois, dans des structures de toutes tailles (quelques personnes ou quelques milliers de personnes), pour du web, du mobile, du machine learning …

Étape 1 : “containeriser” un premier service. Je dis service car ce n’est pas nécessaire de prendre une application dans son intégralité. On peut commencer par un petit composant au sein d’une application plus grosse. Typiquement, on prendra un service ayant de nombreuses dépendances logicielles et un processus de build capricieux, car c’est précisément le genre de scenario où l’on aura le plus grand progrès visible !

Étape 2 : “containeriser” les autres services de l’application, et exprimer l’intégralité de la pile applicative avec un outil comme Docker Compose. Cela va permettre d’uniformiser le processus de développement pour l’application dans son entier. À l’issue de cette phase, vous serez à même de faire tourner cette application de manière identique sur n’importe quel poste de travail (macOS, Windows, Linux) en un clin d’œil.

Étape 3 : mettre en place un pipeline de CI/CD (intégration continue / déploiement continu) pour améliorer la qualité du code. Il y a là deux initiatives distinctes :

L’intégration continue — à chaque fois qu’une modification est enregistrée dans le dépôt de code (après chaque “commit”), des tests unitaires sont automatiquement exécutés, permettant de détecter des régressions éventuelles avant qu’elles n’affectent vos utilisateurs.
Le déploiement continu — à chaque fois qu’une modification est enregistrée dans le dépôt de code, la nouvelle version du code est déployée automatiquement dans un environnement de qualification (ou pré-production). Cela permet au développeur (ou à une équipe qualité) d’effectuer des tests fonctionnels sur une version “live” de l’application, et encore une fois, de détecter des problèmes avant vos utilisateurs.

Ces deux initiatives nécessitent de pouvoir créer à la volée des environnements éphémères. Pas question de demander à un administrateur système de provisionner un ensemble de machines virtuelles à chaque fois qu’on doit lancer un test ! Les conteneurs sont particulièrement adaptés, car créer un conteneur à partir d’un script (par exemple) est à la fois très simple et très rapide.

Étape 4 : étendre le processus de déploiement continu au domaine de la production. Cela signifie que chaque modification du code passe par l’étape CI/CD, et si les tests passent avec succès, les conteneurs sont installés sur les serveurs de production, prêts à démarrer. La mise en production peut alors se faire très rapidement (le démarrage des nouveaux conteneurs et l’arrêt des anciens prend typiquement quelques secondes), voire complètement automatiquement si les tests automatiques sont suffisamment exhaustifs. Cette dernière étape fait généralement appel à un ordonnanceur comme Kubernetes, Mesos, ou Swarm.

Chaque étape apporte des bénéfices concrets et tangibles. Vous n’avez pas besoin de dérouler l’intégralité du plan avant de voir des résultats ! Par exemple, vous pouvez commencer par les premières étapes, constater par vous-même les gains effectués, puis continuer à votre rythme, selon l’évolution de vos besoins.

Se former, seul ou accompagné

La communauté Docker est extrêmement riche en tutoriels divers pour démarrer et aller plus loin. Je recommande particulièrement les “labs” disponibles sur training.play-with-docker.com.

J’en profite donc pour mentionner mes prochaines formations à Paris et en Californie !

Il y aura:

Bien démarrer avec les containers le 15-16 avril à Paris
Déployer ses applications avec Kubernetes le 23-24 avril à Paris
Opérer et administrer Kubernetes le 26 avril à Paris
Kubernetes for administrators and operators le 10-11 juin à San Jose

Les formations à Paris sont en français. Si vous voulez monter en compétence vitesse grand V, vous pouvez enchaîner les 3 formations (elles sont conçues pour fonctionner ensemble).

La formation à San Jose aura lieu dans le cadre de la conférence Velocity.

Je peux aussi assurer des formations sur mesure pour votre équipe. N’hésitez pas à me contacter.

Test drive of AppSwitch, the "network stack from the future"

2018-03-13T00:00:00+00:00

I was given the opportunity to test AppSwitch, a network stack for containers and hybrid setups that promises to be super easy to deploy and configure, while offering outstanding performance. Sounds too good to be true? Let’s find out.

A bit of context

One of the best perks of my job at Docker has been the incredible connections that I was able to make in the industry. That’s how I met Dinesh Subhraveti, one of the original authors of Linux Containers. Dinesh gave me a sneak peek at his new project, AppSwitch.

AppSwitch abstracts the networking stack of an application, just like containers (and Docker in particular) abstract the compute dimension of the application. At first, I found this statement mysterious (what does it mean exactly?), bold (huge if true!), and exciting (because container networking is hard).

The state of container networking

There are (from my perspective) two major options today for container networking: CNM and CNI.

CNM, the Container Network Model, was introduced by Docker. It lets you create networks that are secure by default, in the sense that they are isolated from each other. A given container can belong to zero, one, or many networks. This is conceptually similar to VLANs, a technology that has been used for decades to partition and segregate Ethernet networks. CNM doesn’t require you to use overlay networks, but in practice, most CNM implementations will create multiple overlay networks.

CNI, the Container Network Interface, was designed for Kubernetes, but is also used by other orchestrators. With CNI, all containers are placed on one big flat network, and they can all communicate with each other. Isolation is a separate feature (implemented in Kubernetes with network policies.) On the other hand, this simpler model is easier to understand and implement from a sysadmin and netadmin point of view, since a straightforward implementation can be done with plain routing and CIDR subnets. That being said, a lot of CNI plugins still rely on some sort of overlay network anyway.

Both approaches have pros and cons, and if you ask your developers, your sysadmins, and your security team what to do, it can be very difficult to get everyone to agree! In particular, if you need to blend different platforms: containers and VMs, Swarm and Kubernetes, Google Cloud and Azure, …

That’s why something like AppSwitch is relevant, since it abstracts all that stuff. OK, but how?

How AppSwitch works

Or rather: how I think it works, in a simplified way.

AppSwitch intercepts all networking calls made by a process (or group of processes). When you create a socket, AppSwitch “sees” it. When you connect() to something, if the destination address is known to AppSwitch, it will directly plumb the connection where it needs to go. And it learns about servers when they call bind().

“Intercepting network calls? Isn’t that … slow?”

No! Indeed, it would be slow if it were using standard ptrace(). But instead, it is using mechanisms that are similar to the ones used to execute e.g. kernel performance profiling. These mechanisms are specifically designed to have a very low overhead.

Furthermore, AppSwitch doesn’t intercept the data path calls (like read and write). That means, the IO speeds are at least as good as native. I say at least because AppSwitch may transparently shortcircuit the network endpoints over a fast UNIX connection when possible.

“Do I need to adapt or recompile my code?”

No, you don’t need to recompile, and as far as I understand, that works even if you’re not using the libc; and even if your binaries are statically linked. Cool.

In practice

The exact UX for AppSwitch is not finalized yet. In the version that I have tested, you execute your programs with a special ax executable, conceptually similar to sudo, chroot, nsenter, etc. A program started this way, and all its children, will be using AppSwitch for their network stack. It is also possible to tie AppSwitch to network namespaces or even other kind of namespaces so that the program doesn’t have to be started with ax.

There is a demo at the end of this post, but keep in mind that the final UX may be different.

Benefits

This system has a handful of advantages. It abstracts the network stack, but it also simplifies the actual traffic on the network. In some scenarios, this is going to yield better performances. And in the long term, I believe that it will also bring subtle improvements, similar to what unikernels have done in the past (and will do in the future).

Let’s break that down quickly.

Independent of CNI, CNM, or what have you

Network mechanisms in container-land rely heavily on network namespaces. Virtual machines rely heavily on virtual NICs. AppSwitch abstracts both things away. The network API is now at the kernel level. You run Linux code? You’re good. (Windows apps are a different story, of course.)

You can connect together applications running in containers, VMs, physical machines, and it’s completely transparent.

Simpler networking stack

As we’ve seen in the introduction, overlay networks are very frequent in the world of containers. As a result, when a container communicates with another, we get network traffic looking like this:

(Slide from Laurent Bernaille’s presentation Deeper Dive In Overlay Networks, DockerCon Europe 2017.)

The useful payload in that diagram is within the black rectangle. Everything else is overhead. Granted, that overhead is small (a few bytes each time), which is why overlay networks aren’t that bad in practice (if they are implemented correctly). But there is also a significant operational cost, as those layers add complexity to the system, making it rigid and/or difficult to setup and operate.

AppSwitch lets us get rid of these layers, because once a connection has been identified at the socket level, we do not need any of the other identity information. It reminds me a little bit of ATM (or the more recent MPLS), where packets do not contain full information like “this is a packet from host H1 on port P1 to host H2 to port P2.” Instead, each packet carries only a short label, and that label is enough for the recipient to know which flow the packet belongs to. AppSwitch somehow seems to do this without touching the packets or even having access to them.

Faster local communication

One thing that network people like to do is to complain about the performance of the Linux bridge. The core of the issue is that the Linux bridge code was single threaded (I don’t know if that’s still the case), and this would slow down container-to-container communication (as well as anything going over a Linux bridge). There are remediations (like using Open vSwitch, for instance) but AppSwitch lets us sidestep the problem entirely.

In the special case of two containers communicating locally, the classic flow would look like this:

write() -> TCP -> IP -> veth -> bridge -> veth -> IP -> TCP -> read()

And with AppSwitch, it becomes this:

write() -> UNIX -> read()

There is not even a real IP stack in that scenario and the application doesn’t even notice it!

Demo

Alright, a little less conversation a little more action!

Remember: the exact UX of AppSwitch will probably be different, but this is what I tested looks like.

Getting started

First of all, since I wanted to understand the installation process from A to Z, I did carefully read the docs, and then I decided to reduce the instructions to the bare minimum.

First, I clone the AppSwitch repository:

git clone git@github.com:appswitch/appswitch
cd appswitch

Then, I compile the kernel module. Wait, a kernel module? As I mentioned earlier, this version of AppSwitch works by intercepting network calls at the kernel level. But I’m told that in a future version, it will be able to offer the same functionality (and performance!) without needing the kernel module.

make trap
cp trap/ax.ko ~

Next up, I copy AppSwitch userland piece. That part is written in Go, and currently recommends Go 1.8. I’m using a trick to run the build with Go 1.8, regardless of the exact version of Go on my machine:

docker run \
	-v /usr/local/sbin:/go/bin \
	-v $(pwd):/go/src/github.com/appswitch/appswitch \
	golang:1.8 go get -v github.com/appswitch/appswitch/ax

(If you just went “WAT!” at this and are curious, you can check this other blog post for crunchy details.)

Then I copy the module and userland binary to my test cluster. (The IP addresses of my test machines are in ~/hosts.txt, and I use parallel-ssh to control my machines.)

tar -cf- /usr/local/sbin/ax* ~/ax.ko | 
	parallel-ssh -I -h ~/hosts.txt -O StrictHostKeyChecking=no \
	sudo tar -C/ -xf-

We can now load the module:

parallel-ssh -h ~/hosts.txt -O StrictHostKeyChecking=no \
	sudo insmod ~/ax.ko

Next, we need to run the AppSwitch daemon on our cluster. AppSwitch is structured like the early versions of Docker: the daemon and client are packaged together as one statically linked binary. There is another similarity: the daemon exposes a REST API that the client consumes.

The exact command that we need to run is highlighted below; it boils down to running ax -daemon -service.neighbors X where X is one (or multiple comma-separated) address of another node. You don’t need to specify all nodes: AppSwitch will use Serf to establish cluster membership. I specify two nodes in the example below to be on the safe side if one node is down for an extended period of time.

The whole command is wrapped within a systemd unit, because why not.

parallel-ssh -I -h ~/hosts.txt sudo tee /etc/systemd/system/appswitch.service <<EOF
[Unit]
Description=AppSwitch

[Service]
ExecStart=/usr/local/sbin/ax -daemon -service.neighbors 10.0.0.1,10.0.0.2

[Install]
WantedBy=multi-user.target
EOF

Then we fire up that systemd unit:

parallel-ssh -h ~/hosts.txt sudo systemctl start appswitch

At this point, AppSwitch is running on the whole cluster. Nodes can be added and removed at will.

Basic use

To run a server through AppSwitch, I have to give it an IP identity, like this:

ax -ip 1.1.1.1 python3 -m http.server

(This command runs a static HTTP server on port 8000.)

Then, any process running through AppSwitch, on any node, can access that service by referencing that IP address:

ax curl -I 1.1.1.1:8000
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.5.2
Date: Thu, 08 Mar 2018 03:11:11 GMT
Content-type: text/html; charset=utf-8
Content-Length: 623

Advanced use

AppSwitch also allows to set a name for each service, and then use that name to connect to it:

# Server
ax -name web python3 -m http.server
# Client
ax curl web:8000

It also gives us load balancing, out of the box, just by running multiple server processes with the same name or IP address identity.

There is an interesting-looking system of labels, allowing to control which clients can see/communicate with servers; but I didn’t investigate that in depth.

Performance

That part is particularly interesting. I benchmarked raw transfers between two EC2 VMs, measured with iperf. AppSwitch performance was so close to native performance, that it was indistinguishable. In fact, on average, AppSwitch was even slightly faster than native! (On my test machines, I saw 980 Mb/s with AppSwitch and 950 Mb/s without.) It is probably a coincidence; perhaps my VM had a noisy neighbor during my tests.

This funny result reminded me of a VMware benchmark I saw a while ago, where Redis ran faster on a VM than on the host. This was caused (if I remember correctly) by limiting the VM to one CPU, and pinning that CPU to a physical one on the host. This would prevent the VM (and the Redis process inside) from switching CPUs (and the associated cache misses). The morale of the story is that sometimes, we get seemingly impossible results, but if we dig enough, there is a perfectly logical explanation. Perhaps there is a similar story here as well, who knows.

I also conducted tests with many small parallel requests. The performance here was significantly lower, but when I tried to figure out why, I noticed that systemd-journald was using up most of the CPU on the machine. It turns out that the debug build of AppSwitch that I’m running is pretty verbose about what it does. At some point I’d like to do more testing, but for now I was happy with the results.

Conclusions

With virtual machines, the interface between the VM and the rest of the world is the API with the hypervisor. It’s a small API, but it’s specific to each hypervisor, and it is very far from what our applications need.

With containers, the interface between the container and the rest of the world is the kernel syscalls ABI. It’s a much bigger API, and it’s specific to Linux. It’s also a very stable API, because each time somebody breaks it (by design or by mistake) they receive a deluge of profanity and verbal abuse.

Container engines like Docker give us a very efficient way to abstract compute resources: a container image for Linux x86_64 can run on pretty much any Linux x86_64 machine (and will eventually run at near-native speed on Windows x86_64 machines too, thanks to some pretty cool stuff happening in the Windows ecosystem.)

However, containers don’t abstract the network stack — at all. Our applications still run in things that have (virtual) network interfaces, and communicate by sending IP packets. When you think about it, these things (interfaces and packets) are not essential, and in fact, they do not exist at the kernel syscall boundary. Most applications create sockets, connect() or bind() them, and then read and write on file descriptors. No packets, no interfaces.

As a result, a container connected with AppSwitch doesn’t even need to have an IP address, or even a full IP stack! That’s why AppSwitch is exciting: it offers us a way to get the features that we need, without carrying the overhead of legacy concepts — just like Docker captured exactly what was needed to abstract a runtime environment, without having to deal with concepts like a PCI bus, SCSI adapter, or APIC controller.

Tax implications of relocating to the US

2018-02-26T00:00:00+00:00

This is a boring post about boring tax stuff. It’s boring but I wish I had known that when I moved to the US — it would have saved me more than $10K.

TL,DR: if you live in the US and own shares in foreign companies (even something tiny), you are supposed to declare it to the IRS each year. You won’t be taxed on it, but you have to declare it. Nobody told me anything about that when I moved to the US. When I learned about it, the compliance procedure ended up costing me more than $10K. (It wouldn’t have costed me anything if I had known ahead of time.)

This post will be interesting for people who are US tax residents and own stuff outside of the US. This is the case for many tech workers expatriating to the US, especially the ones who have non-zero work experience, because they are very likely to have earned some equity in their origin country: stocks, stock options, contributions to retirement plans … And they are also very likely to have founded or taken participations in small companies. (And of course, this is not exclusive to tech workers!)

Keep in mind that this is just the results of my own research, relevant to my own situation. Your situation is probably different. Do your own research before deciding what you should or shouldn’t do!

Should I bother?

You should bother if you are a US tax resident and you own stuff outside of the US.

Am I a US tax resident?

If you are a US citizen or a permanent resident (i.e. a green card holder), then you are a US tax resident, regardless of where you live and work. (This doesn’t mean that you will pay taxes to the US; but you must file with the IRS every year.)

If you are in the US with a work visa (e.g. H1B or L1) and live in the US more than half of the year, you also are a tax resident. In other words, when you expatriate to the US, if you arrive before July 1st, you will probably be a tax resident the first year; otherwise you will start being a tax resident the second year.

Do I own stuff outside of the US?

Almost certainly, yes. You probably have bank accounts, unless you closed them just before moving to the US.

The IRS wants to know about everything you own outside of the US. Yes, freaking everything.

Even your bank accounts should be declared (that’s the infamous FBAR, for Foreign Bank Account Report), except if they are below $10K. Then you can skip them, because the IRS doesn’t give a damn about amounts that low.

I was told about the $10K rule, so I thought I was fine, because I had less than $10K on my French accounts when I left France. But there are a few “gotchas.”

First of all, what matters is not how much money you have at the end of the year, but the maximum value you had at any give point in time. In my case, that was fine, because I moved to the US in February and had been preparing for that before, so my French accounts had been low the whole year anyway. Cool.

Next, what matters is not individual accounts, but the aggregate (total) value. So if you have two accounts with $6K, you should report them.

Combining these two rules gets really interesting. One CPA told me that if you have $6K in one account, and move it to another account, then in theory it gets you above the limit. I couldn’t find a definitive answer to that question, and it’s not a huge deal anyway, but it shows how silly the whole thing can be. In my case, I was fine the first few years (I was just keeping a few thousand EUR in my French account to pay for my health insurance, my French mobile phone so I could keep my cell number, that kind of stuff) but a few years later, I ended up having more than $10K spread across a few accounts (yay lucky me!) so I had to declare these.

And finally, there is all the “non-money” stuff. You have shares in a company? You must declare them. Yes, even if you just have 1% in your buddy’s tiny company. You have any kind of retirement plan — the foreign equivalent of an IRA or 401k? You must declare it. Again: you won’t pay taxes on it, but the IRS wants to know.

Cherry on top: if you have access to foreign accounts that aren’t yours, e.g. if you have a checkbook or are an authorized user on such accounts (because you’re an admin for your company with overseas offices or whatever), you also have to declare that. But hopefully, if you are in that situation, your company will help you to do that. Hopefully.

Come on, should I really bother ?!?

A.k.a, “what happens if I don’t declare my stuff?”

This is a good question.

We could unpack it in multiple sub-questions:

What happens if I don’t declare?
How will the IRS find out, anyway?

The penalties are brutal. If you go through the Streamlined Foreign Offshore Procedures, which is a way to say “oops I didn’t know I had to do that, so here are my updated tax returns” the IRS gives you a fine. The amounts vary. In my case is was about 5% of all amounts + tax preparation costs; a bit more than $10K. If you’re good with administrative paperwork, perhaps you can do it yourself and save the tax preparation costs; but I’m not good with administrative paperwork, and I didn’t want to screw up and I wanted to be in a good position if I get audited by the IRS.

If the IRS catches you, then it’s a whole different story. The fines can reach $100K, or even half of the value of the assets. This page on the IRS website will give you a rough idea. (They moved stuff around recently so I cannot find the page I had specifically for individuals.)

Now, how would they find out? They probably won’t, if you have a relatively modest amount of money (less than $100K) and aren’t constantly moving it around. Same thing if you have a modest amount of equity. However, the US (and a lot of countries with which the US has treaties and such) do have systems to report “high” transactions. Generally there are no pre-defined thresholds (to avoid people constantly staying below the threshold) but in theory, if a big amount of money (something between $10K and $100K) suddenly shows up on your account, your bank will very probably ask you where this comes from, and if you can’t or won’t tell, they alert TRACFIN (in France) or the local equivalent.

In 2010, the US voted the FATCA (not to be confused with the FACTA), which gives the IRS a better visibility on the foreign assets of US tax residents. This law was then progressively translated into treaties, on a case-by-case basis, with other countries. The IRS is trying to get rid of the bank secrecy of places that regularly qualify for the Tax Evasion And Other White-Collar Crimes Olympics, like Switzerland, Hong-Kong, Luxembourg … and the way they do that, is by pressuring foreign banks through their US branches. This means that as soon as your foreign bank knows that you are a US tax resident, they have to transmit information to the IRS. (Look up FATCA AEI if you want to know more about that.)

In my specific case, it was a simple decision. I have a small amount of equity in Docker. If Docker eventually gets acquired or goes public, I hope that this equity will be worth a million USD or two. Perhaps more if my ex-coworkers do great. I don’t know if that would make me rich enough to be on the radar for the IRS, but I didn’t want to take any chances. I decided that it was safer for me to comply, and file (sometimes re-file) everything that was needed, to make sure that I’d be 100% clean with the IRS. This did cost me a bunch of money, but brought me a lot of peace of mind.

PFIC and 5471

Now we get to the really ugly and annoying part, and this is where you will understand why there is more venture capital available in the US than anywhere else.

If you own more than 10% of a foreign company

If you own more than 10% of a foreign (non-US) company, this company must file a form 5471 to the IRS. If the company doesn’t do it, then you must do it. What the hell is that form? As I understand it, it is a kind of financial x-ray for the IRS. It has a lot of accounting and legal information about the company.

This is mind-blowingly annoying, because just preparing that form is going to cost you about $500 per year. (Some CPAs can be cheaper, or more expensive; that’s just an average.)

Yes, if you own 25% of a company worth 4,000 EUR, somebody must pay $500 every year so that you can file your US taxes correctly.

You may want to sell these shares before moving to the US, just in case. Again: the IRS probably doesn’t give a damn about it, but if they wanted, they could screw you big time.

Passive Foreign Investment Companies

If you own shares in a PFIC, things also get complicated. “Jérôme, I never heard about PFIC before, so I’m pretty sure that I don’t own shares in one!” Wrong. I had shares in a PFIC even though I had no idea!

A few years before moving to the US, I had invested into a company in France. But I didn’t invest directly in the company. I invested in a holding, and the holding invested in the company. This allows the company to keep a simpler “cap table” (list of people owning shares), since instead of having 10 extra investors, they have one — and that investor is a company regrouping the new 10 investors. It makes things simpler for a lot of people.

Except that this holding is then considered as a Passive Foreign Investment Company, or PFIC. (If you want a moderate amount of details, you can check the Wikipedia definition for PFIC.)

Alright, what does that mean in practical terms? First of all, more forms and paperwork. And also, since this PFIC value was in Euros, the value (converted to USD) did change over time. And therefore, I had to report capital gains and losses — even though I didn’t buy or sell anything, even though that company didn’t buy or sell anything, even though my shares didn’t generate any dividends whatsoever. Sounds completely crazy? Yup!

Consequences for investors

This means that if I want to invest money outside of the US, it will incur significant overheads for me at tax season. Of course, if I’m investing one million dollars, I probably don’t care paying $1K every year in extra tax preparation fees. But for more reasonable amounts … Even at $100K, these fees will wipe out almost 1% of the investment each year. (And of course, any gains that you realize will be taxed on top of that.)

That’s one of the reasons why it’s way easier, as a startup, to find capital in the US than anywhere else. Because the US tax law makes it inconvenient and expensive for US investors to put their money abroad. Next time you hear people complaining about how it’s hard to raise seed money outside of the US, think about it.

The bottom line

All these reporting requirements are primarily intended to track money laundering and tax evasion schemes, which are noble endeavors. It ends up imposing absurd constraints on completely lawful individuals, but since these individuals are either foreign nationals living in the US, or people investing abroad (i.e. money getting out of the US), the US have zero incentive into making that less complicated.

One the one hand, a lot of these requirements and constraints are “rich people problems,” applying to you only if you’re wealthy enough to invest or own equity. On the other hand, it is increasingly frequent in the tech sector to be compensated with stock or stock options, so you may qualify sooner than you think!

The really important bit is that if you are aware of these requirements ahead of time (i.e. before expatriating to the US), you can save yourself a lot of trouble (and money) by making sure that your bank accounts are below $10K, and selling your stock — or at least deciding what is worth keeping.

We’ll be back soon with more exciting posts about container technology, devops, mental health, and diversity!

Seven years at Docker

2018-02-17T00:00:00+00:00

TL,DR: I have left Docker Inc. to take a sabbatical and recover from depression and burnout. I plan to dedicate the next six months to family, friends, meditation, music, and generally speaking, enjoy life to recharge for whatever will come next.

This text is an adaptation of the message that I sent last week to my coworkers to announce my departure. I’m now sharing it with a wider audience, because mental health is serious stuff, and I wish we all felt more comfortable talking about it. I also wanted to share with my friends, the Docker community, the container ecosystem, and beyond, some thoughts about what has been for me an incredible journey.

February 6th was my last day at Docker. Seven years and one day earlier, I boarded a big bird of metal that would take Sam Alba, Sébastien Pahl, and me from Paris to San Francisco, and we joined the dotCloud office on Third Street. I couldn’t imagine What Would Happen Next.

The dotCloud office at Founder’s Den, early 2011. (Credit: SFGate)

From dotCloud to Docker

In 2011, our tiny startup was fearlessly competing with Heroku, which had just been acquired by Salesforce for $250M. We were the first PaaS to support so many languages and databases, thanks to the extensive use of this obscure kun-tay-nerr technology. You could count our engineering team on one hand, and all of us were both on-call and doing customer support. We had weekly contests about who would solve the most support cases.

In 2012, I gave my first “real” talk at a “real” conference. It was about the other cool piece of tech at dotCloud: our ZeroRPC library. (One of the Xooglers who joined us back then even told us, “I wish we had something that simple and straightforward at Google!”) I’m grateful for the incredible work that my peers had put into this project, as it enabled me to speak at PyCON, and encouraged me to try and speak at more conferences.

In 2013, you know what happened: Solomon Hykes presented Docker at the same PyCON conference (one year later), and over the following months, the whole dotCloud engineering team shifted to Docker. Meanwhile, I gave my first container talk at the SCALE conference in Los Angeles; and after that talk, I was invited to present Docker in Beijing, and then in Moscow. These were incredible opportunities, both personally (I forged some long-lasting friendships during these trips) and professionally: thanks to our combined efforts, we were able to issue joint statements with Baidu and Yandex, announcing that they were now using Docker!

From SRE manager to evangelist

In 2014, I gave an average of two talks per week; but most importantly, I spoke at LinuxCon, OSCON, and LISA. I would have been satisfied with my career if it had given me the opportunity to attend these conferences; but now I was speaking there (and would be, multiple times). Again, this wouldn’t have been possible without the fantastic work done by the Docker core team. Being a developer advocate or an evangelist is generally hard; but it’s markedly easier when your product is as helpful and as approachable as Docker. That year, I also turned down an invitation to speak at AWS re:invent because they didn’t have a code of conduct back then. (They eventually added one; probably not by my sole request, but I like to think that it contributed!)

In 2015, [HEAVY SPEAKING INTENSIFIES]. I enabled our partners in Europe by training about a hundred customers and other trainers in a couple of weeks, and gave my first keynotes in Paris and São Paulo. For the first time, I found the courage to speak on stage about sexism and harassment in open source communities, and the reactions I got made me realize that these problems were far worse and more prevalent than I had thought. I was on stage 7 times at LinuxCon that year, and I still don’t know if that deserves an entry in the wall of fame or shame. I finally spoke at re:invent, and it was ridiculous. During that whole time, I was helped and empowered by the whole Docker team to give my best: engineering was always here for me if I had a tricky last-minute technical question; and I could also rely on everyone else in the company for logistics and overall support. That made a huge difference.

From busy to burned out

In 2016, in addition to my regular talks, I delivered an increasing number of orchestration workshops. Unfortunately, that’s also when I found my limits. I should have been kinder with myself; but I didn’t realize it until it was too late: my mental state deteriorated until I was diagnosed with depression in October. Fortunately, by that time, the company had many fantastic speakers among its ranks; and the Docker Captains program had taken off — so there was no negative impact when I shifted my focus.

I started antidepressants and therapy. Results were not encouraging at first; but after switching medication twice and finally being referred to a psychatrist, my symptoms became easier to manage. I started having more energy, so I used it to take care of myself and do things that would make me happy. Cooking fine meals. Reading. Learning the cello. Dating. Building cool stuff with Raspberry Pis. Eventually, things got better.

In 2017, I continued to deliver workshops, and I helped to shape DockerCon’s Black Belt track. It’s hard to find words to describe how much joy and satisfaction I drew from this opportunity. In Austin, the Black Belt track is the track that got the highest ratings and attendance. I also improved the diversity of that track: in Copenhagen, the majority of the talks featured a speaker from a traditionally underrepresented background. Reaching out to these outstanding speakers, helping them when necessary, sometimes coaching them, has been one of the most rewarding steps of my career; and there again, I would never have been able to do it without the full support of our team.

Black Belt track speakers from DockerCon 2017 in Austin.

In the summer of 2017, while participating in a study about mental health, expatriation, and remote teams in the tech industry, I took the Maslach Burnout Inventory. The MBI is a test to assess burnout factors. I was in the red zone. Alas, neither my GP nor my psychiatrist knew much about burnout, and I felt on my own. Out of sheer coincidence, I ended up talking to a doctor who was more knowledgeable on that topic. I will write more about this in the future; but long story short, I in September, I decided that I needed to take a break in 2018.

Before taking that break, I focused my energy on Docker’s Kubernetes strategy. One week after we announced support for Kubernetes plans in Copenhagen, I was delivering a Kubernetes workshop at ContainerCon; and I delivered that workshop 3 times internally at Docker (which gave me the perfect opportunity to visit our Raleigh office and hang out with the wonderful folks there!). The materials are available on kube.container.training, by the way.

The last two months of 2017 were a grueling struggle to figure out what would be the best way for me to take that break. I wanted to take at least 6 months off, which is more than the 12 weeks allowed by the FMLA. (The FMLA allows employees to take up to 12 weeks of unpaid leave.) Docker doesn’t have a sabbatical program, and didn’t want to create one. My doctors didn’t want to fill out the paperwork that would have allowed me to take a medical leave of absence. Switching doctors wouldn’t help because filling that kind of paperwork for mental health reasons requires to be seen over a longer period of time; and I didn’t want to wait 3 or even 6 more months — to perhaps be denied my leave anyway. So my only solution was to quit. This would have been a financially difficult proposition, but I was able to sell a large chunk of my equity in Docker in 2017, meaning that I have a comfortable safety net for now.

From startup to sabbatical

In 2018, I’m going to take a lot of time for myself. I’m learning Rust. I’m writing a tiny Ableton clone to connect a grid controller (like the Monome or the LaunchPad) to a Raspberry Pi to play live music. I’m going to do a Vipassana meditation retreat. I hope to mentor folks who weren’t as lucky and privileged as I was, and be a better ally. The first step was to quit Docker, and that was the most difficult one; but the road ahead looks great.

A lot of people have asked me if I would be joining Heptio / Microsoft / some other company, and some folks asked if I’d be open to some consulting gigs. First of all, while I would be humbled and honored to be deemed fit to work with teams like Heptio’s or some of the Azure folks, I don’t plan on going back to full-time employment until at least September. As for consulting, sure! You can contact me here.

From me to you

One last thing — all the achievements that I listed above are not mine alone. I assume that you mostly saw my happy, productive, engaging side during all these years; but one person in particular also had to deal with me when I was heavily depressed, exhausted, struggling to perform the simplest tasks, and much less interesting to be around. My partner since 2014 supported me unconditionally all that time, and helped me walk through some of my darkest moments. I owe her more than words can tell.

I also owe to a very long list of coworkers, friends, and everything in between. If we’ve worked together or collaborated in any way; if you’ve been a supportive ear or even just a smile during these years — I want you to know that these successes are also yours. I hope that our paths will cross again and that the future holds many opportunities to help each other.

Peace,

jpetazzo out 🎤💨🤚

Recovering the productivity stolen by depression with kanban and emoji

2017-12-24T00:00:00+00:00

I want to share a few organizational tools that helped me to be more productive while dealing with stress, anxiety, and depression. They include post-it notes, Trello cards, calendars, and emojis (just to name a few). I’m sharing them in the hopes that they can be a source of ideas and inspiration for those struggling with similar conditions.

This post was initially published on The Human In The Machine.

In 2016, I was diagnosed with depression, and it significantly affected the way I worked. I was less creative but more reactive. It felt impossible for me to write new content and have great ideas. All I could do was react, i.e. respond to e-mails, show up in meetings, and get things done at the last minute but at immense cost.

Thanks to therapy and medication, my depression eventually retreated, but it was still throwing the occasional wrench in my brain machinery. I was facing a growing workload, looming critical deadlines, a swirl of unrelated tasks on the professional and personal sides, and this all felt very, very overwhelming. As a result, I was procrastinating, pushing things back; which eventually made everything worse (obviously). I learned that, even though I didn’t feel anxious, these were telltale signs of anxiety and work-related stress.

To break out of the vicious circle, I decided that I should organize.

I had never been particularly good at organizing myself. In the past, I had used agendas, notepads, ticket trackers, and trello boards when my job required it; but there was always some external stimulus and structure to guide me. (The company had a process. My project group had this one person who knows how things should be done and taught us the right way.) This time, I was on my own, and I didn’t know where to start.

TODO: make a list of things to do

To be fair, I wasn’t starting entirely from scratch. I already had a to-do list process:

take a notepad, preferably one with lines;
write one task per line, drawing a little square next to each task;
when you complete a task, tick it off;
if you decide that a task is not worth completing anymore, cross the box;
when the page gets full, copy each incomplete task to a new page and continue there;
that moment when you copy incomplete tasks is a good opportunity to review, “Do I still want to do that?”

On the to-do list above, you can see that I actually have two columns on my notebook. The left column is for the long-standing tasks; the right column is for a collection of chores and high-churn tasks that should be completed quickly; I didn’t want them to pollute the main list.

Notice how I started with a well-defined process, but I’m already amending it to accommodate my needs. Balancing rigor and flexibility was (and still is) a difficult act!

But somehow, the to-do lists weren’t working anymore; or at least, they weren’t enough. What was wrong?

The secret to multi-tasking: do one thing at a time

It turns out that I often ended up starting one task from my to-do list, then interrupting myself with another, then being interrupted by some e-mail or notification, then feeling more attracted to another thing (that was also on my to-do list, mind you!), and at the end of the day, I felt like I had accomplished nothing because I wasn’t able to tick anything off the list.

Very providentially, it is about that time that I learned a critical piece of information about Kanban that had escaped me until then.

If you’re not familiar with Kanban (for software development, not for industrial production) I’m going to over-simplify it for you. Let’s take a board with 3 columns, labeled TODO, DOING, DONE. Tasks are written on post-it notes, which all start in the TODO column. When we start working on a task, we move it to DOING, and when it’s completed, to DONE. (We can add extra columns like TESTING or BACKLOG or BLOCKED if we want, but that’s the gist of it.) This gives us a nice, visual way to track progress, and we get gratification from moving things to the DONE column.

The essential piece of information that I had missed about Kanban, is that we should limit the number of cards in each column. If we have 10 things in DOING, it is hard to focus and get anything done.

How could I miss this? I don’t know! But that gave me an idea to enforce focus.

The most basic Kanban board ever

My Kanban board would have one column: DOING. That column would be allowed to have up to 3 tasks. I didn’t have a physical board to work with; so I decided to use my external monitor as a very expensive post-it notes holder. (I don’t often use my external monitor, because I work on the road a lot, so my usual workflow doesn’t involve an external monitor.)

In other words: whenever I would start working on something, I had to write it down on a post-it note first, and stick that note on the screen in front of me. I wouldn’t be allowed to have more than three post-it notes on the screen, so I couldn’t be working on more than three things at a time.

Inception: in the picture above, I am explaining my newfound organizational joy to a friend.

This worked pretty well. I think that the simple fact of having a constant visual reminder that I should be working on the tasks written on the post-it notes in front of me (and nothing else!) was very helpful to avoid distractions.

From that point, I started to actually get stuff done, and tick off more items from my list. YAY!

For two million pixels more

At some point, I wanted to reclaim that glorious external monitor for more useful purposes than just holding a couple of post-it notes. (I was using it sometimes to display some documentation pages, for instance.)

That’s when I decided to resurrect my Trello account — and not a minute earlier.

I think this is important to emphasize this: instead of using Trello because it might be useful, I waited until I really needed something to fit the need.

I created a new board. I initially had TODO, DOING, and DONE columns. Then I added a column for “today” and another for “this week.” The idea was to have a hierarchy:

TODO can have many tasks, but I’m not working on any of them right now;
“This week” can have up to 10 tasks, and it’s a collection of things that I was to tackle soon, but they’re still not distracting me;
“This day” can have up to 5 tasks, and I expect to start working on some of them today;
DOING was renamed to “this hour.” It can still have up to 3 tasks, and I was hoping that the new name would be a constant reminder that tasks should be short enough (to avoid never-ending sagas that won’t let you tick a box to completion). It helped!

This is what my Kanban looked like. It hasn’t changed since then, by the way!

Using Trello (instead of physical post-it notes) gives me a few advantages:

I can now use the screen for other purposes if needed;
I can add a card even when I’m not in front of my computer, by using Trello’s mobile app;
I can easily attach information to a task, making sure that it will be available when needed.

Let me elaborate on that last point. If one of my tasks is to read an article and send a summary to someone, I can put the article URL and that person’s address in the Trello task. Before, I would have needed to keep an extra message in my inbox, or a note in another system. Reducing the overall number of items helps a lot to have better visibility; at least for me.

Do not use all the fancy features

A few times, I thought about using e.g. Slack notifications, due dates and calendar integration … But I resisted. My process isn’t to use a feature just because it’s there and might be useful. I want to use a feature only if it clearly solves a use-case.

For instance, due dates sounded cool! But:

they overloaded the display;
exporting them to a calendar would have been a catastrophe given the state of my agenda;
I would probably end up shuffling them constantly and wasting a lot of time in the process.

So, no due dates.

No exceptions allowed! Except when they are

I made an exception for checklists, because they are perfect for short tasks and chores that don’t quite deserve their own Trello card, but that I want to track anyway.

For instance, some days, I have a card “house chores” that can look like this:

And when working on training content, sometimes I have a list of short actions, ideas, concepts … that I want to add; these might also warrant a checklist rather than individual cards:

My “rules” for checklists are as follow:

no more than 10 items per checklist;
once all items are done, move the card to DONE;
after ticking one item, I am allowed to send back the card to the previous column.

To clarify the last point: if I have my house chores list in the “doing this hour” column, I am allowed (and sometimes encouraged) to do one task and then send it back to “doing this day” or even “doing this week.”

The DONE column

You might wonder what is the point of the DONE column. Can’t we just archive a card (i.e. remove it from the board) when the corresponding action is done?

In traditional Kanban, the DONE column is here for the benefits of other people too, so they can know that something has been completed. It is also helpful if you do e.g. sprint retrospectives.

If I am alone using this board, why keep the DONE column?

Because it feels good! I’m not going to pretend that I am overpowered by joy each time that I move a card to the DONE column; or that at the end of the day, seeing all these cards fills me with a sense of achievements.

But close.

Seriously, I believe in the positive reinforcement effect of having a visual cue of what I have accomplished. Furthermore, at the end of the day, I can have my own miniature retrospective and remind myself that I did something that day.

In fact, I took the habit of archiving the cards in the DONE column every morning. It may or may not give me a little morale boost (“Yesterday, you did all these things! Today is going to be another great day! Go you!”), but it can’t hurt.

Dealing with recurring tasks

This system might sound like a well-oiled machine, but it had two major flaws.

The first one is that I had to compel myself to sit in front of my Trello board at least once every morning and shuffle cards around. Hopefully, the “this hour” and “this day” columns would be empty or almost empty. This would be the only moment where I would allow myself big changes, basically anything that moves a card by more than a column at a time.

If I skipped that card shuffle session, my whole day would go to waste. Instead of being a glorious day of untamed productivity, it would merely disappear in the shadow of the previous day.

The second flaw was the lack of support for recurring tasks. Every day, I wanted to make sure that I:

filled out my mood chart;
checked if I had a new message from my therapist, and replied if necessary;
checked my e-mail at least once;
shuffled my Trello cards as explained above.

(On the topic of e-mails: “checking my e-mail” means going over my inbox and making sure that there is no urgent e-mail requiring immediate attention. E-mails that can be dealt with in less than 5 minutes are dealt with; other ones generally get a Trello card or are just left sitting there.)

I needed to find a way to remember my daily tasks; and that would solve both problems at the same time.

Ideally, I also needed a way that would be habit-forming, i.e. that would help me to think about these things naturally, without external help.

Behavioral sciences to the rescue!

The emoji motivational calendar

A friend of mine once told me that her therapist had given her a calendar, on which she was supposed to affix stickers when she completed specific tasks on specific days, and the idea stuck.

I also read somewhere that if you keep doing something regularly, you can form a habit.

And finally, I’m an unapologetic fan of emoji.

So I came up with the idea of the emoji motivational calendar, a calendar that has a few checkboxes every day, corresponding to the regular actions that need to be done that day.

This is what the first iteration looked like.

It shows the entire month, and each day, there are four little checkboxes, each below an emoji representing the task to do:

📊 fill out my mood chart;
👩🏼‍⚕️ check my therapist’s messages;
✉️ check my e-mail inbox;
🗂 shuffle my Trello cards.

I printed the calendar and left it next to my keyboard on my desk.

Every morning, I started the day with these tasks. Each time I completed a task, I would check the box below it (by filling it entirely). Once all the recurring daily tasks were done, I would work on the tasks from my Trello-Kanban board.

(You can see the “missed” tasks with a single slash in the box below them.)

Did that work?

This was very helpful! As you can see on the picture above, even if I missed a few things here and there, I was overall consistently doing my daily tasks. Sometimes I would catch myself in the middle of the day, “oh shoot I forgot!” but it was still better than realizing it the day after.

However, it hasn’t quite formed a habit—yet. If I don’t have my calendar in a conspicuous place next to my computer, I’m likely to forget to do things. This typically causes problems when I’m traveling to a conference, or if I have a full-day speaking engagement; anything that requires me to wake up early and start my day right away instead of going through my usual morning routine.

Refining the process

Each month, I edit a new version of the calendar, with a few tweaks. Here is the version for the current month:

In this version, I made sure that the emojis were in the most logical order (in the first one, the “shuffle cards” emoji was before the “check inbox” one).

I divided the actions between “before work” and “after work” (that’s what the vertical divider is for).

I added two easy actions in the beginning and end of each day. That bright smiling emoji that you see? They mean “brush your teeth,” because it’s an easy one to get even on the most difficult days; and a good reminder precisely for these days.

“Fill out my mood chart” is now an evening action rather than a morning one. (Small implementation detail but I found it easier that way.)

You can also see a particular emoji, different for every day of the week. I won’t explain them all; just know that the little music notes mean “play some music.” I found that this was a clever way to build my self-care routine (at least some parts of it!) into the system.

The last thing that is a bit counter-intuitive is that I allow myself to be “late” on any action, as long as it’s not due again yet. For instance, “Monday” has music notes. If I don’t play music Monday, I leave the box unchecked; but if I play music Wednesday, I’m allowed to check it. However, if by Friday (the next day with music notes) I haven’t played any music, then it’s too late: I put a slash in that box. Likewise, if I don’t brush my teeth in the evening, I can’t make up for it by doing it the next morning, since there is another smiling face in the morning. 😁

Implementation details

If you are inspired by this calendar and want to do your own, feel free to use my template, with the following caveat: I couldn’t find a way to print emoji correctly, so I just take a screenshot of the “print preview” screen, and then I print that screenshot. True (embarrassing) story, but it works.

Results

These tricks and techniques helped me to recover a decent level of productivity and to feel much better about myself.

That being said, if you get the impression that this made me a super-productive, mega-organized person, that is unfortunately very far from the truth. At the beginning of almost every day, I stare sternly at a few cards in the “this day” column (things that were supposed to be completed the day before!) and sometimes I put them back in “this week.” By the end of the week, there are still a bunch of things to do, and often, they are the most challenging ones; the ones that I kept postponing all along.

My organization method doesn’t really help with that. However, the big difference is that these tasks don’t completely block my pipeline anymore. Instead of falling into a debilitating circle of stress, anxiety, and guilt, I can schedule (and complete) smaller, easier tasks. I make progress. And sometimes, this gives me the energy to tackle the bigger ones. Like writing this post.

Wrapping it up

I believe that organization is a very personal matter; what works for me won’t work for you and vice-versa. It’s even likely that what works for me today will not work for me next year, let alone forever.

But getting inspiration from each other’s toolboxes can help us to solve our own challenges. These tools helped me to overcome a particularly difficult moment of my life, personally and professionally, and gave me a brighter outlook. If it can give ideas that will empower a few readers to improve their lives too, I’ll consider it a wild success. And tick a box somewhere! ✅

Thanks and acknowledgements go to all the people who provided feedback that helped to shape this post; either directly by proofreading it, or indirectly by letting me bounce ideas at them. In particular AJ, Amir, Anne: I had a whole team of A-players to help me!

USB-C redux

2017-12-16T00:00:00+00:00

A few months ago, I started using a 2017 12” Macbook Air. This machine has only two ports: an audio jack, and one USB-C port. That USB-C port is the only thing you have to plug external storage and monitors, network connectivity, and of course, a power supply. I had to do some research to understand how USB-C works, and find the perfect adapters (at least, the perfect adapters for what I do).

Here is a summary!

Update

This was written in 2017. The product recommendations in this post are probably not relevant anymore. However, the general description of how USB works (in particular, USB-C alt modes) is still relevant if you’re trying to figure this out.

I also recommend to check this article which explains how USB-C data lanes get shared when using DP alt modes; it’s useful if you plan on using high resolution and high frame rates displays (think 4K at 60 fps).

TL,DR

If you have a machine like the one I had, with only one USB-C port, and without Thunderbolt support, I recommend to get docks and dongles similar to these ones:

at home: Plugable USB-C docking station
on the go: TNP USB-C multi adapter

Note that these products may not be available anymore today, so you might have to find other ones … Sorry!

The docking station is great. Its only downside (for me!) is that it doesn’t have a DisplayPort output; but you can get 4K on the HDMI port (if your monitor supports it; some older monitors support higher resolutions only on their DP inputs). This means that you can’t use DisplayPort MST to cascade multiple screens. However, the dock has two extra DsplayLink ports (one DVI, one HDMI) that might or might not be helpful. (I’ll cover DisplayLink briefly later.)

The adapter is great too. Its only downside (again: for me!) is that it uses a lot of power, and that’s why I ended up also getting a dock. The charger that comes with the MacBook Air delivers 40W. Once I insert the adapter between the charger and the MacBook, the latter reports that it is connected to a 13W power supply. This means that the battery will charge very slowly, or even drain slowly, if you are doing CPU intensive tasks (and that machine has a very weak CPU, so sometimes “having too many tabs open” can be CPU intensive!). Also, that means that the adapter dissipates a lot of heat. Finally, if you don’t need that plethora of connectors, there are smaller adapters that you might like more.

Both worked out of the box without installing drivers (but I had other USB adapters in the past, so perhaps at some point I installed a driver that took care of business).

If your machine supports Thunderbolt 3 (like the Macbook Pros, which also have more than one USB-C port), you can also look at this Thunderbolt 3 Docking Station. (Thanks Bryan for the recommendation!)

Finally, I’m aware that these adapters are not cheap. This is not an extorsionist move from the adapter lobby: as we will see when we dive into the details of USB-C, some features are simple, others require more complex circuitry.

If you’re on a budget, you may get similar functionalities by getting multiple cheaper adapters and switching between them when necessary. Then again, if you’re on a budget, I would humbly suggest to stay away from Apple hardware.

What I wish I had been told about USB-C

There is a lot of information out there, and it’s not easy to find palatable technical information, between marketing announcement, outdated press releases, and arcane spec sheets. This is my attempt at explaining USB-C in terms that are “just enough technical.”

According to Wikipedia:

USB-C, technically known as USB Type-C, is a 24-pin USB connector system.

(In this whole post, I am using “USB-C” for “USB Type-C”.)

So, USB-C is a connector. It’s not a protocol! The protocol would be e.g. USB 2, or USB 3, or something else.

This connector has the ability to carry many different electric signals, including:

USB 1, USB 2 (for compatibility with existing devices)
USB 3 (the kind you’ve probably seen on the blue connectors with extra pins)
power (to charge e.g. a phone, but also more power-hungry devices like laptops, up to 100W as of late 2017)
DisplayPort
Thunderbolt
HDMI
a few other fancy things

All these electric signals can be present on a USB-C connector (but maybe not all at the same time!), and as far as I understand, none of them is mandatory.

So when you see a USB-C connector, it could be:

just for power (that’s the case for the connector on a charger for a phone or computer)
just for USB (that’s the case for the basic USB Type A / Type C adapters)
just for DisplayPort, or Thunderbolt, or HDMI (that’s the case for the basic video adapters sold by Apple)

But it could also be (almost) all these things at the same time!

Alternate modes

The 12” Macbook Air has only one USB-C connector, but that connector can support many different electric signals simultaneously.

This is a feature in the USB-C spec, called “alternate modes” or “alt modes” in short. That’s how signals like DisplayPort, Thunderbolt, or HDMI are supported. When an “alt mode” is enabled, some “high speed lanes” (electric wires normally used for USB 3) are hijacked to transport the corresponding alt mode instead.

Carrying DisplayPort, Thunderbolt, or HDMI signals, requires enabling the corresponding alt mode. It has to be supported on both sides of the cable (i.e. by the host and the device). You can’t connect a Thunderbolt device over USB-C to an host that doesn’t support Thunderbolt (more on that later).

If that helps, you can imagine that inside this computer, we actually have a bunch of sockets for power, USB 2, USB 3, DisplayPort, and HDMI; and all these sockets are connected to that single USB-C connector. Then you can put an (expensive) adapter or dock station, to get all these sockets back.

I’m very bad with drawing, but I found a nice diagram in this document:

The document has other schematics and explanations that you might like if you want to know more.

The 12” Macbook Air doesn’t have Thunderbolt

Just because a machine has USB-C, doesn’t mean that the machine supports all these protocols and signals. For instance, the 12” Macbook Air does not have Thunderbolt. The 13” and 15” Macbook Pros do have Thunderbolt. This means that if you plug an Apple Thunderbolt display, using Apple’s adapter, on a Macbook Pro, it will work; but if you plug the same display, with the same adapter, on a 12” Macbook Air, no dice.

To make things even more frustrating and confusing: the Thunderbolt connector is physically identical to a miniDP connector. Any other (non-Thunderbolt) display with a miniDP connector will work on any Macbook with the correct USB-C adapter (because it will use the DisplayPort protocol).

Thanks Apple, I guess.

Connecting screens over USB-C, the easy way

If we want to connect external monitors with USB-C, we have plenty of options.

Assuming that the external monitor has an HDMI (or DisplayPort) connector, the most straightforward option is to use an adapter leveraging “HDMI alt mode” or “DisplayPort alt mode”. If you have multiple USB-C ports, these adapters are a good option, because they are cheap, since the circuitry in them is pretty basic. Of course, our source (i.e. your laptop) needs to support HDMI alt mode or DisplayPort alt mode (the latter is also known as VESA alt mode, by the way).

Most laptops with USB-C ports will support these modes, but I don’t know if this is true for all laptops. (E.g. I don’t know about Chromebooks and other cheap ones.)

Phones and tablets are a totally different story! They may or may not support alt modes. I don’t expect any phone to support HDMI or DisplayPort alt modes. However, there is “MHL alt mode” which seems to be designed to carry video signals from mobile devices. I don’t have any device supporting that so I don’t know if you can use the same adapters or need different ones.

And then, there is DisplayLink.

Connecting screens with DisplayLink

DisplayLink is basically “video stream over USB.”

A DisplayLink adapter might look physically exactly like an alt mode adapter; except that it will work very differently. When you connect a DisplayLink adapter, instead of negotiating alt mode to allocate a few wires to HDMI signals, it will present itself as a regular USB device—i.e. one that shows up in lsusb. The driver for this USB device will behave like a graphics adapter. When you display something on this graphics adapter, the display is encoded, sent over the USB protocol, decoded by the DisplayLink adapter, and shown on the connected physical screen.

These extra steps mean that a DisplayLink adapter will use extra CPU cycles (because of the video encoding), and depending on your setup, this can add a tiny bit (or a good bit) of extra latency. Various sources recommend to NOT use DisplayLink for gaming.

Superficial research showed that there might be DisplayLink drivers available for Linux, but I didn’t try.

So far, it sounds like DisplayLink has a bunch of inconvenients: it needs a custom driver, eats CPU cycles, adds latency … But it has two advantages: you can plug as many as you want on your machine (since they’re just normal USB devices), and I saw references to Android drivers, meaning that it might work on some tablets.

This is why you can end up with an adapter that works out of the box, without drivers, on a machine; and an adapter that works almost out of the box (if, say, the driver is loaded automatically) on a tablet; but the adapters are not interchangeable (they won’t work with the other device) because they’re fundamentally different.

There is a lot of “maybe” in that section, because I didn’t take the time to try DisplayLink so far. Sorry!

One adapter to rule them all

Alright, now that we are armed with all that knowledge, let’s find the best adapter EVER.

Everyone’s needs are different, but I wanted to find a way to have the following connectors on my Mac:

gigabit Ethernet
a few USB ports
VGA (it’s getting very rare that I need that one, but who knows)
HDMI
power

The latter might seem weird, but many adapters (including some from Apple) don’t pass power to the computer; and remember: that 12” MacBook Air has only one connector. You then end up with a difficult choice: do I want to connect my external monitor, or do I want to charge my battery?

I also wanted to be able to connect everything at the same time.

“Whoa, that Jérôme guy for sure is picky!”

As it happens, when I deliver a full day workshop, I need:

wired connectivity
at least one USB port for my remote clicker
VGA or HDMI for the projector
power (full day workshop, remember)

I also wanted to get an extra USB-C port on the adapter, because I wanted to be able to buy USB-C devices (e.g. memory sticks, security tokens…) without having to choose between the device and everything else.

It turns out that I had to drop that last requirement, as (in August 2017) I couldn’t find any adapter that would connect to a single USB-C port and then provide more USB-C ports (in addition to my other requirements).

I got this adapter. The reviews might not be stellar, but it works great for me. It also has SD and miniSD card readers (which I use once every blue moon to re-image a Raspberry Pi), and audio output (because why not). In addition to VGA and HDMI, it has a miniDP connector as well.

The adapter can also be used as a USB charger: if it is connected to the AC adapter, but not to the computer, it will still deliver power to the USB A ports.

Likewise if it is connected to the computer, but not to the AC adapter: it can charge your phone and other devices from the battery of your laptop.

Note, however, that when you plug/unplug the AC adapter, it seems to “reset” the adapter (as if you had disconnected and reconnected all the peripherals). Keep that in mind if you have a disk connected, or if you’re performing live music with a USB MIDI controller.

One dock to rule them all

I do some video editing sometimes (as well as some other CPU intensive tasks), and my adapter then has a “small” problem: the battery will charge very slowly, or even discharge if the CPU stays running at high speeds for continued periods of time. So eventually, I also got a dock. It’s not a “dock” like the docks I was used to (where you physically lock the computer to a base).

There are many docks out there, with varying options. I wanted something just like my adapter, but with full power delivery to the host, and with at least one extra USB-C port, so that I wouldn’t be constantly plugging/unplugging stuff if I decided to buy some USB-C peripheral.

I picked that dock. It might seem expensive, but the other ones that did fit my requirements were sometimes more than $300 (!). There were also a bunch of docks boasting Thunderbolt support, and I didn’t know if that meant “and also, it supports Thunderbolt!” or “and by the way, it requires Thunderbolt!” — the latter would have been a showstopper.

The dock also has mic and headphone connectors, which can be super convenient. I had a headset connected to these for a while. I never really understood how macOS picked which default output to use, but most modern conferencing software has easily accessible settings to switch audio devices to make up for that.

Having a dock also means much less plugging/unplugging: the dock can stay at home, and all my peripherals can stay connected to it, while the adapter stays in my backpack for when I’m on the go.

A couple of observations:

hindsight 20/20, it might have been better to get a dock with DisplayPort (to support cascading display with MST), but the HDMI output on the dock can carry 4K, so I’m fine with it;
initially, I wanted a dock that got power from USB-C (to simplify the different types of cables I had around) but I couldn’t find any. Perhaps because it’s way cheaper to tack a classic AC adapter, rather than a fancy USB-C power supply and the corresponding circuitry in the dock.

Wrapping it up

After spending a bunch of time reading on USB-C, trying to understand what would work, what wouldn’t, etc., I think it is pretty fantastic to be able to use a single connector for so many things. The transfer speeds are orders of magnitude faster than USB 2: with Thunderbolt 3 over a USB-C connector, you can supposedly get 40 Gb/s. You can even connect external GPUs through USB-C, because Thunderbolt can carry PCI Express lanes! (Don’t hold your breath, though: this is still pretty early stage.)

However, since all hosts (machines) and devices don’t support all these modes, it means that debugging problems gets really complicated, especially without knowing of the underlying fundamentals of USB-C, alt modes, etc.

For instance, if I plug my dock or my adapter to an Android phone with an USB-C connector, nothing happens. Obviously, I didn’t expect my phone to suddenly drive my 4K monitor, connect over my gigE interface, and mount my external disks connected to the dock. But the USB-C spec includes a lot of signaling and negotiation to let each side identify itself and its capabilities. It would be fantastic if the devices could use that and report it adequately: it would make for a much better user experience. Hopefully that will evolve in the future.

If there is an adapter or dock that you particularly like, feel free to drop me a note, I’ll add it here for others!

Letter to Santa Kube

2017-12-06T00:00:00+00:00

A few months ago, I wrote and delivered a Kubernetes orchestration workshop, based on my Swarm orchestration workshop. While doing so, I hit a few snags; and since I’m attending KubeCon this week, I thought this would be the perfect occasion to track down Santa Kube and give them my wishlist for Christmas! 🎄🎅🏿❤️

As a foreword: I don’t consider myself a Kube expert, and while I did a bit of research, I might have missed some obvious workarounds. If that is the case, I trust my readers to let me know, and I will be happy to update this post and give them all the credit (as well as a chunk of my eternal gratitude!)

`kubectl logs`

The kubectl logs command lets you stream logs from a single container, and it lets you retrieve logs from multiple pods using a selector. However, it doesn’t let you combine both:

# This is okay
kubectl logs --follow my-little-pod

# This is okay too
kubectl logs --selector run=deploy-oh-my-deploy

# But not this
kubectl logs --selector run=dpeloy-oh-my-deploy --follow

I’ve been spoiled by the Docker CLI, which lets me stream the logs of multiple replicas at the same time, and will prefix each line with the source of the log line:

I would be out-of-this-world-ly delighted if I could get the convenience of multi-container log streaming, together with the power of Kubernetes selectors.

Knowing which log line comes from which container in which pod would be fantastic, too. 🌈

I wrote my own bashery that basically repeatedly fetches logs using --since and then does something ugly to remove duplicate lines, but then I realized that this was so wrong on so many levels and stopped before that code would spring to life and try to devour my face.

I have tried kail, but the installation was far from painless (why can’t I just go get a Go program? 😭) and it doesn’t play nice with white terminals (at first, I thought it was buggy because a lot of text was in white on a white background).

And then there is stern which looks pretty dope too. Update: I tried it! Thanks @lestrrat for encouraging me to try it by the way.

Using stern to view logs is painless.

Download binary from GitHub.
chmod +x the freshly downloaded binary.
./stern_linux_amd64 pod_or_deployment_or_whatever_name --tail 1 -t.
Profit!

I just wish this were embedded in the kubectl CLI. I would also get much joy and happiness from a thing that could be installed with just go get (without messing with additional dependency management tools)_ since I could then install it without installing Go!

`kubeadm` token

When writing the provisioning scripts for my Kubernetes training clusters, I used kubeadm. It works great, except for one little detail: extracting the token generated by kubeadm init is way too complicated.

I’m pretty sure that somebody will point out that I missed an obvious option (because for Christmas’ sake, it HAS TO be there!), but I looked high and low and couldn’t find it. Show me how! And I will then be able to deprecate the pretty horrible things that I’m currently doing to retrieve that token in a stable-ish way. (I didn’t feel dirty enough to parse the porcelain-ish output of the command!)

For reference, the equivalent command on Docker is docker swarm join-token -q worker (it outputs the join token for the cluster, without further decoration).

Stray pods

Selectors and labels are an amazing combo (like peanut butter and jelly). They allow you, for instance, to remove a faulty pod from a load balancer, just by changing its labels, in such a way that the pod is no longer “selected” by its load balancer; but the pod still exists and you can examine it at leisure.

Even better: if that pod is managed by e.g. a replica set, you can change its labels so that the pod is set aside, and the replica set creates a replacement pod immediately.

However! If you do that a lot, you end up with a bunch of stray pods. (Or orphan pods? I don’t know which word would be more appropriate there.)

For years and years, people have complained that Docker would eat up all their disk space. For years and years, Docker’s answer was, “just craft up some shell script to remove old containers and images, then drop that in a crontab”; but we had to wait for Docker 1.13 to introduce docker system prune to provide a better answer.

I’m still getting familiar with the Kubernetes data model. I haven’t figured out if each resource created by another resource has some kind of “parent” link, that would help to track who-created-who and whether the parent resource is still watching after their child or if the latter is now outside of the scope of the former’s selector. So I don’t know if this is as simple as cobbling together 100 lines of script; but I think we shouldn’t wait 3 years for some kubectl prune pods command to show up!

(Or perhaps I’m completely missing the point, and there is a better way to handle these situations? I’m always happy to be enlightened. ⚡🙇🏻‍♂️ )

Figure out port numbers when they are in image metadata

Docker images have some meta data indicating which ports are exposed, but for some reason, Kubernetes can’t or won’t use that.

For instance, if I do:

kubectl run nginx-is-love --image=nginx
kubectl expose nginx

… kubectl will tell me that I need to specify the port number for my NGINX deployment. Which is weird, because that port number definitely is in the image! Proof is, if I do docker run -dP nginx it gets exposed automatically.

I suppose that this piece of metadata is lost somewhere in translation, but I don’t know where.

Support recent versions of Docker

Kubernetes officially supports Docker 1.12, 1.13, and 17.03. Anything after 17.03 is not supported. Which is kind of sad, because Docker 17.03-ce has been EOL for almost 6 months now. 🤷

Why does Kubernetes use such an old version of Docker? Because in the early days of Docker, the API had to evolve quickly, and breaking changes happened regularly. Building on top of a fast-moving API is hard, and this prompted two things: freezing the version of Docker used in most Kubernetes deployments, and the development of the CRI interface, to support other container engines like rkt, CRI-O, and containerd.

Interestingly, the very same thing happened for Docker itself. It initially relied on LXC for container execution; and when LXC started to evolve too quickly (and when it became impossible to support all the different versions of LXC in existence out there), Docker introduced libcontainer and promoted it as the primary execution engine.

There is a big difference in the two situations, though: LXC doesn’t have an API. LXC is leveraged by invoking lxc-start, so the API contract is … lxc-start’s manpage. On the other hand, Docker exposes a versioned API, and has been offering backward compatibility on that API for a while. Breakages are considered release critical bugs. In other words: if you use the versioned API (i.e. /v1.31/containers instead of /containers) and stick to a given API number, but get a different behavior on a newer version of the Docker Engine, you can report that bug and it will be treated as a release critical bug. That means that if you’re using Docker CE, you can track the latest version (and get bug fixes and new features) with the guarantee that if something breaks, people will look at it.

I’m glad that the CRI exists, and allows to have multiple options for the container engine today. And it’s great that in the context of Kubernetes, picking one over the other won’t (or at least, shouldn’t) have any impact on the user. But I’m also glad that there are initiatives to support the Docker API in a more stable way.

By the way, I’m fully aware that this one will probably require collaboration from my coworkers as well. If you’re working on this but need some contacts at Docker to move forward, I’ll be more than happy to make intros and beg, bribe, seduce, or threaten my loved coworkers so that everybody wins at the end!

And while we’re here, I’m also going to acknowledge the pretty cool stuff that Virtual Kubelet is doing, and that moves the API border at a different level.

Final words

The real conclusion of this post is that quite frankly, nothing significant got in my way when I learned how to do with Kubernetes the stuff I had been doing with Docker since June 2015. I’m sure that you’ll agree that none of the points mentioned above is a showstopper to anyone wondering if they can apply their Docker knowledge (and Swarm in particular) to Kubernetes.

And of course, I’m well aware that Swarm has its shortcomings as well. This post is, by no mean, advocating for or against either orchestrator.

Thanks; and if you want to chat more about all this, please reach out, I’ll be thankful to learn from you!

DevOps, Docker, and Empathy

2017-10-31T00:00:00+00:00

Just because we’re using containers doesn’t mean that we “do DevOps.” Docker is not some kind of fairy dust that you can sprinkle around your code and applications to deploy faster. It is only a tool, albeit a very powerful one. And like every tool, it can be misused. Guess what happens when we misuse a power tool? Power fuck-ups. Let’s talk about it.

I’m writing this because I have seen a few people expressing very deep frustrations about Docker, and I would like to extend a hand to show them that instead of being a giant pain in the neck, Docker can help them to work better, and (if that’s their goal) be an advantage rather than a burden in their journey (or their “digital transformation” if we want to speak fancy.)

Docker: hurting or helping the DevOps cause?

I recently attended a talk where the speaker tried to make the point that Docker was anti-devops, for a number of reasons (that I will list below.) However, each of these reasons was (in my opinion) not exactly a problem with Docker, but rather in the way that it was used (or sometimes, abused). Furthermore, all these reasons were, in fact, not specific to Docker, but generic to cloud deployment, immutable infrastructure, and other things that are generally touted as good things in the DevOps movement, along with cultural choices like cross-team collaboration. The speaker confirmed this when I asked at the end of the talk, “did you identify any issue that was specific to Docker and containers and not to cloud in general?” — there was none.

What are these “Docker problems?” Let’s view a few of them.

We crammed this monolith in a container …

… and called it a microservice.

In his excellent talk “The Five Stages of Cloud Native”, Casey West describes an evolution pattern that he has seen in many organizations when they adopt microservices.

Some of us (especially in the enterprise) are putting multiple services in a container, including a SSH daemon used for default access, and calling it a day.

Is this a problem? Yes and no.

Yes, it is a problem if we pretend that this is the final goal of our containerization journey. Containers really shine with small services, and that’s why the Venn diagram of folks embracing containers and folks embracing micro-services has a pretty big overlap. We can tar -cf- / ... | docker import and obtain a container image of your system. Should we? Probably not.

Except if we acknowledge that this is just a first step. There are many good reasons to do this:

verifying that our code (and all associated services) runs correctly in a container;
making it easier to run that VM in a local environment, to leverage the ease of installation of e.g. Docker4Mac and Docker4Windows;
running that VM on a container platform, to be able to control and manage a mix of containers and VMs from an interface that “understands” containers;
or even having a point-in-time snapshot of your system, that you will be able to start in a pinch in case of unexpected incident.

Docker Inc. has a program called “Modernize Traditional Applications” (MTA in short), aiming at helping the adoption of containers for legacy apps. A lot of people seem to believe that this program is basically “import all our VM images as containers and YOLO,” which couldn’t be farther from the truth. If you’re a big organization leveraging that program, you will first identify the apps that are the best fit for containerization. Then, there are tools and wizards (like image2docker) to generate Dockerfiles, that you will progressively fine-tune so that the corresponding service can be built quickly and efficiently. The MTA program doesn’t make this entirely automatic, but it helps considerably in the process and gives a huge jump-start.

Yes, some VMs might end up running, almost unchanged, as containers; in particular for apps that don’t receive updates anymore but have to be kept running anyway. But if somebody told you, “I’m going to turn all your VMs into containers so that you can have more DevOps,” you were played, my friend.

You know what? We had exactly the same challenge 10 years ago, when EC2 became a thing. “We took our physical servers and turned them as-is into AMIs and we are now making good use of the cloud!” said no-one ever. Moving applications to the cloud requires changes. Sometimes it’s easy, and sometimes, well, you have to replace this SQL database with an object store. This is not a problem unique to containers.

Shadow IT is back, with a vengeance

“Shadow IT,” if you’re not familiar with the term, is when Alice and Bob decide to get some cloud VMs with the company credit card, because their company IT requires them to fill 4 forms and wait 2 weeks to get a VM in their data center. It’s good for developers, because they can finally work quickly; it’s bad for the IT department, because now they have lots of unknown resources lying around and it’s a nightmare to manage and/or clean up afterwards. Let alone the fact that these costs, seemingly small at first, add up after a while.

Since the rise of Docker, it’s not uncommon to hear the following story: our developers, instead of getting VMs from the IT department, get one giant big VM, install Docker on it, and now they don’t have to ask for VMs each time they need a new environment.

Some people think that this is bad, because we’re repeating the same mistakes as before.

Let me reframe this. If our IT department is not able to give us resources quickly enough, and our developers prefer to start a N-tier complex app with a single docker-compose up command, perhaps the problem is not Docker. Perhaps our IT department could use this as an opportunity, instead of a threat. Docker gives us fantastic convenience and granularity to manage shadow IT. If we agree to let our developers run things on EC2, we will have to learn and leverage a lot of new things, such as access control with IAM and tagging resources so that we can identify what belongs to which project, what is production, etc. We could use separate AWS accounts but this comes with other drawbacks, like AZ naming, security groups synchronization… With Docker, we can use a much simpler model. New project? Allocate it a new Docker host. Give UNIX shell access to the folks who need to use it. We all know how to manage that, and we can always evolve this later if needed.

If anything, Docker is helping IT departments to have a more manageable shadow IT, and that’s good — because these IT departments can now do more useful things than provisioning VMs each time a developer needs a new environment.

To rephrase with less words and the wit of Andrew Clay Shafer: “Good job configuring servers this year! … said no CEO ever.”

Persistent services, or “dude, where’s my data?”

“If you run a database in a container, when you restart the container, the data is gone!” That’s false on many levels.

The only way to really lose data is if you start your database container with docker run --rm and the data is not on a volume.

Of course, if you docker run mysql, then stop that container, then docker run mysql again, you get a new MySQL container, with a new, empty database. But the old database is still there, only a docker start command away.

In fact, even if you docker rm the container, or run it with docker run --rm, or run it through Compose and execute docker-compose down or docker-compose rm, your data will still be there, in a volume. This is because all the official images for data services (MySQL, Redis, MongoDB, etc.) persist their state to a volume, and the volume has to be destroyed explicitly.

Of course, if you don’t know this, and are just learning Docker, you might freak out and wonder where is your data. That’s perfectly valid. But after looking around a bit, you’ll be able to find and recover it.

However, if you run in the cloud (say, for instance, EC2) and are storing anything on instance store … Good luck. Now you can really lose data super easily. You should have been using an EBS volume! If you didn’t know that, too bad, too late, your data is gone, and all the Googling in the world won’t get it back. (Oh, and let’s not forget that for at least half a decade, EBS volumes have been plagued with performance and reliability issues, and have even caused region-wide outages on EC2.)

Bottom line: managing databases is way harder than managing stateless services, because production issues can incur not only downtime, but also data loss. To quote Charity Majors, “the closer you get to laying bits down on disk, the more paranoid and risk averse you should be”.

No matter what avenue you choose for your databases (containers, VMs, self-hosted, managed by a third party), take appropriate measures and make sure you have a plan for when things go south. (That plan can start with “backups”!)

The tragedy of the unmaintained images

What happens if our stack uses the jpetazzo/nginx:custom image, and that sketchy jpetazzo individual stops maintaining it? We will quickly be exposed to security issues or worse.

That is, indeed, a shame. That would never happen with distro packages! We would never use a PPA, and certainly not download some .deb or .rpm files to install them from a second-hand Puppet recipe.

Just in case you had a doubt: the last paragraph was pure, unadulterated sarcasm. Virtually every organization has an app that uses an odd package, installs some library straight from somebody’s GitHub repository master branch, or relies on some hidden gem like left-pad, unknowingly lurkingin the bowels of a shell script hidden under thousands of lines of config management cruft.

We can address all the bitter criticism we want to Docker and the sketchy, unmaintained images that haunt the Docker Hub, but realistically, Docker is not the first platform that allows developers to share their work.

If we worry about our developers using unvetted Docker images, I wonder: how do we check what they’re using in requirements.txt, package.json, Gemfile, pom.xml, and other dependencies?

In fact, Docker gives us significant improvements over the status quo. Products like CoreOS Clair or Docker Security Scanning let us analyze images at rest, finding vulnerabilities without requiring direct access to our servers. Read-only containers and docker diff give us easy ways to enforce or check compliance of our applications to make sure that they do not deviate.

Works in my container — ops problem now

In the early days of Docker, “works on my machine - ops problem now” was one of the memes used to convey the advantage of Docker. Ship a container image! It will work everywhere.

According to some perceptions, however, the reality is different:

we went from “blindly shipping tarballs” to “blindly shipping containers”;
Docker put us back 5 years with regards to culture adoption.

These two points are very important. Let’s discuss them in detail.

Building empathy

Going from “works on my machine” to “works on my container” was huge progress. In Spring 2015, I had the honor of keynoting the TIAD conference in Paris; and I tried to show in practical ways how we could use Docker to foster empathy between teams, and break down silos. The presentation was in French, but my slides are in English. My core idea was built around a number of specific experiences.

When I was doing customer support for dotCloud (the PaaS that eventually pivoted to become Docker), I was constantly being challenged by the variety of stacks and frameworks that our customers were using. PHP (and half a dozen frameworks like Laravel, Symfony, Drupal, etc.), Python (with Flask, Django, Pyramid, just to name a few!), Ruby, Node.js, Perl, Java (with all the variety of languages that you can run on top of the JVM) — dotCloud could run all of them. When a customer opened an issue, I had two options: try to reproduce it from scratch (that’s how I wrote my first Clojure program, by the way), or ask the customer if I could clone their environment (including the dotcloud.yml file, a distant paleolithic ancestor of the Compose file). The latter would give me a huge head start to reproduce the issue.

Imagine, as a customer, telling your support representative: “When I do requests to S3 from my PHP webapp, they time out once in a while; however, if I do that from the CLI, they always work.” Unless you give them access to your environment, they are very unlikely to figure out what’s going on. However, if you write a tiny Dockerfile, and explain “if you run docker-compose up and then curl localhost:8000 in a loop, you’ll see the problem” — they are way more likely to be able to help. And even if it works on their machine, now at least you know that it’s not a code / version / library problem.

Good luck achieving the same thing by hurling tarballs of code.

It doesn’t stop here. In too many organizations, it’s alas too frequent that communication between support and dev teams is highly dysfunctional, with level 1-2 support engineers being considered as a lower tier of engineers, because the “soft skills” (aka “being a decent good human”) that they have are devalued in comparison to the “technical skills” of developers. As a result, it can be difficult for support teams to get developers to acknowledge issues, until they attract the attention of upper management. Docker can be helpful here as well, because support teams can reproduce issues in a containerized environment — thus providing functional tests. It is then easier for the dev team to look at these issues, because the “tedious work” (of reproducing the problem in strictly controlled conditions) has been solved for them.

Wait, couldn’t we already do that before?

Of course. Reproducible environments with Vagrant, Puppet, etc. are not a new thing. What’s new is bringing the power of a Dockerfile to a crowd that can’t or won’t learn how to use a configuration management system.

The title of my TIAD keynote was “Docker: automation for the rest of us” because I’m deeply convinced that it gives access to powerful tools to a larger crowd.

Successfully embracing DevOps principles requires us to agree and use some common tools and languages. Don’t get me wrong: I’m not talking about technical tools or programming languages. But if the majority of the people supposed to “do DevOps” in our organization are left on the side of the road because the tools that we have picked are too complex for them, we won’t get far in our DevOps journey, and we won’t digitally transform much.

Harder Better Faster Stronger Docker

I recently found myself joking about the fact that “Docker lets us go faster; but if we’re facing a wall, we’re just going to hit it harder.” I mean it. But I think that’s good. Because it means that we’re going to fail fast, and we’ll improve faster. Which is one of the key points of DevOps. Shorten that feedback cycle, because each iteration lets us improve the process. The faster we iterate, the faster we improve.

One particular quote I’ve seen surprised me so much, that I wondered if it was said seriously:

“We had disciplined ourselves to work in cloud environments, as close as possible to our production setups. Docker allows us to work locally, in very different conditions; it takes us 5 years back.”

My first thought was, “That person must be joking or trolling.” Docker gives us back the ability to work locally. If your team, organization, or tooling, required you to work in the cloud, it was taking you 25 years back, to the era of mainframes and minis. We should celebrate a tool that lets us work locally, not decry it; because we can work faster, without waiting for the CI pipeline to pick up our commit and test it and deploy it to preprod just to see a trivial change. (These steps should be mandatory when we submit something to others for review, though.)

But velocity has a cost (and no, I’m not talking about the price of conference tickets.)

It’s not about the tools, and yet …

The amount of tools at our disposal keeps growing. We used to joke about the multiplication of JavaScript frameworks, but if you have an AWS account, log into the AWS console and have a look at the number of services out there. Do you even know what they all do? I don’t. Go has barely solidified its place as a language of choice for infrastructure projects, and some of us are already trying to displace it with Rust. Everybody and their dog is getting excited about Kubernetes, but which one of its 15 different network plugins are we going to pick when we deploy it? Docker has a boatload of features at each release, but even I don’t have the time to know all of them. Should we look into Habitat, Flatpak, Buildah?

We don’t have to keep up with everything, though. And more importantly, we don’t have to embrace new things at 1/10th of the speed of light. As early as 2014, people were asking me if “Docker was ready for production.” It was ready — if you knew what you were doing. Most oftentimes, my answer was: “Start Dockerizing an app or two. Write a Compose file. Empower your developers to use Docker. Set up CI, QA, a staging environment. You will get a huge ROI in the process, and by the time you’re done, you will have acquired a huge amount of operational knowledge about Docker, and you will be able to answer that question on your own.”

I feel bad for all the folks who went straight to production without taking the time to consider what they were doing and learn more about the technology. (Except the ones doing high frequency trading on CentOS 6, because I do like me a good joke.)

This is not specific to Docker. Today we laugh at the poor souls who edit files on the servers, only to have them overwritten by Puppet the next minute; forgetting that years ago, we were these poor souls and we had no idea what the hell was going on, persuaded that the computers were conspiring against us.

Docker is not the perfect tool; but it’s a pretty good one. It brings to the masses (or at least, to a larger number) lots of techniques that everybody wanted to implement, but that only Netflix managed to get right. Today, with Docker, a one-person-team can build artefacts for any language, run them on their local machine whatever its operating system, and deploy them on any cloud. And that’s just a first step!

So instead of complaining that Docker is killing our DevOps efforts, it would be more productive to explain how to refactor the anti-patterns that we see out there.

Containers will not fix your broken culture

(This is the title of an excellent talk by Bridget Kromhout, covering these topics as well.)

If there is one point where I strongly agree, it’s that the DevOps movement is more about a culture shift than embracing a new set of tools. One of the tenets of DevOps is to get people to talk together.

Implementing containers won’t give us DevOps.

You can’t buy DevOps by the pound, and it doesn’t come in a box, or even in intermodal containers.

It’s not just about merging “Dev” and “Ops,” but also getting these two to sit at the same table and talk to each other.

Docker doesn’t enforce these things (I pity the fool who preaches or believes it) but it gives us a table to sit at, and a common language to facilitate the conversation. It’s a tool, just a tool indeed, but it helps people share context and thus understanding.

That’s not too bad.

I’ll take it.

I would like to thank Bridget Kromhout for giving thoughtful and constructive feedback on an early version of that post. All remaining typos and mistakes are my own. I take full responsibility for what is written here; so please send complaints and rants my way!

Being a track host – tips from DockerCon

2017-05-05T00:00:00+00:00

You’ve accepted to be a “track host,” but you don’t know how to do it? Or you’ve done it before, but you’d love to swap a few tips, tricks, and new ideas to do it better next time? I got you covered!

I’ve been a track host twice. Both times were for DockerCon, in the “Black Belt Tech” track. I had a wonderful time, because that gave me the perfect excuse to sit during the whole conference in a track that was both fascinating and enlightening to me, and listen to speakers of the highest caliber; some of them my friends, some of them my idols, some of them both at the same time. I’ll be honest with you: the first time, I had no idea what I was doing. I knew I was supposed to introduce the speakers and make sure that everything went fine, but that was pretty much it.

This experience alone doesn’t make me qualified to tell others how to be good track hosts, or “MCs.” However, I have spoken at conferences many, many times. If we include meetups and internal presentations at various companies, I have given more than one hundred talks over the last couple of years. This gave me a feel of what worked (and what didn’t) as a speaker. When I started “MC’ing” at DockerCon, I tried to have these experiences in mind, to turn them into useful teachings to better do my job.

These tips and tricks are for the “day of.” If you are also involved in the program committee, speaker selection, and pre-conference preparation – I have another post that I want to write about that; but in the present one, I’m sticking purely to things happening the day(s) of the conference.

Before the sessions

This is going to sound silly, but make sure that you know when your speakers arrive. Know how to contact them. (It could be a phone number or any kind of instant messaging where you know that they’ll be responsive.)

Good speakers will typically check their room the day before; but you can also setup a tour. Sometimes you have the room sufficiently ahead of time to offer rehearsals in that room; sometimes you’ll just be able to show them what it’s like. This is important for the speakers: it lets them anticipate the size of the room, how the audience will be seated, where the screen is (or are). If there are multiple projector screens, they’ll know that they shouldn’t use a laser pointer (since they’ll only be able to point at one screen at a time). They should also be able to check if there is a mirror screen, allowing them to see what they’re projecting without turning their back to the audience. That’s also a good moment to check that their computer (if they’re using theirs) can connect to the projector, and that they can connect to the network (wired or wireless) if they need it.

I highly recommend to have a few video adapters. Most conference projectors use HDMI connectors, and VGA is also pretty common. But lots of laptops won’t have these connectors, and will instead use miniDP, or USB-C for recent models. Ideally, your conference AV team should be able to provide a few adapters to accomodate everyone. Ideally, you should tell your speakers “bring your adapters.” But eventually, you will miss an adapter. It’s better to realize that ahead of time than five minutes before the beginning of a talk!

The day of the talks

Make sure you know who the AV person is, so you can track them quickly if anything goes wrong. Will they be in the room at all times, or will they be here just at the beginning of each talk (to mic up the speaker) and then they’ll go away? In the latter case, it’s a good idea to get their phone number so that you can call/text them if there is a problem.

Check what type of mic the speakers will have (ideally, lapel or over-ear); what type of mic you will have to make announcements; if there will be an extra mic for Q&A. If some of the talks will have more than one speaker, ask the AV team if each speaker will have their own mic.

The AV person is sometimes knowledgeable about the lighting of the room. If you want to feel over-prepared, inquire about that: what should you do if the lights go off? Or conversely, if they go on, making it harder to see the projector screen?

You should save yourself a few seats in the front row. The perfect seats should be close to the stage or podium (so you can easily interact with the speaker if needed), but also directly facing a screen (so that you can take good pictures during the talk). It’s often hard to get both, especially in a room with two screens, one on each side, and the speaker right in the middle.

It’s also a good idea to save a few good seats for the speakers. Write down (or print) “SPEAKER” in the middle of a few sheets of paper, and place them on the seats that you want to reserve.

Meeting your speakers

Ideally, you asked your speakers to be in the room 10-15 minutes before the beginning of the talk, and to meet you on the front. It’s time for a few pre-flight checks!

First and foremost, confirm with the speaker the duration of the talk. I know, that might seem weird; but some of us deliver a lot of talks, and it can be hard to remember if your conference has 40/45/50 minutes speaking slots, and whether that includes Q&A.

Speaking of Q&A, ask your speaker if they’ll do one! Not everyone does. Some speakers prefer to keep the whole time slot for their talk because they have a lot of content to cover; some speakers prefer to have their Q&A in the “hallway track.” So check with them!

If the speaker will be taking questions at the end of their talk, remind them that they should repeat the questions (unless you have an extra mic for the audience to ask questions). This will make sure that everyone hears the questions, and if the talks are recorded, it will make sure that the questions are on the recording.

If the speaker has a “complicated” name (and by “complicated,” I mean something that you don’t know how to pronounce), ask them how you should pronounce it. Full disclosure: as a French person, the pronunciation of english makes no sense at all to me. I mean, what do you expect from a language where ghoti and fish can be pronounced the same way? I expect that it’s also true the other way around, since I’m asked very often how to pronounce my last name (which is written “Petazzoni”). I’ll tell you: I don’t care how you pronounce my name. I got it from my grand-father, who was Italian; but I was born and raised in France, and I can’t speak much italian, save for counting to ten and a few expletives. Italians would pronounce it “pet-a-DZOH-nee,” French typically go for “pet-ah-zoh-nee,” but I mostly give talks in english anyway, so I don’t care. I will be slightly sad if you write it incorrectly (e.g. “Pettazoni”), so I will give you a mnemotechnic trick: it only has two Zs, like “pizza.” I will also be slighty sad if you mix it up (e.g. “Bertaloni”, it happened), unless you call me “Pizza Toni” with an overly heavy italian accent – but in that case, you also have to pour me a glass of limoncello. As a speaker, I will appreciate if you ask me how to pronounce my name; and I will invariably reply with an honest smile that it doesn’t matter. Bottom line: ask your speaker! I bet most of them won’t hold a grudge if you don’t nail the pronounciation right, but they will always appreciate you asking.

Ask your speaker how they’d like to be introduced. Same story: they generally don’t care (or rather, don’t have anything specific on their mind); but they might give you a hint about what matters to them. If they work for a company doing something particularly impressive, it’s a good time to do some fact-checking. For instance, recently I introduced Brendan Gregg, who works for Netflix. I remembered that Netflix accounted for about one third of the internet traffic of the US during peak hours, but I wasn’t 100% sure about that; so I checked with him.

If you can, try to find a little anecdote or story to introduce the speaker. I personally try to find something fun or exciting. It doesn’t have to add to the talk (the speaker will take care of that; not me!) so I just try to get people’s attention before the speaker begins. There is some research showing that people pay more attention after they laugh; so I aim for “funny” instead of “insightful” when introducing speakers. They’ll be the smart ones; not me!

If you’re going to do some live-tweeting of the session (or even just a few pictures), make sure you have their Twitter handle.

Just before the beginning of a talk

If the room is pretty full, and a bunch of people are standing in the back of the room, there are a couple of techniques that you can use to optimize resource utilization.

(1) You can announce that there are available seats and point at them.

(2) Even better: you can invite people seated near the aisles to shift away from the aisles. This will let people find available seats easily, without having to disturb a bunch of attendees to find a spot in the middle if a row. (This technique was taught to me by the amazing Bridget Kromhout!)

A couple of minutes before the designated time, go on stage (if you’re not there already). Welcome the audience. If people are talking and chatting, don’t worry: when you’ll start speaking with the Holy Microphone, they’ll go quiet. (I never had to do “Shhhh!” or ask people to stop talking so far!) Then, introduce the speaker. Their name, their title, the topic of their talk; the small anecdote about them or their subject … And it’s off to them!

Upping the social media game: if you feel like it, and if the speaker is OK with it, now is the moment to take a great selfie with the speaker and the audience in the background!

During the talk

Now is the moment to listen to the talk and tweet stuff. Insightful quotes, unexpected numbers or results, hot takes, summary slides… Here are a few tips and tricks that I use.

Use both your phone and laptop to tweet! The laptop is better for “text-only” tweets (because it’s much faster to type with a real keyboard), and the phone will be great for pictures.
Always include the conference hashtag and the speaker’s @ handle.
Remember that you can put multiple pictures in a tweet. If a few slides go together, you can tweet them in a single tweet; or you can also tweet side-by-side a picture of a slide, and a picture of the speaker.
Sometimes you can save a few characters by tagging the speaker on their picture, rather than mentioning them in text.
Likewise, if at some point you need to mention a bunch of people or organizations, you could tag them all on a picture.
Use threading when necessary.
If you did a talk rehearsal, you will probably be able to anticipate the moments that are “photo worthy.” Use this at your (and your audience’s) advantage.
If you have a copy of the slide deck, you can use it to tweet captures of the slides (instead of blurry pics). If you’re using Linux, “scrot -s” is your friend; on OSX, Shift-Control-Command-4 will let you select a portion of the screen and copy it to the clipboard. You can then directly paste it (Command-V) when composing a new tweet.

Of course, you don’t have to use all these tips all the time. But I guarantee that they’ll come up handy!

Last thing on the social media side: check regularly for tweets tagged with the conference hashtag. You can do that very easily, even if you don’t use a fancy Twitter client. Just enter the conference hashtag in Twitter search, and switch to “live.” Then, whenever you see an interesting tweet (related to the session where you are right now or anything else), retweet it for reach. Even if you don’t tweet much yourself, retweeting other folks ends up having a significant impact overall.

The end is near

Five minutes before the end of the talk, if you see that the speaker will probably run over, it’s time to flash them the helpful “FIVE MINUTES” sign that you had printed in huge letters beforehand. Oh, you forgot to print it? So did I. Every single time. Instead, I take my favorite editor, put it full screen, and type “5 MINUTES” with a huge font. Then I flash that to the speaker. It usually works.

Q&A

If there is a Q&A, you might have some housekeeping to do!

The easiest scenario is when there are mics on mic stands for the Q&A. People will then line up behind the mic, and all you will have to do is to intervene if there is a risk of exceeding the alloted time. (“We are running out of time, but don’t hesitate to reach out to the speaker after the talk!”)

If there is no mic at all for the Q&A, make sure that the speaker repeats the questions (especially if there is a recording).

If the AV team gave you 1 or more extra mics for the Q&A, if you are able-bodied, now is the perfect time to get some exercise! You can run with the mic to hand it to whoever is raising their hand. If your mobility is reduced, or if you don’t want to run around, get someone to carry the mic for you, or keep the mic but repeat the questions yourself.

After the talk

Thank the speaker, ask the audience to give one more round of applause. It’s a good idea to tell the audience how long it is until the next talk. If your conference has a rating system for talks, remind people to use it.

Some conferences want to have the speakers’ slides. Some speakers don’t turn out their slides ahead of time. As a result, at some conferences, somebody shows up right at the end of the talk to ask the speaker to copy their slides on a USB stick. Give your speakers some time to breathe!

Parting words

I gave a ton of ideas, hints, things to do; but of course, not everything will apply to your conference. You might or might not be comfortable with some of the things I mentioned here. So feel free to adapt as much as needed!

Train people well enough

2017-04-26T00:00:00+00:00

I’d like to tell you a short story illustrating why training your employees is crucial to the success of your organization.

I was born and raised in France, and worked there until 2010. Of course, I’m a native French speaker. What about other languages? Well, I could vaguely get around in German, and my written English was pretty good. So good, in fact, that most people with whom I was interacting (through emails or instant messaging) could easily mistake me for a native English speaker. My spoken English was a very different story, though. We’ll get back to that soon enough!

In 2011, I moved to San Francisco to join dotCloud, the startup that eventually became Docker. We were 5, 6 engineers in a coworking space, Founder’s Den, at 625 3rd Street in San Francisco. My extremely thick French accent did not get too much in the way when working with dotCloud founders Solomon Hykes and Sebastien Pahl, who were both perfectly fluent in French and English (and German as well for Sebastien); or with my fellow compatriot Sam Alba. (I have to give props to Mark Erdmann, who was the only one who didn’t speak French in the office back then. Thanks for keeping up, dude!)

One day, Solomon’s sister visited our office and filmed us at work. She was shooting a documentary about tech startups, and interviewed all of us. That’s how I realized that my French accent back then was, to put things mildly, not awesome.

Countless people (including our investors–true story!) had told me multiple times that I should “absolutely not try to change it,” but it turns out that said accent was so thick, that Solomon’s sister had to add subtitles when I was speaking. Ouch.

I don’t know if that was related, but later on, Solomon encouraged us to take English lessons.

Multiple times a week, an English teacher (paid by the company) would come to the office, and that’s how I learned that the words law and low are, in fact, pronounced very differently (shocker!).

I still have the thick French accent, but people can understand me more easily now. In 2013, when dotCloud became Docker, the SRE team that I was managing was reduced to a huge team of one, to borrow the words of my amazing coworker Kristie Howard. I considered switching gigs. But I accidentally submitted a talk about containers at the SCALE conference; and after that talk, I was asked to do a repeat in Beijing, and then in Moscow. My speaker career really took off, and I gave up to 100 talks per year about Docker. (We can have a conversation about whether that was a sane, healthy thing to do; but that’ll be in a later post!)

Without these English lessons, I wouldn’t have been able to speak at so many anglophone conferences and meetups. Because even if his accent is “cute,” you don’t want a barely intelligible French dude to speak at your conference.

Instead of becoming Docker’s first evangelist, I’d probably have continued to build infrastructure and fling requests at cloud API endpoints (at Docker or elsewhere).

At the end of the day, these modest English lessons had a huge impact for me. But what was the value-add for the company? Well, I said it one paragraph above. At a moment when Docker needed adoption, traction, and to build a community, here I was, a passable speaker but with a deep knowledge of the product and the tech behind it. If you can put a dollar amount on this, let me know; but without overstating my achievements, I want to believe that the return on investment for Docker was tremendous. Bigly.

I’m going to conclude with this joke, that most of you probably know already:

– CFO asks CEO: What happens if we spend money training our people and then they leave?
– CEO: What happens if we don’t and they stay?

And if you want to say the same things but with the class and words of of Richard Branson:

“Train people well enough so they can leave, treat them well enough so they don’t want to.”

I’d like to thank AJ Bowen for proofreading this post. All remaining mistakes and typos are my own. By the way, are you looking for a Pythonista who is also fluent in Go? Somebody with outstanding interpersonal and communication skills; willing and able to write properly documented code; somebody with great attention to details, as in “my CLIs have bash completion”? AJ is looking for a remote job. Get in touch with me so we can discuss my referral fees ;-)

I’d also like to thank for Kristie Howard who suggested a few changes and improvements to that post, and contributed to my never-ending English education. You’re the best!

From dotCloud to Docker

2017-02-24T00:00:00+00:00

Have you heard about dotCloud? If you haven’t, I’m going to give you a hint: it is a PAAS company. Another hint: eventually, dotCloud open-sourced their container engine. That container engine became Docker.

This is a quasi-archeological account of some of the early design decisions of dotCloud, some of which have shaped how Docker is today (and how it is not). “How is this relevant to my interests?” you ask. If you are not using containers, and not planning to, ever, then this article will not be very useful to you. Otherwise, I hope that you can learn a lot from our past successes and failures. At the very least, you will understand why Docker was built this way.

This was initially published as a guest post on Taos’ blog. I would like to thank Julie Gunderson for inviting me to share this with Taos’ audience!

Also, if you want to know more about the early story of dotCloud and Docker, you should read 5 years at Docker by my coworker Ken Cochrane. It’s pretty dope and covers lots of things that I didn’t mention there.

First of all, a disclaimer

Don’t consider this as a set of guidelines, recommendations, or whatever. It’s important to keep in mind that when dotCloud was created (and for quite a while!), things were very different:

EC2 had just rolled out support for custom kernels
EBS had major outages at least once a year
Linux 3.0 wasn’t out yet
Consul and etcd didn’t exist
Go 1.0 hadn’t been released yet

This might help to get some perspective on some of our technical choices.

Take everything I say here with a grain of salt. I no longer have access to the original dotCloud code, and while I knew that codebase pretty well, I don’t have an eidetic memory and it’s very possible (and even likely) that I misremember a few things. If you were there, and think that I got something wrong, let me know! I’ll be happy to fix it.

And then, a short page of boring history

At $STARTUP_NAME, we always knew that containers were the future, and we were using them before they were cool! We are true containhipsters and we are glad that everybody else is finally seeing the blinding light that we saw decades ago!

This is not a cheap shot at Bryan Cantrill, for whom I have an inordinate amount of respect and admiration. Sometimes I wish I had been born earlier (and also smarter), and got a chance to work on Solaris. Now that’s a decent OS (even if the userland is extremely picky about who it makes friends with), and when the Joyent crew runs containers, they’re not messing around. (If you want a taste of what it’s like to run containers in production like a boss, check out this talk, you’ll see what I’m talking about.)

But no Solaris for me! Instead, a friend whose hair and beard could rival with Stallman’s gave me a Slackware CD in the mid-90s, and I’ve been stuck with Linux ever since. (I tried FreeBSD once. I managed to crash the installer and then went on to file one of the most inane bug reports ever.)

Fast forward to 2008, when fellow hacker Solomon Hykes gives me (and others) a demo of dotCloud. Back then, dotCloud was a CLI tool allowing to author container images, move them around, and easily instantiate them on multiple machines. That demo was honestly pretty similar to the one in Solomon’s lightning talk at PyCon in 2013, but the tech behind it was very different. And for a good reason: what Solomon demo’ed in 2013 was the result of 5+ years of trial, error, and learning hard truths the hard way.

Flintstone’s Docker

This is what our first containers looked like

The dotCloud container engine (the ancestor of Docker) started as a Python CLI tool called dc. (Yes, we knew that it conflicted with the old-school desk calculator program. No, we didn’t care.)

dc acted as a frontend to LXC and AUFS. Specifically, dc could:

manage container images (pull/push them from/to a registry),
create a container using one of these images (leveraging AUFS copy-on-write),
configure the container, by allowing any file to be generated from a template (for instance, putting the correct IP address in /etc/network/interfaces),
start the container, by automatically creating its LXC configuration file and invoking lxc-start,
dynamically expose ports, by managing a set of iptables rules,
and a few other cool things.

This is what interacting with dc looked like. Keep in mind that I haven’t used dc in 3 years and I don’t have the code anymore, so this is only approximate.

# pull an image
dc image_pull ubuntu@f00db33f
# create a container
dc container_create ubuntu@f00db33f jerome.awesome.ubuntu
# start the container
dc container_start jerome.awesome.ubuntu
# enter the container, like with "docker exec"
dc container_enter jerome.awesome.ubuntu
# there are a bunch of commands to manage port mappings
# the following one will allocate a random port
dc container_connection_add jerome.awesome.ubuntu tcp 80
# check which port was allocated
dc network_ls

So far, so good.

There are a few profound differences, however, between dc and modern Docker.

Image format

We wanted to be able to track and audit accurately changes made to containers, and possibly “transplant” them (e.g. when a new release of Ubuntu comes out, run your application on that new release without rebuilding it.) It sounds like a good idea at first! We stored images in Mercurial repositories, using the metashelf extension to track special files and permissions. This means that images operations were slow. Furthermore, it turns out that you can’t “rebase” a filesystem image like you would rebase a bunch of source commits. It kind of works as long as you’re only changing configuration files; but it’s useless if these configuration files are generated from templates anyway. And it doesn’t work at all for binaries or bytecode.

As a result, authoring images was a slow, bulky process, requiring some extra tooling to be done efficiently. One of us (Louis Opter if I remember correctly) was generally in charge of updating the dotCloud official images; and everybody else hoped that they’d never have to do it.

That’s why Docker just used AUFS layers as-is. It was good enough, it was fast, and since an AUFS layer is just a bunch of files masking their counterparts in the original image, it means that you can still get the list of modifications very easily.

Dependency on AUFS

The dotCloud platform ran for about 5 years, and used AUFS all along. I’m going to be brutally honest: no other option would have worked for us. BTRFS used too much memory (and still does), because multiple containers running the same image lead to page cache duplication. Device Mapper thin provisioning didn’t exist (and has the same memory issues anyway). Ditto for ZFS. The other union filesystems (UnionMount, anyone?) were hardly maintained, and had tons of edge cases.

That’s why we used AUFS. It had its quirks, but for our use case, it worked beautifully. It allowed us to pack hundreds of containers on instances with 32 GB of RAM. It ran flawlessly everything we threw at it, except MongoDB (something to do with fancy mmap semantics), which prompted us to introduce volumes.

When we rolled out the first versions of Docker, we knew that the dependency on AUFS would eventually become an issue. We were particularly lucky, in the sense that it became an issue after Docker got enough traction to convince Red Hat to do a lot of the hard work involved to bring Docker to mainstream kernels. That’s how Alexander Larsson (and later, Vincent Batts) ended up writing the Device Mapper and BTRFS “graph drivers.” On the Docker side, core maintainers Guillaume Charmes, Michael Crosby, Victor Vieux and Solomon Hykes himself did the heavy lifting required to modularize that part of the Docker Engine.

Entering a running container

Back in the day, you couldn’t do the equivalent of docker exec or nsenter, because they both rely on the setns() system call. That system call appeared in Linux 3.0, in 2011. So how did we do, then?

If you have used LXC, you might remember lxc-attach, which gives you a console on a running container. It could have worked, but we found it rather capricious. It was acceptable if you just wanted to get a terminal in a runaway container; but you couldn’t depend on it as a remote command execution engine to setup database replication, for instance. It was conceptually closer to a serial console.

This leaves you with two options:

patch your kernel to add support for the setns() system call;
run an SSH server in your containers.

We did both. We had an abstract execution engine that would use setns() when available, and fallback to SSH otherwise.

This means that our containers were all running an SSH server. I was a huge fan of this SSH server, by the way, because it allowed me to do all kinds of cool hacks. This may come as a surprise, especially when one knows that I wrote this blog post, but that merely demonstrates my ability to change my mind, amirite?

Why have both? Because we wanted the performance and convenience of setns(), but we didn’t want to rely on it and be forced to stick to an older kernel if a wild kernel vulnerability appeared.

The container daemon itself

Since containers are managed by LXC, you don’t need a long-running daemon (and at this point there was no container engine per se). In fact, if you scratch the surface, you realize that each container has its own long-running daemon: it’s lxc-start (it’s similar to rkt or runc) and you connect to it using an abstract socket (from memory, @/var/lib/lxc/<containername>).

This is great, because it’s simple. At least, it seems simple. Each container was fully contained (so to speak) within /var/lib/dotcloud/<containername>, so you could move a container simply by copying that directory to another machine. Of course, copying this directory while the container is running requires extra precautions; but there was something satisfying and UNIX-y in the fact that a container was just a directory, after all.

Of course we couldn’t help but build our own RPC layer

Perfect, we have our dc tool on our container nodes; now we need to slap an API on top of that to orchestrate deployments from a central place. Since containers are standalone, the process exposing that API doesn’t have to be bullet-proof, and you can update/upgrade/restart it without being worried about your containers being restarted.

Almost all the communication between processes and hosts was done using ZeroRPC. ZeroRPC is basically RPC over ZeroMQ, using MessagePack to serialize parameters, return values, and exceptions. MessagePack is similar to JSON, but way more efficient. (We didn’t care much about efficiency except for the high-traffic use cases like metrics and logs.)

If you’re curious about ZeroRPC, I presented it at PyCon a few years ago. Unfortunately, my French accent was a few orders of magnitude thicker than it is today (which says a lot) so you might struggle to understand me, sorry ☹

ZeroRPC allowed us to expose almost any Python module or class like this:

# Expose Python built-in module "time" over port 1234
zerorpc-server --listen tcp://0.0.0.0:1234 time &
# Call time.sleep(4)
zerorpc-client tcp://localhost:1234 sleep 4

ZeroRPC also supports some fan-out topologies, including broadcast (all nodes receiving the function call; return value is discarded) and worker queue (all nodes subscribe to a “hub;” you send a function call to the hub, one idle worker will get it, so you get transparent load balancing of requests).

The original ZeroRPC was synchronous, but François-Xavier Bourlet implemented an asynchronous version (making use of coroutines), as well as “streaming” — basically, the ability for a function to return an iterator/generator, very useful for logs and live metrics! Andrea Luzzardi also implemented the zerotracer, which allowed us to get full traces of API calls using transparent middlewares. But I digress.

Let’s sprinkle micro-services all over

So here we are, with a “containers” service running on each node, letting us do the following operations from a central place:

create containers
start/stop/destroy them

Listing containers (and gathering core host metrics) relied on a separate service called “hostinfo.” This service would just scan all the containers deployed locally, aggregate their satus, and send it all to a central place.

So thanks to “hostinfo” we can also list all containers from that central place. Cool.

In the very first versions, dotCloud was building your apps “in place,” i.e. when you push your code, the code would be copied to a temporary directory in the container (while it’s still running the previous version of your app!), the build would happen, then a switcheroo happens (a symlink is updated to point to the new version) and processes are restarted.

To keep things clean and simple, this build system was managed by a separate service, that directly accessed the container data structures on disk. So we had the “container manager,” “hostinfo,” and the “build manager,” all accessing a bunch of containers and configuration files in the same directory (/var/lib/dotcloud).

Then we added support for separate builds (probably similar to Heroku’s “slugs”). The build would happen in a separate container; then that container image would be transferred to the right host, and a switcheroo would happen (the old container is replaced by the new one).

We had the equivalent of volumes, so by making sure that the old and new containers were on the same host, this process could be used for stateful apps as well. This, by the way, was probably a Very Bad Idea; as ditching away stateful apps would have simplified things immensely for us. Keep in mind, though, that we were running not only web apps but also databases like MySQL, PostgreSQL, MongoDB, Redis, etc. I was one of the strong proponents of keeping stateful containers on board, and on retrospect I was very certainly wrong, since it made our lives way more complicated than they could have been. But I digress again!

To keep things simple and reduce impact to existing systems (at this point, we had a bunch of customers that each already generated more than $1K of monthly revenue, and we wanted to play safe), when we rolled out that new system, it was managed by another service. So now on our hosts we had the “container manager,” “hostinfo,” the “build manager” (for in-place builds), and the “deploy manager.”

(Small parenthesis: we didn’t transfer full container images, of course. We transferred only the AUFS rw layer; so that’s the equivalent of a two-line Dockerfile doing FROM python-nginx-uwsgi and RUN dotcloud-build.sh then pushing the resulting image around.)

Then we added a few extra services also accessing container data; in no specific order, there was a remote execution manager (used e.g. by the MySQL replication system), a metrics collector, and a bunch of hacks to work around EC2/EBS issues, kernel issues, out of memory killer, etc.; for instance in some scenarios, the OOM killer would leave the container in a weird state and we would need a few special operations to clean it up. In the early day this was manual ops work, but as soon as we had enough data it was automated.

The process tree on a container node looked like this:

- init -+- container
        +- hostinfo
        +- runner
        +- builder
        +- deployer
        +- metrics
        +- oomwrangler
        +- someotherstuff
        +- lxc-start for container X -+- process of container X
        |                             \- other process of container X
        +- lxc-start for container Y --- process of container Y
        \- lxc-start for container Z --- process of container Z

So at this point we have a bunch of services accessing a bunch of on-disk structures. Locking was key. The problem is, that some operations are slow, so you don’t want to lock when unnecessary (e.g. you don’t want to lock everything while you’re merely pulling an image). Some operations can fail gracefully (e.g. it’s OK if metrics collection fails for a few minutes). Some operations are really important and you absolutely want to know if they went wrong (e.g. the stuff that watches over MySQL replica provisioning). Sometimes it’s OK to ignore a container for a bit (e.g. for metrics) but sometimes you absolutely want to know if it’s there (because if it’s not, a failover mechanism will spin it up somewhere else; so having containers disappearing in a transient manner would be bad).

To spice things further up, our ops toolkit was based on the dc CLI tool, so that tool had to play nice with everything else.

Still with us? Get ready for another episode of “embarrassing early start-up decisions.”

Dalek says, ORCHESTRATE

When your container platform runs 10 containers on a handful of nodes, you can place them manually; especially if you don’t create or resize containers all the time.

But when you have thousands of containers (dotCloud peaked above 100,000 containers) running across hundreds of nodes, and your users constantly deploy and scale services, you need an orchestrator. More specifically, you need to automate resource scheduling.

In dotCloud’s case, we wanted to be able to make an API call to create a container, with the following parameters:

some unique identifier for this container (combination of owner, app, service…)
container base image and some templating parameters
resources needed (mainly RAM)
a “high availability token” which was used to prevent two containers with the same token from running on the same host (e.g. the leader and follower of a replicated database)

The API call should pick a machine to run the container, while honoring the various constraints specified in the call (available resources and HA token).

And by “orchestration” I mean “scheduling”

In theory, any good CS grad student will tell you that this seems like a perfectly good case to use some bin packing algorithm.

In practice, anybody who has worn a pager long enough knows that network latency and packet loss are both non-zero quantities, and that therefore, we are facing a distributed systems problem (aka potential nuclear waste dumpster fire).

Most “standard” algorithms assume that you know the full state of the cluster when taking a scheduling decision. But in our scenario, you don’t know the state of the cluster. You have to query each machine. The request has to go over the network, and then the machine has to read the state of all its containers before replying. Both operations (network round-trip and gathering container state) can and will take some time. Using aggressive timeouts (to avoid waiting forever for unreachable nodes) gets problematic when a host is very loaded (and takes a while to gather container state).

“Caching,” I heard someone say. Excellent idea! Caching is easy. Cache invalidation, however, is one of the hardest things in computer science. How, why, when would we have to invalidate the cache? Whenever some other system (other than the scheduler) makes changes to the containers. We ended up having lots of cluster maintenance tasks to move containers around, resize them, regroup them (if you regroup containers using the same image, you realize huge memory savings). These operations were implemented by relatively simple scripts, relying on the fact that each container was fully contained in a directory. To preserve these semantics, you need to somehow watch all your container configurations for changes, and trigger cache invalidation events when changes happen. Alternative option: implement these operations with the scheduler. That was not realistic. We were constantly expecting the unexpected (AWS “degraded performance” and “elevated error rates,” some customer spiking 100x, 1 Mpps distributed denial-of-service attacks, etc.) so we needed the ability to cobble solutions fast, without breaking too many things. Central scheduler was out, at least in the beginning.

Distributed, robust, suboptimal scheduling

Our scheduler would just broadcast the request to a subset of nodes, and place it in a retry queue. These nodes would try to acquire a lock (implemented by a centralized Redis). The one acquiring the lock would carry on and deploy your container. Once the container is up and running, another system takes care of removing it from the retry queue (which gets re-broadcasted once in a while).

That’s an extremely naive algorithm, but it’s also very resilient. By the way, some of the fastest search algorithms work by scattering your request on multiple nodes and gathering only the first (fastest) replies, to make sure that you get a good response time (at the expense of correctness if some nodes are overloaded or down).

The main single point of failure was the Redis used for locking, and even that one was not a big deal since it didn’t really store anything. (We always meant to replace it with Zookeeper or Doozer, but it never turned out to be worth the pain!)

If this “algorithm” makes you cringe, that’s fair. We weren’t particularly proud of it, and we wanted our next container engine to support better semantics.

Summoning daemons

At this point, we really dreamed of a single point of entry to the container engine, to avoid locking issues. At the very least, all container metadata should be mediated by an engine exposing a clean API. We had a pretty good idea of what was needed, and that’s what shaped the first versions of the Docker API.

The first versions of Docker were still relying on LXC. The process tree on the container nodes would have looked like this:

- init -+- container
        +- hostinfo
        +- runner
        +- builder
        +- deployer
        +- metrics
        +- oomwrangler
        +- someotherstuff
        +- docker
        +- lxc-start for container X -+- process of container X
        |                             \- other process of container X
        +- lxc-start for container Y --- process of container Y
        \- lxc-start for container Z --- process of container Z

“Waitaminute,” you say, “that’s exactly the same thing as before!”

Yes! But now, all our management processes (container, hostinfo, etc.) would go through “docker” instead of accessing container metadata on disk. No more crazy locking, no more hoping that everybody used the locking primitives correctly instead of accessing stuff directly, etc.

From LXC to libcontainer

Then, as containers picked up steam, LXC development (which was pretty much dead, or at least making very slow progress) came to life, and in a few months, there were more LXC versions than in the few years before. This broke Docker a few times, and that’s what led to the development of libcontainer, allowing to directly program cgroups and namespaces without going through LXC. You could put container processes directly under the container engine, but having an intermediary process helps a lot, so that’s what we did; it was named dockerinit.

The process tree now looked like this:

- init --- docker -+- dockerinit for container X -+- process of container X
                   |                              \- other process of container X
                   +- dockerinit for container Y --- process of container Y
                   \- dockerinit for container Z --- process of container Z

But now you have a problem: if the docker process is restarted, you end up orphaning all your “dockerinits.” For simplicity, docker and dockerinit share a bunch of file descriptors (giving access to the container’s stdout and stderr). The idea was to eventually make dockerinit a full-blown, standalone mini-daemon, allowing to pass FDs around across UNIX sockets, buffering logs, whatever’s needed.

Having a daemon to manage the containers (we’re talking low-level management here, i.e. listing, starting, getting basic metrics) is crucial. I’m sorry if I failed to convince you that it was important; but believe me, you don’t want to operate containers at scale without some kind of API. (Executing commands over SSH is fine until you have more than 10 containers per machine, then you really want a true API ☺)

When the daemon does too much

But at the same time, the Docker Engine has lots of features and complexity: builds, image management, semantic REST API over HTTP, etc.; those features are essential (they are what helped to drive container adoption, while vserver, openvz, jails, zones, LXC, etc. kept containers contained (sorry!) to the hosting world) but it’s totally reasonable that you don’t want all that code near your production stuff.

That’s why in Docker Engine 1.11, we decided to break away all the low-level container management functions to containerd, while the rest would stay in the Docker Engine.

So the current solution is to delegate all the low-level management to containerd, and keep the rest in the Docker Engine.

You can think of containerd like a simplified Docker Engine. You can do the equivalent of docker ps, docker run, docker kill; but it doesn’t deal with builds, and its API uses grpc instead of REST.

The process tree looks like this:

- init - docker - containerd -+- shim for container X -+- process of container X
                              |                        \- other process of container X
                              +- shim for container Y --- process of container Y
                              \- shim for container Z --- process of container Z

The big upside (which doesn’t appear on the diagram) is that the link between docker and containerd can be severed and reestablished, e.g. to restart or upgrade the Docker Engine. (This can be achieved with the live restore configuration option.)

Going full circle

The dotCloud container engine started as a simple, standalone CLI tool. It was augmented with a collection of “sidekick” daemons, each providing a little bit of extra functionality. Eventually, this architecture showed its limits. The first version of the Docker Engine gathered all the features that were deemed necessary in a single daemon. Too many features? I don’t think so; precisely because these features made the success of Docker. The first versions of Docker sacrificed modularity, but that was only temporary. Over time, features were separated from the Docker Engine again. Today, you can use runc or containerd to run containers without the Docker Engine. Clustering features are provided by SwarmKit. External image builders are available, e.g. dockramp or box.

Two years ago, Docker committed to the motto “batteries included, but swappable.” It’s still doing exactly that: providing what most people need to build, ship, and run containerized apps, but giving an increasing number of options to remove whatever you don’t need or don’t like. And it all started with some really, really embarrassing container management code in Python, almost 10 years ago!

Adventures in GELF

2017-01-20T00:00:00+00:00

If you are running apps in containers and are using Docker’s GELF logging driver (or are considering using it), the following musings might be relevant to your interests.

Some context

When you run applications in containers, the easiest logging method is to write on standard output. You can’t get simpler than that: just echo, print, write (or the equivalent in your programming language!) and the container engine will capture your application’s output.

Other approaches are still possible, of course; for instance:

you can use syslog, by running a syslog daemon in your container or exposing a /dev/log socket;
you can write to regular files and share these log files with your host, or with other containers, by placing them on a volume;
your code can directly talk to the API of a logging service.

In the last scenario, this service can be:

a proprietary logging mechanism operated by your cloud provider, e.g. AWS CloudWatch or Google Stackdriver;
provided by a third-party specialized in managing logs or events, e.g. Honeycomb, Loggly, Splunk, etc.;
something running in-house, that you deploy and maintain yourself.

If your application is very terse, or if it serves very little traffic (because it has three users, including you and your dog), you can certainly run your logging service in-house. My orchestration workshop even has a chapter on logging which might give you the false idea that running your own ELK cluster is all unicorns and rainbows, while the truth is very different and running reliable logging systems at scale is hard.

Therefore, you certainly want the possibility to send your logs to somebody else who will deal with the complexity (and pain) that comes with real-time storing, indexing, and querying of semi-structured data. It’s worth mentioning that these people can do more than just managing your logs. Some systems like Sentry are particularly suited to extract insights from errors (think traceback dissection); and many modern tools like Honeycomb will deal not only with logs but also any kind of event, letting you crossmatch everything together to find out the actual cause of that nasty 3am outage.

But before getting there, you want to start with something easy to implement, and free (as much as possible).

That’s where container logging comes handy. Just write your logs on stdout, and let your container engine do all the work. At first, it will write plain boring files; but later, you can reconfigure it to do something smarter with your logs — without changing your application code.

Note that the ideas and tools that I discuss here are orthogonal to the orchestration platform that you might or might not be using: Kubernetes, Mesos, Rancher, Swarm … They can all leverage the logging drivers of the Docker Engine, so I’ve got you covered!

The default logging driver: `json-file`

By default, the Docker Engine will capture the standard output (and standard error) of all your containers, and write them in files using the JSON format (hence the name json-file for this default logging driver). The JSON format annotates each line with its origin (stdout or stderr) and its timestamp, and keeps each container log in a separate file.

When you use the docker log command (or the equivalent API endpoint), the Docker Engine reads from these files and shows you whatever was printed by your container. So far, so good.

The json-file driver, however, has (at least) two pain points:

by default, the log files will grow without bounds, until you run out of disk space;
you cannot make complex queries such as “show me all the HTTP requests for virtual host api.container.church between 2am and 7am having a response time of more than 250ms but only if the HTTP status code was 200/OK.”

The first issue can easily be fixed by giving some extra parameters to the json-file driver in Docker to enable log rotation. The second one, however, requires one of these fancy log services that I was alluding to.

Even if your queries are not as complex, you will want to centralize your logs somehow, so that:

logs are not lost forever when the cloud instance running your container disappears;
you can at least grep the logs of multiple containers without dumping them entirely through the Docker API or having to SSH around.

Aparté: when I was still carrying a pager and taking care of the dotCloud platform, our preferred log analysis technique was called “Ops Map/Reduce” and involved fabric, parallel SSH connections, grep, and a few other knick-knacks. Before you laugh of our antiquated techniques, let me ask you how your team of 6 engineers dealt with the log files of 100000 containers 5 years ago and let’s compare our battle scars and PTSD-related therapy bills around a mug of tea, beer, or other suitable beverage. ♥

Beyond `json-file`

Alright, you can start developing (and even deploying) with the default json-file driver, but at some point, you will need something else to cope with the amount of logs generated by your containers.

That’s where the logging drivers come handy: without changing a single line of code in your application, you can ask your faithful container engine to send the logs somewhere else. Neat.

Docker supports many other logging drivers, including but not limited to:

awslogs, if you’re running on Amazon’s cloud and don’t plan to migrate to anything else, ever;
gcplogs, if you’re more a Google person;
syslog, if you already have a centralized syslog server and want to leverage it for your containers;
gelf.

I’m going to stop the list here because GELF has a few features that make it particulary interesting and versatile.

GELF

GELF stands for Graylog Extended Log Format. It was initially designed for the Graylog logging system. If you haven’t heard about Graylog before, it’s an open source project that pioneered “modern” logging systems like ELK. In fact, if you want to send Docker logs to your ELK cluster, you will probably use the GELF protocol! It is an open standard implemented by many logging systems (open or proprietary).

What’s so nice about the GELF protocol? It addresses some (if not most) of the shortcomings of the syslog protocol.

With the syslog protocol, a log message is mostly a raw string, with very little metadata. There is some kind of agreement between syslog emitters and receivers; a valid syslog message should be formatted in a specific way, allowing to extract the following information:

a priority: is this a debug message, a warning, something purely informational, a critical error, etc.;
a timestamp indicating when the thing happened;
a hostname indicating where the thing happened (i.e. on which machine);
a facility indicating if the message comes from the mail system, the kernel, and such and such;
a process name and number;
etc.

That protocol was great in the 80s (and even the 90s), but it has some shortcomings:

as it evolved over time, there are almost 10 different RFCs to specify, extend, and retrofit it to various use-cases;
the message size is limited, meaning that very long messages (e.g.: tracebacks) have to be truncated or split across messages;
at the end of the day, even if some metadata can be extracted, the payload is a plain, unadorned text string.

GELF made a very risqué move and decided that every log message would be a dict (or a map or a hash or whatever you want to call them). This “dict” would have the following fields:

version;
host (who sent the message in the first place);
timestamp;
short and long version of the message;
any extra field you would like!

At first you might think, “OK, what’s the deal?” but this means that when a web servers logs a request, instead of having a raw string like this:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

You get a dict like that:

{
  "client": "127.0.0.1",
  "user": "frank",
  "timestamp": "2000-10-10 13:55:36 -0700",
  "method": "GET",
  "uri": "/apache_pb.gif",
  "protocol": "HTTP/1.0",
  "status": 200,
  "size": 2326
}

This also means that the logs get stored as structured objects, instead of raw strings. As a result, you can make elaboarate queries (something close to SQL) instead of carving regexes with grep like a caveperson.

OK, so GELF is a convenient format that Docker can emit, and that is understood by a number of tools like Graylog, Logstash, Fluentd, and many more.

Moreover, you can switch from the default json-file to GELF very easily; which means that you can start with json-file (i.e. not setup anything in your Docker cluster), and later, when you decide that these log entries could be useful after all, switch to GELF without changing anything in your application, and automatically have your logs centralized and indexed somewhere.

Using a logging driver

How do we switch to GELF (or any other format)?

Docker provides two command-line flags for that:

--log-driver to indicate which driver to use;
--log-opt to pass arbitrary options to the driver.

These options can be passed to docker run, indicating that you want this one specific container to use a different logging mechanism; or to the Docker Engine itself (when starting it) so that it becomes the default option for all containers.

(If you are using the Docker API to start your containers, these options are passed to the create call, within the HostConfig.LogConfig structure.)

The “arbitrary options” vary for each driver. In the case of the GELF driver, you can specify a bunch of options but there is one that is mandatory: the address of the GELF receiver.

If we have a GELF receiver on the machine 1.2.3.4 on the default UDP port 12201, you can start your container as follows:

docker run \
  --log-driver gelf --log-opt gelf-address=udp://1.2.3.4:12201 \
  alpine echo hello world

The following things will happen:

the Docker Engine will pull the alpine image (if necessary)
the Docker Engine will create and start our container
the container will execute the command echo with arguments hello world
the process in the container will write hello world to the standard output
the hello world message will be passed to whomever is watching (i.e. you, since you started the container in the foreground)
the hello world message will also be caught by Docker and sent to the logging driver
the gelf logging driver will prepare a full GELF message, including the host name, the timestamp, the string hello world, but also a bunch of informations about the container, including its full ID, name, image name and ID, environment variables, and much more;
this GELF message will be sent through UDP to 1.2.3.4 on port 12201.

Then, hopefully 1.2.3.4 receives the UDP packet, proecesses it, writes the message to some persistent indexed store, and allows you to retrieve or query it.

Hopefully.

I would tell you an UDP joke, but

If you have ever been on-call or responsible for other people’s code, you are probably cringing by now. Our precious logging message is within a UDP packet that might or might not arrive to our logging server (UDP has no transmission guarantees). If our logging server goes away (a nice wording for “crashes horribly”), our packet might arrive, but our message will be obliviously ignored, and we won’t know anything about it. (Technically, we might get an ICMP message telling us that the host or port is unreachable, but at that point, it will be too late, because we won’t even know which message this is about!)

Perhaps we can live with a few dropped messages (or a bunch, if the logging server is being rebooted, for instance). But what if we live in the Cloud, and our server evaporates? Seriously, though: what if I’m sending my log messages to an EC2 instance, and for some reason that instance has to be replaced with another one? The new instance will have a different IP address, but my log messages will continue to stubbornly go to the old address.

DNS to the rescue

An easy technique to work around volatile IP addresses is tu use DNS. Instead of specifying 1.2.3.4 as our GELF target, we will use gelf.container.church, and make sure that this points to 1.2.3.4. That way, whenever we need to send messages to a different machine, we just update the DNS record, and our Docker Engine happily sends the messages to the new machine.

Or does it?

If you have to write some code sending data to a remote machine (say, gelf.container.church on port 12345), the simplest version will look like this:

Resolve gelf.container.church to an IP address (A.B.C.D).
Create a socket.
Connect this socket to A.B.C.D, on port 12345.
Send data on the socket.

If you must send data multiple times, you will keep the socket open, both for convenience and efficiency purposes. This is particularly important with TCP sockets, because before sending your data, you have to go through the “3-way handshake” to establish the TCP connection; in other words, the 3rd step in our list above is very expensive (compared to the cost of sending a small packet of data).

In the case of UDP sockets, you might be tempted to think: “Ah, since I don’t need to do the 3-way handshake before sending data (the 3rd step in our list above is essentially free), I can go through all 4 steps each time I need to send a message!” But in fact, if you do that, you will quickly realize that you are now stumped by the first step, the DNS resolution. DNS resolution is less expensive than a TCP 3-way handshake, but barely: it still requires a round-trip to your DNS resolver.

Aparté: yes, it is possible to have very efficient local DNS resolvers. Something like pdns-recursor or dnsmasq running on localhost will get you some craaazy fast DNS response time for cached queries. However, if you need to make a DNS request each time you need to send a log message, it will add an indirect, but significant, cost to your application, since every log line will generate not only one syscall, but three. Damned! And some people (like, almost everyone running on EC2) are using their cloud provider’s DNS service. These people will incur two extra network packets for each log line. And when the cloud provider’s DNS is down, logging will be broken. Not cool.

Conclusion: if you log over UDP, you don’t want to resolve the logging server address each time you send a message.

Hmmm … TCP to the rescue, then?

It would make sense to use a TCP connection, and keep it up as long as we need it. If anything horrible happens to the logging server, we can trust the TCP state machine to detect it eventually (because timeouts and whatnots) and notify us. When that happens, we can then re-resolve the server name and re-connect. We just need a little bit of extra logic in the container engine, to deal with the unfortunate scenario where the write on the socket gives us an EPIPE error, also known as “Broken pipe” or in plain english “the other end is not paying attention to us anymore.”

Let’s talk to our GELF server using TCP, and the problem will be solved, right?

Right?

Unfortunately, the GELF logging driver in Docker only supports UDP.

(╯°□°)╯︵ ┻━┻

At this point, if you’re still with us, you might have concluded that computing is just a specialized kind of hell, that containers are the antichrist, and Docker is the harbinger of doom in disguise.

Before drawing hasty conclusions, let’s have a look at the code.

When you create a container using the GELF driver, this function is invoked, and it creates a new gelfWriter object by calling gelf.NewWriter.

Then, when the container prints something out, eventually, the Log function of the GELF driver is invoked. It essentially writes the message to the gelfWriter.

This GELF writer object is implemented by an external dependency, github.com/Graylog2/go-gelf.

Look, I see it coming, he’s going to do some nasty fingerpointing and put the blame on someone else’s code. Despicable!

Hot potato

Let’s investigate this package, in particular the NewWriter function, the Write method, and the other methods called by the latter, WriteMessage and writeChunked. Even if you aren’t very familiar with Go, you will see that these functions do not implement any kind of reconnection logic. If anything bad happens, the error bubbles up to the caller, and that’s it.

If we conduct the same investigation with the code on the Docker side (with the links in the previous section), we reach the same conclusions. If an error occurs while sending a log message, the error is passed to the layer above. There is no reconnection attempt, neither in Docker’s code, nor in go-gelf’s.

This, by the way, explains why Docker only supports the UDP transport. If you want to support TCP, you have to support more error conditions than UDP. To phrase things differently: TCP support would be more complicated and more lines of code.

Haters gonna hate

One possible reaction is to get angry at the brave soul who implemented go-gelf, or the one who implemented the GELF driver in Docker. Another better reaction is to be thankful that they wrote that code, rather than no code at all!

Workarounds

Let’s see how to solve our logging problem.

The easiest solution is to restart our containers whenever we need to “reconnect” (technically, resolve and reconnect). It works, but it is very annoying.

A slightly better solution is to send logs to 127.0.0.1:12201, and then run a packet redirector to “bounce” or “mirror” these packets to the actual logger; e.g.:

socat UDP-LISTEN:12201 UDP:gelf.container.church:12201

This needs to run on each container host. It is very lightweight, and whenever gelf.container.church is updated, instead of restarting your containers, you merely restart socat.

(You could also send your log packets to a virtual IP, and then use some fancy iptables -t nat ... -j DNAT rules to rewrite the destination address of the packets going to this virtual IP.)

Another option is to run Logstash on each node (instead of just socat). It might seem overkill at first, but it will give you a lot of extra flexibility with your logs: you can do some local parsing, filtering, and even “forking,” i.e. deciding to send your logs to multiple places at the same time. This is particularly convenient if you are switching from one logging system to another, because it will let you feed both systems in parallel for a while (during a transition period).

Running Logstash (or another logging tool) on each node is also very useful if you want to be sure that you don’t lose any log message, because it would be the perfect place to insert a queue (using Redis for simple scenarios, or Kafka if you have stricter requirements).

Even if you end up sending your logs to a service using a different protocol, the GELF driver is probably the easiest one to setup to connect Docker to e.g. Logstash or Fluentd, and then have Logstash or Fluentd speak to the logging service with the other protocol.

UDP packets sent to localhost can’t be lost, except if the UDP socket runs out of buffer space. This could happen if your sender (Docker) is faster than your receiver (Logstash/Fluentd), which is why we mentioned a queue earlier: the queue will allow the receiver to drain the UDP buffer as fast as possible to avoid overflows. Combine that with a large enough UDP buffer, and you’ll be safe.

Future directions

Even if running a cluster-wide socat is relatively easy (especially with Swarm mode and docker service create --mode global), we would rather have a good behavior out of the box.

There are already some GitHub issues related to this: #23679, #17904, and #16330. One of the maintainers has joined the conversation and there are some people at Docker Inc. who would love to see this improved.

One possible fix is to re-resolve the GELF server name once in a while, and when a change is detected, update the socket destination address. Since DNS provides TTL information, it could even be used to know how long the IP address can be cached.

If you need better GELF support, I have good news: you can help! I’m not going to tell you “just send us a pull request, ha ha ha!” because I know that only a very small number of people have both the time and expertise to do that — but if you are one of them, then by all means, do it! There are other ways to help, though.

First, you can monitor the GitHub issues mentioned above (#23679 and #17904). If the contributors and maintainers ask for feedback, indicate what would (or wouldn’t) work for you. If you see a proposition that makes sense, and you just want to say “+1” you can do it with GitHub reactions (the “thumbs up” emoji works perfectly for that). And if somebody proposes a pull request, testing it will be extremely helpful and instrmental to get it accepted.

If you look at one of these GitHub issues, you will see that there was already a patch proposed a long time ago; but the person who asked for the feature in the first place never tested it, and as a result, it was never merged. Don’t get me wrong, I’m not putting the blame on that person! It’s a good start to have a GitHub issue as a kind of “meeting point” for people needing a feature, and people who can implement it.

It’s quite likely that in a few months, half of this post will be obsolete because the GELF driver will support TCP connections and/or correctly re-resolve addresses for UDP addresses!

Yes, all men

2017-01-15T00:00:00+00:00

In conversations about sexism (in the tech industry or elsewhere), men are often surprised to hear how bad the situation is for some of their women coworkers and friends. We often are tempted to say “this wouldn’t happen in my company.” If you are an expatriate or travel abroad, there is also the variant “in my country, we treat women fairly!” I would like to share something that made me think twice about this.

A few years ago, I was invited to talk about “the future of the cloud” at a tech event in Paris. This would be a 15-minute talk, with a very wide audience (both technical and non-technical folks). That was something very different from my usual mandate: back then, half of my talks were “Introduction to Docker and containers” and the other half consisted of more advanced topics revolving around containers (containers and security, containers and microservices, containers and immutable infrastructure, containers this and that, you get the idea).

Just a few days before, one of my coworkers had published a gut-wrenching blog post where she was describing the daily harassment that she was facing. That woman was (and still is) one of the best open source developers of my generation, and back then, she was working on a very popular open source project. And as a result, she was receiving a constant stream of horrible emails and other messages containing death threats, photoshopped pictures, and more.

Of course, not all women in the tech industry have to deal with behavior as extreme; but sexism and bias is rampant in our industry. Women have to work harder to get the same amount of credit; we collectively have biases that push women out of scientific disciplines (as illustrated by the story of this transgender man who suddenly found himself way better considered at work than when his name was Barbara). The software industry, and open source in particular, is no better: far from the “meritocracy” often advertised, open source communities don’t welcome women and studies show that women write code that is at least as good as men, but will be rated lower if their gender is known.

I’m French and spent the first 30 years of my life in France. I became aware of the extreme sexism in tech only when I moved to the US. Let me clarify: I am not stating that the US is more (or less) sexist than France. I’m merely saying that I was blissfully ignorant of the issue before. Obviously, I was aware that in my CS degree, only 10% of the students were women; but it never crossed my mind that women could have a lower proficiency than men in the field. My mother was a math teacher who used LaTeX to typeset the assignments that she handed out to her students. My sister knew her way around a Linux system (text mode, back then) to access IRC and copy CDs. My girlfriend in University floored me as we were debugging code together because she could instantly spot which pointers were on the heap or on the stack. The (few) women in our class back then were also in the top tier of our alumni. And yet.

And yet, when I started to become aware of how rampant and ubiquitous sexism could be, a little voice in me kept whispering silently: “Not all men are like that. Look, in France, you never saw anything like this happen.” To be honest, I didn’t know – because I had never investigated sexism in tech in France.

Until that talk.

I hadn’t prepared anything for that talk. A few hours before, I realized that I was completely unable to talk about Docker for a non-technical audience. And as I tried to chalk out ideas, to come up with colorful metaphors, I kept thinking about my friend’s blog post; about what she (and possibly many other women) were facing.

So when I climbed on stage, I spent a few minutes babbling not very convincingly about Docker, containers, and DevOps. And then I tried to talk about sexism in tech. I don’t know if this was very convincing; honestly I don’t even dare rewatching that talk again because I had slept two hours that night and my performance was probably very poor. The only thing I remember clearly is that I finished the talk by saying, “my vision for the future of the cloud is as follows: in ten years, on this stage, there should be 5 men and 5 women, instead of 1 woman and 9 men like today.”

You Will Never Guess What Happens Next.

After my talk, a lot of women came to thank me for “talking about it.” That was quite surprising, especially given the small number of women attending the event. Statistics were at odds. A few hours later, when I left the venue, somebody who was sitting at a café across the street even hailed me to chat about it.

I think it’s about that time that the little voice telling me “not all men” died in my head, because this experience made me understand that even in France, where we’re all about “Liberté, Égalité, Fraternité,” sexism is just as bad as in the US, and men shouldn’t try to pretend otherwise.

Next time you hear someone pretend that in their school, university, country, region … sexism is “not as bad,” think about this. Think about the fact that women cannot talk about sexism without facing negative consequences. So let’s try at least to change our discourse. If you say, “this doesn’t happen in my community,” you are basically silencing anybody who would like to say otherwise. Next time, try this instead: “I’m sorry that this happened to you. I wasn’t aware of this problem. How can I help?”

One thing you can do is to fight against your own biases. One of my favorite techniques is to deliberately apply reverse bias. At a tech conference, never ask a woman, “so you’re a recruiter?” but try “so, are you rather in dev or ops?” If you are attending a meetup full of dudes, and before the presentation starts, you are striking conversation with one of the only women in attendance, don’t ask her if she works here. Instead, ask her if she’s the speaker. (I did once, and she was. True story!)

Les mots ont un sens

2017-01-06T00:00:00+00:00

I usually write in English, about container technology. This will be in French, and about very different topics. You’ve been warned! :-)

Oh, regarde, une vache en train de pondre !

C’est une phrase un peu bizarre, pas vrai ? Elle est tout à fait correcte grammaticalement parlant. Elle intime la personne qui l’entend à observer un spectacle insolite : une vache en train de pondre. C’est là que les problèmes commencent. Parce qu’on peut difficilement trouver plus mammifère qu’une vache, et que les mammifères ne pondent pas d’œufs. Une vache en train de pondre, c’est comme un dragon volant crachant du feu, un archange vengeur, ou un gentil électeur du FN : on peut s’imaginer à quoi ça ressemble (et encore), mais on sait que ça n’existe que dans notre imagination.

Je n’ai pas trouvé de vue d’artiste d’un gentil électeur du FN.

Qu’est-ce que ça veut dire, alors ?

Si je vous dis, « oh, regarde, une vache en train de pondre », qu’allez-vous comprendre ? Ma foi, il y a beaucoup d’interprétations possibles.

Peut-être que j’ai cru voir une vache en train de pondre, assise dans l’herbe ; et ignorant que les vaches ne pondent pas, j’essaie d’attirer votre attention sur ce fait peu ordinaire. En ce cas, vous allez probablement me reprendre aimablement, m’expliquant que même si les vaches sont parfois les voisines des poules, elles ne pondent pas pour autant.

La vache n’est certainement pas loin de son nid, continuons à chercher.

Peut-être que je ne suis pas tout à fait saint d’esprit, ou bien que j’ai une aphasie de Wernicke à prédominance sémantique à la suite d’un AVC, par exemple ; autrement dit, j’ai une lésion au cerveau qui affecte la manière dont j’utilise mon langage.

Peut-être que je ne montre pas une vraie vache en train de pondre, mais une représentation d’une vache en train de pondre. C’est un des points forts de l’humain, d’être capable d’imaginer, de conceptualiser des choses qui n’existent pas. Un artiste peut dessiner une vache assise sur un œuf. Ce que je voulais dire, c’est en réalité : « oh, regarde, un tableau représentant une vache en train de pondre », sauf que si cette discussion se déroule dans une galerie d’art, il sera quelque peu redondant de préciser qu’on parle d’un tableau (puisque il n’y a que ça tout autour de nous).

Ceci n’est pas une vache en train de pondre.

Peut-être que je suis en train de filer une métaphore douteuse, et que je parle d’une femme enceinte particulièrement grosse. Cela démontrerait aussi une certaine aptitude aux blagues sexistes et grossophobes ; mais ça ne voudrait en aucun cas dire qu’une vache (animal de ferme dont on trait le lait) est en train de pondre un œuf (duquel sortira une petite vache au bout d’un moment).

Ou bien peut-être que je suis en train de détourner l’attention : tout le monde va regarder dans la direction que j’indique, se rendre compte qu’une vache en train de pondre, ça n’a pas de sens, et se mettre à rire de cette bonne blague. Quelques personnes vont se mettre à m’expliquer que ce n’est pas possible, d’autres vont intervenir pour leur expliquer que c’est une blague, et pendant ce temps-là, mon acolyte vous fait les poches pour vous délester de vos porte-feuilles et téléphones portables.

Moi et ma pote la vache on va s’offrir un aller-simple pour Rio et y siroter des caïpis jusqu’à la fin de nos jours!

Le racisme anti-blanc, c’est pareil

Ça a escaladé rapidement dites donc

Si je parle de « racisme anti-blanc », qu’est-ce que ça veut dire ? Pour répondre, on va commencer par la définition du racisme : le racisme est une idéologie qui considère qu’il existe des races, et que certaines sont supérieures à d’autres. On pourrait s’arrêter là, et conclure que « racisme anti-blanc » désigne une croyance selon laquelle les individus à la peau blanche seraient inférieurs aux autres. Ça serait aller un peu vite en besogne. Parce que le problème principal du racisme, ce n’est pas la croyance. Après tout, il y a des gens qui croient en Dieu, ou en la main invisible du marché, ou encore que la fin du monde aura lieu en 2012, et j’en passe. Chacun peut bien croire ce qu’il veut. Le problème, c’est les conséquences de cette croyance.

Les conséquences du racisme sont nombreuses. Le lynchage (c’est-à-dire pendaison sans procès) de plusieurs milliers de Noirs par le Ku Klux Klan, par exemple. Variante moderne : en 2015, un terroriste abat une dizaine de paroissiens noirs dans une église. Plus proche de chez vous (pour mes lecteurs Français), les ratonnades de 1973. Le point commun de tous ces massacres, c’est qu’ils sont perpétués par des gens croyant bien faire, parce que de toute façon, la race blanche est supérieure (d’après eux), et ce n’est donc pas un problème de supprimer avec violence des gens qui ne sont pas blancs (toujours d’après eux).

Si seulement il avait pu en rester à la théorie … 😢

Il y a d’autres conséquences moins fatales mais tout aussi injustes : les préjugés, qui entraînent des difficultés plus ou moins importantes pour trouver un boulot, un logement, ou entrer en boîte, parce qu’on n’est pas né de la bonne couleur.

Pour beaucoup de gens, le mot racisme est chargé de toutes ces significations – de la même manière que pour beaucoup de gens, quand on parle d’une vache (sans contexte), on pense à l’animal. Bien sûr si on parle d’une peau de vache ou d’une vache à lait, cela se charge d’un sens différent ; mais la vache en train de pondre dont je parlais tout à l’heure reste un animal mythique, n’ayant aucune existence tangible. Le racisme anti-blanc, c’est pareil : ça n’existe pas, car il n’existe pas de mouvement conduisant au meurtre ou à la discrimination contre des groupes de gens au seul motif que leur peau est blanche.

Oui, il y a eu des massacres de personnes blanches. Un des plus sinistres exemples est probablement l’Holocauste. Mais ce qui faisait la différence entre les assassins et les victimes n’était pas la couleur de leur peau, mais le fait d’être Juif ou non. Ce n’est donc pas anti-blanc, mais anti-sémite.

Oui, une personne à la peau blanche peut parfois se sentir en position inconfortable lorsqu’elle se retrouve (pour une fois) entourée de gens différents. Cette personne peut se sentir exclue de ne pas arriver à accéder à certains lieux. Si cela vous arrive un jour : félicitations, vous venez de vivre pendant un instant ce qui est le quotidien de millions de personnes de couleur en Europe et en Amérique du Nord. Pendant un court instant, vous venez de démontrer que vous êtes capable d’empathie, c’est-à-dire de vous mettre dans la peau des autres pour comprendre ce qu’ils ressentent.

La prochaine fois que vous vous sentez mal à l’aise parce que vous traversez un quartier « noir », ou parce qu’un individu un peu trop bronzé à votre goût s’assoit en face de vous dans le bus, dites-vous que pour eux, ce « mal à l’aise » c’est tout le temps qu’il peut se produire. Parce qu’il y a tellement de Blancs racistes, qu’il est tout à fait raisonnable pour un Noir de partir du principe qu’un Blanc va toujours essayer de la lui faire à l’envers juste parce qu’il est Noir. Après tout, le FN récoltait 25% des suffrages en France il y a peu ; autrement dit, si je n’ai pas une tête de Français « de souche », une personne sur quatre que je croise dans la rue ou au boulot serait bien contente de me voir quitter le pays. Il y a de quoi me rendre un peu parano, non ?

Et l’hétérophobie, même combat

Qu’est-ce que l’hétérophobie ? A priori, même pour quelqu’un qui n’a jamais croisé ce mot (et qui n’a jamais fait de Grec), c’est accessible. L’homophobie c’est le rejet et la haine des homosexuels, donc l’hétérophobie ça doit être le rejet et la haine des hétérosexuels. J’ai bon ?

Non.

Parce que là aussi, quand on parle d’homophobie, on ne parle pas seulement d’une croyance ou d’idées rétrogrades, mais de leurs conséquences. La fusillade d’Orlando par exemple. La France n’est malheureusement pas en reste du côté des violences homophobes.

Or, il n’existe pas de crimes « hétérophobes » comparables. Des personnes hétérosexuelles qui meurent, oui, il y en a tous les jours. Mais des personnes hétérosexuelles qui meurent parce qu’elles sont hétérosexuelles, autrement dit qui ne seraient pas mortes si elles avaient été homosexuelles, c’est nettement plus difficile à trouver. Aux dernières nouvelles, les chiffres mondiaux de crimes hétérophobes sont de … zéro.

La photo ci-dessus ne représente malheureusement pas deux hommes qui aiment les hommes qui aiment chasser en forêt et se rouler nus dans la boue en faisant l’amour à leur fusil d’assaut (encore que pour la dernière partie, c’est discutable). Les gens qui posent sur cette photo trouvent ça rigolo d’utiliser le drapeau arc-en-ciel (symbole de tolérance de toutes les sexualités) comme cible d’entraînement. Devoir d’éducation civique en trois questions :
- Auraient-ils mieux fait de garder la cible qu’ils utilisaient avant, sachant que c’était un portrait d’Obama ?
- Vous savez à quoi ressemble le drapeau de la fierté hétérosexuelle vous ? (Sans chercher sur Internet!)
- C’est quand, la dernière fois que des gens ont pris un symbole hétérosexuel (drapeau ou équivalent) pour le brûler ou lui tirer dessus ? (Vous avez le droit de chercher sur Internet, bonne chance!)

Mais alors, pourquoi des gens utilisent ces mots ?

Excellente question ! Et certaines des réponses sont les mêmes que pour la vache en train de pondre.

Peut-être que ces gens sont mal informés, et ne connaissent pas la définition de ces mots. Peut-être que les gens qui parlent de « racisme anti-blanc » ne connaissent pas, ou ne réalisent pas bien, l’ampleur des horreurs, des massacres, qui ont été perpétués au nom du racisme ; et de la discrimination qui continue à exister aujourd’hui.

Peut-être que leur définition est différente. Il y a des gens pour qui la « pureté de la race blanche » est un concept très important, et quand ils voient un mariage mixte (comprendre, entre une personne blanche et une autre d’une autre couleur de peau), ils appellent ça le « génocide du peuple blanc ». Vous comprendrez alors que pour les descendants des victimes de vrais génocides, la pilule est particulièrement difficile à avaler. Imaginez le dialogue :
« Nous sommes victiiiimes d’un génociiiiide !
– Oh mince ! Toi aussi, tes grands-parents ont été tués dans les camps de concentration des nazis ?
– Noooon ! Pire que ça ! Ma sœur a épousé un Arabe !
– Ah ouais bro c’est chaud ce qui t’arrive, t’as pensé à en parler à quelqu’un ?… »

Peut-être que ce n’est pas innocent, en fait

Vous avez sûrement déjà entendu la plume est plus forte que l’épée. Oui, c’est une façon de parler : dans une baston, si vous sortez un crayon, vous avez moins de chance de vous en sortir qu’avec un couteau (toutes choses étant égales par ailleurs). Mais les mots ont un sens. Quand on utilise délibérément certains mots, c’est parfois pour créer certaines émotions auprès des personnes qui nous écoutent. Ça fait partie du pathos, une technique de persuasion vieille de plus de deux mille ans.

Alors peut-être que c’est une stratégie bien réfléchie. En utilisant des mots forts, des mots qui nous font penser à des massacres et des crimes atroces, ces gens espèrent susciter en nous des émotions négatives, court-circuitant notre intelligence. Le simple fait d’employer le mot « hétérophobie » laisse entendre qu’il existe un mouvement contre les hétérosexuels, visant à les exclure, prêt à aller jusqu’à les tabasser, les torturer, les tuer. C’est le même principe lorsqu’on parle des grévistes qui « prennent en otage » les usagers des transports en commun. On oublie un peu vite que dans une prise d’otage, le risque principal, ce n’est pas d’arriver en retard au boulot, ou de galérer des heures dans les transport, même sous la neige. L’enjeu, c’est de mourir. Juste parce qu’on était au mauvais endroit, au mauvais moment. Mais en parlant de « prise d’otage », on fait une métaphore que les gens comprennent tout de suite, sans réfléchir, et on fait entrer dans l’inconscient des gens que les grévistes sont de dangereux criminels prêts à tuer des innocents pour avoir ce qu’ils veulent. (Ce qui s’avère pratique au cas où plus tard on voudrait faire intervenir l’armée contre les grévistes en question : après tout, ce sont de dangereux preneurs d’otages, non?)

Ce qui est formidable avec cette technique, c’est qu’un mot ou une expression est encore plus facile à répandre qu’une idée. Il suffit de s’en servir autour de soi, et ça se répand comme un virus. La plupart du temps, il n’y a même pas besoin d’expliquer : les gens comprennent ce qu’on veut dire. Et précisément parce qu’on n’a pas besoin d’expliquer, on peut se retrouver avec des « porteurs sains » du virus, qui vont utiliser ces mots et les répéter autour d’eux sans réaliser qu’ils sont en train de collaborer avec une machine de propagande servant une idéologie qui les ferait vomir s’ils s’en rendaient compte.

Conclusion

Il y a de fortes chances pour que les gens qui utilisent ces mots soient en train de créer une distraction permettant non pas de vous vider les poches (quoi que), mais de vous remplir la tête avec de la merde.

Si vous ne devez retenir qu’une chose de tout ça, souvenez vous juste que l’hétérophobie, c’est une vache en train de pondre.

Go + Docker = ♥

2016-09-09T00:00:00+00:00

This is a short collection of tips and tricks showing how Docker can be useful when working with Go code. For instance, I’ll show you how to compile Go code with different versions of the Go toolchain, how to cross-compile to a different platform (and test the result!), or how to produce really small container images.

The following article assumes that you have Docker installed on your system. It doesn’t have to be a recent version (we’re not going to use any fancy feature here).

Go without `go`

… And by that, we mean “Go without installing go”.

If you write Go code, or if you have even the slightest interest into the Go language, you certainly have the Go compiler and toolchain installed, so you might be wondering “what’s the point?”; but there are a few scenarios where you want to compile Go without installing Go.

You still have this old Go 1.2 on your machine (that you can’t or won’t upgrade), and you have to work on this codebase that requires a newer version of the toolchain.
You want to play with cross compilation features of Go 1.5 (for instance, to make sure that you can create OS X binaries from a Linux system).
You want to have multiple versions of Go side-by-side, but don’t want to completely litter your system.
You want to be 100% sure that your project and all its dependencies download, build, and run fine on a clean system.

If any of this is relevant to you, then let’s call Docker to the rescue!

Compiling a program in a container

When you have installed Go, you can do go get -v github.com/user/repo to download, build, and install a library. (The -v flag is just here for verbosity, you can remove it if you prefer your toolchain to be swift and silent!)

You can also do go get github.com/user/repo/... (yes, that’s three dots) to download, build, and install all the things in that repo (including libraries and binaries).

We can do that in a container!

Try this:

docker run golang go get -v github.com/golang/example/hello/...

This will pull the golang image (unless you have it already; then it will start right away), and create a container based on that image. In that container, go will download a little “hello world” example, build it, and install it. But it will install it in the container … So how do we run that program now?

Running our program in a container

One solution is to commit the container that we just built, i.e. “freeze” it into a new image:

docker commit $(docker ps -lq) awesomeness

Note: docker ps -lq outputs the ID (and only the ID!) of the last container that was executed. If you are the only user on your machine, and if you haven’t created another container since the previous command, that container should be the one in which we just built the “hello world” example.

Now, we can run our program in a container based on the image that we just built:

docker run awesomeness hello

The output should be Hello, Go examples!.

Bonus points

When creating the image with docker commit, you can use the --change flag to specify arbitrary Dockerfile commands. For instance, you could use a CMD or ENTRYPOINT command so that docker run awesomeness automatically executes hello.

Running in a throwaway container

What if we don’t want to create an extra image just to run this Go program?

We got you covered:

docker run --rm golang sh -c \
    "go get github.com/golang/example/hello/... && exec hello"

Wait a minute, what are all those bells and whistles?

--rm tells to the Docker CLI to automatically issue a docker rm command once the container exits. That way, we don’t leave anything behind ourselves.
We chain together the build step (go get) and the execution step (exec hello) using the shell logical operator &&. If you’re not a shell aficionado, && means “and”. It will run the first part go get..., and if (and only if!) that part is successful, it will run the second part (exec hello). If you wonder why this is like that: it works like a lazy and evaluator, which needs to evaluate the right hand side only if the left hand side evaluates to true.
We pass our commands to sh -c, because if we were to simply do docker run golang "go get ... && hello", Docker would try to execute the program named go SPACE get SPACE etc. and that wouldn’t work. So instead, we start a shell and instruct the shell to execute the command sequence.
We use exec hello instead of hello: this will replace the current process (the shell that we started) with the hello program. This ensures that hello will be PID 1 in the container, instead of having the shell as PID 1 and hello as a child process. This is totally useless for this tiny example, but when we will run more useful programs, this will allow them to receive external signals properly, since external signals are delivered to PID 1 of the container. What kind of signal, you might be wondering? A good example is docker stop, which sends SIGTERM to PID 1 in the container.

Using a different version of Go

When you use the golang image, Docker expands that to golang:latest, which (as you might guess) will map to the latest version available on the Docker Hub.

If you want to use a specific version of Go, that’s very easy: specify that version as a tag after the image name.

For instance, to use Go 1.5, change the example above to replace golang with golang:1.5:

docker run --rm golang:1.5 sh -c \
    "go get github.com/golang/example/hello/... && exec hello"

You can see all the versions (and variants) available on the Golang image page on the Docker Hub.

Installing on our system

OK, so what if we want to run the compiled program on our system, instead of in a container?

We could copy the compiled binary out of the container. Note, however, that this will work only if our container architecture matches our host architecture; in other words, if we run Docker on Linux. (I’m leaving out people who might be running Windows Containers!)

The easiest way to get the binary out of the container is to map the $GOPATH/bin directory to a local directory. In the golang container, $GOPATH is /go. So we can do the following:

docker run -v /tmp/bin:/go/bin \
  golang go get github.com/golang/example/hello/...
/tmp/bin/hello

If you are on Linux, you should see the Hello, Go examples! message. But if you are, for instance, on a Mac, you will probably see:

-bash:
/tmp/test/hello: cannot execute binary file

What can we do about it?

Cross-compilation

Go 1.5 comes with outstanding out-of-the-box cross-compilation abilities, so if your container operating system and/or architecture doesn’t match your system’s, it’s no problem at all!

To enable cross-compilation, you need to set GOOS and/or GOARCH.

For instance, assuming that you are on a 64 bits Mac:

docker run -e GOOS=darwin -e GOARCH=amd64 -v /tmp/crosstest:/go/bin \
  golang go get github.com/golang/example/hello/...

The output of cross-compilation is not directly in $GOPATH/bin, but in $GOPATH/bin/$GOOS_$GOARCH. In other words, to run the program, you have to execute /tmp/crosstest/darwin_amd64/hello.

Installing straight to the `$PATH`

If you are on Linux, you can even install directly to your system bin directories:

docker run -v /usr/local/bin:/go/bin \
  golang get github.com/golang/example/hello/...

However, on a Mac, trying to use /usr as a volume will not mount your Mac’s filesystem to the container. It will mount the /usr directory of the Moby VM (the small Linux VM hidden behind the Docker whale icon in your toolbar).

You can, however, use /tmp or something in your home directory, and then copy it from there.

Building lean images

The Go binaries that we produced with this technique are statically linked. This means that they embed all the code that they need to run, including all dependencies. This contrasts with dynamically linked programs, which don’t contain some basic libraries (like the “libc”) and use a system-wide copy which is resolved at run time.

This means that we can drop our Go compiled program in a container, without anything else, and it should work.

Let’s try this!

The `scratch` image

There is a special image in the Docker ecosystem: scratch. This is an empty image. It doesn’t need to be created or downloaded, since by definition, it is empty.

Let’s create a new, empty directory for our new Go lean image.

In this new directory, create the following Dockerfile:

FROM scratch
COPY ./hello /hello
ENTRYPOINT ["/hello"]

This means:

start from scratch (an empty image),
add the hello file to the root of the image,
define this hello program to be the default thing to execute when starting this container.

Then, produce our hello binary as follows:

docker run -v $(pwd):/go/bin --rm \
  golang go get github.com/golang/example/hello/...

Note: we don’t need to set GOOS and GOARCH here, because precisely, we want a binary that will run in a Docker container, not on our host system. So leave those variables alone!

Then, we can build the image:

docker build -t hello .

And test it:

docker run hello

(This should display Hello, Go examples!.)

Last but not least, check the image’s size:

docker images hello

If we did everything right, this image should be about 2 MB. Not bad!

Building something without pushing to GitHub

Of course, if we had to push to GitHub each time we wanted to compile, we would waste a lot of time.

When you want to work on a piece of code and build it within a container, you can mount a local directory to /go in the golang container, so that the $GOPATH is persisted across invocations: docker run -v $HOME/go:/go golang ....

But you can also mount local directories to specific paths, to “override” some packages (the ones that you have edited locally). Here is a complete example:

# Adapt the two following environment variables if you are not running on a Mac
export GOOS=darwin GOARCH=amd64
mkdir go-and-docker-is-love
cd go-and-docker-is-love
git clone git://github.com/golang/example
cat example/hello/hello.go
sed -i .bak s/olleH/eyB/ example/hello/hello.go
docker run --rm \
  -v $(pwd)/example:/go/src/github.com/golang/example \
  -v $(pwd):/go/bin/${GOOS}_${GOARCH} \
  -e GOOS -e GOARCH \
  golang go get github.com/golang/example/hello/...
./hello
# Should display "Bye, Go examples!"

The special case of the `net` package and CGo

Before diving into real-world Go code, we have to confess something: we lied a little bit about the static binaries. If you are using CGo, or if you are using the net package, the Go linker will generate a dynamic binary. In the case of the net package (which a lot of useful Go programs out there are using indeed!), the main culprit is the DNS resolver. Most systems out there have a fancy, modular name resolution system (like the Name Service Switch) which relies on plugins which are, technically, dynamic libraries. By default, Go will try to use that; and to do so, it will produce dynamic libraries.

How do we work around that?

Re-using another distro’s libc

One solution is to use a base image that has the essential libraries needed by those Go programs to function. Almost any “regular” Linux distro based on the GNU libc will do the trick. So instead of FROM scratch, you would use FROM debian or FROM fedora, for instance. The resulting image will be much bigger now; but at least, the bigger bits will be shared with other images on your system.

Note: you cannot use Alpine in that case, since Alpine is using the musl library instead of the GNU libc.

Bring your own libc

Another solution is to surgically extract the files needed, and place them in your container with COPY. The resulting container will be small. However, this extraction process leaves the author with the uneasy impression of a really dirty job, and they would rather not go into more details.

If you want to see for yourself, look around ldd and the Name Service Switch plugins mentioned earlier.

Producing static binaries with `netgo`

We can also instruct Go to not use the system’s libc, and substitute Go’s netgo library, which comes with a native DNS resolver.

To use it, just add -tags netgo -installsuffix netgo to the go get options.

-tags netgo instructs the toolchain to use netgo.
-installsuffix netgo will make sure that the resulting libraries (if any) are placed in a different, non-default directory. This will avoid conflicts between code built with and without netgo, if you do multiple go get (or go build) invocations. If you build in containers like we have shown so far, this is not strictly necessary, since there will be no other Go code compiled in this container, ever; but it’s a good idea to get used to it, or at least know that this flag exists.

The special case of SSL certificates

There is one more thing that you have to worry about if your code has to validate SSL certificates; for instance if it will connect to external APIs over HTTPS. In that case, you need to put the root certificates in your container too, because Go won’t bundle those into your binary.

Installing the SSL certificates

Three again, there are multiple options available, but the easiest one is to use a package from an existing distribution.

Alpine is a good candidate here because it’s so tiny. The following Dockerfile will give you a base image that is small, but has an up-to-date bundle of root certificates:

FROM alpine:3.4
RUN apk add --no-cache ca-certificates apache2-utils

Check it out; the resulting image is only 6 MB!

Note: the --no-cache option tells apk (the Alpine package manager) to get the list of available packages from Alpine’s distribution mirrors, without saving it to disk. You might have seen Dockerfiles doing something like apt-get update && apt-get install ... && rm -rf /var/cache/apt/*; this achieves something equivalent (i.e. not leave package caches in the final image) with a single flag.

As an added bonus, putting your application in a container based on the Alpine image gives you access to a ton of really useful tools: now you can drop a shell into your container and poke around while it’s running, if you need to!

Wrapping it up

We saw how Docker can help us to compile Go code in a clean, isolated environment; how to use different versions of the Go toolchain; and how to cross-compile between different operating systems and platforms.

We also saw how Go can help us to build small, lean container images for Docker, and described a number of associated subtleties linked (no pun intended) to static libraries and network dependencies.

Beyond the fact that Go is really good fit for a project that Docker, we hope that we showed you how Go and Docker can benefit from each other and work really well together!

Acknowledgements

This was initially presented during the hack day at GopherCon 2016.

I would like to thank all the people who proofread this material and gave ideas and suggestions to make it better; including but not limited to:

Aaron Lehmann
Stephen Day
AJ Bowen

All mistakes and typos are my own; all the good stuff is theirs! ☺

One container to rule them all

2016-04-03T00:00:00+00:00

A while ago, I wrote about how to bind-mount the Docker control socket instead of running Docker-in-Docker. This is a huge win for CI use-cases, and many others. Here I want to talk about a more generic scenario: controlling any Docker setup (local or remote Engine, but also Swarm clusters) from a container, and the benefits that it brings us.

Bind-mounting the control socket

If you have never done this before, I invite you to try the following command (on a Linux machine running a local Docker Engine, or if you are one of the lucky fews who have access to Docker Mac when this article is published):

docker run -v /var/run/docker.sock:/var/run/docker.sock \
           -ti docker docker ps

This will execute docker ps in a container (using the docker official image), and it will display the containers running on your local Docker Engine. There will be at least the container running docker ps itself, and possibly other containers that are running at that moment.

This gives us a way to control the Docker Engine from within a container. This is particularly convenient when you want to create containers from within a container, without running Docker-in-Docker.

However, this only works if you are connecting to Docker using a local UNIX socket. In other words, it doesn’t work if you are using:

a remote Docker Engine (with or without TLS authentication),
a local boot2docker VM,
a Swarm cluster.

If you are using a remote Engine without TLS authentication, the only thing you need to do is to set the DOCKER_HOST environment variable. Then all standard tools (like the Docker CLI and Docker Compose) will automatically detect this variable and use it to contact the Engine. But if you are using TLS authentication (and you definitely should!) things are a bit more complex.

I want to give you a generic method to connect to any Docker API endpoint from within a container, regardless of its location (local or remote), even if it’s using TLS, even if it’s actually a Swarm cluster instead of a single Engine.

Let’s look at our environment

To connect to remote Docker API endpoints using TLS, we need a bit more than the DOCKER_HOST environment variable. First of all, we need to tell our local Docker client to use TLS, by setting the DOCKER_TLS_VERIFY environment variable. We also need to provide the client with:

a private key,
a certificate (used to prove our identity to the remote server),
a root certificate (used to check the identity of the remote server).

Those elements will be stored in three files in PEM format:

key.pem,
cert.pem,
ca.pem.

By default, the client looks for those files in the ~/.docker directory, but this can be changed by setting the DOCKER_CERT_PATH environment variable to another directory.

If you are using Docker Machine, you can easily check what those variables look like with the env command; e.g. if you have a Docker Machine named node1:

$ docker-machine env node1
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://11.22.33.44:2376"
export DOCKER_CERT_PATH="/home/docker/.docker/machine/machines/node1"
export DOCKER_MACHINE_NAME="node1"
# Run this command to configure your shell:
# eval $(docker-machine env node1)

If we want to connect to this Docker Engine from within a container, we just need to set those environment variables, and make the three PEM files available within the container.

Transporting TLS material and settings

Of course, we could manually copy-paste the environment variables and the three PEM files from our host to our container. But we are going to automate the process, and store everything we need in a data container named dockercontrol. Assuming that our environment is currently setup to talk to some remote Docker API endpoint secured with TLS, we can run the following command:

$ tar -C $DOCKER_CERT_PATH -cf- ca.pem cert.pem key.pem |
  docker run --name dockercontrol -i -v /docker busybox \
  sh -c "tar -C /docker -xf-
         echo export DOCKER_HOST=$DOCKER_HOST >>/docker/env
         echo export DOCKER_TLS_VERIFY=1 >>/docker/env
         echo export DOCKER_CERT_PATH=/docker >>/docker/env"

This will create a tar archive locally, containing the three PEM files; then it will stream this archive to a container which will unpack it on the fly; and finally we create an env file that can be sourced later to restore the environment variables.

Now, when we need to talk to our Docker API endpoint from a container, all we have to do is to start that container with --volumes-from dockercontrol, and source /docker/env:

$ docker run --rm --volumes-from dockercontrol docker \
  sh -c "source /docker/env; docker ps"

If you want a totally transparent operation (i.e. you don’t want to change the container so that it sources /docker/env) you can also read that file on your host, and pass down the environment variables to your containers.

Data containers vs. named volumes

You might be wondering why I’m using an old-fashioned data container instead of creating a proper [named volume[ (with docker volume create). This would indeed be more “Dockerish” but wouldn’t work (yet) with Swarm; e.g. when doing docker run -v dockercontrol:/docker … we would have to add an affinity constraint to make sure that the container is created on the host that has the dockercontrol volume. I think it’s simpler to use a data container for now.

Putting everything together

I wrote a little shell script to automate the whole process; it’s available on jpetazzo/dctrl on GitHub.

It lets you run:

$ dctrl purple

This will create a data container named purple, holding the information necessary to connect to the current Docker API endpoint.

Then, if you need to run a container that has access to this Docker API endpoint, you can do:

$ eval \$(docker run --rm --volumes-from $CONTROL alpine
          sed 's/DOCKER_/DOCKERCONTROL_/' /docker/env)"
$ docker run --volumes-from $CONTROL \
  -e DOCKER_HOST=\$DOCKERCONTROL_HOST \
  -e DOCKER_TLS_VERIFY=\$DOCKERCONTROL_TLS_VERIFY \
  -e DOCKER_CERT_PATH=\$DOCKERCONTROL_CERT_PATH \
  ...

What can we use this for?

This allows us to create containers from a container, even when running on e.g. a Swarm cluster. (I initially thought about writing this blog post when 3 different persons, the same week, asked me “how can I bind-mount the Docker socket when the Docker Engine is not local, but on a remote host?”)

But this is also very useful in the general case when you need a container to be able to interact with your overall Docker setup; e.g. to setup or reconfigure a load balancer. See for instance dockercloud-haproxy, which accesses the Docker Events API to notice when backends are added to a service, and dynamically update load balancer configuration accordingly.

Another example would be the implementation of a replication controller in a container. This container would be given e.g. a Compose file and a set of scaling parameters (the number of desired intances for each service). It would bring up the application described by the Compose file, scale it according to the scaling parameters, and watch the Docker Events API to adjust the number of containers should any node go down during the lifecycle of the application. (That container would be started with a rescheduling policy, to be automatically redeployed by Swarm if its own node goes down.)

Generally speaking, any kind of application that needs access to the Docker API would benefit from this as soon as you want to be able to run it seamlessly in a container.

Power to the people

This takes me to one of my favorite features of Docker Swarm: the fact that it uses the same API as the Docker Engine.

This means that as a developer, when I build my application on my local machine with a single Docker Engine, I leverage the full API that will be available on a Swarm cluster:

if I need to partition my app across multiple networks, I can do it on a single node with the default bridge driver, and when I deploy on a Swarm cluster, everything will work exactly the same way, thanks to the overlay driver (or whatever network plugin has been deployed by my ops team);
if I need to use persistence and volumes, same story: I will use the default local driver in my environment, and if I’m running on a cluster with a volume plugin like Flocker or PortWorx, it will automatically achieve reliable persistence without changing anything on my side;
if I want to automatically scale up and down a background worker depending on the backlog size of a message queue, I can develop and test this locally, because the API used to scale (and gather metrics) will be the same in my local environment and the production one.

This last example is particularly powerful. If you are developing an app intended to run in the public cloud, and want to use auto-scaling, you won’t be able to test the auto-scaling behavior locally - unless your cloud provider gives you the option of installing a fully functional cloud instance locally, on your development laptop. With Docker, the fully functional cloud instance is the Docker Engine that you’re already using to power your containers.

If you have some other creative scenario involving controlling the Docker API from within a container, let me know!

Want to learn more about Docker? I will deliver two Docker workshops next month (May 2016) in Austin, Texas: an intro-level workshop and an advanced orchestration workshop using Compose, Swarm, and Machine to build, ship, and run distributed applications. If you want to attend, you can get 20% off the conference and workshop prices using the code PETAZZONI20. I will also deliver those workshops at other conferences in Europe and the US, so if you’re interested, let me know!

“I am a feminist, but…”

2015-09-15T00:00:00+00:00

TL,DR: we all have different perceptions and experiences. Just because you’re fine with a picture, book, movie, etc., doesn’t mean that everybody will accept it equally well. Even if you are the nicest person in the world. Let’s accept it, and be aware of each other’s sensitivities.

What am I talking about?

A recent CommitStrip Episode describes two guys and a girl watching an episode of the TV show Mr Robot. This TV show has the reputation of being technically accurate in many ways, which makes it popular among people who know how modern technology actually works. (Those people can often be frustrated by the considerable suspension of disbelief required to watch most movies where computers, networks, and other technology artifacts are a central point of the plot. Anyway!)

The first image just shows the three characters on a couch, watching the show. By looking at it, I was already afraid of what would be coming next. And I was right: they watch the show, with an avalanche of technical words: DDOS! Rootkits! Boot sequence! And the two guys are totally into the story, but the girl doesn’t get it.

I think that’s sexist, and I’m not OK with it.

Why would that be sexist?

It’s conveying the stereotype that women are “not technical,” that they are not interested by computers, and when there is a “geeky” TV show they won’t “get it.”

There is indeed a serious problem in the tech industry: there are much more men than women, and the unbalance becomes even worse when you focus on engineering positions. Some people naively think that this is because men are better than women in tech roles, but it’s simply not true. Every day,. we use inventions and concepts created or pioneered by women.

Also, things are getting worse: the percentage of women in tech is decreasing, which seems to indicate that we’re doing something horribly wrong, causing many women to leave this industry.

There are multiple causes, but sexism in the tech industry is one of them. Sexism isn’t always obvious objectification of women or crass harrassment. It’s also all the little things.

One of the things we do wrong is to perpetuate the stereotype that women don’t understand computers, and that when a bunch of dudes will talk about technology, the token woman will be the one who doesn’t understand.

“But I don’t find that offensive! And I’m a feminist/a woman!”

Good for you! But other people think differently. Not a few; a lot, in fact. Because we all have different perceptions. Maybe you are a woman, working in tech, and you are lucky enough to be respected by your peers, have never been asked if you were a recruiter at a conference, have never received rape and death threats on IRC and in your emails. Good!

But some (unfortunately, many) women in our industry have a very different experience.

In July 2015, I gave a short talk about the future of the Cloud, and one thing I said at the end was (short version), “And I hope that in the future of the Cloud, tech will be a nicer industry for women.” After that talk, some women in the audience asked me, “Hey, is it true, that horrible stuff you were describing? Harrassment, death threats, etc.? It never happened to me!” But a considerably larger number of women thanked me for talking about this, and some of them shared their stories. That’s sad, and this has to stop.

What’s the responsibility of a web comic there?

It’s a shared responsibility for all of us. Collectively, we have to make tech welcoming for women. We have to call out people when they make a stupid Mother’s day joke, or when they publish a book that will convey the idea that girls need boys to fix their computers.

Put it differently: if some day I have a daughter, and if she’s interested into computers, I don’t want her to look at this sexist shit and unconsciously get the message that only boys understand technology, and that she should go back to play with her dolls. I want her to look up at some of the badass women out there who are amzing role models and pursue that path if she wants to, rather than getting the feeling that she’s not welcome in this boy’s club.

How to deliver a great tech tutorial

2015-09-10T00:00:00+00:00

Here are a few tips and tricks that I learned when building then delivering the Docker Fundamentals course at Docker Inc. This course is a 2 days training designed to be delivered to small groups (up to 20 people) but we also delivered the intro part many times at tech conferences, to groups of varying sizes (50 to 300).

Foreword

I wrote this in a hurry. The style is probably not very good, and I should probably move some parts around. Pull requests welcome :P

Now, without further ado, here are all the things you could do to deliver a great tutorial!

Pre-provisioned cloud VMs

We did this for every single Docker tutorial: just before the tutorial (like, the night before the tutorial, or even a few hours before it starts) we would create cloud VMs, pre-provisioned with all the things the students will need.

Constraints

You will need stable Internet access. Some people balked at the very idea of doing the training on remote machines. “What if the conference WiFi goes down?” In the case of Docker, we want to pull images, download packages, etc.; so by doing things “in the cloud” we just maintain one SSH connection per student, instead of having each student download images and packages for 10s or 100s of MB.

At Velocity Santa Clara in 2014, we had 300 people in the room, and it worked pretty well. Just clarify ahead of time to set expectations.

How?

To automate the provisioning of the images, we recommend to use cloud-init because it’s ridiculously simple. In our case, we start from an Ubuntu 14.04 image (all serious providers will have this available; if not, switch to another provider), and we provide a script as the cloud-init payload. (On EC2, that’s the metadata field.)

The script gets executed at first boot. In our case, the script installs the Docker Engine, Docker Compose, Docker Swarm, and pre-pulls a few images. It also sets a custom user and password in the VM.

After provisioning the images, we have a script that gather the IP addresses, and generates a printable HTML file that has little cards, one per machine, showing IP address + login + password. We print that file, cut out the cards, and hand them out to the students.

Providers

We used successfully AWS EC2, Gandi, Digital Ocean. We don’t endorse a specific one. If you have a huge training (basically, if you need 100+ VMs) EC2 is slightly easier to deal with because you can provision tons of VMs in a single API call, while with the others, you’ll have to do loops (and it will take longer).

If you are on a budget and your tutorial is about an open source project or a worthy cause, I recommend that you contact Gandi, because while they don’t do any kind of traditional advertisement, they spend their marketing budget helping ethical projects and the open source movement in general; so you might be able to obtain a discount or some other kind of arrangement with them.

Alternate methods

If you can’t afford providing one VM per student, you can also have VM images (VirtualBox recommended) that you will hand out on USB keys, and have a smaller amount of VMs for people who can’t/won’t use VirtualBox.

If you have a small amount of VMs and you don’t want to print credentials, you can also put the credentials in a shared Google Spreadsheet and have people tick them off when they use a VM.

Prepare your material

Obviously, you want to prepare your material ahead of time. Try to highlight the hands-on parts (i.e., the commands that people are expected to run in the environments), so that people who just want to “see it in action” can jump straight to the point.

I’m a huge fan of keeping my presentation materials in a repo, in diffable format. This means that PowerPoint and Keynote are out. If you can afford the time investment that goes with those tools, great! But unfortunately, I cannot.

(Note: I’m not saying that PowerPoint or Keynote are more complicated to use, or require any particular kind of training. What I’m saying is that in the long run, maintaining a complex document, with successive versions, bug fixes, collaborators, etc., turns out to be a nightmare with those formats. By keeping my materials in markdown, I can store them in a GitHub repo, accept pull requests, etc.)

I have used two different systems: showoff, and remark. Remark is a simple markdown-to-HTML thing; you can see an example in my Docker orchestration workshop. I added a custom class for the hands-on sections, so that people can identify them easily.

Showoff is way more advanced. You also write slides in markdown, but when you will present them, you start a custom server, that can be accessed in presenter or viewer mode. The presenter has a fancy interface, and viewers can “sync” their view to the presenter, so that the slides auto-advance as the presenter goes through the material. It’s great if you want to go the extra mile.

Beta-test your material

Before the first delivery, enroll 1 or 2 “candid users” to be your beta testers. Go through the material with them at the expected pace. See what works, what doesn’t. Don’t hesitate to make significant pauses to rehaul content if you see that it really doesn’t work.

If you are going to have TAs and must train them, this is the perfect opportunity to do it!

Put sample code on GitHub

If you can, put your material (slides) on GitHub. If you cannot (e.g. if you make a living off your training and that building the material represents a huge time investment for you, and/or you’re licensing that material), consider putting the sample code on GitHub anyway.

Why? So that people can easily download it, refer to it, and even fix it (through pull requests).

This is particularly important for any file exceeding a few lines, or for code samples spread across multiple files. It’s way faster to git clone a sample repo, than to copy-paste multiple files, or even worse, type them manually.

Put all examples on GitHub

Something I learned recently: if your workshop has a section where people:

start with some sample code that you provide;
execute it;
then tweak it and execute again;
tweak it more;
etc.

… Then you should consider providing all the successive versions of the code. Either as files with different names, or maybe different tags or branches in the repo; whatever suits your fancy.

This helps a lot; rather than having slides saying “and now you have to change files X and Y so they look like this.”

Put your slides online during the training

If you can, have the slides available online during the training. There are multiple advantages:

it will help people that are far from the projector screen, or have bad eyesight, or don’t have a direct line of sight to the projector screen…
it will help people who are lagging behind a little bit, or who want to look ahead before asking a question
it will help to copy-paste sample code rather than typing it

My 2c tip

When I spin up the VMs for the students, I set one aside for myself. I assign a DNS name for this VM (e.g. training.dckr.info) and I deploy a small static web server on it, with the material that I want to share. When the training involves manager/worker topologies or any kind of setup where the students have to refer to a well-known node, I make this VM the well-known node, and I tell them to use the DNS name (instead of the IP address, which is error-prone).

Try to always have at least one TA

If you have more than 20-30 students, it’s great to have a Teaching Assistant. When somebody is stuck, they can raise their hand, and the TA will help them - instead of requiring you to break the flow and stop everything to help them.

Also, when you notice things to fix (typos, commands that don’t work like expected…) it’s great if the TA can double as a scribe and take reliable notes. Those notes will generally be better than the 3 cryptic keywords that you’ll jolt down on a scratch file or a post-it note.

Bonus point if the TA makes pull requests during the training, so that immediately after, you can review and fix. The alternative is to postpone that later, and then life happens and by the time the next training is there you still haven’t fixed the material!

Review your changes

When you do significant changes (moving content around, switching a tool for another…) make sure that you re-run the whole material. It’s very easy to forget a little detail that will make the whole thing obscure (e.g. at some point you decide to replace wget with curl, but you leave a few references to wget in a few places, and now your students are super confused!)

Things I don’t do, but you should

satisfaction surveys
follow-up emails
your suggestion here

Using Docker-in-Docker for your CI or testing environment? Think twice.

2015-09-03T00:00:00+00:00

The primary purpose of Docker-in-Docker was to help with the development of Docker itself. Many people use it to run CI (e.g. with Jenkins), which seems fine at first, but they run into many “interesting” problems that can be avoided by bind-mounting the Docker socket into your Jenkins container instead.

Let’s see what this means. If you want the short solution without the details, just scroll to the bottom of this article. ☺

Update (July 2020) : when I wrote this blog post in 2015, the only way to run Docker-in-Docker was to use the -privileged flag in Docker. Today, the landscape is very different. Container security and sandboxing advanced very significantly, with e.g. rootless containers and tools like sysbox. The latter lets you run Docker-in-Docker without the -privileged flag, and even comes with optimizations for some specific scenarios, like running multiple nodes of a Kubernetes cluster as ordinary containers. This article has been updated to reflect that!

Docker-in-Docker: the good

More than two years ago, I contributed the -privileged flag in Docker and wrote the first version of dind. The goal was to help the core team to work faster on Docker development. Before Docker-in-Docker, the typical development cycle was:

hackity hack
build
stop the currently running Docker daemon
run the new Docker daemon
test
repeat

And if you wanted to a nice, reproducible build (i.e. in a container), it was a bit more convoluted:

hackity hack
make sure that a workable version of Docker is running
build new Docker with the old Docker
stop Docker daemon
run the new Docker daemon
test
stop the new Docker daemon
repeat

With the advent of Docker-in-Docker, this was simplified to:

hackity hack
build+run in one step
repeat

Much better, right?

Docker-in-Docker: the bad

However, contrary to popular belief, Docker-in-Docker is not 100% made of sparkles, ponies, and unicorns. What I mean here is that there are a few issues to be aware of.

One is about LSM (Linux Security Modules) like AppArmor and SELinux: when starting a container, the “inner Docker” might try to apply security profiles that will conflict or confuse the “outer Docker.” This was actually the hardest problem to solve when trying to merge the original implementation of the -privileged flag. My changes worked (and all tests would pass) on my Debian machine and Ubuntu test VMs, but it would crash and burn on Michael Crosby’s machine (which was Fedora if I remember well). I can’t remember the exact cause of the issue, but it might have been because Mike is a wise person who runs with SELINUX=enforce (I was using AppArmor) and my changes didn’t take SELinux profiles into account.

Docker-in-Docker: the ugly

The second issue is linked to storage drivers. When you run Docker in Docker, the outer Docker runs on top of a normal filesystem (EXT4, BTRFS, what have you) but the inner Docker runs on top of a copy-on-write system (AUFS, BTRFS, Device Mapper, etc., depending on what the outer Docker is setup to use). There are many combinations that won’t work. For instance, you cannot run AUFS on top of AUFS. If you run BTRFS on top of BTRFS, it should work at first, but once you have nested subvolumes, removing the parent subvolume will fail. Device Mapper is not namespaced, so if multiple instances of Docker use it on the same machine, they will all be able to see (and affect) each other’s image and container backing devices. No bueno.

There are workarounds for many of those issues; for instance, if you want to use AUFS in the inner Docker, just promote /var/lib/docker to be a volume and you’ll be fine. Docker added some basic namespacing to Device Mapper target names, so that if multiple invocations of Docker run on the same machine, they won’t step on each other.

Yet, the setup is not entirely straightforward, as you can see from those issues on the dind repository on GitHub.

Docker-in-Docker: it gets worse

And what about the build cache? That one can get pretty tricky too. People often ask me, “I’m running Docker-in-Docker; how can I use the images located on my host, rather than pulling everything again in my inner Docker?”

Some adventurous folks have tried to bind-mount /var/lib/docker from the host into the Docker-in-Docker container. Sometimes they share /var/lib/docker with multiple containers.

The Docker daemon was explicitly designed to have exclusive access to /var/lib/docker. Nothing else should touch, poke, or tickle any of the Docker files hidden there.

Why is that? It’s one of the hard learned lessons from the dotCloud days. The dotCloud container engine worked by having multiple processes accessing /var/lib/dotcloud simultaneously. Clever tricks like atomic file replacement (instead of in-place editing), peppering the code with advisory and mandatory locking, and other experiments with safe-ish systems like SQLite and BDB only got us so far; and when we refactored our container engine (which eventually became Docker) one of the big design decisions was to gather all the container operations under a single daemon and be done with all that concurrent access nonsense.

(Don’t get me wrong: it’s totally possible to do something nice and reliable and fast involving multiple processes and state-of-the-art concurrency management; but we think that it’s simpler, as well as easier to write and to maintain, to go with the single actor model of Docker.)

This means that if you share your /var/lib/docker directory between multiple Docker instances, you’re gonna have a bad time. Of course, it might work, especially during early testing. “Look ma, I can docker run ubuntu!” But try to do something more involved (pull the same image from two different instances…) and watch the world burn.

This means that if your CI system does builds and rebuilds, each time you’ll restart your Docker-in-Docker container, you might be nuking its cache. That’s really not cool.

Docker-in-Docker: and then it gets better

You certainly have heard some variant of that famous quote by Mark Twain: “They didn’t know it was impossible, so they did it.”

Many folks tried to run Docker-in-Docker safely. A few years ago, I had modest success with user namespaces and some really nasty hacks (including mocking the cgroups pseudo-fs structure over tmpfs mounts so that the container runtime wouldn’t complain too much; fun times) but it appeared that a clean solution would be a major endeavor.

That clean solution exists now: it’s called sysbox. Sysbox is an OCI runtime that can be used instead of, or in addition to, runc. It makes it possible to run “system containers” that would typically require the privileged flag, without the privileged flag; and provides adequate isolation between these containers, as well as between these containers and their host.

Sysbox also provides optimizations to run containers-in-containers. Specifically, when running multiple instances of Docker side by side, it is possible to “seed” them with a shared set of images. This saves both a lot of disk space and a lot of time, and I think this makes a huge difference when running e.g. Kubernetes nodes in containers.

(Running Kubernetes nodes in containers can be particularly useful for CI/CD, when you want to deploy a Kubernetes staging app or run tests in its own cluster, without the infrastructure cost and time overhead of deploying a full cluster on dedicated machines.)

Long story short: if your use case really absolutely mandates Docker-in-Docker, have a look at sysbox, it might be what you need.

The socket solution

Let’s take a step back here. Do you really want Docker-in-Docker? Or do you just want to be able to run Docker (specifically: build, run, sometimes push containers and images) from your CI system, while this CI system itself is in a container?

I’m going to bet that most people want the latter. All you want is a solution so that your CI system like Jenkins can start containers.

And the simplest way is to just expose the Docker socket to your CI container, by bind-mounting it with the -v flag.

Simply put, when you start your CI container (Jenkins or other), instead of hacking something together with Docker-in-Docker, start it with:

docker run -v /var/run/docker.sock:/var/run/docker.sock ...

Now this container will have access to the Docker socket, and will therefore be able to start containers. Except that instead of starting “child” containers, it will start “sibling” containers.

Try it out, using the docker official image (which contains the Docker binary):

docker run -v /var/run/docker.sock:/var/run/docker.sock \
           -ti docker

This looks like Docker-in-Docker, feels like Docker-in-Docker, but it’s not Docker-in-Docker: when this container will create more containers, those containers will be created in the top-level Docker. You will not experience nesting side effects, and the build cache will be shared across multiple invocations.

⚠️ Former versions of this post advised to bind-mount the docker binary from the host to the container. This is not reliable anymore, because the Docker Engine is no longer distributed as (almost) static libraries.

If you want to use e.g. Docker from your Jenkins CI system, you have multiple options:

installing the Docker CLI using your base image’s packaging system (i.e. if your image is based on Debian, use .deb packages),
using the Docker API.

Someone said that 30% of the images on the Docker Registry contain vulnerabilities

2015-05-27T00:00:00+00:00

This number is wonderful. Not because it’s high or low, but because it exists. The fact that it is possible (and relatively easy) to compute this metric means that it will be possible (and relatively easy) to improve it, among other things.

Disclaimer: I work for Docker, and while this post is not sponsored or approved by my employer, you are obviously welcome to take it with a grain of salt.

The original number was published on BanyanOps Blog.

For more information and another point of view, you can also read this infoQ article.

Counting vulnerabilities

First, let’s see how we can come up with those metrics. The process is rather simple:

get a list of images on the Docker registry;
download those images;
audit them for vulnerabilities.

That looks almost too simple, so let’s dive a little bit into the details.

Listing images

Listing official images is easy. They are all built using an automated system called bashbrew, using publicly available recipes. By the way, this means that if you want to rebuild the official images yourself, it is very easy to do so. (Keep in mind that some of those recipes include blobs and tarballs used for bootstrapping purposes; so sometimes you will have to go one step further to rebuild those blobs and tarballs.)

The recipes for all official images are available in the docker-library on GitHub.

Listing other images (the ones belonging to users and organizations) is harder. The hub doesn’t provide a way to list them all right now, so an acceptable workaround is to search for a very common word, e.g. a, and go from there. Of course, this requires some crawling; and you might end up missing a few users, but that will get you pretty close. (That being said, I’m told that the new registry API has something nice to make that task easier…)

Downloading the images

Downloading the images is trivial. If you want to do it without much fuss, just run a Docker daemon, and run docker pull username/imagename:tag.

If you want to get a tarball of the container filesystem, that’s easy: just run docker export username/imagename:tag. (Redirect your standard output somewhere, otherwise you terminal will be a sad panda.)

If you don’t trust the Docker daemon, you can also check the registry API (v1, v2) and download the layers through the API, then reconstruct the image from those layers. I’ll spare you the details, but as of today, layers are regular tarballs, and you can just unpack them in top of each other (in the right order) to reconstruct an image. Nothing fancy is involved; the only “trick” is to watch for whiteouts. Whiteouts are special marker files indicating that “a file used to be there, but it is no more.” In other words, if a layer has the file /etc/foo.conf but was removed in an upper layer, then that upper layer will have /etc/.wh.foo.conf, and the file foo.conf won’t show up in the container. It is masked by the whiteout, so to speak.

As it turns out, the amazing Tianon actually wrote a script to do exactly that, if you’re interested!

Auditing the images

There are a few different things you can do at this stage. The details are way beyond the scope of this post; but here are some of the things that you might want to do in a comprehensive security audit:

execute yum-security or equivalent, to make sure that no security upgrade is available at this point;
better: get list and version of all installed packages, and check that no vulnerable version is present;
compute hash of each file on the system, and compare them against a set of hashes of known vulnerable files;
execute automated tools (like chkrootkit) to find suspicious files;
execute a number of vulnerability tests, tailored for specific vulnerabilities. The goal of those tests is to try to exploit a vulnerability, and tell you “your system is vulnarable because I managed to exploit this vulnerability” or “I failed to exploit this vulnerability, so your system is probably not vulnerable.”

Things get particularly interesting in the context of containers, because it becomes easy (and convenient) to automate all those things with Docker. For instance, you can put your vulnerability analysis toolkit in /tmp/toolkit, then for each image $I, execute something like docker run -v /tmp/toolkit:/toolkit $I /toolkit/runall.sh.

(Note: this assumes that your toolkit is statically linked and/or self-contained, i.e. doesn’t rely on anything in your container image that might fool the toolkit itself. My main point here is to show that if you need to hammer your container image with a bunch of tests, you can do that in containers to make your life easier, and the overall process will be much faster than it would usually be if you had to make a full copy of the audited machine for each test.)

Improving the metric

Alright, so we run all those tests, and we find that an outrageously high number of images contain vulnerable packages. How can we change that?

For official images, the easiest path is to follow Docker’s security guidelines. Down the road, as the number of official images increases, Docker will improve this mechanism to automatically notify upstream security lists for official images.

For non-official images, you can check the Author field in an image:

$ docker inspect --format '{{.Author}}' bin/ngrep
Jerome Petazzoni <jerome@docker.com>

If the image comes from an automated build, you can look up its source repository, and contact them directly.

If you are directly impacted by the vulnerability, and want things to move faster, you can rebuild the image yourself, and/or investigate to see what’s needed to patch the vulnerability, and submit a pull request with the appropriate changes. The intent here is not to offload security to the end users, but rather to empower them to contribute to security if they are willing and able to do so.

Down the road, you can expect all those steps to be improved and streamlined. Automation will be built to reduce the friction around contacting the appropriate authority, and minimize the time required to release patched version.

But 30% is a lot, right?

It might sound like 30% of “vulnerable images” is a lot. That’s also what I thought first. But if you take a closer look, a large fraction of those images are older images, that are deliberately not updated.

What? Deliberately not updated?

Yes, and there are a couple of good reasons for that. The first one is (for some of them) parity with other media. Some distributions want version XYZ to be consistent across CD/DVD media, network installs, VM images, and containers. The second reason (which also explains the first reason) is repeatable builds.

Imagine that you have a problem with some servers running Ubuntu 12.04, but you can’t reproduce the issue with a new install of Ubuntu 12.04 (let alone 14.04). After investigating further, it turns out that the problem only appears on machines installed at a given time, with Ubuntu 12.04.2. If a container image is available for 12.04.2, you will be able to reproduce the bug; otherwise, you will have to fetch it from elsewhere somehow. That’s why the Docker Hub has images for some older versions in the exact state that they were when they were released - including security issues. That being said, we have put pretty big yellow police tape everywhere saying “LEGACY IMAGES - DO NOT CROSS,” so we hoped that it would be obvious that those images should not be included in a security metric…

Let’s hope that people will realize that next time they compute metrics on the Docker Hub.

Taking action - locally

We might be running vulnerable images! Halp! What do, what do?

There again, the situation isn’t as bad as it looks. When you (or anybody else) do your audit of those images (official, public, or private), the outcome is a list of images (as unique hashes) alongside with a “PASS” or “FAIL” status. (In the case of “FAIL” you hopefully have some details, e.g. “Seems to be vulnerable to ShellShock / CVE-2014-7187 and others)” or “Has package OpenSSL 1.0.1c / CVE-2014-0160.)

Webscale security audit

You can take this list, and compare it to the images you have locally. That’s where things get really interesting. By doing a simple (and cheap) match of your local images with this list, you will know instantly if you are running vulnerable images. That scales nicely to thousands or millions of hosts.

It also means that things can be decoupled nicely: your security auditor doesn’t need access to your production systems (or even to your development ones). They don’t even need to know what you are running: they perform an analysis on a broad range of images, and you consume the result. You can also have multiple security companies and compare their results.

What if my containers have been modified after creation?

For starters, you shouldn’t do that. If you need to upgrade something in a container, you should make a new image and run that image. OK, but what if you’ve done it anyway?

Then all bets are off, but at least we can find out that it’s happening. As part of the security audit, you can run docker diff on your running containers to find out if they have been modified. (Normally, the output of docker diff should be empty. Note that if you have started a container with a shell, or dropped into a container with docker exec, you might see a few modifications though. But production containers should not show any change.)

Protip: you can even prevent modifications, by running your containers with the --read-only flag. This will make the container filesystem read-only, warranting that docker diff will remain empty.

To inspect all your containers with a single command, you can do:

docker ps -q | xargs -I {} docker diff {}

(Courtesy of @diogomonica!)

What if I have built custom containers?

If you have built your own containers, I suggest that you push them to a repository. If it’s the public one, we’re back to the initial scenario. If it’s a private repository… Let’s check the next section!

What about private images and registries?

What if you are pushing private images? What if you are pushing on a local registry, or on Docker Hub Enterprise?

Things obviously get more complex. You can’t expect someone to magically tell you “image ABC is vulnerable to CVE-XYZ” if they never saw image ABC.

Here are a few things that can happen:

security providers can offer image scanners, that you can run on your images;
security providers can go farther, and integrate with the Docker registry. This can be done either by delegating read access (for private images on the Docker Hub) or even by on-prem deployment of the security scanner (in the case of Docker Hub Enterprise). In both cases, that gives the ability to automatically scan an image right after it’s pushed, and immediately report any vulnerability.

Conclusions

There are two things that I would like to emphasize, because I believe that they will yield to positive results in the security field.

Having numbers is good. Once we have metrics, we can improve them. Docker takes security seriously, and you can be sure that we’ll work with the community and image maintainers to improve those metrics.
Having an ecosystem and community like those around Docker and the Docker Hub make them amazing places to standardize. As Solomon pointed out in a few keynotes, one of the most important things in Docker is not the technology, but to get people to agree on something.

The last point means that Docker now has enough critical mass to justify the development of transverse tools (including security audit) that will benefit the whole ecosystem. The outcome will be an improved security - for everybody.

Docker cares about security

If you get the impression that Docker Inc. doesn’t care about security, you’re far from the truth. As pointed out above, we have a responsible disclosure security policy, and we have always been very fast to address issues that we were aware of. No software is exempt from bugs. Docker is written by humans, and even if some of them are amazing, they still make mistakes. What matters is how seriously we take security reports and how fast we address them; and I think we’ve been doing well on that side.

If you want to make your Docker install more secure, I recommend that you also check dockerbench. As I write those lines, it contains an automated assessment tool, evaluating a Docker host using the criterias of the CIS Docker 1.6 Benchmark. It checks a large number of things (e.g., that SELinux or AppArmor are enabled) and produces a report.

This is the first of many tools that Docker will produce or contribute to, to help you to run Docker safely without holding a Ph.D in container security or hiring Taylor Swift.

Also, we encourage public discussion, and security concerns are no exception! There is an interesting thread on the Docker Library repository about this topic.

Extra notes

I’ve been asked to clarify why containers are useful at all, if we don’t triple-check the provenance of all the things we run. Here are a few examples.

Containers allow us to test risky things (like the infamous curl ... | sh) in a sandbox to see exactly what they’re doing, thanks to docker diff.
Containers allow us to test risky things (like a commercial vendor’s install.sh) in a sandbox to see exactly what they’re doing, thanks to docker diff.
Containers allow us to test risky things (like installing a npm, pip, gem… package of unknown origin) in a sandbox to see exactly what they’re doing, thanks to docker diff.
Containers allow us to test risky things (like installing a deb, rpm, or other distribution package) in a sandbox to see exactly what they’re doing, thanks to docker diff.
Containers allow us to test risky things (like installing a dangerous squid package) in a sandbox to see exactly what they’re doing, thanks to docker diff.

I guess you see the pattern here. Just because things come in a familiar form doesn’t mean that they are safe. But we can use Docker to improve security.

Putting data in a volume in a Dockerfile

2015-01-19T00:00:00+00:00

In a Dockerfile, if you put data in a directory, and then declare that directory to be a volume, weird things can happen. Let’s see what exactly.

The problem

Someone contacted me to ask about very slow build times. They told me: “This is weird. In this Dockerfile, the VOLUME and CMD lines take a few minutes. Why is that?”

The diagnostic

I was very intrigued, and investigated. And I found the reason!

The Dockerfile looked roughly like this:

FROM someimage
RUN mkdir /data
ADD http://.../somefile1 /data/somefile1
ADD http://.../somefile2 /data/somefile2
ADD http://.../somefile3 /data/somefile3
ADD http://.../somefile4 /data/somefile4
VOLUME /data
CMD somebinary

The files added were very big (more than 10 GB total).

The ADD steps do exactly what you think they do: they download the files, and place them in /data.

Then, two particular things happen.

First, when you get to VOLUME /data, you inform Docker that you want /data to be a volume. If Docker doesn’t do anything special, when you create a container from that image, /data will be an empty volume. So instead, when you create a container, Docker makes /data to be a volume (so far, so good!) and then, it copies all the files from /data (in the image) to this new volume. If those files are big, then the copy will take some time, of course.

This copy operation will happen each time a container is created from the image.

The second thing might also surprise you: even though VOLUME and CMD just modify some metadata, they still create a new container from the image, then modify that metadata, and finally create the image from the modified container.

It means that a new “anonymous” volume will be created for /data, and its content will be populated from the image - for each step of the Dockerfile, even when it’s not strictly necessary.

The solution

So, what do?

Don’t put a lot of data in a volume directory. That’s pretty much it!

It’s OK to have a few megabytes of data in a volume directory. For instance, a blank (or almost empty) database containing a small data set. But, if it’s bigger than that, you probably want to do differently.

How exactly?

The easiest way is to not use a volume. Just put the data in a normal directory, and it will be part of the copy-on-write filesystem. This is the right thing to do if the data will be read-only, or if it will have only very little modifications during the lifetime of the container.

A few examples:

a GeoIP database (mapping IP addresses to geographic information);
pre-generated tiles for a map server;
data and possibly (slow-to-update) search indexes for a significant corpus, like e.g. offline copies of Wikipedia;
etc.

Now what if you really want the data to be on a volume, because you need native I/O speeds?

Then, of course, use a volume. But you should decouple the application and its data. Author a first container image for the application itself, without any significant amount of data (or maybe a minimal test set, allowing to test that the image works properly). It is OK to put the data on a volume. Since it is small, it won’t cause a significant performance degradation when you work with this container.

Then, author a second container image, just for this data. If you need the application to generate the data, you can base this second container image on the first one. In this image, you can put the data on a volume, but you don’t have to. It is probably better to not declare the data directory as a volume, to avoid the bad surprise of “oops, I’ve triggered a 10 GB data copy again!” each time you start this container.

Once you have your two container images ready, create a container from the second one. If the data directory is not a volume, it is time to declare it explicitly now, with the -v option. This container will be a “data container”; it will not run the service. (When creating it, you could override the process with --entrypoint true, for instance.)

Last but not least, start the actual service container, based on the first image, using the data container volumes with the --volumes-from option.

Voilà!

Additional readings

Check this out:

My notes on Amazon's ECS (EC2 Container Service), aka Docker on AWS

2015-01-14T00:00:00+00:00

This morning, I watched AWS’ webinar presenting their container service. Here are some quick notes, for those of you who are as curious as I was about it!

This is not meant to be an intro to Docker. This is not meant to be an intro to EC2 or to AWS. This is for people who are already familiar with AWS, specifically with EC2, and who are already familiar with Docker, and wonder what’s behind the ECS (EC2 Container Service) announcements made at AWS re:invent last November.

AWS has made the video available if you want to watch the webinar yourself.

Bullet points

TL,DR:

it’s supposed to be a set of building blocks, usable “as-is” or as part of something more complex
your containers will run on your EC2 instances (a bit like for Elastic Beanstalk, if you’re familiar with that)
there is no additional cost: you pay only for the EC2 resources
it only works on VPC
the service is currently in preview (behind a sign-up wall) in us-east-1; general availability will come in the next few months
there is no console dashboard yet; you have to use the CLI or API
for now, you can only start containers from public images hosted on the Docker Hub, but that’s expected to change when the service goes out of preview

Glossary of terms

Here is some vocabulary to help you to mash through the ECS docs.

Container instance

A “container instance” can be any EC2 instance, running any distro (Amazon Linux, Ubuntu, CoreOS…)

It just needs two extra software components:

the Docker daemon,
the AWS ECS agent.

The ECS agent is open source (Apache license). You can check the ECS agent repo on github.

Cluster

That’s a pool of resources (i.e. of container instances).

A cluster starts being empty, and you can dynamically scale it up and down by adding and removing instances.

You can have mixed types of instances in here.

It’s a regional object, that can span multiple AZs.

Task definition

It’s an app definition in JSON. The format is conceptually similar to Fig, but not exactly quite like it.

I don’t know why they didn’t pick something more like Fig, or more like the Docker Compose project. It might be because almost everything else on AWS is in JSON, and they wanted to stick to that.

Note: Micah Hausler wrote container-transform, a tool to convert Fig/Compose YAML files to the ECS task format:

@jpetazzo @docker I wrote a little fig.yml <==> ecs-task.json converter. https://t.co/brcOtFvLAk
— Micah Hausler (@micahhausler) January 15, 2015

Task

A task is an instanciation of a task definition. In other words, that will be a group of related, running containers.

The workflow

So, how does one use that? The workflow looks like this:

Build image using whatever you want.
Push image to registry.
Create JSON file describing your task definition.
Register this task definition with ECS.
Make sure that your cluster has enough resources.
Start a new task from the task definition.

Now, diving into the details; there are 3 ways to start a task:

Use the CLI command start-task. You must then specify the cluster to use, the task definition, and the exact container instance on which to start it. It’s a bit like doing manual scheduling.
Use the CLI command run-task. You must then specify the cluster, task definition, and an instance count. It will run ECS default resource scheduler (which is a random scheduler).
Bring your own scheduler!

The webinar had a demo involving Mesos; they started container from Marathon, from Chronos, and using the CLI as well, and the containers were visible everywhere. That looked cool. Initially, I didn’t understand how it worked; but the people who built it were kind enough to chime in and explain:

@jpetazzo the Mesos integration is via a Mesos scheduler driver
— Deepak Singh (@mndoci) January 15, 2015

@mndoci @jpetazzo the scheduler drive speaks to AWS ECS only, there are no Mesos masters or slaves involved. Just ECS + Marathon/Chronos
— William Thurston (@williamthurston) January 15, 2015

Networking

Not much on that side.

Containers can be linked (the task definition allows to name containers, and then to indicate that a container is linked to another one) but I don’t know how that works.

My personal take

My understanding is, that ECS as it is today is a technological preview.

There are a some items that are still to be clarified, like the use of private registries (but on that front, Docker Hub Enterprise might eventually come on AWS; and it will likely integrate nicely with ECS).

Docker-centric point of view

I would love if:

container instances and clusters could be managed with Docker Machine
task definitions and tasks could be managed with Docker Compose as the frontend
Docker Swarm could be used as a custom scheduler

Those interoperability points would let anyone move their container workloads seamlessly form/to ECS. More importantly, they will let anyone use the elasticity and scale of EC2, without having to learn APIs and concepts specific to ECS.

AWS-centric point of view

I would love if:

ECS could integrate with Cloud Formation (that’s plannned)
I could also build images (that’s pretty trivial with an ad-hoc instance)…
… and push them on a S3-backed registry that would be neatly integrated with ECS (notably for security credentials)

Last words

Full disclaimer: I haven’t tested ECS yet (and unfortunately, I don’t know if I’ll be able to). So if you have any feedback or useful tip that would be useful for others, don’t hesitate to let me know!

Attach a volume to a container while it is running

2015-01-13T00:00:00+00:00

It has been asked on #docker-dev recently if it was possible to attach a volume to a container after it was started. At first, I thought it would be difficult, because of how the mnt namespace works. Then I thought better :-)

TL,DR

To attach a volume into a running container, we are going to:

use nsenter to mount the whole filesystem containing this volume on a temporary mountpoint;
create a bind mount from the specific directory that we want to use as the volume, to the right location of this volume;
umount the temporary mountpoint.

It’s that simple, really.

Preliminary warning

In the examples below, I deliberately included the $ sign to indicate the shell prompt and help to make the difference between what you’re supposed to type, and what the machine is supposed to answer. There are some multi-line commands, with > continuation characters. I am aware that it makes the examples really painful to copy-paste. If you want to copy-paste code, look at the sample script at the end of this post!

Step by step

In the following example, I assume that I started a simple container named charlie, with the following command:

$ docker run --name charlie -ti ubuntu bash

I also assume that I want to mount the host directory /home/jpetazzo/Work/DOCKER/docker to /src in my container.

Let’s do this!

nsenter

First, you will need nsenter, with the docker-enter helper script. Why? Because we are going to mount filesystems from within our container, and for security reasons, our container is not allowed to do that. Using nsenter, we will be able to run an arbitrary command within the context (technically: the namespaces) of our container, but without the associated security restrictions. Needless to say, this can be done only with root access on the Docker host.

The simplest way to install nsenter and its associated docker-enter script is to run:

$ docker run --rm -v /usr/local/bin:/target jpetazzo/nsenter

For more details, check the nsenter project page.

Find our filesystem

We want to mount the filesystem containing our host directory (/home/jpetazzo/Work/DOCKER/docker) in the container.

We have to find on which filesystem this directory is located.

First, we will canonicalize (or dereference) the file, just in case it is a symbolic link - or its path contains any symbolic link:

$ readlink --canonicalize /home/jpetazzo/Work/DOCKER/docker
/home/jpetazzo/go/src/github.com/docker/docker

A-ha, it is indeed a symlink! Let’s put that in an environment variable to make our life easier:

$ HOSTPATH=/home/jpetazzo/Work/DOCKER/docker
$ REALPATH=$(readlink --canonicalize $HOSTPATH)

Then, we need to find which filesystem contains that path. We will use an unexpected tool for that, df:

$ df $REALPATH
Filesystem     1K-blocks      Used Available Use% Mounted on
/sda2          245115308 156692700  86157700  65% /home/jpetazzo

Let’s use the -P flag (to force POSIX format, just in case you have an exotic df, or someone runs that on Solaris or BSD when those systems will get Docker too) and put the result into a variable as well:

$ FILESYS=$(df -P $REALPATH | tail -n 1 | awk '{print $6}')

Find the device (and sub-root) of our filesystem

Now, in a world without bind mounts or BTRFS subvolumes, we would just have to look into /proc/mounts to find out the device corresponding to the /home/jpetazzo filesystem, and we would be golden. But on my system, /home/jpetazzo is a subvolume on a BTRFS pool. To get subvolume information (or bind mount information), we will check /proc/self/mountinfo.

If you had never heard about mountinfo, check proc.txt in the kernel docs, and be enlightened :-)

So, first, let’s retrieve the device of our filesystem:

$ while read DEV MOUNT JUNK
> do [ $MOUNT = $FILESYS ] && break
> done </proc/mounts
$ echo $DEV
/dev/sda2

Next, retrieve the sub-root (i.e. the path of the mounted filesystem, within the global filesystem living in this device):

$ while read A B C SUBROOT MOUNT JUNK
> do [ $MOUNT = $FILESYS ] && break
> done < /proc/self/mountinfo 
$ echo $SUBROOT
/jpetazzo

Perfect. Now we know that we will need to mount /dev/sda2, and inside that filesystem, go to /jpetazzo, and from there, to the remaining path to our file (in our example, /go/src/github.com/docker/docker).

Let’s compute this remaining path, by the way:

$ SUBPATH=$(echo $REALPATH | sed s,^$FILESYS,,)

Note: this works as long as there are no , in the path. If you have an idea to make that work regardless of the funky characters that might be in the path, let me know! (I shall invoke the Shell Triad to the rescue: jessie, soulshake, tianon?)

The last thing that we need to do before diving into the container, is to resolve the major and minor device numbers for this block device. stat will do it for us:

$ stat --format "%t %T" $DEV
8 2

Note that those numbers are in hexadecimal, and later, we will need them in decimal. Here is a hackish way to convert them easily:

$ DEVDEC=$(printf "%d %d" $(stat --format "0x%t 0x%T" $DEV))

Putting it all together

There is one last subtle hack. For reasons that are beyond my understanding, some filesystems (including BTRFS) will update the device field in /proc/mounts when you mount them multiple times. In other words, if we create a temporary block device named /tmpblkdev in our container, and use that to mount our filesystem, then now our filesystem (in the host!) will appear as /tmpblkdev instead of e.g. /dev/sda2. This sounds like a little detail, but in fact, it will screw up all future attempts to resolve the filesystem block device.

Long story short: we have to make sure that the block device node in the container is located at the same path than its counterpart on the host.

Let’s do this:

$ docker-enter charlie -- sh -c \
> "[ -b $DEV ] || mknod --mode 0600 $DEV b $DEVDEC"

Create a temporary mount point, and mount the filesystem:

$ docker-enter charlie -- mkdir /tmpmnt
$ docker-enter charlie -- mount $DEV /tmpmnt

Make sure that the volume mount point exists, and bind mount the volume on it:

$ docker-enter charlie -- mkdir -p /src
$ docker-enter charlie -- mount -o bind /tmpmnt/$SUBROOT/$SUBPATH /src

Cleanup after ourselves:

$ docker-enter charlie -- umount /tmpmnt
$ docker-enter charlie -- rmdir /tmpmnt

(We don’t clean up the device node. We could be extra fancy and detect whether the device existed in the first place, but this is already pretty complex as it is right now!)

Voilà!

Automating the hell out of it

This little snippet is almost copy-paste ready.

#!/bin/sh
set -e
CONTAINER=charlie
HOSTPATH=/home/jpetazzo/Work/DOCKER/docker
CONTPATH=/src

REALPATH=$(readlink --canonicalize $HOSTPATH)
FILESYS=$(df -P $REALPATH | tail -n 1 | awk '{print $6}')

while read DEV MOUNT JUNK
do [ $MOUNT = $FILESYS ] && break
done </proc/mounts
[ $MOUNT = $FILESYS ] # Sanity check!

while read A B C SUBROOT MOUNT JUNK
do [ $MOUNT = $FILESYS ] && break
done < /proc/self/mountinfo 
[ $MOUNT = $FILESYS ] # Moar sanity check!

SUBPATH=$(echo $REALPATH | sed s,^$FILESYS,,)
DEVDEC=$(printf "%d %d" $(stat --format "0x%t 0x%T" $DEV))

docker-enter $CONTAINER -- sh -c \
	     "[ -b $DEV ] || mknod --mode 0600 $DEV b $DEVDEC"
docker-enter $CONTAINER -- mkdir /tmpmnt
docker-enter $CONTAINER -- mount $DEV /tmpmnt
docker-enter $CONTAINER -- mkdir -p $CONTPATH
docker-enter $CONTAINER -- mount -o bind /tmpmnt/$SUBROOT/$SUBPATH $CONTPATH
docker-enter $CONTAINER -- umount /tmpmnt
docker-enter $CONTAINER -- rmdir /tmpmnt

Status and limitations

This will not work on filesystems which are not based on block devices.

It will only work if /proc/mounts correctly lists the block device node (which, as we saw above, is not necessarily true).

Also, I only tested this on my local environment; I didn’t even try on a cloud instance or anything like that, but I would love to know if it works there or not!

Gravlax (cured salmon) recipe

2014-12-27T00:00:00+00:00

This is my recipe for Gravlax (cured salmon). It makes a great appetizer. The only downside is that you must prepare it in advance, since it needs to cure 36 to 48 hours in the fridge.

Ingredients

The following quantities will easily feed 15 persons if the salmon is the main appetizer (and you will probably have some leftovers, but that’s OK because you can keep the salmon a few days in the fridge without any problem, since it will be cured). Don’t hesitate to scale down the recipe if you have less people!

4 pounds of salmon (either filets with the skin, or a slice of the whole fish, cut “en portefeuille”)
5 ounces of sugar
5 ounces of salt
some fresh dill
pepper (if you can grind some Sichuan pepper, go for it!)
1 pound of “fromage blanc” (in the US, it’s generally sold as “greek yogurt”)
2 big lemons
some olive oil

Finely chop the dill. If you don’t have specialized equipment, you can use kitchen scissors. Better remove the stems, at least the bigger ones; it’s not mandatory but it will be more pleasant (and the taste is in the leaves and smaller stems anyway).

The fish

If you see some fishbones poking out of the salmon, remove them now.

In a bowl, mix the sugar, salt, one small teaspoon of ground pepper, and half of the dill.

Smear the mix evenly on the salmon flesh.

Now press the filets against each other, flesh against flesh.

Put the salmon in a dish with a relatively high edge (maybe 2 inches), because the fish is going to let out a lot of liquid.

Put a weight on the fish, so that the filets are tightly pressed against each other. I typically put a cutting board (smaller than the dish), and let one or two packs of beer rest on the board.

Put the dish in the fridge. Flip it every 12 hours. Leave at least 36 hours total (48 hours is even better).

When you will remove the fish from the fridge, you should see a lot of fishy water in the dish. Drain it. Then rince the salmon profusely, to remove the sugar and salt mix. There will probably be some bits of dill incrusted within the fish; that’s OK.

Dice the salmon. I typically make cubes about the size of those appetizer cheese cubes, but you can also make thin slices.

To dice the fish, I use a very sharp, toothless knife. I cut thin strips (half an inch wide, 2 to 4 inches long), with the skin; then I slice the strip in the middle, and make the blade slide between the skin and the flesh.

Now, wash your hands 3 times, because after cutting the fish, they probably stink :-D

The sauce

Let’s prepare the sauce. Mix the greek yogurt, the juice of the two lemons, one tablespoon of olive oil, and the other half of the dill. Add a bit of salt and pepper according to your taste.

Note: a few hours after cutting it, the salmon will taste very salty, but with the sauce, it should be just perfect. In my experience, after a day or so, it doesn’t taste that salty, and can be enjoyed without the sauce. I don’t know if the taste of salt really goes away, or if me tastebuds get accustomed.

Serve!

I generally split the salmon in multiple bowls, paired with smaller bowls containing the sauce. Unless your guests enjoy the smell of fish on their fingers, you probably want to consider providing picks (wooden toothpicks work just fine) :-)

Multiple Docker containers logging to a single syslog

2014-08-24T00:00:00+00:00

This is a simple recipe showing how to run syslog in one container, and then send the syslog messages of multiple other containers to that one.

The Dockerfile and basic instructions are available on a tiny GitHub repo: https://github.com/jpetazzo/syslogdocker.

The concept is very simple.

First, we build a container with the following characteristics:

has rsyslogd daemon installed, and defined as the default command;
/dev is defined to be a volume;
/var/log is defined to be a volume.

Here is a Dockerfile for such a container.

Then, we start that container; but we use an explicit host bind-mount, e.g.:

docker run --name syslog -d -v /tmp/syslogdev:/dev syslog

Why the explicit host bind-mount? Because that container will create /dev/log when rsyslog starts, and we want to “pick up” that socket and bind-mount it in our future containers, without having to bind-mount the whole /dev. If we just use --volumes-from, we will pick up the whole /dev. It won’t have a big impact for now, but if later we do fancy stuff (like adding custom devices) it could mess things up, so let’s be fine-grained.

Later versions of Docker might allow fine-grained --volumes-from, which will be even better.

Then we can start any container, bind-mounting the /dev/log into it:

docker run -v /tmp/syslogdev/log:/dev/log myimage somecommand

For an educational example, you can do this:

docker run -v /tmp/syslogdev/log:/dev/log ubuntu logger hello

That’s it! That container will send log messages to /dev/log, which will actually be the socket created by rsyslogd.

You can see the logs by running another container with --volumes-from syslog and checking the files in /var/log.

For bonus points, you can try to see what happens when you use journald or something that tries to be container-aware :-)

If you run SSHD in your Docker containers, you're doing it wrong!

2014-06-23T00:00:00+00:00

When they start using Docker, people often ask: “How do I get inside my containers?” and other people will tell them “Run an SSH server in your containers!” but that’s a very bad practice. We will see why it’s wrong, and what you should do instead.

Note: if you want to comment or share this article, use the canonical version hosted on the Docker Blog. Thank you!

Your containers should not run an SSH server

…Unless your container is an SSH server, of course.

It’s tempting to run the SSH server, because it gives an easy way to “get inside” of the container. Virtually everybody in our craft used SSH at least once in their life. Most of us use it on a daily basis, and are familiar with public and private keys, password-less logins, key agents, and even sometimes port forwarding and other niceties.

With that in mind, it’s not surprising that people would advise you to run SSH within your container. But you should think twice.

Let’s say that you are building a Docker image for a Redis server or a Java webservice. I would like to ask you a few questions.

What do you need SSH for?

Most likely, you want to do backups, check logs, maybe restart the process, tweak the configuration, possibly debug the server with gdb, strace, or similar tools. We will see how to do those things without SSH.
How will you manage keys and passwords?

Most likely, you will either bake those into your image, or put them in a volume. Think about what you should do when you want to update keys or passwords. If you bake them into the image, you will need to rebuild your images, redeploy them, and restart your containers. Not the end of the world, but not very elegant neither. A much better solution is to put the credentials in a volume, and manage that volume. It works, but has significant drawbacks. You should make sure that the container does not have write access to the volume; otherwise, it could corrupt the credentials (preventing you from logging into the container!), which could be even worse if those credentials are shared across multiple containers. If only SSH could be elsewhere, that would be one less thing to worry about, right?
How will you manage security upgrades?

The SSH server is pretty safe, but still, when a security issue arises, you will have to upgrade all the containers using SSH. That means rebuilding and restarting all of them. That also means that even if you need a pretty innocuous memcached service, you have to stay up-to-date with security advisories, because the attack surface of your container is suddenly much bigger. Again, if SSH could be elsewhere, that would be a nice separation of concerns, wouldn’t it?
Do you need to “just add the SSH server” to make it work?

No. You also need to add a process manager; for instance Monit or Supervisor. This is because Docker will watch one single process. If you need multiple processes, you need to add one at the top-level to take care of the others. In other words, you’re turning a lean and simple container into something much more complicated. If your application stops (if it exits cleanly or if it crashes), instead of getting that information through Docker, you will have to get it from your process manager.
You are in charge of putting the app inside a container, but are you also in charge of access policies and security compliance?

In smaller organizations, that doesn’t matter too much. But in larger groups, if you are the person putting the app in a container, there is probably a different person responsible for defining remote access policies. Your company might have strict policies defining who can get access, how, and what kind of audit trail is required. In that case, you definitely don’t want to put a SSH server in your container.

But how do I …

Backup my data?

Your data should be in a volume. Then, you can run another container, and with the --volumes-from option, share that volume with the first one. The new container will be dedicated to the backup job, and will have access to the required data.

Added benefit: if you need to install new tools to make your backups or to ship them to long term storage (like s3cmd or the like), you can do that in the special-purpose backup container instead of the main service container. It’s cleaner.

Check logs?

Use a volume! Yes, again. If you write all your logs under a specific directory, and that directory is a volume, then you can start another “log inspection” container (with --volumes-from, remember?) and do everything you need here.

Again, if you need special tools (or just a fancy ack-grep), you can install them in the other container, keeping your main container in pristine condition.

Restart my service?

Virtually all services can be restarted with signals. When you issue /etc/init.d/foo restart or service foo restart, it will almost always result in sending a specific signal to a process. You can send that signal with docker kill -s <signal>.

Some services won’t listen to signals, but will accept commands on a special socket. If it is a TCP socket, just connect over the network. If it is a UNIX socket, you will use… a volume, one more time. Setup the container and the service so that the control socket is in a specific directory, and that directory is a volume. Then you can start a new container with access to that volume; it will be able to use the socket.

“But, this is complicated!” - not really. Let’s say that your service foo creates a socket in /var/run/foo.sock, and requires you to run fooctl restart to be restarted cleanly. Just start the service with -v /var/run (or add VOLUME /var/run in the Dockerfile). When you want to restart, execute the exact same image, but with the --volumes-from option and overriding the command. This will look like this:

# Starting the service
CID=$(docker run -d -v /var/run fooservice)
# Restarting the service with a sidekick container
docker run --volumes-from $CID fooservice fooctl restart

It’s that simple!

Edit my configuration?

If you are performing a durable change to the configuration, it should be done in the image - because if you start a new container, the old configuration will be there again, and your changes will be lost. So, no SSH access for you!

“But I need to change my configuration over the lifetime of my service; for instance to add new virtual hosts!”

In that case, you should use… wait for it… a volume! The configuration should be in a volume, and that volume should be shared with a special-purpose “config editor” container. You can use anything you like in this container: SSH + your favorite editor, or an web service accepting API calls, or a crontab fetching the information from an outside source; whatever.

Again, you’re separating concerns: one container runs the service, another deals with configuration updates.

“But I’m doing temporary changes, because I’m testing different values!

In that case, check the next section!

Debug my service?

That’s the only scenario where you really need to get a shell into the container. Because you’re going to run gdb, strace, tweak the configuration, etc.

In that case, you need nsenter.

Introducing `nsenter`

nsenter is a small tool allowing to enter into namespaces. Technically, it can enter existing namespaces, or spawn a process into a new set of namespaces. “What are those namespaces you’re blabbering about?” They are one of the essential constituants of containers.

The short version is: with nsenter, you can get a shell into an existing container, even if that container doesn’t run SSH or any kind of special-purpose daemon.

Where do I get `nsenter`?

Check jpetazzo/nsenter on GitHub. The short version is that if you run:

docker run -v /usr/local/bin:/target jpetazzo/nsenter

… this will install nsenter in /usr/local/bin and you will be able to use it immediately.

nsenter might also be available in your distro (in the util-linux package).

How do I use it?

First, figure out the PID of the container you want to enter:

PID=$(docker inspect --format {{.State.Pid}} <container_name_or_ID>)

Then enter the container:

nsenter --target $PID --mount --uts --ipc --net --pid

You will get a shell inside the container. That’s it.

If you want to run a specific script or program in an automated manner, add it as argument to nsenter. It works a bit like chroot, except that it works with containers instead of plain directories.

What about remote access?

If you need to enter a container from a remote host, you have (at least) two ways to do it:

SSH into the Docker host, and use nsenter;
SSH into the Docker host, where a special key with force a specific command (namely, nsenter).

The first solution is pretty easy; but it requires root access to the Docker host (which is not great from a security point of view).

The second solution uses the command= pattern in SSH’s authorized_keys file. You are probably familiar with “classic” authorized_keys files, which look like this:

ssh-rsa AAAAB3N…QOID== jpetazzo@tarrasque

(Of course, a real key is much longer, and typically spans multiple lines.)

You can also force a specific command. If you want to be able to check the available memory on your system from a remote host, using SSH keys, but you don’t want to give full shell access, you can put this in the authorized_keys file:

command="free" ssh-rsa AAAAB3N…QOID== jpetazzo@tarrasque

Now, when that specific key connects, instead of getting a shell, it will execute the free command. It won’t be able to do anything else.

(Technically, you probably want to add no-port-forwarding; check the manpage authorized_keys(5) for more information.)

The crux of this mechanism is to split responsibilities. Alice puts services within containers; she doesn’t deal with remote access, logging, and so on. Betty will add the SSH layer, to be used only in exceptional circumstances (to debug weird issues). Charlotte will take care of logging. And so on.

Wrapping up

Is it really Wrong (uppercase double you) to run the SSH server in a container? Let’s be honest, it’s not that bad. It’s even super convenient when you don’t have access to the Docker host, but still need to get a shell within the container.

But we saw here that there are many ways to not run an SSH server in a container, and still get all the features we want, with a much cleaner architecture.

Docker allows you to use whatever workflow is best for you. But before jumping in the “my container is really a small VPS” bandwagon, be aware that there are other solutions, so you can make an informed decision!

Setting up a transparent proxy for your Docker containers

2014-06-17T00:00:00+00:00

If you build a lot of containers, and have a not-so-fast internet link, you might be spending a lot of time waiting for packages to download. It would be nice if all those downloads could be automatically cached, without tweaking your Dockerfiles, right?

Or, maybe your corporate network forbids direct outside access, and require you to use a proxy. Then you can edit this recipe so that it cascades to the corporate proxy. Your containers will use the transparent proxy, which itself will pass along to the corporate proxy.

I want this now!

Just do this:

docker run --net host jpetazzo/squid-in-a-can
iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to 3129

That’s it. Now all HTTP requests going through your Docker host will be transparently routed through the proxy running in the container.

Note: it will only affect HTTP traffic on port 80.

Note: traffic originating from the host will not be affected, because the PREROUTING chain is not traversed by packets originating from the host.

Note: if your Docker host is also a router for other things (e.g. if it runs various virtual machines, or is a VPN server, etc), those things will also see their HTTP traffic routed through the proxy. They have to use internal IP addresses, though.

Note: if you plan to run this on EC2 (or any kind of infrastructure where the machine has an internal IP address), you should probably tweak the ACLs, or make sure that outside machines cannot access ports 3128 and 3129 on your host.

How does it work?

The jpetazzo/squid-in-a-can container runs a really basic Squid3 proxy. You can see the Dockerfile for this image on the Docker Hub.

Rather than writing my own configuration file, I patch the default Debian configuration. The main thing is to enable intercept on another port (here, 3129).

Then, this container should be started using the network namespace of the host (that’s what the --net host option is for). Another strategy would be to start the container with its own namespace. Then, the HTTP traffic can be directed to it with a DNAT rule. The problem with this approach, is that Squid will “see” the traffic as being directed to its own IP address, instead of the destination HTTP server IP address; and since Squid 3.3, it refuses to honor such requests.

(The reasoning is, that it would then have to trust the HTTP Host: header to know where to send the request. You can check CVE-2009-0801 for details.)

Notes

Ideas for improvement:

persistent caching (with, obviously, a volume!)
easy chaining to an upstream proxy
setup the iptables automatically if the container runs in privileged mode

Don’t hesitate to fork it on GitHub and contribute! :-)

Attaching to a container with Docker 0.9 and libcontainer

2014-03-23T00:00:00+00:00

If you upgraded your Docker installation to 0.9, you are now using libcontainer to run your containers. And if you were using lxc-attach, you probably noticed that it doesn’t work anymore. Here are two ways to recover the “attach” feature with Docker containers.

What happened?

First, let’s explain exactly what’s involved here. Before 0.9, Docker was using the LXC userland tools to start containers. It means that docker run … eventually translated to a call to lxc-start. As such, Docker containers could be managed with the LXC userland tools, including lxc-attach to obtain a shell within an existing container. This is a very convenient feature, because you can drop into a container without having to run a special server process within the container.

lxc-attach relies on the fact that each container created with lxc-start listens on a specific socket: by default, an abstract socket named /var/lib/lxc/<container_name>/command. It uses that socket to infer the PID of the container, and then uses the setns() syscall to attach a new process to the namespaces used by the container.

Docker 0.9 ships with the “native” execution driver, which uses libcontainer instead of the LXC userland tools. And guess what, libcontainer doesn’t create that abstract socket, so lxc-attach is confused and can’t locate the container.

There are (at least) three solutions:

use nsenter, a little Linux tool to fiddle with namespaces and enter them (as you could guess from the name!);
use nsinit, a tool that comes with libcontainer;
revert to the LXC driver (if you can’t install nsenter or nsinit).

Use `nsenter`

In most distros, nsenter is in the util-linux package. It ships after version 2.23. Unfortunately, Debian and Ubuntu still ship with util-linux 2.20 as of March 2014; so you will have to compile it yourself:

cd /tmp
curl https://www.kernel.org/pub/linux/utils/util-linux/v2.24/util-linux-2.24.tar.gz \
     | tar -zxf-
cd util-linux-2.24
./configure --without-ncurses
make nsenter
cp nsenter /usr/local/bin

(You might have to adjust the configure line a little bit.)

Now, find the PID of the first process of the container (actually, any PID will do, but this is just easier and safer):

PID=$(docker inspect --format '{{.State.Pid}}' my_container_id)

Then, enter like this:

nsenter --target $PID --mount --uts --ipc --net --pid

Voilà, you are now in the container!

nsenter does not drop capabilities; so the shell started by nsenter can do more stuff (and more harm!) than a normal process running within the container.

Note: when looking for details about nsenter, I realized that Sebastien Han already posted a very similar recipe. If you want to use nsenter before version 0.9, his recipe works best (since Docker pre-0.9 doesn’t have .State.Pid).

Use `nsinit`

According to Michael Crosby, it is even better to use nsinit. And he’s a core maintainer of Docker, and primary author of libcontainer; so you bet he knows what he’s talking about ☺

To install nsinit, you need a Go development environment. (On Debian/Ubuntu, apt-get install golang-go might be sufficient.)

Then, assuming that your GOPATH etc. is set correctly, all you need is:

go install github.com/dotcloud/docker/pkg/libcontainer/nsinit/nsinit

Then, you need to go to the container configuration directory. Where’s that? It’s in /var/lib/docker/execdriver/native/<container_id>/. Find the short ID of your container with docker ps, then go to the right directory (you will need root access, since /var/lib/docker is readable only by root).

Then, once in that directory, just run nsinit exec bash. That’s all.

You can check this Asciinema demo to see it in action!

Revert to the LXC driver

If you can’t compile neither nsenter nor nsinit, well, your last option is to revert Docker to use the LXC driver.

First, stop your Docker daemon. Then edit the daemon start options (on Debian/Ubuntu, edit /etc/default/docker and fine the line with DOCKER_OPTS). Add -e lxc. Restart Docker. Done. You can now use lxc-attach again, but each morning, when you’ll see your face in the mirror, you will have to remember that this is the face of someone who is missing all the goodness of libcontainer!

Resizing Docker containers with the Device Mapper plugin

2014-01-29T00:00:00+00:00

If you’re using Docker on CentOS, RHEL, Fedora, or any other distro that doesn’t ship by default with AUFS support, you are probably using the Device Mapper storage plugin. By default, this plugin will store all your containers in a 100 GB sparse file, and each container will be limited to 10 GB. This article will explain how you can change that limit, and move container storage to a dedicated partition or LVM volume.

Warning

At some point, Docker storage driver internals have changed significantly, and the technique described here doesn’t work anymore. If you want to change the filesystem size for Docker containers using the Device Mapper storage driver, you should use the --storage-opt flag of the Docker Engine.

You can find abundant documentation for the --storage-opt flag in the Docker Engine reference documentation.

The rest of this article has been left for historical purposes, but take it with a grain of salt. The downside of fast-changing, rapidly-evolving software projects is that nothing is ever cast in stone! :-)

How it works

To really understand what we’re going to do, let’s look how the Device Mapper plugin works.

It is based on the Device Mapper “thin target”. It’s actually a snapshot target, but it is called “thin” because it allows thin provisioning. Thin provisioning means that you have a (hopefully big) pool of available storage blocks, and you create block devices (virtual disks, if you will) of arbitrary size from that pool; but the blocks will be marked as used (or “taken” from the pool) only when you actually write to it.

This means that you can oversubscribe the pool; e.g. create thousands of 10 GB volumes with a 100 GB pool, or even a 100 TB volume on a 1 GB pool. As long as you don’t actually write more blocks than you actually have in the pool, everything will be fine.

Additionally, the thin target is able to perform snapshots. It means that at any time, you can create a shallow copy of an existing volume. From a user point of view, it’s exactly as if you now had two identical volumes, that can be changed independently. As if you had made a full copy, except that it was instantaneous (even for large volumes), and they don’t use twice the storage. Additional storage is used only when changes are made in one of the volumes. Then the thin target allocates new blocks from the storage pool.

Under the hood, the “thin target” actually uses two storage devices: a (large) one for the pool itself, and a smaller one to hold metadata. This metadata contains information about volumes, snapshots, and the mapping between the blocks of each volume or snapshot, and the blocks in the storage pool.

When Docker uses the Device Mapper storage plugin, it will create two files (if they don’t already exist) in /var/lib/docker/devicemapper/devicemapper/data and /var/lib/docker/devicemapper/devicemapper/metadata to hold respectively the storage pool and associated metadata. This is very convenient, because no specific setup is required on your side (you don’t need an extra partition to store Docker containers, or to setup LVM or anything like that). However, it has two drawbacks:

the storage pool will have a default size of 100 GB;
it will be backed by a sparse file, which is great from a disk usage point of view (because just like volumes in the thin pool, it starts small, and actually uses disk blocks only when it gets written to) but less great from a performance point of view, because the VFS layer adds some overhead, especially for the “first write” scenario.

Before checking how to resize a container, we will see how to make more room in that pool.

We need a bigger pool

Warning: the following will delete all your containers and all your images. Make sure that you backup any precious data!

Remember what we said above: Docker will create the data and metadata files if they don’t exist. So the solution is pretty simple: just create the files for Docker, before starting it!

Stop the Docker daemon, because we are going to reset the storage plugin, and if we remove files while it is running, Bad Things Will Happen©.
Wipe out /var/lib/docker. Warning: as mentioned above, this will delete all your containers all all your images.
Create the storage directory: mkdir -p /var/lib/docker/devicemapper/devicemapper.
Create your pool: dd if=/dev/zero of=/var/lib/docker/devicemapper/devicemapper/data bs=1G count=0 seek=250 will create a sparse file of 250G. If you specify bs=1G count=250 (without the seek option) then it will create a normal file (instead of a sparse file).
Restart the Docker daemon. Note: by default, if you have AUFS support, Docker will use it; so if you want to enforce the use of the Device Mapper plugin, you should add -s devicemapper to the command-line flags of the daemon.
Check with docker info that Data Space Total reflects the correct amount.

We need a faster pool

Warning: the following will also delete all your containers and images. Make sure you pull your important images to a registry, and save any important data you might have in your containers.

An easy way to get a faster pool is to use a real device instead of a file-backed loop device. The procedure is almost the same. Assuming that you have a completely empty hard disk, /dev/sdb, and that you want to use it entirely for container storage, you can do this:

Stop the Docker daemon.
Wipe out /var/lib/docker. (That should sound familiar, right?)
Create the storage directory: mkdir -p /var/lib/docker/devicemapper/devicemapper.
Create a data symbolic link in that directory, pointing to the device: ln -s /dev/sdb /var/lib/docker/devicemapper/devicemapper/data.
Restart Docker.
Check with docker info that the Data Space Total value is correct.

Using RAID and LVM

If you want to consolidate multiple similar disks, you can use software RAID10. You will end up with a /dev/mdX device, and will link to that. Another very good option is to turn your disks (or RAID arrays) into LVM Physical Volumes, and then create two Logical Volumes, one for data, another for metadata. I don’t have specific advices regarding the optimal size of the metadata pool; it looks like 1% of the data pool would be a good idea.

Just like above, stop Docker, wipe out its data directory, then create symbolic links to the devices in /dev/mapper, and restart Docker.

If you need to learn more about LVM, check the LVM howto.

Growing containers

By default, if you use the Device Mapper storage plugin, all images and containers are created from an initial filesystem of 10 GB. Let’s see how to get a bigger filesystem for a given container.

First, let’s create our container from the Ubuntu image. We don’t need to run anything in this container; we just need it (or rather, its associated filesystem) to exist. For demonstration purposes, we will run df in this container, to see the size of its root filesystem.

$ docker run -d ubuntu df -h /
4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603

We now have to run some commands as root, because we will be affecting the volumes managed by the Device Mapper. In the instructions below, all the commands denoted with # have to run as root. You can run the other commands (starting with the $ prompt) as your regular user, as long as it can access the Docker socket, of course.

Let’s look into /dev/mapper; there should be a symbolic link corresponding to this container’s filesystem. It will be prefixed with docker-X:Y-Z-:

# ls -l /dev/mapper/docker-*-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603
lrwxrwxrwx 1 root root 7 Jan 31 21:04 /dev/mapper/docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603 -> ../dm-8

Note that full name; we will need it. First, let’s have a look at the current table for this volume:

# dmsetup table docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603
0 20971520 thin 254:0 7

The second number is the size of the device, in 512-bytes sectors. The value above corresponds to 10 GB.

Let’s compute how many sectors we need for a 42 GB volume:

$ echo $((42*1024*1024*1024/512))
88080384

An amazing feature of the thin snapshot target is that it doesn’t limit the size of the volumes. When you create it, a thin provisioned volume uses zero blocks, and as you start writing to those blocks, they are allocated from the common block pool. But you can start writing block 0, or block 1 billion: it doesn’t matter to the thin snapshot target. The only thing determining the size of the filesystem is the Device Mapper table.

Confused? Don’t worry. The TL,DR is: we just need to load a new table, which will be exactly the same as before, but with more sectors. Nothing else.

The old table was 0 20971520 thin 254:0 7. We will change the second number, and be extremely careful about leaving everything else exactly as it is. Your volume will probably not be 7, so use the right values!

So let’s do this:

# echo 0 88080384 thin 254:0 7 | dmsetup load docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603

Now, if we check the table again, it will still be the same because the new table has to ba activated first, with the following command:

# dmsetup resume docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603

After that command, check the table one more time, and it will have the new number of sectors.

We have resized the block device, but we still need to resize the filesystem. This is done with resize2fs:

# resize2fs /dev/mapper/docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603
resize2fs 1.42.5 (29-Jul-2012)
Filesystem at /dev/mapper/docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603 is mounted on /var/lib/docker/devicemapper/mnt/4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 3
The filesystem on /dev/mapper/docker-0:37-1471009-4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603 is now 11010048 blocks long.

As an optional step, we will restart the container, to check that we indeed have the right amount of free space:

$ docker start 4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603
$ docker logs 4ab0bdde0a0dd663d35993e401055ee0a66c63892ba960680b3386938bda3603
df: Warning: cannot read table of mounted file systems: No such file or directory
Filesystem      Size  Used Avail Use% Mounted on
-               9.8G  164M  9.1G   2% /
df: Warning: cannot read table of mounted file systems: No such file or directory
Filesystem      Size  Used Avail Use% Mounted on
-                42G  172M   40G   1% /

Want to automate the whole process? Sure:

CID=$(docker run -d ubuntu df -h /)
DEV=$(basename $(echo /dev/mapper/docker-*-$CID))
dmsetup table $DEV | sed "s/0 [0-9]* thin/0 $((42*1024*1024*1024/512)) thin/" | dmsetup load $DEV
dmsetup resume $DEV
resize2fs /dev/mapper/$DEV
docker start $CID
docker logs $CID

Growing images

Unfortunately, the current version of Docker won’t let you grow an image as easily. You can grow the block device associated with an image, then create a new container from it, but the new container won’t have the right size.

Likewise, if you commit a large container, the resulting image won’t be bigger (this is due to the way that Docker will prepare the filesystem for this image).

It means that currently, if a container is really more than 10 GB, you won’t be able to commit it correctly without additional tricks.

Conclusions

Docker will certainly expose nicer ways to grow containers, because the code changes required are very small. Managing the thin pool and its associated metadata is a bit more complex (since there are many different scenarios involved, and a potential data migration, that we did not cover here, since we wiped out everything when building the new pool), but the solutions that we described will take you pretty far already.

As usual, if you have further questions or comments, don’t hesitate to ping me on IRC (jpetazzo on Freenode) or Twitter (@jpetazzo)!

Network booting machines with a PXE server running in a Docker container

2013-12-07T00:00:00+00:00

When you want to install a new machine, or boot in rescue mode, the usual method is to boot from a CD or USB stick. But virtually all modern computers with an Ethernet interface can also boot from the network. Here is how to setup a boot server super easily, by running it in a Docker container.

Netboot 101

On Intel machines (32 or 64 bits), the network boot mechanism is called PXE. It uses multiple protocols:

DHCP lets the booting machine discover the IP address it should use, and retrieve some basic parameters, like DNS, gateway, address of server to boot from;
TFTP is used to download the code to execute – typically a loader which, in turn, can fetch a Linux kernel and initrd.

PXE can be used in many scenarios; but we can simplify and consider two cases.

You can’t (or don’t want to) use a CD/USB media to install/reinstall the machine. If you are managing thousands of machines, you don’t want to haul around a stack of bootable CDs or USB sticks; and you don’t want to have to find one to reinstall a single machine, neither. My university was (and probably still is) using PXE to painlessly deploy Linux and Windows on thousands of machines. It took maybe 10 minutes for a single person to install, reinstall, or upgrade a lab of 25 machines.
You want to run totally diskless. In that case, after booting, Linux will typically switch to a NFS root filesystem. Since it is possible to operate from a read-only NFS root, it means that you can boot hundreds of machines from a single PXE+NFS server. Installing or upgrading packages is extremely easy (and doesn’t require a reboot of the diskless machines!). Your workstations can be more reliable (since hard disks are often the number one failure cause), run faster (since a gigabit Ethernet network will have faster throughput and lower latency than a typical spinning disk), use less power, and be more silent (since you can spin down the hard disk, or remove it altogether).

Here, we will show how to build a PXE server (DHCP+TFTP) to boot a machine to the Debian install system. It lets you install Debian entirely from the network. You can of course tailor it to your needs (and I hope someone will submit interesting pull requests to make that happen).

Why run that in a Docker container?

As you will see, setting up a PXE server is not hard. It used to be more complicated, but dnsmasq simplified the whole things immensely, since it combines a DNS, DHCP, and TFTP server, and can be configured entirely from the command-line.

So why bother using a container for that?

The setup is not hard, but it’s still a bit of work (especially if you’re not familiar with those protocols). PXE used to be very picky (some machines would require random magic DHCP options to be there, or they would just ignore the DHCP server). It got better with the years, but still, it’s great to have something that is known to work, rather than re-installing the environment each time, and wondering if it doesn’t work because you’re missing some magic option that you forgot to write down. (Happened to me countless times, until I froze my whole boot server in a chroot!)
PXE uses DHCP, and running a DHCP server can be disruptive. Almost all networks use DHCP for automatic IP address allocation and configuration now; so if you run a DHCP server on your machine, you will probably disrupt the local network (and get in trouble with the local network administrator, unless you’re the local network adminstrator; then you will be the one troubleshooting weird issues with machines suddenly misbehaving because they were “hooked” by your new DHCP server). It would be a good idea to have an easy way to start and stop the boot server. A VM would be great, but VMs are so 2000; this is the 2010s, so let’s containerize all the things!
Because we can! :-)

Pre-requirements

Of course, you need to have Docker on your machine. Install it already!

Then, you will need pipework. Just download it from the repository; it’s a simple shell script.

Running the boot server

Two steps:

PXECID=$(docker run -d jpetazzo/pxe)
pipework br0 $PXECID 192.168.242.1/24

Now, the PXE server is booting anything connected to the br0 bridge; but usually, nothing is connected to that bridge. So, assuming that eth0 is your Ethernet interface, just do brctl addif br0 eth0 – and that’s it! Now you can boot PXE machines on the network connected to eth0!

Alternatively, you can put VMs on br0 and achieve the same result.

When you want to stop the boot server, just do docker kill $PXECID.

How did you build the container?

With a Dockerfile, of course. Let’s look at this Dockerfile.

First, we’ll use Debian (because I love Debian).

FROM stackbrew/debian:jessie

Then, we declare some environment variables. If you want to netboot 32 bits machines, you can change ARCH; if you want to install the jessie distribution instead, update DIST. And of course you can update the mirror if you want.

ENV ARCH amd64
ENV DIST wheezy
ENV MIRROR http://ftp.nl.debian.org

Now we install the required packages. Dnsmasq is the DNS+DHCP+TFTP server. We will need wget a bit later; and iptables will be used to give network access to the netbooted machines.

RUN apt-get -q update
RUN apt-get -qy install dnsmasq wget iptables

We install pipework. Pipework is used in the container for one trivial thing: waiting until eth1 becomes available. eth1 will appear “automagically” when we run pipework in the host, after starting the container.

RUN wget --no-check-certificate https://raw.github.com/jpetazzo/pipework/master/pipework
RUN chmod +x pipework

Download the Linux kernel, ramdisk, and PXE boot loader from the Debian mirror. The WORKDIR instruction means that all further lines will be executed in /tftp.

RUN mkdir /tftp
WORKDIR /tftp
RUN wget $MIRROR/debian/dists/$DIST/main/installer-$ARCH/current/images/netboot/debian-installer/$ARCH/linux
RUN wget $MIRROR/debian/dists/$DIST/main/installer-$ARCH/current/images/netboot/debian-installer/$ARCH/initrd.gz
RUN wget $MIRROR/debian/dists/$DIST/main/installer-$ARCH/current/images/netboot/debian-installer/$ARCH/pxelinux.0

Then we generate a minimal boot configuration. This works like this:

the DHCP server will tell to the netbooted machines “hey, you should first execute pxelinux.0!”
pxelinux.0 is a boot loader which, in turn, will try to load a configuration file from pxelinux.cfg/XXX; it will try multiple different files in that directory, and will eventually try pxelinux.cfg/default
this file tells to the boot loader “get the files called linux and initrd.gz and use them respectively as a kernel and initial ramdisk, then boot!”

RUN mkdir pxelinux.cfg
RUN printf "DEFAULT linux\nKERNEL linux\nAPPEND initrd=initrd.gz\n" >pxelinux.cfg/default

Last but not least, we define the command that should run within the container. This one is big! We could have used a script; but since it’s not that big, we decided to use line continuations instead.

This command will enable network connection sharing, it will wait for the pipework-provided network interface to come up, then it will start dnsmasq. Dnsmasq really does all the work!

CMD \
    echo Setting up iptables... &&\
    iptables -t nat -A POSTROUTING -j MASQUERADE &&\
    echo Waiting for pipework to give us the eth1 interface... &&\
    /pipework --wait &&\
    echo Starting DHCP+TFTP server...&&\
    dnsmasq --interface=eth1 \
                --dhcp-range=192.168.242.2,192.168.242.99,255.255.255.0,1h \
            --dhcp-boot=pxelinux.0,pxeserver,192.168.242.1 \
            --pxe-service=x86PC,"Install Linux",pxelinux \
            --enable-tftp --tftp-root=/tftp/ --no-daemon

How do I boot something else?

I hope that this container can be used as a base for more complex stuff. If you extend it to add menus and other things, don’t hesitate to submit pull requests. It would be awesome to have a bigger, more universal, PXE boot server!

What’s the deal with the hard-coded 192.168.242…?

I had two options when writing this: using pipework, or not using pipework. First, let’s see what it means to not use pipework.

If we don’t use pipework, we need to expose the UDP ports used by DHCP and TFTP. Then, since the goal is to boot machines sitting on the “real” network (i.e. not the Docker internal network), we need to probe that network to figure out the address of the default gateway, of the DNS server, and a range of available addresses to use for DHCP. Once we have that information, we can use it to start dnsmasq.

I think that this would have been much more complicated to get right. In some scenarios it would have been completely impossible. So I decided to use pipework instead, and use an arbitrary network.

Acknowledgements…

Thanks to Tianon, who suggested that this might be possible. Your feedback (and your contributions to Docker in general) are awesome!

Efficient management Python projects dependencies with Docker

2013-12-01T00:00:00+00:00

There are many ways to handle Python app dependencies with Docker. Here is an overview of the most common ones – with a twist.

In our examples, we will make the following assumptions:

you want to write a Dockerfile for a Python app;
the code is directly at the top of the repo (i.e. there’s a setup.py file at the root of the repo);
your app requires Flask (and possibly other dependencies).

Using your distro’s packages

This is the easiest method, but it has some pretty strict requirements.

The Python dependencies that you need must be packaged by your distro. (Obviously!)
Almost as obvious, but a bit more tricky: your distro has to carry the specific version that you need. You want Django 1.6 but your distro only have 1.5? Too bad!
You must be able to map the Python package name to the distro package name. Again, that sounds really obvious, and it’s not a big deal if you are familiar with your distro. For instance, on Debian/Ubuntu, in most cases, Python package xxx will be packaged as python-xxx. But if you have to deal with a complex Python app with a large-ish requirements.txt file, things might be more tedious.
If you run multiple apps in the same environment, their requirements must not conflict with each other. For instance, if you install (on the same machine) a CMS system and a ticket tracking system both depending on different versions of Django, you’re in trouble.

The most common answer to those constraints is “just use virtualenv instead!”, and this is the generally accepted strategy. However, before ditching distro packages, let’s remember two key things!

If we’re using Docker, most of those problems go away (just like when using virtualenv), because you can use different containers for different apps (and get rid of version conflicts). Also, if you need a more recent (or older) version of a package, you can use a more recenet (or older) version of the distro, and a moderate amount of luck will make sure that you can find the right thing. Just check e.g. http://packages.debian.org/ or http://packages.ubuntu.com/ to check version numbers first.
Sometimes, it happens that a specific Python dependency will be incompatible with your Python version, or some other library on your system. Example: I recently stumbled upon a version of simplejson which didn’t work with Python 3.2. This is less likely to occur with distro packages, because such problems will be caught by the packagers and the other users. Free QA!

So what does your Dockerfile look like?

# Use a specific version of Debian (because it has the exact Python for us)
FROM stackbrew/debian:jessie
RUN apt-get install -qy python3
RUN apt-get install -qy python3-flask
ADD . /myapp
WORKDIR /myapp
RUN python3 setup.py install
EXPOSE 8000
CMD myapp --port 8000

Pretty simple – especially if you don’t have too many requirements. Note how we apt-get install each package with a separate command. It creates more Docker layers, but that’s OK, and it means that if you add more dependencies later, the cache will be used. If you use a single line, each time you add a new package, everything will be downloaded and installed again.

requirements.txt

If you can’t use the packages of your distro (they don’t have that specific version that you absolutely need!), or if you are using some stuff which is just not packaged at all, here’s our “plan B”. In that situation, you will generally have a requirements.txt file, describing the dependencies of the app, pinned to specific versions. That kind of file can be generated with pip freeze, and those dependencies can then be installed with pip install -r requoirements.txt.

That’s also the preferred solution when you want to use some dependencies straight from GitHub, BitBucket, or any other code repository, because pip supports that too.

Let’s see first what the Dockerfile will look like, and discuss the pros and cons of this approach.

FROM stackbrew/debian:jessie
RUN apt-get install -qy python3
RUN apt-get install -qy python3-pip
ADD . /myapp
WORKDIR /myapp
RUN pip-3.3 install -r requirements.txt
RUN pip-3.3 install .
EXPOSE 8000
CMD myapp --port 8000

While it looks similar to what we did earlier, there is actually a huge difference (apart from the fact that dependencies are no longer handled by Debian, but directly by pip). Dependencies are now installed after the ADD command. This is a big deal because as of Docker 0.7.1, the ADD command is not cached, which means that all subsequent commands are not cached, neither. So each time you build this Dockerfile, you end up re-installing all the dependencies, which could take some time.

This is a significant drawback, because development is now significantly slower, since each build can take minutes instead of seconds.

So how do we solve that problem? Well, let’s see!

Two Dockerfiles

A common workaround to ADD issue is to use two Dockerfiles. The first one installs your dependencies, the second one installs your code. They will look like this:

FROM stackbrew/debian:jessie
RUN apt-get install -qy python3
RUN apt-get install -qy python3-pip
ADD requirements.txt /
RUN pip-3.3 install -r requirements.txt

This first Dockerfile should be built with a specific name; e.g. docker build -t myapp .. Then, the second Dockerfile reuses it:

FROM myapp
ADD . /myapp
WORKDIR /myapp
RUN pip-3.3 install .
EXPOSE 8000
CMD myapp --port 8000

Now, code modifications won’t cause all dependencies to be re-installed. However, if you change dependencies, you have to manually rebuild the first image, then the second.

This workaround is good, but has two drawbacks.

You have to remember to rebuild the first image when you update dependencies. That sounds obvious and easy, but what happens if someone else updates requirements.txt, and then you pull their changes from git? Are you sure that you will notice the change? Maybe you should setup a git hook to remind you?
Workflows like Trusted Builds get more complicated as well. It’s still possible to get full automation, though. You can put the first Dockerfile (and the requirements file) in a subdirectory of the repository, and create a first Trusted Build for e.g. username/myappbase, pointing at that subdirectory. Then create a second Trusted Build, e.g. username/myapp, pointing at the root directory, and using FROM username/myappbase.

I appreciate the convenience of being able to use two Dockerfiles, but at the same time, I believe that it makes the build process more complicated and error-prone.

So let’s see what else we could do!

One-by-one pip install

We are in a kind of catch 22: we want to pip install -r requirements.txt, but if we ADD requirements.txt we break caching, And we want caching.

What would McGyver do?

Instead of installing from requirements.txt, let’s install each package manually, with pip, with different RUN commands. That way, those commands can be properly cached. See the following Dockerfile:

FROM stackbrew/debian:jessie
RUN apt-get install -qy python3
RUN apt-get install -qy python3-pip
RUN pip-3.3 install Flask
RUN pip-3.3 install some-other-dependency
ADD . /myapp
WORKDIR /myapp
RUN pip-3.3 install .
EXPOSE 8000
CMD myapp --port 8000

Now we won’t reinstall dependencies each time we rebuild. Great. However, our dependencies are now duplicated in two places: in requirements.txt, and in Dockerfile. It’s not the end of the world, but if you update one of them without the other, confusion will ensue.

So this solution is nice from a build time and tooling perspective, but it doesn’t abide by “DRY” principles (Don’t Repeat Yourself), which is another way to say that it can be subtly error-prone as well.

Combo

I’m therefore suggesting to mix two of the previous solutions to solve the issue! Really, the idea is to install dependencies twice. Or rather, to install them the first time with RUN statements (which get cached), and execute pip install -r requirements.txt after the ADD. The latter won’t get cached, but pip is nice, and it won’t reinstall things that are already installed.

That way, you leverage the caching system of the Docker builder, but at the same time, if you update requirements.txt without updating Dockerfile, the pip install command will patch up your image anyway, by upgrading your dependencies to the right version. The build will just be slower until you update the Dockerfile, but that’s it.

The Dockerfile will look like this:

FROM stackbrew/debian:jessie
RUN apt-get install -qy python3
RUN apt-get install -qy python3-pip
RUN pip-3.3 install Flask
RUN pip-3.3 install some-other-dependency
ADD . /myapp
WORKDIR /myapp
RUN pip-3.3 install -r requirements.txt
RUN pip-3.3 install .
EXPOSE 8000
CMD myapp --port 8000

Virtualenv

If you followed carefully, you noticed that we mentioned virtualenv in the beginning of this post, but we haven’t used it so far. Is virtualenv useful with Docker? It depends!

On a regular machine (be it your local development machine or a deployment server), you will have multiple Python apps. If they rely only on Python dependencies that happen to be packaged by your distro, great. Otherwise, virtualenv will come to the rescue; either as a sidekick to your distro’s packages (by complementing them) or as a total replacement (if you create the virtualenv with --no-site-packages).

With Docker, you will generally deploy one single app per container; so why use virtualenv? It might still be useful to advert conflicts between Python libs installed as distro packages, and libs installed with pip. This is not very likely for simple projects, but if you have a bigger codebase with many dependencies, and also install distro packages bringing their own Python dependencies with them, it could happen.

Other points of view

There is no right or wrong solution for that matter. Depending on the size of your project, on the number of dependencies, and how their interact with your distro, one method can be better than another.

On that topic, I suggest that you read Nick Stinemates’ blog post about running Python apps with Docker, or Paul Tagliamonte blog post about the respective merits of apt and pip.

Unveiling Flynn, a new PAAS based on Docker

2013-11-17T00:00:00+00:00

Earlier this month, I attended the first Flynn meet-up in San Francisco, where the project was presented by its authors. Here’s what I have to say about it.

Important reminder: this post hasn’t been sponsored, endorsed, approved, or anything, neither by my employer (Docker Inc.) nor by the Flynn team. All opinions expressed here are my own.

Flynn? Docker? What?

Docker is an Open Source runtime for Linux Containers. It has been released in March 2013 by Docker Inc. (my employer), and since then, many projects have been based on (or integrated) with it.

Linux Containers being a very good component for Platform-as-a-Service systems, multiple Open Source PAAS were started on top of Docker. Deis is one of them; Flynn is another. Until very recently, there wasn’t a lot of technical details available about Flynn; so I wanted to know more – and the first Flynn meet-up in San Francisco, hosted by Twilio, was the best place to get that information!

For more information about who develops Flynn, how it’s funded, etc., just refer to the project website. I’ll try to cover only technical and architectural topics here.

Flynn technical overview

… or, at least, as I understand it. You’ve been warned :-)

General architecture

A Flynn cluster (or grid) is composed of multiple Docker hosts. Each Docker host will run a number of Docker containers, each holding a “service appliance”. A service appliance is a basic function useful to the whole cluster.

There will be service appliances to deal with scheduling, load balancing, code builds, code execution, etc.

Each service appliance can be deployed (and possibly scaled) individually.

Layer 0 / Layer 1

Flynn is broken down in two layers. Layer 0 provides basic services: host management, scheduling framework, and service discovery. This can be used standalone; for instance if you need something to manage a cluster of Docker machines, without the whole PAAS business on top of it. Layer 0 currently uses Go RPC to communicate (but this will be replaced by a cross-platform RPC system later).

Layer 1 contains everything else that is needed to implement the PAAS itself. PAAS-specific concepts (applications, builds…) are implementend in Layer 1, and don’t exist in Layer 0.

The Grid: the 4 large squares are hosts, the small squares are service appliances. L, for instance, is Lorne, the host management service.

Bestiary of Service Appliances

Here are some of the service appliances. The first two implement “Layer 0”, and everything else is “Layer 1”.

Lorne is the host service. There will be one instance of it on each host in the cluster. It interfaces with Docker. If I understand correctly, it’s an adapter between Flynn discovery/naming/etc. facilities and Docker.

Sampi is the scheduling service. “Scheduling” here means “given the current state of the grid, and the current resource allocation on each node, where should I run this new job, which needs such and such resources?”. To quote the authors: “this does a job similar to Mesos, but for 1000x less lines of code”. To be more accurate, Sampi itself doesn’t do any scheduling; but it presents a consistent view of the cluster (and resource usage) to the actual schedulers, and serializes transactions. In other words, it prevents two concurrent schedulers (or two concurrent operations by the same scheduler) from putting the cluster in a state where resource constraints wouldn’t be satisfied anymore. This is inspired by the Google Omega paper. The real schedulers are implemented on top of Sampi; there are currently two tiny schedulers implemented in the controller API (to support basic scaling and one-off jobs), and something more robust will be added later.

There is a git frontend. It’s a generic SSH server, able to accept git-over-SSH connections, receive git pushes, and then ship them to other parts of the grid. Given that Flynn author Jeff Lindsay is also the author of Dokku and gitreceive, that part should work very well.

The controller exposes the API used to control the whole thing.

The router is a HTTP and TCP load-balancer for inbound traffic. For HA purposes, there should obviously be multiple instances of that guy across the whole grid. As far as I understand, this appliance works closely with the service discovery mechanism – which is expected, since it has to track the location of backends across the cluster as services are created, scaled, and destroyed.

There is also a slug builder and a slug runner. I’m less familiar with Heroku’s funked up terminology, but I expect that the slug builder will take some code (previously received and stored by the git frontend), build it (remember that in the case of Python, Ruby, and other interpreted langauges, “build” often means “install dependencies expressed by pip, setuptools, Gemfile, etc.), and store it as a “slug”. Then the slug runner will somehow instanciate one or multiple containers (depending on scaling parameters) to execute the code with its dependencies.

Principles of Service Appliances

The global idea is that each appliance should perform only a small, simple task, and compose nicely with others. To use the words of the authors, appliances should “focus on a single function, but be optimally minimal”.

They should do one thing, and do it well, rather than combining multiple features. That allows to scale them separately, and to replace a specific component more easily. A very good example is the builder and runner duo. In the early days, the dotCloud PAAS combined both functions in a single component: compute resources were allocated across the cluster, containers were created, then the build process happened in situ; i.e. the container building an app was the same as the one running the app. This was fine for small, un-scaled apps; but it was very inefficient for apps with dozens of containers, since the build process would be replicated N times. Later, the snapshotting builder was deployed; it orchestrated the builds on separate containers, stored the build artefacts, then deployed them on the runtime. As a result, builds were faster, more reliable, and hitless upgrades of applications became possible.

Another principle in Flynn is that each service appliance should have an API. Anyone who has worked with distributed or large-scale systems will take this as granted; but still, it’s good to remind that APIs are essential to automation and orchestration. You can script API calls much easier than you can script ttys, SSH commands, web forms, or clicks in GUIs.

Additionally, appliances should use the service discovery mechanism of the platform, so they can be discovered by other services. Using service discovery also means avoiding hard-coded API endpoints and other bad habits that will bite you when scaling or replicating an existing setup.

Appliances should also clusterable, i.e. scalable for performance and/or reliability.

They should be self-contained – which means that they should not rely on other components when it’s not necessary. I also believe that this is very important, especially when your organization scales out, and different teams (or maybe just different developers) assume ownership and responsibility for different services. When something is down (or doesn’t behave properly), the people maintaining it should be accountable for it. If the service cannot perform as intended because it depends on another component, it should identify the issue and report it accordingly, and, if possible, degrade gracefully. Consider as an example an online shop. If it uses a 3rd party service to perform searches, an outage of that 3rd party service shouldn’t take down the whole website. Search features will be unavailable, but everything else should continue to work. Likewise, in a PAAS, an outage of the build service will prevent you from deploying new versions of your apps, but shouldn’t affect scaling, metrics, or basically the function of existing apps.

Last but not least, appliances should be pluggable. It should be possible to replace a single service with a different implementation without rewriting everything else. A typical example is the routing component. The authors of Flynn told us that it would be straightforward to replace their router with something custom based on Nginx or HAProxy if need be. This particular example rings a bell. Recently, I discussed with the team from Yandex working on the Cocaine project, which integrates with Docker, and one of their questions was “we need to handle hundreds of requests per second on this specific system, so how can we bypass the default networking model and use ours instead?”. One size doesn’t fit all: since no system will be able to cater for everyone’s needs, just make sure that you can replace it with a more suitable version!

Those principles are good not only for Flynn, not only for PAAS, but for most distributed systems out there.

Service discovery

Service discovery is a key part in any distributed system, so it deserves a section of its own.

Etcd

Etcd is a highly-available key/value store, similar to Zookeeper, except that it is based on the Raft algorithm instead of Paxos. From my (arguably limited) experience with both systems, Etcd is much easier to deploy and operate (but just see for yourself).

Flynn uses etcd as a backend for the service discovery mechanism. As said above, etcd is based on the raft protocol, which is a strong consensus protocol. In other words, it will allow writes as long as at least 50% of the cluster is alive and connected.

Flynn also uses etcd to store configuration information for its various components.

Interestingly, etcd has been developed by the CoreOS project, which is… also based on Docker. It’s a small world we live in.

`sdutil`

There was a pattern that I found particularly interesting. Usually, interfacing with an existing service discovery system is complex, and requires extensive modifications in your code. So Flynn comes with a tool named sdutil, which can wrap any existing TCP server to plumb it with the service discovery mechanism, like this:

sdutil exec www:8080 /path/to/www/daemon --daemon-options...

This will run your daemon with specified flags, and, assuming that it runs on port 8080, it will inform the service discovery mechanism that the www service is running here. If the daemon crashes, exits, or whatever, sdutil will detect this, and unregister the service.

More details available on the sdutil repository.

What’s next?

To quote the authors, “Flynn is not a PAAS yet; it is a Docker scheduler” – but it’s getting there. The roadmap is ambitious. In 2014, Flynn should acquire the following features:

log aggregation
infrastructure cloning
autoscaling and provisioning (hybrid cloud)
permissions and access control
datastore appliances

What are those “datastore appliances”? I’m getting there in the next section.

How do I run e.g. PostgreSQL, ElasticSearch…?

With a datastore appliance. (Told you!)

If I understand correctly the model proposed by Flynn, you will have to run multiple Docker containers: some of them will be data nodes (e.g. PostgreSQL servers, masters and slaves), and some of them will be manager nodes (exposing an API to manage the service).

This reminds me a little bit of the Heroku plugin model: data services are not regular Heroku (or Flynn) apps; they are implemented “on the side” and provide a service that can be consumed by apps.

Interesting parallels with the dotCloud PAAS

There are many similarities between Flynn and dotCloud. This is not very surprising, since they both implement a PAAS. Actually, many components are mapped one-to-one:

dotCloud also has a per-host container manager;
dotCloud also has a routing layer to handle load-balancing and scaling for HTTP and TCP services;
dotCloud also has a SSH endpoint to handle git, mercurial, and rsync code uploads;
dotCloud also has a component implementing a REST API to interface with the outside world;
dotCloud also has a builder (to transform source code into a ready-to-run image) and a runner (to execute those images);
the dotCloud scheduler works a bit differently, but conceptually, there is one as well.

The key differences would be in the service discovery mechanism: dotCloud doesn’t use etcd (which didn’t exist 3 years ago). It uses Riak as a data store, and relies on ZeroRPC for intercomponent communication. The use of ZeroRPC (rather than a classic REST API) allowed us to develop and deploy distributed services extremely quickly, since it made possible to call remote code transparently, without having to abstract it with a RESTful interface. On the other hand, it also means that the code is much more entangled: when it’s cheap and convenient to call the service next door, I mean next host, you do it – and the result is a higher interdependency of the components.

From a user point of view, another key difference is the way to persist state. If you have used Heroku, you know that you cannot persist anything without relying on a 3rd party service (like S3, or, most frequently, the PostgreSQL add-on). And if you have used dotCloud, you know that conversely, each scaled instance of a service has its own local storage that you can retain across successive deployments. Flynn implements both, at different levels. Containers implementing service appliances can have persistent storage (that doesn’t get removed when the container is terminated, and can be re-used by other containers), but apps on top of Layer 1 will initially be stateless.

When building and operating the dotCloud platform at scale, we learned (the hard way) that stateful containers are much more complex to get right. When a container is stateless, you can destroy it, move it elsewhere, scale it at will. If it is idle, it can be removed, and redeployed later. When a container is stateful, you can’t do that anymore. You can stop it when idle, but you can’t destroy it – otherwise, its data is lost. Migrating it to another host means redeploying its code (which is easy) but also moving its data (which is harder, and can take an long time if there is a lot of data). It cannot be scaled as easily, since new instances won’t have the same data.

Of course, it means that each database has to be implemented through a specific service appliance. But that’s a very acceptable tradeoff, especially if service appliances are properly interoperable. The Flynn project can then bootstrap the process with some service appliances, and the community can add more. This wasn’t an option for dotCloud, where only specific parts of the PAAS were Open Source, preventing implementations of internal components by the community.

Conclusion: what did I think of it?

As a potential PAAS user, I would say that Flynn will be a serious option for people with medium to large-ish apps running on traditional PAAS like Heroku (or dotCloud, provided that all service appliances exist for all your stateful services). Just like “private cloud” made sense for people who needed the ability to spin VMs with specific constraints (location, latency, performance, cost…), “private PAAS” will make sense for people who need the same flexibility with apps.

As a devops/sysadmin operating a PAAS, I really like the whole concept and architecture. As often, the devil is in the details, but at least the overall plan makes a lot of sense, and I wouldn’t be afraid of operating a platform like that. (Then again, keep in mind that I have been part of the core team of dotCloud for 3 years, so my views on what it takes to operate a PAAS might be biased.)

As a Docker user, I’m a bit less happy, because it doesn’t look like integration with existing Docker containers will be easy. Flynn apps have to go through the slug builder and runner. Can I push an app with a Dockerfile? Run an existing container image? Conversely, how easy will it be to build a Docker container from a Flynn app, to run it standalone, without the whole platform? From what I could understand, the roadmap of Flynn is driven by the requests made by the organizations sponsoring the development of the project, and those features haven’t been mentioned a lot so far. I hope that it will evolve (or that implementing the missing parts will be easy), since it would mean that in addition of being a PAAS leveraging Docker, Flynn could be the Docker PAAS; i.e. the solution for anyone who is sold on Docker and its concept, and want to take that to the next level.

Also, why the name?

Since Jeff described Flynn as a Grid, I believe that the project is named after this other Flynn ☺

The Grid. A digital frontier. I tried to picture clusters of information as they moved through the computer. What did they look like? Ships? Motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I’d never see. And then one day… I got in.

Additional reading…

Flynn dev environment (as a Vagrantfile), including video demo
Flynn blog post about demo and roadmap, including video of the first meet-up

Function pointers in IDL

2013-10-27T00:00:00+00:00

To help @EstelleDeau to refactor some code, I had a look at introspection and reflection features in IDL. It is a really weird language (especially when my primary languages are now Python and Go), but it was a fun ride.

Note: we are talking about Interactive Data Language here; not Interface description language. The former is a programming language used for data anlysis by e.g. NASA; the latter is used to describe component interface for e.g. RPC.

Why?

Why am I doing this?

Science!

My wife @EstelleDeau is an astrophysicist, and as part of her job, she uses IDL to process heaps of data (mainly sent by the Cassini spacecraft, but also from other sources).

Why IDL? Mainly for historical reasons. When you advance in your carreer as a scientist in a very specialized field, you build your own toolkit to analyze data, fit it to various models, graph it in nifty ways, etc.; and often, this is a very specialized toolkit. If you’re bored or curious, have a look at the tech specs for ISS data (I mean, it’s just 171 pages). I can understand that if someone wrote that kind of code in a language, they wouldn’t want to rewrite it in another. So, here we are, with IDL.

This picture was assembled from a mosaic of smaller pictures, taken by a 1 megapixel, grayscale digital camera, moving at thousands of mph around Saturn. Told you: SCIENCE!.

Why function pointers?

It started with a simple idea: a lot of this IDL code had endless sequences of “if i EQ 1 then … else if i EQ 2 then … else …”, and I wanted to apply some classic refactoring:

put each code section in its own function,
replace the long sequence of if/then/else with an array of function pointers.

Let’s dive into IDL

The “workbench” looks like your average Eclipse-like IDE, with its load of quirks and fails. For instance, you have a button to build, and another to run; when run, it will first try to build, but if the build fails, it will run the old version. Also, the keyboard shortcuts (on a Mac) for those actions are Cmd+F8 and Cmd+Shift+F8. Since F8 requires Fn+F8, you end up pressing Cmd+Shift+Fn+F8. I’m comfortable with 7th and 9th chords on a piano so I won’t mind, but most people will probably use the mouse.

It looks like IDL is half-compiled, half-interpreted; i.e. while it requires a compilation phase, a lot of checking (and therefore, potential errors) happen at run time, which is quite surprising; as in “seriously, couldn’t you catch that at compile time?”

That being said, the online documentation of IDL is actually useful, once you know what to look for.

What does IDL look like?

Like this:

function mult,a,b
  return,a*b
end

pro main_prog
  print,mult(2,4)
end

This display 8.

The syntax is definitely weird, but makes sense if you think about the fact that IDL was born before the 80s, and was inspired by Fortran. You will often see UPPERCASE_NAMES(LIKE_THIS) and some /FLAG_NAMES which betray it’s VMS heritage (where do you think MS-DOS got that crap from?).

Do we have pointers?

IDL has pointers; the equivalent of &schmoo is PTR_NEW(schmoo) (you don’t really have to use uppercase names, but it helps to get into the atmosphere). However, if you try ptr_new(mult) when mult is a function, you will have a very bad time. It works, but when you try to actually reference the pointer, it will crash.

So, no function pointers.

When you don’t have function pointers, plan B is to evaluate arbitrary code. After looking around, I see that we have EXECUTE to do exactly that. And the doc here gets really helpful, since it mentions:

The EXECUTE function compiles and executes one or more IDL statements contained in a string at run-time. EXECUTE is limited by two factors: The need to compile the string at runtime makes EXECUTE inefficient in terms of speed. The EXECUTE function cannot be used in code that runs in the IDL Virtual Machine. Use of the EXECUTE function is not permitted when IDL is in Virtual Machine mode. The CALL_FUNCTION, CALL_METHOD, and CALL_PROCEDURE routines do not share this limitation; in many cases, uses of EXECUTE can be replaced with calls to these routines.

And, sure enough, you can do this:

result = call_function("mult", 3, 4)

We could declare an array of strings, containing the names of the function we need to call; but let’s keep looking a bit.

Structures

IDL has an interesting data type: structures.

At first, they look like structs, or maybe hash tables:

my_struct = {x: 42, y: 60, z: -1, color: "red"}

But they are ordered as well. You can access my_struct.x with my_struct.(0).

Also, there is a kind of inheritance system:

my_new_struct = {my_struct, background: "black"}

I think it is better to think of structures as “annotated arrays”, i.e. regular arrays that come with a convenient label for each position, rather than real dictionaries. And, sure enough, there is a tag_names function that returns an ordered array of all the tags/labels/fields of your structure.

I looked for the equivalent for Python’s getattr, but it looks like it doesn’t exist; however, I found a StackOverflow answer which helped me to write it:

function getattr,struct,attr
  tnames = tag_names(struct)
  tindex = where(strcmp(tnames, attr) EQ 1)
  if tindex EQ -1 then begin
    print,"NOT FOUND: ",attr
    ;EXIT?
  endif
  return,struct.(tindex)
end

I haven’t figured yet how to raise an exception or properly halt the program. EXIT crashed the workbench (we had to exit and restart it). There was probably some magic button that we could have pressed to restore it to working condition, but we couldn’t find it :-)

Putting everything together

In this example, we have a number of functions that need to be called in order, with specific parameters.

function job_foo,a,b
  print,"doing job foo",a,b
  return 42
end

function job_bar,a,b
  print,"doing job bar",a,b
  return 105
end

pro run_all_jobs
  jobs = {$
    job_foo: {a: 42, b: 4},$
    job_bar: {a: 10, b: 5} $
  }
  jobnames = tag_names(jobs)
  for i=0,n_tags(jobs)-1 do begin
    jobname = jobnames(i)
    job = jobs.(i)
    print,"Starting job ",jobname
    r = call_function(jobname, job.a, job.b)
    print,"Result of job: ",r
  endfor
end

As shown above, statements can span over multiple lines, by using $ at the end of the line. In the 70s, backslashes were a hipster thing.

And here, we prompt the user for the specific job they need to run:

pro run_one_jobs
  jobs = {$
    job_foo: {a: 42, b: 4},$
    job_bar: {a: 10, b: 5} $
  }
  jobnames = tag_names(jobs)
  print,"Which job should we run?"
  print jobnames
  jobname = ""
  read,">>> ",jobname
  job = getattr(jobs, jobname)
  print,"Starting job ",jobname
  r = call_function(jobname, job.a, job.b)
  print,"Result of job: ",r
end

Another interesting fact: if you don’t do jobname = "", IDL will assume by default that it is a number, and read will complain that the thing you have entered doesn’t parse as a number.

Conclusions

If, as a professional programmer, IDL gives you heartburns, you can soothe the pain by watching this breathtaking video made with images taken by Cassini ISS cameras! :-)

If you know IDL and know better ways to do what has been shown here, I would be very happy to hear about it. Thank you!

Sesquimosa

2013-10-22T00:00:00+00:00

A mimosa is half a mosa. A sesquimosa is one mosa and a half. If you like mimosas, you might like this beverage thrice as much :-)

Recipe for two flutes of sesquimosa:

pour 6 oz of peach juice in a shaker;
add a dash of bitter angostura;
add a dash of lemon juice;
add 1 oz of gold rum;
add 1 oz of cointreau;
add some ice cubes and shake;
add 4 oz of perrier (sparkling water will do);
serve!

Securing Docker in the wild

2013-10-20T00:00:00+00:00

By default, the Docker API is exposed over a local UNIX socket. If you want to control Docker from a remote host, you can configure Docker to expose its API over a TCP socket instead. However, Docker itself doesn’t implement authentication. We will see here how we can use SSL certificate authentication to encrypt and authenticate the Docker API.

The plan

This is a very simple recipe, using socat in front of the Docker API. socat will accept HTTPS connections, make sure that the client shows an appropriate certificate, and relay the connection to the UNIX socket. The client should either use socat as well to wrap a normal connection into a SSL connection; or use OpenSSL (or a similar crypto library) to do the wrapping directly.

A few words about certificates

I won’t do a full intro do public key crypto; but the basic idea is the following:

the server (i.e. Docker) and each client connecting to it have to generate their own private key;
they get a certificate authority to sign those keys, delivering them a certificate;
when a client connect to the server, each party asks the other one to present its certificate, and is able to verify the validity of the certificate.

In other words: the client will know for sure that it’s talking to the server, and the server will know for sure that it’s talking to an authorized client.

In this example, we will cut corners. The client, server, and certificate authority will actually be the same entity. They will use the same key and certificate.

Get prepared

We need to install socat on both the client and server; and we need openssl somewhere (doesn’t matter where exactly: it’s purely for generation of the key material).

apt-get install socat openssl

socat is a very common tool, so it should be available for your distro, even if it’s an exotic one.

Generate key and certificate

Here is my quick-and-dirty recipe to generate a RSA key (stored in key.pem) and a self-signed certificate (stored in cert.pem), valid for 100 years:

openssl genrsa -out key.pem 2048
openssl req -new -key key.pem -x509 -out cert.pem -days 36525 -subj /CN=WoopWoop/

Run that anywhere, then copy both key.pem and cert.pem on client and server.

On server (running Docker)

Docker should run as usual. Then start socat like this:

socat \
  OPENSSL-LISTEN:4321,fork,reuseaddr,cert=cert.pem,cafile=cert.pem,key=key.pem \
  UNIX:/var/run/docker.sock

fork means that socat will fork a new child process for each incoming connection (instead of handling only one connection and exiting right away).

reuseaddr is a useful socket option, so that if you exit and restart socat, it won’t tell you that the address is already taken.

By default, OPENSSL connections made with socat require the other end to show a valid certificate; unless you add verify=0. In that case, we want to encrypt connections and check certificates (to deny unauthorized clients), so the defaults are good.

On client (running e.g. Docker CLI)

The symmetrical invocation of socat looks like this:

socat \
  UNIX-LISTEN:/tmp/docker.sock,fork \
  OPENSSL:$SERVERADDR:4321,cert=cert.pem,cafile=cert.pem,key=key.pem

Now you can point your Docker CLI to the server through the tunnel, like this:

docker -H unix:///tmp/docker.sock run -t -i busybox sh

On client (using an HTTP client API)

If you want to connect to the Docker daemon with a regular HTTP client (which maybe cannot connect to a UNIX socket to do HTTP requests), try this version:

socat \
  TCP-LISTEN:4321,bind=127.0.0.1,fork \
  OPENSSL:$SERVERADDR:4321,cert=cert.pem,cafile=cert.pem,key=key.pem

The Docker API is then available on http://127.0.0.1:4321.

Enjoy!

What’s next?

It would obviously be much better to use a separate certificate authority, and generate different keys and certificates for the server and for each client. “This is left as an exercise for the reader,” as we say! :-)

How to configure Docker to start containers on a specific IP address range

2013-10-16T00:00:00+00:00

A recurring question on the Docker mailing list and on the Docker IRC channel is “how can I change the network range used by Docker?”. While Docker itself doesn’t have a configuration option to change this network range (yet!), it is very easy to change it, and here is how.

Docker’s default behavior

When you (or your distro’s init scripts) start the Docker daemon, the daemon will check if it was given a -b option on the command-line. This option specifies the name of the bridge interface to be used by Docker. All the containers will be bound to this bridge. If the -b option is not specified, Docker will use the name docker0 instead.

Then, Docker will check if that bridge interface actually exists. If it does, it will use it – and use whatever IP address and netmask are configured on this address. For instance, if you already have a bridge br0 setup with IP address 10.3.3.100/24, and start the Docker daemon with -b br0, then containers will be started on IP addresses from 10.3.3.1 to 10.3.3.99, then (skipping the bridge address) from 10.3.3.101 to 10.3.3.254.

If the interface doesn’t exist, Docker will create it, and assign an IP address to it. But of course, it cannot just pick a random IP address: it would always conflict with someone’s IP addressing plan out there. So Docker tries to be smart. It tries a number of different ranges, until it finds one that doesn’t overlap with an existing route on your system, or with your DNS server. (You can see the whole list in network.go.)

Hell is paved with good intentions

But Docker only knows about your directly connected routes (using the ip route command) and your DNS server (checking /etc/resolv.conf). The first address that Docker tries to use is 172.17.42.1/16. Suppose that your machine’s IP address is 192.168.1.2/24, your default gateway is 192.168.1.1/24, and you happen to have an internal server on 172.17.6.6, reachable through your default gateway. Docker won’t “see” the route to that server (it will only see the default route), and it won’t be able to “know” that it shouldn’t use that network.

In other words, Docker network allocation scheme is not bullet-proof. It’s still useful, because instead of working 99% of the time, it probably works 99.99% (I’m completely making up those numbers); but the remaining 0.01% still need a solution.

So what should I do?

If you are in that 0.01%, the solution is very simple: just create your own bridge, configure it with a fixed address, tell Docker to use it. Done.

If you do it manually, it will look like this (on Ubuntu):

stop docker
ip link add br0 type bridge
ip addr add 172.30.1.1/20 dev br0
ip link set br0 up
docker -d -b br0

If you want to persist your changes across server reboots, you can add the bridge to /etc/network/interfaces/. On my laptop, I have the following definition in that file:

auto br0
iface br0 inet static
        address 10.1.1.1
        netmask 255.255.255.0
        bridge_ports dummy0
        bridge_stp off
        bridge_fd 0

My version of the ifupdown scripts require that a bridge_ports option is present, otherwise, it doesn’t recognize the interface as a bridge. Therefore, I put a dummy interface in it. Also, for bonus points, I disabled the STP protocol and reduced the forwarding delay to zero.

Then, I updated my Docker init script to add -b br0.

Note: I used br0 because I also have other VMs running on this machine (using QEMU, VirtualBox, and sometimes KVM) and I configured everything to use br0, so my containers and my VirtualBox VMs can communicate directly. But to make things simpler, you can just use the name docker0 in your interfaces definition file, and Docker will pick it up automatically without extra configuration.

But I don’t want to edit my system files; can’t Docker do this?

Not yet. But it would be reasonable to extend the -b option to specify the address and netmask to use; for instance -b br0 would still use the br0 interface “as-is”, but -b br0=192.168.1.1/24 would create the interface and assign an IP address.

Docker is an Open Source project, and contributing is really easy. If you really need that feature, it could be the perfect opportunity to learn Go :-)

Seriously, though, if you want to implement this, don’t hesitate to open a GitHub issue (after having read the contributing guidelines) to indicate that you will be working on it; and we’ll look forward to reviewing your pull requests!

Gathering container metrics

2013-10-08T00:00:00+00:00

Linux Containers rely on control groups which not only track groups of processes, but also expose a lot of metrics about CPU, memory, and block I/O usage. We will see how to access those metrics, and how to obtain network usage metrics as well. This is relevant for “pure” LXC containers, as well as for Docker containers.

Locate your control groups

Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under /sys/fs/cgroup. Under that directory, you will see multiple sub-directories, called devices, freezer, blkio, etc.; each sub-directory actually corresponds to a different cgroup hierarchy.

On older systems, the control groups might be mounted on /cgroup, without distinct hierarchies. In that case, instead of seeing the sub-directories, you will see a bunch of files in that directory, and possibly some directories corresponding to existing containers.

To figure out where your control groups are mounted, you can run:

grep cgroup /proc/mounts

Control groups hierarchies

The fact that different control groups can be in different hierarchies mean that you can use completely different groups (and policies) for e.g. CPU allocation and memory allocation. Let’s make up a completely imaginary example: you have a 2-CPU system running Python webapps with Gunicorn, a PostgreSQL database, and accepting SSH logins. You can put each webapp and each SSH session in their own memory control group (to make sure that a single app or user doesn’t use up the memory of the whole system), and at the same time, stick the webapps and database on a CPU, and the SSH logins on another CPU.

Of course, if you run LXC containers, each hierarchy will have one group per container, and all hierarchies will look the same.

Merging or splitting hierarchies is achieved by using special options when mounting the cgroup pseudo-filesystems. Note that if you want to change that, you will have to remove all existing cgroups in the hierarchies that you want to split or merge.

Enumerating our cgroups

You can look into /proc/cgroups to see the different control group subsystems known to the system, the hierarchy they belong to, and how many groups they contain.

You can also look at /proc/<pid>/cgroup to see which control groups a process belongs to. The control group will be shown as a path relative to the root of the hierarchy mountpoint; e.g. / means “this process has not been assigned into a particular group”, while /lxc/pumpkin means that the process is likely to be a member of a container named pumpkin.

Finding the cgroup for a given container

For each container, one cgroup will be created in each hierarchy. On older systems with older versions of the LXC userland tools, the name of the cgroup will be the name of the container. With more recent versions of the LXC tools, the cgroup will be lxc/<container_name>.

Additional note for Docker users: the container name will be the full ID or long ID of the container. If a container shows up as ae836c95b4c3 in docker ps, its long ID might be something like ae836c95b4c3c9e9179e0e91015512da89fdec91612f63cebae57df9a5444c79. You can look it up with docker inspect or docker ps -notrunc.

Putting everything together: on my system, if I want to look at the memory metrics for a Docker container, I have to look at /sys/fs/cgroup/memory/lxc/<longid>/.

Collecting memory, CPU, block I/O metrics

For each subsystem, we will find one pseudo-file (in some cases, multiple) containing statistics about used memory, accumulated CPU cycles, or number of I/O completed. Those files are easy to parse, as we will see.

Memory metrics

Those will be found in the memory cgroup (duh!). Note that the memory control group adds a little overhead, because it does very fine-grained accounting of the memory usage on your system. Therefore, many distros chose to not enable it by default. Generally, to enable it, all you have to do is to add some kernel command-line parameters: cgroup_enable=memory swapaccount=1.

The metrics are in the pseudo-file memory.stat. Here is what it will look like:

cache 11492564992
rss 1930993664
mapped_file 306728960
pgpgin 406632648
pgpgout 403355412
swap 0
pgfault 728281223
pgmajfault 1724
inactive_anon 46608384
active_anon 1884520448
inactive_file 7003344896
active_file 4489052160
unevictable 32768
hierarchical_memory_limit 9223372036854775807
hierarchical_memsw_limit 9223372036854775807
total_cache 11492564992
total_rss 1930993664
total_mapped_file 306728960
total_pgpgin 406632648
total_pgpgout 403355412
total_swap 0
total_pgfault 728281223
total_pgmajfault 1724
total_inactive_anon 46608384
total_active_anon 1884520448
total_inactive_file 7003344896
total_active_file 4489052160
total_unevictable 32768

The first half (without the total_ prefix) contains statistics relevant to the processes within the cgroup, excluding sub-cgroups. The second half (with the total_ prefix) includes sub-cgroups as well.

Some metrics are “gauges”, i.e. values that can increase or decrease (e.g. swap, the amount of swap space used by the members of the cgroup). Some others are “counters”, i.e. values that can only go up, because they represent occurrences of a specific event (e.g. pgfault, which indicates the number of page faults which happened since the creation of the cgroup; this number can never decrease).

Let’s see what those metrics stand for. All memory amounts are in bytes (except for event counters).

cache is the amount of memory used by the processes of this control group that can be associated precisely with a block on a block device. When you read and write files from and to disk, this amount will increase. This will be the case if you use “conventional” I/O (open, read, write syscalls) as well as mapped files (with mmap). It also accounts for the memory used by tmpfs mounts. I don’t know exactly why; it might be because tmpfs filesystems work directly with the page cache.
rss is the amount of memory that doesn’t correspond to anything on disk: stacks, heaps, and anonymous memory maps.
mapped_file indicates the amount of memory mapped by the processes in the control group. In my humble opinion, it doesn’t give you an information about how much memory is used; it rather tells you how it is used.
pgpgin and pgpgout are a bit tricky. If you are used to vmstat, you might think that they indicate the number of times that a page had to be read and written (respectively) by a process of the cgroup, and that they should reflect both file I/O and swap activity. Wrong! In fact, they correspond to charging events. Each time a page is “charged” (=added to the accounting) to a cgroup, pgpgin increases. When a page is “uncharged” (=no longer “billed” to a cgroup), pgpgout increases.
pgfault and pgmajfault indicate the number of times that a process of the cgroup triggered a “page fault” and a “major fault”, respectively. A page fault happens when a process accesses a part of its virtual memory space which is inexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent a SIGSEGV signal, typically killing it with the famous Segmentation fault message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process’ own copy of the page. “Major” faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it’s a regular (or “minor”) fault.
swap is (as expected) the amount of swap currently used by the processes in this cgroup.
active_anon and inactive_anon is the amount of anonymous memory that has been identified has respectively active and inactive by the kernel. “Anonymous” memory is the memory that is not linked to disk pages. In other words, that’s the equivalent of the rss counter described above. In fact, the very definition of the rss counter is active_anon+inactive_anon-tmpfs (where tmpfs is the amount of memory used up by tmpfs filesystems mounted by this control group). Now, what’s the difference between “active” and “inactive”? Pages are initially “active”; and at regular intervals, the kernel sweeps over the memory, and tags some pages as “inactive”. Whenever they are accessed again, they are immediately retagged “active”. When the kernel is almost out of memory, and time comes to swap out to disk, the kernel will swap “inactive” pages.
Likewise, the cache memory is broken down into active_file and inactive_file. The exact formula is cache=active_file+inactive_file+tmpfs. The exact rules used by the kernel to move memory pages between active and inactive sets are different from the ones used for anonymous memory, but the general principle is the same. Note that when the kernel needs to reclaim memory, it is cheaper to reclaim a clean (=non modified) page from this pool, since it can be reclaimed immediately (while anonymous pages and dirty/modified pages have to be written to disk first).
unevictable is the amount of memory that cannot be reclaimed; generally, it will account for memory that has been “locked” with mlock. It is often used by crypto frameworks to make sure that secret keys and other sensitive material never gets swapped out to disk.
Last but not least, the memory and memsw limits are not really metrics, but a reminder of the limits applied to this cgroup. The first one indicates the maximum amount of physical memory that can be used by the processes of this control group; the second one indicates the maximum amount of RAM+swap.

Accounting for memory in the page cache is very complex. If two processes in different control groups both read the same file (ultimately relying on the same blocks on disk), the corresponding memory charge will be split between the control groups. It’s nice, but it also means that when a cgroup is terminated, it could increase the memory usage of another cgroup, because they are not splitting the cost anymore for those memory pages.

CPU metrics

Now that we’ve covered memory metrics, everything else will look very simple in comparison. CPU metrics will be found in the cpuacct controller.

For each container, you will find a pseudo-file cpuacct.stat, containing the CPU usage accumulated by the processes of the container, broken down between user and system time. If you’re not familiar with the distinction, user is the time during which the processes were in direct control of the CPU (i.e. executing process code), and system is the time during which the CPU was executing system calls on behalf of those processes.

Those times are expressed in ticks of 1/100th of second. (Actually, they are expressed in “user jiffies”. There are USER_HZ “jiffies” per second, and on x86 systems, USER_HZ is 100. This used to map exactly to the number of scheduler “ticks” per second; but with the advent of higher frequency scheduling, as well as tickless kernels, the number of kernel ticks wasn’t relevant anymore. It stuck around anyway, mainly for legacy and compatibility reasons.)

Block I/O metrics

Block I/O is accounted in the blkio controller. Different metrics are scattered across different files. While you can find in-depth details in the blkio-controller file in the kernel documentation, here is a short list of the most relevant ones:

blkio.sectors contains the number of 512-bytes sectors read and written by the processes member of the cgroup, device by device. Reads and writes are merged in a single counter.
blkio.io_service_bytes indicates the number of bytes read and written by the cgroup. It has 4 counters per device, because for each device, it differentiates between synchronous vs. asynchronous I/O, and reads vs. writes.
blkio.io_serviced is similar, but instead of showing byte counters, it will show the number of I/O operations performed, regardless of their size. It also has 4 counters per device.
blkio.io_queued indicates the number of I/O operations currently queued for this cgroup. In other words, if the cgroup isn’t doing any I/O, this will be zero. Note that the opposite is not true. In other words, if there is no I/O queued, it does not mean that the cgroup is idle (I/O-wise). It could be doing purely synchronous reads on an otherwise quiescent device, which is therefore able to handle them immediately, without queuing. Also, while it is helpful to figure out which cgroup is putting stress on the I/O subsystem, keep in mind that is is a relative quantity. Even if a process group does not perform more I/O, its queue size can increase just because the device load increases because of other devices.

For each file, there is a _recursive variant, that aggregates the metrics of the control group and all its sub-cgroups.

Also, it’s worth mentioning that in most cases, if the processes of a control group have not done any I/O on a given block device, the block device will not appear in the pseudo-files. In other words, you have to be careful each time you parse one of those files, because new entries might have appeared since the previous time.

Collecting network metrics

Interestingly, network metrics are not exposed directly by control groups. There is a good explanation for that: network interfaces exist within the context of network namespaces. The kernel could probably accumulate metrics about packets and bytes sent and received by a group of processes, but those metrics wouldn’t be very useful. You want (at least!) per-interface metrics (because traffic happening on the local lo interface doesn’t really count). But since processes in a single cgroup can belong to multiple network namespaces, those metrics would be harder to interpret: multiple network namespaces means multiple lo interfaces, potentially multiple eth0 interfaces, etc.; so this is why there is no easy way to gather network metrics with control groups.

So what shall we do? Well, we have multiple options.

Iptables

When people think about iptables, they usually think about firewalling, and maybe NAT scenarios. But iptables (or rather, the netfilter framework for which iptables is just an interface) can also do some serious accounting.

For instance, you can setup a rule to account for the outbound HTTP traffic on a web server:

iptables -I OUTPUT -p tcp --sport 80

There is no -j or -g flag, so the rule will just count matched packets and go to the following rule.

Later, you can check the values of the counters, with:

iptables -nxvL OUTPUT

(Technically, -n is not required, but it will prevent iptables from doing DNS reverse lookups, which are probably useless in this scenario.)

Counters include packets and bytes. If you want to setup metrics for container traffic like this, you could execute a for loop to add two iptables rules per container IP address (one in each direction), in the FORWARD chain. This will only meter traffic going through the NAT layer; you will also have to add traffic going through the userland proxy.

Then, you will need to check those counters on a regular basis. If you happen to use collectd, there is a nice plugin to automate iptables counters collection.

Interface-level counters

Since each container has a virtual Ethernet interface, you might want to check directly the TX and RX counters of this interface. However, this is not as easy as it sounds. If you use Docker (as of current version 0.6) or lxc-start, then you will notice that each container is associated to a virtual Ethernet interface in your host, with a name like vethKk8Zqi. Figuring out which interface corresponds to which container is, unfortunately, difficult. (If you know an easy way, let me know.)

In the long run, Docker will probably take over the setup of those virtual interfaces. It will keep track of their names, and make sure that it can easily associate containers with their respective interfaces.

But for now, the best way is to check the metrics from within the containers. I’m not talking about running a special agent in the container, or anything like that. We are going to run an executable from the host environment, but within the network namespace of a container.

ip-netns magic

To do that, we will use the ip netns exec command. This command will let you execute any program (present in the host system) within any network namespace visible to the current process. This means that your host will be able to enter the network namespace of your containers, but your containers won’t be able to access the host, nor their sibling containers. Containers will be able to “see” and affect their sub-containers, though.

The exact format of the command is:

ip netns exec <nsname> <command...>

For instance:

ip netns exec mycontainer netstat -i

How does the naming system work? How does ip netns find mycontainer? Answer: by using the namespaces pseudo-files. Each process belongs to one network namespace, one PID namespace, one mnt namespace, etc.; and those namespaces are materialized under /proc/<pid>/ns/. For instance, the network namespace of PID 42 is materialized by the pseudo-file /proc/42/ns/net.

When you run ip netns exec mycontainer ..., it expects /var/run/netns/mycontainer to be one of those pseudo-files. (Symlinks are accepted.)

In other words, to execute a command within the network namespace of a container, we need to:

find out the PID of any process within the container that we want to investigate;
create a symlink from /var/run/netns/<somename> to /proc/<thepid>/ns/net;
execute ip netns exec <somename> ....

Now, we need to figure out a way to find the PID of a process (any process!) running in the container that we want to investigate. This is actually very easy. You have to locate one of the control groups corresponding to the container. We explained how to locate those cgroups in the beginning of this post, so we won’t cover that again.

On my machine, a control group will typically be located in /sys/fs/cgroup/devices/lxc/<containerid>. Within that directory, you will find a pseudo-file called tasks. It contains the list of the PIDs that are in the control group, i.e., in the container. We can take any of them; so the first one will do.

Putting everything together, if the “short ID” of a container is held in the environment variable $CID, here is a small shell snippet to put everything together:

TASKS=/sys/fs/cgroup/devices/$CID*/tasks
PID=$(head -n 1 $TASKS)
mkdir -p /var/run/netns
ln -sf /proc/$PID/ns/net /var/run/netns/$CID
ip netns exec $CID netstat -i

The same mechanism is used in Pipework to setup network interfaces within containers from outside the containers.

Tips for high-performance metric collection

Note that running a new process each time you want to update metrics is (relatively) expensive. If you want to collect metrics at high resolutions, and/or over a large number of containers (think 1000 containers on a single host), you do not want to fork a new process each time.

Here is how to collect metrics from a single process. You will have to write your metric collector in C (or any language that lets you do low-level system calls). You need to use a special system call, setns(), which lets the current process enter any arbitrary namespace. It requires, however, an open file descriptor to the namespace pseudo-file (remember: that’s the pseudo-file in /proc/<pid>/ns/net).

However, there is a catch: you must not keep this file descriptor open. If you do, when the last process of the control group exits, the namespace will not be destroyed, and its network resources (like the virtual interface of the container) will stay around for ever (or until you close that file descriptor).

The right approach would be to keep track of the first PID of each container, and re-open the namespace pseudo-file each time.

Collecting metrics when a container exits

Sometimes, you do not care about real time metric collection, but when a container exits, you want to know how much CPU, memory, etc. it has used.

The current implementation of Docker (as of 0.6) makes this particularly challenging, because it relies on lxc-start, and when a container stops, lxc-start carefully cleans up behind it. If you really want to collect the metrics anyway, here is how.

For each container, start a collection process, and move it to the control groups that you want to monitor by writing its PID to the tasks file of the cgroup. The collection process should periodically re-read the tasks file to check if it’s the last process of the control group. (If you also want to collect network statistics as explained in the previous section, you should also move the process to the appropriate network namespace.)

When the container exits, lxc-start will try to delete the control groups. It will fail, since the control group is still in use; but that’s fine. You process should now detect that it is the only one remaining in the group. Now is the right time to collect all the metrics you need!

Finally, your process should move itself back to the root control group, and remove the container control group. To remove a control group, just rmdir its directory. It’s counter-intuitive to rmdir a directory as it still contains files; but remember that this is a pseudo-filesystem, so usual rules don’t apply. After the cleanup is done, the collection process can exit safely.

As you can see, collecting metrics when a container exits can be tricky; for this reason, it is usually easier to collect metrics at regular intervals (e.g. every minute) and rely on that instead.

Wrapping it up

To recap, we covered:

how to locate the control groups for containers;
reading and interpreting compute metrics for containers;
different ways to obtain network metrics for containers;
a technique to gather overall metrics when a container exits.

As we have seen, metrics collection is not insanely difficult, but still involves many complicated steps, with special cases like those for the network subsystem. Docker will take care of this, or at least expose hooks to make it more straightforward. It is one of the reasons why we repeat over and over “Docker is not production ready yet”: it’s fine to skip metrics for development, continuous testing, or staging environments, but it’s definitely not fine to run production services without metrics!

Last but not least, note that even with all that information, you will still need a storage and graphing system for those metrics. There are many such systems out there. If you want something that you can deploy on your own, you can check e.g. collectd or Graphite. There are also “-as-a-Service” offerings. Those services will store your metrics and let you query them in various ways, for a given price. Some examples include Librato, AWS CloudWatch, New Relic Server Monitoring, and many more.

Acknowledgements

This post was initally published on the Docker blog.

I’m indebted to Andrew Rothfusz for proofreading this article. If any mistake or typo remains, I take full responsibility for them :-)

Use policy-rc.d to prevent services from starting automatically

2013-10-06T00:00:00+00:00

When you install (or upgrade) a service, the package manager will try to start (or restart) this service. If you are working on a normal server, this is usually what you want. But if you are inside a chroot environment, or maintaining some kind of golden image, you don’t want to start services. If you are using Debian/Ubuntu-based distros, there is a super easy way to solve the problem: the /usr/sbin/policy-rc.d script.

Sysadmin inhibits service start and stop with this weird trick

When anything needs to start and stop services on Debian or Ubuntu, it doesn’t invoke init scripts directly: it goes through invoke-rc.d. So, instead of doing /etc/init.d/foobar start, a well-behaved postinstall script should do invoke-rc.d foobar start. It will do exactly the same thing, except that it will run policy-rc.d foobar start first. (If /usr/sbin/policy-rc.d doesn’t exist, it is skipped.)

The policy-rc.d script has only one job: it should tell to invoke-rc.d if the action is allowed or not, by using its exit status. An exit status of 0 means “action allowed”; an exit status of 101 means “action not allowed”. There are other possibilities, for more complicated scenarios. You can read the details in the invoke-rc.d interface documentation.

So, to prevent services from being started automatically when you install packages with dpkg, apt, etc., just do this (as root):

echo exit 101 > /usr/sbin/policy-rc.d
chmod +x /usr/sbin/policy-rc.d

If you’re not root, you can use the sudo tee trick, i.e.:

echo exit 101 | sudo tee /usr/sbin/policy-rc.d
sudo chmod +x /usr/sbin/policy-rc.d

Why she got no bangs?

If you already knew about policy-rc.d, here is a second chance to learn something new today!

You might be wondering “shouldn’t I put #!/bin/sh in the beginning of the policy-rc.d script?”

If there is no shebang at the beginning of the file, the OS will try to execute it as a “normal” binary. The execve syscall will fail with ENOEXEC (Exec format error). Well, unless your script happen to conveniently have an ELF signature (or another binary signature recognized by your system), but this is very unlikely.

What happens next depends on the calling program.

The exec wrappers in the libc will try to use /bin/sh as a fallback to invoke the program – which is why I didn’t deem necessary to add a shebang to the policy-rc.d script.

However, if you are running a shell, it will use execve directly, and if it fails, it will try to execute the script itself. In other words, if you call a script without shebang from bash, then bash will be used to execute it. (If the script is neither a standard executable nor a bash script, major confusion will ensue.)

Note that in some languages, execvp, execlp, and other execve wrappers do not always call their libc counterparts. This is why in Python (for instance), if you use execvp on a script without a shebang, you will get the ENOEXEC error. It will not try to use /bin/sh like the normal libc call.

Isn’t that great? :-)

Gunsub: keep your GitHub notifications under control

2013-09-26T00:00:00+00:00

Gunsub means “GitHub Unsubscribe”. It lets you be aware of everything happening in a given Github repository (through GitHub’s e-mail notifications), without getting too much spam. It lets the first notification go through, then automatically unsubscribes you from further messages in the same thread (unless you comment or are mentioned in the thread).

What’s the point?

I wrote this because I wanted to follow closely what was happening inthe Docker repository, but as some point, I realized that I spent too much time dealing with the constant stream of e-mail notifications.

What I really needed was an initial notification for the first message of each conversation (e.g. when an issue is created). I didn’t want the rest of the conversation. If I want to get involved, all I have to do is to manually subscribe to the thread (which happens automatically if I comment on the issue through the website or through e-mail, or if someone mentions me on the issue).

How does it work?

Gunsub uses the Github API; specifically, the /notifications endpoint. It checks all the notifications that I have received. For each notification, the API indicates the reason of the notification: is it because I was mentioned there? Or automatically subscribed because I’m watching the repository? Or something else? If I was automatically subscribed, then Gunsub checks if there is a subscription information for that thread. If there is a subscription information, it can be either to indicate a manual subscription, or conversely, to indicate that I’m already ignoring that thread; in either case, Gunsub doesn’t change the subscription setting. However, if there is no subscription information, Gunsub will unsubscribe me from further notifications. The subscription information gets overridden if I comment or get mentioned anyway.

This is awesome, how can I use it too?

Thank you! The code is available on /jpetazzo/gunsub. Gunsub only uses the basic Python library, so you don’t need to install anything fancy. You only need to set two environment variables, GITHUB_USER and GITHUB_PASSWORD, and run it with python gunsub.py.

Optionally, you may set GITHUB_INCLUDE_REPOS or GITHUB_EXCLUDE_REPOS to a comma-separated list of repositories to include or exclude. If you do not specify anything, by default, Gunsub will act upon all your repositories; if you specify GITHUB_INCLUDE_REPOS, it will act only on those; and if you specify GITHUB_EXCLUDE_REPOS, it will act on all repositories except those. If you specify both, it will be a little bit silly, but it will work anyway, operating on all included repositories except those in the exclude list.

By default, Gunsub will do one pass over your notifications, unsubscribe from the “passive” notifications, and exit. But you can also set the GITHUB_POLL_INTERVAL environment variable to be a delay (in seconds): in that case, it will run in a loop, waiting for the indicated delay between each iteration.

Known issues

There are two issues that I’m aware of.

If someone opens an issue, and another comment is added quickly thereafter (i.e. before Gunsub enters its periodic loop to unsubscribe you), you will receive two e-mail notifications. I believe that it is not a problem, since you will probably handle both messages simultaneously (in low level I/O parlance, the interrupts have been coalesced, or the I/O requests have been merged, if you will :-)).

More importantly, sometimes you will see a notification in your inbox, and think right away “ah, I know this stuff, I will reply to that guy!”. Before replying, remember that you are only seeing the first message of the thread. You should open the thread on GitHub to see if other people have replied. This will avoid you embarrassing moments, believe me! :-)

This seriously sucks, there are better ways to do it!

Please let me know. This is the first time I do something meaningful with the Github API. I found the documentation to be technically accurate, but a lot of explanations were missing. For instance, when posting a subscription, there are two boolean flags: ignore and subscribe. Everywhere I looked, they were XORed (i.e., if ignore is true then subscribe is false and vice-versa). Is it meaningful to have them both to true or false? I don’t know. So if you know more efficient ways to do that, I’d love to hear about it!

You should use the If-Modified-Since…

Yes, I understand that it would be nicer; and I might implement this soon enough. Consider this as a Minimimal Viable Product :-)

Running from Docker

Gunsub is so simple, that it can probably run literally anywhere, even on Windows or OS X machines. However, in an ongoing effort to CONTAINERIZE ALL THE THINGS!, I wrote a tiny Dockerfile to run it inside a Docker container, and I uploaded it to the Docker registry.

If you already have a Docker installation, you can do something like this:

docker run -d -e GITHUB_USER=johndoe GITHUB_PASSWORD=SecretSesame \
       	      	 GITHUB_POLL_INTERVAL=300 jpetazzo/gunsub

… and Docker will start a Gunsub container, running the main loop every five minutes.

First post with Jekyll

2013-09-23T00:00:00+00:00

This is the blog I should have setup 15 years ago. Here I will talk about cool hacks, cooking, cocktails, books I’ve read (or sometimes I haven’t), linguistics… And I decided to use Jekyll to run it.

Why this blog?

I love to share about the stuff I do. At $WORK I manage ops for the dotCloud PaaS, and I spread the word about lightweight virtualization, Linux Containers, and Docker. This content has been published on the dotCloud blog or the Docker blog.

But, I would also like to talk about other topics, not related to my work (or not directly). So I had to do something I had postponed for the last 15 years or so: setup my own blog :-)

Why Jekyll?

When I wrote my recent entries for the Docker blog, I drafted them in Markdown format, using Gist as a scratchpad. I like neat, lean markup formats like reStructuredText and Markdown. Moreover, I want to be able to write efficiently during my commute, or when in planes. (I don’t fly so often, but when I do, I’d rather make it producitve if I can’t get some sleep.)

I don’t remember how I learned about Jekyll, but it was exactly what I was looking for: a decent blogging system, apparently designed to work with plain text source files. The GitHub Pages integration is the icing on the cake.

First steps with Jekyll

I did a local install of Jekyll using [Stevedore]. I will talk more about Stevedore another time; but to give you an idea, it was as simple as:

jpetazzo@tarrasque:~$ stevedore new jekyll
jpetazzo@tarrasque:~$ stevedore enter jekyll
jpetazzo@stevedore-jekyll:~$ sudo apt-get install -qy ruby1.8 rubygems1.8
[...]
jpetazzo@stevedore-jekyll:~$ sudo gem install jekyll
[...]
jpetazzo@stevedore-jekyll:~$ jekyll new jpetazzo.github.io
jpetazzo@stevedore-jekyll:~$ cd jpetazzo.github.io
jpetazzo@stevedore-jekyll:~/jpetazzo.github.io$ jekyll serve --watch --drafts
[...]
  Server running... press ctrl-c to stop.

Then in a different terminal:

jpetazzo@tarrasque:~$ stevedore url jekyll 4000
http://10.1.1.7:4000/

Then, I essentially started to customize the CSS and HTML templates a little bit, and wrote this.

Once I was happy with the result, I did a git init, added a .gitignore, committed everything to the appropriate GitHub repository, and there you go!

What’s next?

I will probably tweak the layout a little bit to make it nicer (or less ugly), maybe add some Twitter feed and/or nicer social links; and obviously, write more exciting content!