WTF, Linux?

Sometimes, weird things happen on your system.

The first thing to look at on any given system is the output of top:

``` top - 16:34:38 up 2 days, 1:46, 19 users, load average: 0.74, 0.45, 0.39 Tasks: 290 total, 1 running, 288 sleeping, 0 stopped, 1 zombie %Cpu(s): 2.5 us, 5.2 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st MiB Mem : 15785.9 total, 1554.9 free, 3441.4 used, 10789.5 buff/cache MiB Swap: 976.0 total, 976.0 free, 0.0 used. 11682.1 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10856 tycho 20 0 1575288 66744 15424 S 5.9 0.4 25:42.61 hangups 4046 root 20 0 311044 48760 27076 S 4.0 0.3 30:22.79 Xorg 12812 tycho 20 0 1127988 316872 152760 S 3.0 2.0 13:29.05 chrome 12852 tycho 20 0 370588 103756 61092 S 2.0 0.6 6:22.84 chrome 30949 tycho 20 0 51804 18328 11976 S 2.0 0.1 0:00.81 urxvt 31013 tycho 20 0 12268 4316 3428 R 2.0 0.0 0:01.20 top 642 root -51 0 0 0 0 S 1.0 0.0 8:55.92 irq/134-iwlwifi 26517 tycho 20 0 603672 147324 78360 S 1.0 0.9 0:09.02 chrome 1 root 20 0 166772 11124 7692 S 0.0 0.1 0:07.28 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcugp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcupargp 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-eventshighpri 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mmpercpuwq 9 root 20 0 0 0 0 S 0.0 0.0 0:03.80 ksoftirqd/0 10 root 20 0 0 0 0 I 0.0 0.0 3:36.11 rcusched 11 root rt 0 0 0 0 S 0.0 0.0 0:00.70 migration/0 12 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idleinject/0 ```

top allows you to sort by what's using lots of CPU, memory, etc. This machine is not loaded at all, according to the load average:

load average: 0.74, 0.45, 0.39

The first number is the load average in the last minute, the second number is the load average over the last five minutes, and the third number is the load average over the last 15 minutes. Note that load averages are a multiple of cores, so a machine with four cores and a load average of 4 is totally CPU bound. Additionally, the load average is really the number of processes that "wanted" to run, not the number of processes that were actually running. So a machine with four cores and a load average of 16 is 4x oversubscribed on CPU.

But the most magical line in top is the per-cpu state line:

%Cpu(s): 2.5 us, 5.2 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st

Often, a careful reading of this line can tell you what is going on with a system. The numbers here are percentages and similar to load averages in that percentages > 100 are still sensible. But the most interesting thing are the suffixes here. From the top man page:

us, user : time running un-niced user processes sy, system : time running kernel processes ni, nice : time running niced user processes id, idle : time spent in the kernel idle handler wa, IO-wait : time waiting for I/O completion hi : time spent servicing hardware interrupts si : time spent servicing software interrupts st : time stolen from this vm by the hypervisor

Mostly, you'll see high numbers in the us, sy, id, and wa columns. Large values in us, ni, means that the workload is CPU bound. These will generally correlate with non-zero values in sy, since sy indicates time stuff was running in the kernel. Large values in sy relative to us almost always indicate a problem

stacker: build OCI images without host privilege

Ahoy! Recently, I've been working on a tool called stacker, which allows unprivileged users to build OCI images. The images that are generated are generated without uid shifting, so they look like any other OCI image that was generated by Docker or some other mechanism, while not requiring root (worth noting that this is what James Bottomley has described as his motivation for writing shiftfs).

Some base setup is required in order to make this happen, though. First, you can follow stacker's install guide to build and install it.

Next, as with any user namespaces setup, stacker needs a 65k uid delegation. On my ubuntu VM with the ubuntu user, this looks like,

$ grep ubuntu /etc/subuid
ubuntu:165536:65536
$ grep ubuntu /etc/subgid
ubuntu:165536:65536

Note that these can be any 65k range of subuids, stacker will use whatever you give the user you run it as.

Finally, stacker also needs a btrfs filesystem. Stacker was designed to build a large number of varying images from a single base image, and uses btrfs to avoid doing a large amount of i/o (and compression/decompression), undiffing filesystems back to their original state. For the purposes of this blog post, we can just use a loopback mounted btrfs filesystem. A slightly modified excerpt from the stacker test suite:

# btrfs setup
sudo truncate -s 100G btrfs.loop
sudo mkfs.btrfs btrfs.loop
sudo mkdir -p roots
# allow for unprivileged subvolume deletion; use a sane flushing strategy
sudo mount -o user_subvol_rm_allowed,flushoncommit,loop .stacker/btrfs.loop roots
# now make sure ubuntu can actually do stuff with this filesystem
sudo chown -R ubuntu:ubuntu roots

And with that, we can actually run stacker and build an image:

stacker build -f ./stacker.yaml

What goes in stacker.yaml you ask? Consider the example from stacker's readme:

centos:
	from:
		type: tar
		url: http://example.com/centos.tar.gz
	environment:
		http_proxy: http://example.com:8080
		https_proxy: https://example.com:8080
	labels:
		foo: bar
		bar: baz
boot:
	from:
		type: built
		tag: centos
	run: |
		yum install openssh-server
		echo meshuggah rocks
web:
	from:
		type: built
		tag: centos
	import: ./lighttp.cfg
	run: |
		yum install lighttpd
		cp /stacker/lighttp.cfg /etc/lighttpd/lighttp.cfg
	entrypoint: lighthttpd
	volumes:
		- /data/db
	working_dir: /var/lib/www

The top level describes the name of a tag in the OCI image to be built, in this case there will be three tags at the end: centos, boot, and web (notably, this example is quite contrived :). Underneath those, there are the following keys:

. from: this describes the base image that stacker will start from. You can

either start from some other image in the same stackerfile, a Docker image, or a tarball.

. import: A set of files to download or copy into the container. Stacker

will put these files at /stacker, which will be automatically cleaned up after the commands in the run section are run and the image is finalized.

. run: This is the set of commands to run in order to build the image; they

are run in a user namespaced container, with the set of files imported available in /stacker.

. environment, labels, working_dir, volumes: these all correspond

exactly to the similarly named bits in the OCI image config spec, and are available for users to pass things through to the runtime environment of the image.

That's a bit about stacker. Hopefully some more details about the internals will appear at some point :). Happy hacking!

Just how expensive is slub_deug=p?

Recently, I became interested in a debugging option in the Linux kernel

slub_debug=p

``` Average Half load -j 2 Run (std deviation): Elapsed Time 44.586 (1.67125) User Time 73.874 (2.51294) System Time 7.756 (0.741741) Percent CPU 182.4 (0.547723) Context Switches 13880.8 (157.161) Sleeps 15745.2 (24.3146)

Average Optimal load -j 4 Run (std deviation): Elapsed Time 32.702 (0.400087) User Time 89.22 (16.3062) System Time 8.945 (1.37014) Percent CPU 266.4 (88.5729) Context Switches 15701 (1929.57) Sleeps 15722.2 (78.1875) ```

without slub_debug=p

``` Average Half load -j 2 Run (std deviation): Elapsed Time 40.614 (0.232873) User Time 69.978 (0.503061) System Time 5.09 (0.182209) Percent CPU 184.4 (0.547723) Context Switches 13596 (121.501) Sleeps 15740.4 (46.4629)

Average Optimal load -j 4 Run (std deviation): Elapsed Time 30.622 (0.171523) User Time 86.233 (17.1381) System Time 5.874 (0.853557) Percent CPU 270.1 (90.3431) Context Switches 15370.3 (1875.97) Sleeps 15777.4 (74.43) ```

Linux Piter

Last weekend I attended the Linux Piter conference for the second year in a row. I have thoroughly enjoyed this conference both for the caliber of speaker (Cristoph Hellwig and Lennart Poettering this year) but more for the caliber of the audience. I receive interesting technical questions, suggestions, and insights about my talks when I present there. I would liken it to a conference like linux.conf.au: a less corporate/more community focused audience which is highly technical.

Getting to Russia can be complicated for most, but speaking here is interesting in addition to the technical aspects: the program committee puts on a "cultural day" the day after the conference, showing visitors around Saint Petersburg, which is a much nicer speaker gift than a box of chocolates or another USB charger :)

Using the LXD API from Python

After our recent splash at ODS in Vancouver, it seems that there is a lot of interest in writing some python code to drive LXD to do various things. The first option is to use pylxd, a project maintained by a friend of mine at Canonical named Chuck Short. However, the primary client of this is OpenStack, and thus it is python2. We also don't want to add a lot of dependencies in this module, so we're using raw python urllib and friends, which as you know can sometimes be...painful :)

Another option would be to use python's awesome requests module, which is considerably more user friendly. However, since LXD uses client certificates, it can be a bit challenging to get the basic bits going. Here's a small program that just does some GETs to the API, to see how it might work:

import os.path

import requests

conf_dir = os.path.expanduser('~/.config/lxc')
crt = os.path.join(conf_dir, 'client.crt')
key = os.path.join(conf_dir, 'client.key')

print(requests.get('https://127.0.0.1:8443/1.0', verify=False, cert=(crt, key)).text)

which gives me (piped through jq for sanity):

$ python3 lxd.py | jq .
{
  "type": "sync",
  "status": "Success",
  "status_code": 200,
  "metadata": {
    "api_compat": 1,
    "auth": "trusted",
    "config": {
      "trust-password": true
    },
    "environment": {
      "backing_fs": "ext4",
      "driver": "lxc",
      "kernel_version": "3.19.0-15-generic",
      "lxc_version": "1.1.2",
      "lxd_version": "0.9"
    }
  }
}

It just piggy backs on the lxc client generated certificates for now, but it would be great to have some python code that could generate those as well!

Another bit I should point out for people is lxd's --debug flag, which prints out every request it receives and response that it sends. I found this useful while developing the default lxc client, and it will probably be useful to those of you out there who are developing your own clients.

Happy hacking!

Live Migration in LXD

There has been a lot of interest on the various mailing lists as well as internally at Canonical about the state of migration in LXD, so I thought I'd write a bit about the current state of affairs.

Migration in LXD today passes the "Doom demo" test, i.e. it works well enough to reproduce the LXD announcement demo under certain conditions, which I'll cover below. There is still a lot of ongoing work to make CRIU (the underlying migration technology) work with all these configurations, so support will eventually arrive for everything. For now, though, you'll need to use the configuration I describe below.

First, I should note that things currently won't work on a systemd host. Since systemd re-mounts the rootfs as MS_SHARED, lots of things automatically become shared mounts, which confuses CRIU. There are several mailing list threads about ongoing work with respect to shared mounts in CRIU and I expect something to be merged that will resolve the situation shortly, but for now your host machine needs to be a non-systemd host (i.e. trusty or utopic will work just fine, but not vivid).

You'll need to install the daily versions of liblxc and lxd from their respective PPAs on each host:

sudo apt-add-repository -y ppa:ubuntu-lxc/daily
sudo apt-add-repository -y ppa:ubuntu-lxc/lxd-git-master
sudo apt-get update
sudo apt-get install lxd

Also, you'll need to uninstall lxcfs on both hosts:

sudo apt-get remove lxcfs

liblxc currently doesn't support migrating the mount configuration that lxcfs uses, although there is some work on that as well. The overmounting issue has been fixed in lxcfs, so I expect to land some patches in liblxc soon that will make lxcfs work.

Next, you'll want to set a password for your new lxd instance:

lxc config set password foo

You need some images in lxd, which can be acquired easily enough by lxd-images (of course, this only needs to be done on the source host of the migration):

lxd-images import lxc ubuntu trusty amd64 --alias ubuntu

You'll also need to set a few configuration items in lxd. First, the container needs to be privileged, although there is yet more ongoing work to remove this restriction. There are also a few things that CRIU does not support, so we need to set our container config to respect those as well. You can do all of this using lxd's profiles mechanism, that is:

lxc config profile create migratable
lxc config profile edit migratable

And paste the following content in instead of what's there:

name: migratable
config:
  raw.lxc: |
    lxc.console = none
    lxc.cgroup.devices.deny = c 5:1 rwm
    lxc.start.auto =
    lxc.start.auto = proc:mixed sys:mixed
  security.privileged: "true"
devices:
  eth0:
    nictype: bridged
    parent: lxcbr0
    type: nic

Finally, launch your contianer:

lxc launch ubuntu migratee -p migratable

Finally, add both of your LXDs as non unix-socket remotes (required for now, but not forever):

lxc remote add lxd thishost:8443   # don't use localhost here
lxc remote add lxd2 otherhost:8443 # use a publicly addressable name

Profiles used by a particular container need to be present on both the source of the migration and the sink, so we should copy the profile to the sink as well:

lxc config profile copy migratable lxd2:

And now, you're ready for the magic!

lxc start migratee
lxc move lxd:migratee lxd2:migratee

With luck, you'll have migrated the container to lxd2. Of course, things don't always go right the first time. The full log file for the migration attempts should be available in /var/log/lxd/migratee/migration_{dump|restore}_<timestamp>.log, on the respective host where the dump or restore took place. If you aren't successful in migrating things (or parsing the dump/restore log), feel free to mail lxc-users, and I can help you debug what went wrong.

Happy hacking!

setproctitle() in Linux

While working on LXD, one of the things I occasionally do is submit patches to LXC (e.g. the migration work or other things). In particular, the name of the LXC monitor process (the process that's the parent of init) is fork()ed in the C API call, so whatever the name of the binary that ran the API call (in our case, LXD) is the name of the parent. This could be slightly confusing (especially in the case where LXD dies but a process that looks like it is named LXD lives on). Should be easy enough to fix, right? Lots of *nixes seem to have a setproctitle() function to correct this, so we'll just call that!

And lo, there is prctl() which has a PR_SET_NAME mode that we can use. Done! Except from one small caveat from the man page:

The name can be up to 16 bytes long, and should be null-terminated if it contains fewer bytes.

Yes, you read that, 16 bytes; not useful for a lot of process names, especially something which would be ideal for LXC:

[lxc monitor] /var/lib/lxc container-name

Ok, so how hard can it be to write our own? If you look around on the internet, a lot of people suggest something like strcpy(argv[0], "my-proc-name"). That works, but what happens if your process name is longer than the original? You smash the stack! Try cat /proc/<pid>/environ on the program below:

#include <string.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
    char buf[1024];
    memset(buf, '0', sizeof(buf));
    buf[1023] = 0;
    strncpy(argv[0], buf, sizeof(buf));
    sleep(10000);
    return 0;
}

If your process name is longer than the original environment, you overwrite something else potentially more useful, which could cause all sorts of nastiness, especially as something that runs as root.

The thing is, the environment isn't necessarily all that useful; it doesn't indicate the current environment, just the initial environment. So we could use that space for the process name, as long as the kernel knew the environment wasn't valid any more. prctl() to the rescue again, we can pass it PR_SET_MM and PR_SET_MM_ENV_{START|END} to update these locations.

Problem solved! Except that we want to do this from liblxc.so, which has no concept of argv. prctl() has no PR_GET_MM calls, so we can't just go the other way with it. We could invent some ugly API where you have to pass it in, but that would require users to either set their argv pointers up front, or carry it around until they needed it, or something similarly ugly. Instead, we steal an idea from the CRIU codebase: we look in /proc/<pid>/stat. This file has (in columns 48-51, if your kernel is new enough) exactly the arguments you want from PR_GET_MM_*! Thus, we can use this file to find out inside of liblxc where is safe to put the new proctitle.

Putting it all together, liblxc now has an implementation of setproctitle() that will overwrite your initial environment (but is careful not to overwrite anything else), which can be used to set process titles longer than 16 bytes. Enjoy!

Live Migration of Linux Containers

Recently, I've been playing around with checkpoint and restore of Linux containers. One of the obvious applications is checkpointing on one host and restoring on another (i.e. live migration). Live migration has all sorts of interesting applications, so it is nice to know that at least a proof of concept of it works today.

Anyway, onto the interesting bits! The first thing I did was create two vms, and install criu's and lxc's development versions on both hosts:

sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get install lxc

sudo apt-get install build-essential protobuf-c-compiler
git clone https://github.com/xemul/criu && cd criu && sudo make install

Then, I created a container:

sudo lxc-create -t ubuntu -n u1 -- -r trusty -a amd64

Since the work on container checkpoint/restore is so young, not all container configurations are supported. In particular, I had to add the following to my config:

cat << EOF | sudo tee -a /var/lib/lxc/u1/config
# hax for criu
lxc.console = none
lxc.tty = 0
lxc.cgroup.devices.deny = c 5:1 rwm
EOF

Finally, although the lxc-checkpoint tool allows us to checkpoint and restore containers, there is no support for migration directly today. There are several tools in the works for this, but for now we can just use a cheesy shell script:

cat > migrate <<EOF
#!/bin/sh
set -e

usage() {
  echo $0 container user@host.to.migrate.to
  exit 1
}

if [ "$(id -u)" != "0" ]; then
  echo "ERROR: Must run as root."
  usage
fi

if [ "$#" != "2" ]; then
  echo "Bad number of args."
  usage
fi

name=$1
host=$2

checkpoint_dir=/tmp/checkpoint

do_rsync() {
  rsync -aAXHltzh --progress --numeric-ids --devices --rsync-path="sudo rsync" $1 $host:$1
}

# we assume the same lxcpath on both hosts, that is bad.
LXCPATH=$(lxc-config lxc.lxcpath)

lxc-checkpoint -n $name -D $checkpoint_dir -s -v

do_rsync $LXCPATH/$name/
do_rsync $checkpoint_dir/

ssh $host "sudo lxc-checkpoint -r -n $name -D $checkpoint_dir -v"
ssh $host "sudo lxc-wait -n u1 -s RUNNING"
EOF
chmod +x migrate

Now, for the magic show! I've set up the container I created above to be a web server running micro-httpd that serves an incredibly important message:

$ ssh ubuntu@$(sudo lxc-info -n u1 -H -i)
ubuntu@u1:~$ sudo apt-get install micro-httpd
ubuntu@u1:~$ echo "Meshuggah is the best metal band." | sudo tee /var/www/index.html
ubuntu@u1:~$ exit
$ curl -s $(sudo lxc-info -n u1 -H -i)
Meshuggah is the best metal band.

Let's migrate!

$ sudo ./migrate u1 ubuntu@criu2.local
  # lots of rsync output...
$ ssh ubuntu@criu2.local 'curl -s $(sudo lxc-info -n u1 -H -i)'
Meshuggah is the best metal band.

Of course, there are several caveats to this. You've got to add the lines above to your config, which means you can't dump containers with ttys. Since containers have the hosts's fusectl bind mounted and fuse mounts aren't supported by criu, containers or hosts using fuse can't be dumped. You can't migrate unprivileged containers yet. There are probably others that I'm forgetting, though list of troubleshoting steps is available at criu.org/LXC#Troubleshooting.

There is ongoing work in both CRIU and LXC to get rid of all the caveats above, so stay tuned!