WTF, Linux?
Sometimes, weird things happen on your system.
The first thing to look at on any given system is the output of top:
``` top - 16:34:38 up 2 days, 1:46, 19 users, load average: 0.74, 0.45, 0.39 Tasks: 290 total, 1 running, 288 sleeping, 0 stopped, 1 zombie %Cpu(s): 2.5 us, 5.2 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st MiB Mem : 15785.9 total, 1554.9 free, 3441.4 used, 10789.5 buff/cache MiB Swap: 976.0 total, 976.0 free, 0.0 used. 11682.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10856 tycho 20 0 1575288 66744 15424 S 5.9 0.4 25:42.61 hangups 4046 root 20 0 311044 48760 27076 S 4.0 0.3 30:22.79 Xorg 12812 tycho 20 0 1127988 316872 152760 S 3.0 2.0 13:29.05 chrome 12852 tycho 20 0 370588 103756 61092 S 2.0 0.6 6:22.84 chrome 30949 tycho 20 0 51804 18328 11976 S 2.0 0.1 0:00.81 urxvt 31013 tycho 20 0 12268 4316 3428 R 2.0 0.0 0:01.20 top 642 root -51 0 0 0 0 S 1.0 0.0 8:55.92 irq/134-iwlwifi 26517 tycho 20 0 603672 147324 78360 S 1.0 0.9 0:09.02 chrome 1 root 20 0 166772 11124 7692 S 0.0 0.1 0:07.28 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcugp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcupargp 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-eventshighpri 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mmpercpuwq 9 root 20 0 0 0 0 S 0.0 0.0 0:03.80 ksoftirqd/0 10 root 20 0 0 0 0 I 0.0 0.0 3:36.11 rcusched 11 root rt 0 0 0 0 S 0.0 0.0 0:00.70 migration/0 12 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idleinject/0 ```
top
allows you to sort by what's using lots of CPU, memory, etc. This machine is not loaded at all, according to the load average:
load average: 0.74, 0.45, 0.39
The first number is the load average in the last minute, the second number is the load average over the last five minutes, and the third number is the load average over the last 15 minutes. Note that load averages are a multiple of cores, so a machine with four cores and a load average of 4 is totally CPU bound. Additionally, the load average is really the number of processes that "wanted" to run, not the number of processes that were actually running. So a machine with four cores and a load average of 16 is 4x oversubscribed on CPU.
But the most magical line in top is the per-cpu state line:
%Cpu(s): 2.5 us, 5.2 sy, 0.0 ni, 91.8 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
Often, a careful reading of this line can tell you what is going on with a system. The numbers here are percentages and similar to load averages in that percentages > 100 are still sensible. But the most interesting thing are the suffixes here. From the top man page:
us, user : time running un-niced user processes sy, system : time running kernel processes ni, nice : time running niced user processes id, idle : time spent in the kernel idle handler wa, IO-wait : time waiting for I/O completion hi : time spent servicing hardware interrupts si : time spent servicing software interrupts st : time stolen from this vm by the hypervisor
Mostly, you'll see high numbers in the us
, sy
, id
, and wa
columns. Large values in us
, ni
, means that the workload is CPU bound. These will generally correlate with non-zero values in sy
, since sy
indicates time stuff was running in the kernel. Large values in sy
relative to us
almost always indicate a problem
stacker: build OCI images without host privilege
Ahoy! Recently, I've been working on a tool called stacker, which allows unprivileged users to build OCI images. The images that are generated are generated without uid shifting, so they look like any other OCI image that was generated by Docker or some other mechanism, while not requiring root (worth noting that this is what James Bottomley has described as his motivation for writing shiftfs).
Some base setup is required in order to make this happen, though. First, you can follow stacker's install guide to build and install it.
Next, as with any user namespaces setup, stacker needs a 65k uid delegation. On my ubuntu VM with the ubuntu user, this looks like,
$ grep ubuntu /etc/subuid
ubuntu:165536:65536
$ grep ubuntu /etc/subgid
ubuntu:165536:65536
Note that these can be any 65k range of subuids, stacker will use whatever you give the user you run it as.
Finally, stacker also needs a btrfs filesystem. Stacker was designed to build a large number of varying images from a single base image, and uses btrfs to avoid doing a large amount of i/o (and compression/decompression), undiffing filesystems back to their original state. For the purposes of this blog post, we can just use a loopback mounted btrfs filesystem. A slightly modified excerpt from the stacker test suite:
# btrfs setup
sudo truncate -s 100G btrfs.loop
sudo mkfs.btrfs btrfs.loop
sudo mkdir -p roots
# allow for unprivileged subvolume deletion; use a sane flushing strategy
sudo mount -o user_subvol_rm_allowed,flushoncommit,loop .stacker/btrfs.loop roots
# now make sure ubuntu can actually do stuff with this filesystem
sudo chown -R ubuntu:ubuntu roots
And with that, we can actually run stacker and build an image:
stacker build -f ./stacker.yaml
What goes in stacker.yaml
you ask? Consider the example from stacker's readme:
centos:
from:
type: tar
url: http://example.com/centos.tar.gz
environment:
http_proxy: http://example.com:8080
https_proxy: https://example.com:8080
labels:
foo: bar
bar: baz
boot:
from:
type: built
tag: centos
run: |
yum install openssh-server
echo meshuggah rocks
web:
from:
type: built
tag: centos
import: ./lighttp.cfg
run: |
yum install lighttpd
cp /stacker/lighttp.cfg /etc/lighttpd/lighttp.cfg
entrypoint: lighthttpd
volumes:
- /data/db
working_dir: /var/lib/www
The top level describes the name of a tag in the OCI image to be built, in this case there will be three tags at the end: centos
, boot
, and web
(notably, this example is quite contrived :). Underneath those, there are the following keys:
. from
: this describes the base image that stacker will start from. You can
either start from some other image in the same stackerfile, a Docker image, or a tarball.
. import
: A set of files to download or copy into the container. Stacker
will put these files at /stacker
, which will be automatically cleaned up after the commands in the run
section are run and the image is finalized.
. run
: This is the set of commands to run in order to build the image; they
are run in a user namespaced container, with the set of files imported available in /stacker
.
. environment
, labels
, working_dir
, volumes
: these all correspond
exactly to the similarly named bits in the OCI image config spec, and are available for users to pass things through to the runtime environment of the image.
That's a bit about stacker. Hopefully some more details about the internals will appear at some point :). Happy hacking!
Just how expensive is slub_deug=p?
Recently, I became interested in a debugging option in the Linux kernel
slub_debug=p
``` Average Half load -j 2 Run (std deviation): Elapsed Time 44.586 (1.67125) User Time 73.874 (2.51294) System Time 7.756 (0.741741) Percent CPU 182.4 (0.547723) Context Switches 13880.8 (157.161) Sleeps 15745.2 (24.3146)
Average Optimal load -j 4 Run (std deviation): Elapsed Time 32.702 (0.400087) User Time 89.22 (16.3062) System Time 8.945 (1.37014) Percent CPU 266.4 (88.5729) Context Switches 15701 (1929.57) Sleeps 15722.2 (78.1875) ```
without slub_debug=p
``` Average Half load -j 2 Run (std deviation): Elapsed Time 40.614 (0.232873) User Time 69.978 (0.503061) System Time 5.09 (0.182209) Percent CPU 184.4 (0.547723) Context Switches 13596 (121.501) Sleeps 15740.4 (46.4629)
Average Optimal load -j 4 Run (std deviation): Elapsed Time 30.622 (0.171523) User Time 86.233 (17.1381) System Time 5.874 (0.853557) Percent CPU 270.1 (90.3431) Context Switches 15370.3 (1875.97) Sleeps 15777.4 (74.43) ```
Linux Piter
Last weekend I attended the Linux Piter conference for the second year in a row. I have thoroughly enjoyed this conference both for the caliber of speaker (Cristoph Hellwig and Lennart Poettering this year) but more for the caliber of the audience. I receive interesting technical questions, suggestions, and insights about my talks when I present there. I would liken it to a conference like linux.conf.au: a less corporate/more community focused audience which is highly technical.
Getting to Russia can be complicated for most, but speaking here is interesting in addition to the technical aspects: the program committee puts on a "cultural day" the day after the conference, showing visitors around Saint Petersburg, which is a much nicer speaker gift than a box of chocolates or another USB charger :)
Using the LXD API from Python
After our recent splash at ODS in Vancouver, it seems that there is a lot of interest in writing some python code to drive LXD to do various things. The first option is to use pylxd, a project maintained by a friend of mine at Canonical named Chuck Short. However, the primary client of this is OpenStack, and thus it is python2. We also don't want to add a lot of dependencies in this module, so we're using raw python urllib and friends, which as you know can sometimes be...painful :)
Another option would be to use python's awesome requests module, which is considerably more user friendly. However, since LXD uses client certificates, it can be a bit challenging to get the basic bits going. Here's a small program that just does some GETs to the API, to see how it might work:
import os.path
import requests
conf_dir = os.path.expanduser('~/.config/lxc')
crt = os.path.join(conf_dir, 'client.crt')
key = os.path.join(conf_dir, 'client.key')
print(requests.get('https://127.0.0.1:8443/1.0', verify=False, cert=(crt, key)).text)
which gives me (piped through jq for sanity):
$ python3 lxd.py | jq .
{
"type": "sync",
"status": "Success",
"status_code": 200,
"metadata": {
"api_compat": 1,
"auth": "trusted",
"config": {
"trust-password": true
},
"environment": {
"backing_fs": "ext4",
"driver": "lxc",
"kernel_version": "3.19.0-15-generic",
"lxc_version": "1.1.2",
"lxd_version": "0.9"
}
}
}
It just piggy backs on the lxc client generated certificates for now, but it would be great to have some python code that could generate those as well!
Another bit I should point out for people is lxd's --debug
flag, which prints out every request it receives and response that it sends. I found this useful while developing the default lxc
client, and it will probably be useful to those of you out there who are developing your own clients.
Happy hacking!
Live Migration in LXD
There has been a lot of interest on the various mailing lists as well as internally at Canonical about the state of migration in LXD, so I thought I'd write a bit about the current state of affairs.
Migration in LXD today passes the "Doom demo" test, i.e. it works well enough to reproduce the LXD announcement demo under certain conditions, which I'll cover below. There is still a lot of ongoing work to make CRIU (the underlying migration technology) work with all these configurations, so support will eventually arrive for everything. For now, though, you'll need to use the configuration I describe below.
First, I should note that things currently won't work on a systemd host. Since systemd re-mounts the rootfs as MS_SHARED
, lots of things automatically become shared mounts, which confuses CRIU. There are several mailing list threads about ongoing work with respect to shared mounts in CRIU and I expect something to be merged that will resolve the situation shortly, but for now your host machine needs to be a non-systemd host (i.e. trusty or utopic will work just fine, but not vivid).
You'll need to install the daily versions of liblxc and lxd from their respective PPAs on each host:
sudo apt-add-repository -y ppa:ubuntu-lxc/daily
sudo apt-add-repository -y ppa:ubuntu-lxc/lxd-git-master
sudo apt-get update
sudo apt-get install lxd
Also, you'll need to uninstall lxcfs
on both hosts:
sudo apt-get remove lxcfs
liblxc
currently doesn't support migrating the mount configuration that lxcfs uses, although there is some work on that as well. The overmounting issue has been fixed in lxcfs, so I expect to land some patches in liblxc soon that will make lxcfs work.
Next, you'll want to set a password for your new lxd instance:
lxc config set password foo
You need some images in lxd
, which can be acquired easily enough by lxd-images (of course, this only needs to be done on the source host of the migration):
lxd-images import lxc ubuntu trusty amd64 --alias ubuntu
You'll also need to set a few configuration items in lxd. First, the container needs to be privileged, although there is yet more ongoing work to remove this restriction. There are also a few things that CRIU does not support, so we need to set our container config to respect those as well. You can do all of this using lxd's profiles mechanism, that is:
lxc config profile create migratable
lxc config profile edit migratable
And paste the following content in instead of what's there:
name: migratable
config:
raw.lxc: |
lxc.console = none
lxc.cgroup.devices.deny = c 5:1 rwm
lxc.start.auto =
lxc.start.auto = proc:mixed sys:mixed
security.privileged: "true"
devices:
eth0:
nictype: bridged
parent: lxcbr0
type: nic
Finally, launch your contianer:
lxc launch ubuntu migratee -p migratable
Finally, add both of your LXDs as non unix-socket remotes (required for now, but not forever):
lxc remote add lxd thishost:8443 # don't use localhost here
lxc remote add lxd2 otherhost:8443 # use a publicly addressable name
Profiles used by a particular container need to be present on both the source of the migration and the sink, so we should copy the profile to the sink as well:
lxc config profile copy migratable lxd2:
And now, you're ready for the magic!
lxc start migratee
lxc move lxd:migratee lxd2:migratee
With luck, you'll have migrated the container to lxd2
. Of course, things don't always go right the first time. The full log file for the migration attempts should be available in /var/log/lxd/migratee/migration_{dump|restore}_<timestamp>.log
, on the respective host where the dump or restore took place. If you aren't successful in migrating things (or parsing the dump/restore log), feel free to mail lxc-users, and I can help you debug what went wrong.
Happy hacking!
setproctitle() in Linux
While working on LXD, one of the things I occasionally do is submit patches to LXC (e.g. the migration work or other things). In particular, the name of the LXC monitor process (the process that's the parent of init) is fork()ed
in the C API call, so whatever the name of the binary that ran the API call (in our case, LXD) is the name of the parent. This could be slightly confusing (especially in the case where LXD dies but a process that looks like it is named LXD lives on). Should be easy enough to fix, right? Lots of *nixes seem to have a setproctitle()
function to correct this, so we'll just call that!
And lo, there is prctl()
which has a PR_SET_NAME
mode that we can use. Done! Except from one small caveat from the man page:
The name can be up to 16 bytes long, and should be null-terminated if it contains fewer bytes.
Yes, you read that, 16 bytes; not useful for a lot of process names, especially something which would be ideal for LXC:
[lxc monitor] /var/lib/lxc container-name
Ok, so how hard can it be to write our own? If you look around on the internet, a lot of people suggest something like strcpy(argv[0], "my-proc-name")
. That works, but what happens if your process name is longer than the original? You smash the stack! Try cat /proc/<pid>/environ
on the program below:
#include <string.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
char buf[1024];
memset(buf, '0', sizeof(buf));
buf[1023] = 0;
strncpy(argv[0], buf, sizeof(buf));
sleep(10000);
return 0;
}
If your process name is longer than the original environment, you overwrite something else potentially more useful, which could cause all sorts of nastiness, especially as something that runs as root.
The thing is, the environment isn't necessarily all that useful; it doesn't indicate the current environment, just the initial environment. So we could use that space for the process name, as long as the kernel knew the environment wasn't valid any more. prctl()
to the rescue again, we can pass it PR_SET_MM
and PR_SET_MM_ENV_{START|END}
to update these locations.
Problem solved! Except that we want to do this from liblxc.so
, which has no concept of argv. prctl()
has no PR_GET_MM
calls, so we can't just go the other way with it. We could invent some ugly API where you have to pass it in, but that would require users to either set their argv pointers up front, or carry it around until they needed it, or something similarly ugly. Instead, we steal an idea from the CRIU codebase: we look in /proc/<pid>/stat
. This file has (in columns 48-51, if your kernel is new enough) exactly the arguments you want from PR_GET_MM_*
! Thus, we can use this file to find out inside of liblxc where is safe to put the new proctitle.
Putting it all together, liblxc now has an implementation of setproctitle()
that will overwrite your initial environment (but is careful not to overwrite anything else), which can be used to set process titles longer than 16 bytes. Enjoy!
Live Migration of Linux Containers
Recently, I've been playing around with checkpoint and restore of Linux containers. One of the obvious applications is checkpointing on one host and restoring on another (i.e. live migration). Live migration has all sorts of interesting applications, so it is nice to know that at least a proof of concept of it works today.
Anyway, onto the interesting bits! The first thing I did was create two vms, and install criu's and lxc's development versions on both hosts:
sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get install lxc
sudo apt-get install build-essential protobuf-c-compiler
git clone https://github.com/xemul/criu && cd criu && sudo make install
Then, I created a container:
sudo lxc-create -t ubuntu -n u1 -- -r trusty -a amd64
Since the work on container checkpoint/restore is so young, not all container configurations are supported. In particular, I had to add the following to my config:
cat << EOF | sudo tee -a /var/lib/lxc/u1/config
# hax for criu
lxc.console = none
lxc.tty = 0
lxc.cgroup.devices.deny = c 5:1 rwm
EOF
Finally, although the lxc-checkpoint tool allows us to checkpoint and restore containers, there is no support for migration directly today. There are several tools in the works for this, but for now we can just use a cheesy shell script:
cat > migrate <<EOF
#!/bin/sh
set -e
usage() {
echo $0 container user@host.to.migrate.to
exit 1
}
if [ "$(id -u)" != "0" ]; then
echo "ERROR: Must run as root."
usage
fi
if [ "$#" != "2" ]; then
echo "Bad number of args."
usage
fi
name=$1
host=$2
checkpoint_dir=/tmp/checkpoint
do_rsync() {
rsync -aAXHltzh --progress --numeric-ids --devices --rsync-path="sudo rsync" $1 $host:$1
}
# we assume the same lxcpath on both hosts, that is bad.
LXCPATH=$(lxc-config lxc.lxcpath)
lxc-checkpoint -n $name -D $checkpoint_dir -s -v
do_rsync $LXCPATH/$name/
do_rsync $checkpoint_dir/
ssh $host "sudo lxc-checkpoint -r -n $name -D $checkpoint_dir -v"
ssh $host "sudo lxc-wait -n u1 -s RUNNING"
EOF
chmod +x migrate
Now, for the magic show! I've set up the container I created above to be a web server running micro-httpd that serves an incredibly important message:
$ ssh ubuntu@$(sudo lxc-info -n u1 -H -i)
ubuntu@u1:~$ sudo apt-get install micro-httpd
ubuntu@u1:~$ echo "Meshuggah is the best metal band." | sudo tee /var/www/index.html
ubuntu@u1:~$ exit
$ curl -s $(sudo lxc-info -n u1 -H -i)
Meshuggah is the best metal band.
Let's migrate!
$ sudo ./migrate u1 ubuntu@criu2.local
# lots of rsync output...
$ ssh ubuntu@criu2.local 'curl -s $(sudo lxc-info -n u1 -H -i)'
Meshuggah is the best metal band.
Of course, there are several caveats to this. You've got to add the lines above to your config, which means you can't dump containers with ttys. Since containers have the hosts's fusectl bind mounted and fuse mounts aren't supported by criu, containers or hosts using fuse can't be dumped. You can't migrate unprivileged containers yet. There are probably others that I'm forgetting, though list of troubleshoting steps is available at criu.org/LXC#Troubleshooting.
There is ongoing work in both CRIU and LXC to get rid of all the caveats above, so stay tuned!