2017-07-10 09:49:12 +00:00
---
2018-01-06 16:22:12 +00:00
title: Albatross - provisioning, deploying, managing, and monitoring virtual machines
2017-07-10 09:49:12 +00:00
author: hannes
tags: mirageos, deployment, provisioning
abstract: all we need is X.509
---
2023-05-16 17:21:47 +00:00
EDIT (2023-05-16): Please take a look at [the updated article](/Posts/Albatross).
2017-07-10 09:49:12 +00:00
## How to deploy unikernels?
MirageOS has a pretty good story on how to compose your OCaml libraries into a
virtual machine image. The `mirage` command line utility contains all the
knowledge about which backend requires which library. This enables it to write a
unikernel using abstract interfaces (such as a network device). Additionally the
`mirage` utility can compile for any backend. (It is still unclear whether this
is a sustainable idea, since the `mirage` tool needs to be adjusted for every
new backend, but also for additional implementations of an interface.)
Once a virtual machine image has been created, it needs to be deployed. I run
my own physical hardware, with all the associated upsides and downsides.
Specifically I run several physical [FreeBSD](https://freebsd.org) machines on
the Internet, and use the [bhyve](http://bhyve.org) hypervisor with MirageOS as
2021-09-08 17:43:52 +00:00
described [earlier](/Posts/Solo5). Recently, Martin
2017-07-10 09:49:12 +00:00
Lucina
[developed](https://github.com/Solo5/solo5/pull/171/commits/e67a007b75fa3fcee5c082aab04c9fe9e897d779)
a
[`vmm`](https://svnweb.freebsd.org/base/head/sys/amd64/include/vmm.h?view=markup)
backend for [Solo5](https://github.com/solo5/solo5). This means there is no
need to use virtio anymore, or grub2-bhyve, or the bhyve binary (which links
`libvmmapi` that already had a [security
advisory](https://www.freebsd.org/security/advisories/FreeBSD-SA-16:38.bhyve.asc)).
Instead of the bhyve binary, a ~70kB small `ukvm-bin` binary (dynamically
linking libc) can be used which is the solo5 virtual machine monitor on the host
side.
Until now, I manually created and deployed virtual machines using shell scripts,
ssh logins, and a network file system shared with the FreeBSD virtual machine
which builds my MirageOS unikernels.
But there are several drawbacks with this approach, the biggest is that sharing
resources is hard - to enable a friend to run their unikernel on my server,
they'll need to have a user account, and even privileged permissions to
create virtual network interfaces and execute virtual machines.
To get rid of these ad-hoc shell scripts and copying of virtual machine images,
I developed an UNIX daemon which accomplishes the required work. This daemon
waits for (mutually!) authenticated network connections, and provides the
2018-01-11 14:02:01 +00:00
desired commands; to create a new virtual machine, to acquire a block device of
a given size, to destroy a virtual machine, to stream the console output of a
2017-07-10 09:49:12 +00:00
virtual machine.
## System design
The system bears minimalistic characteristics. The single interface to the
outside world is a TLS stream over TCP. Internally, there is a family of
processes, one of which has superuser privileges, communicating via unix domain
2017-07-10 10:26:01 +00:00
sockets. The processes do not need any persistent storage (apart from the
2017-07-10 09:49:12 +00:00
revocation lists). A brief enumeration of the processes is provided below:
* `vmmd` (superuser privileges), which terminates TLS sessions, proxies messages, and creates and destroys virtual machines (including setup and teardown of network interfaces and virtual block devices)
* `vmm_stats` periodically gathers resource usage and network interface statistics
* `vmm_console` reads console output of every provided fifo, and stores this in a ringbuffer, replaying to a client on demand
* `vmm_log` consumes the event log (login, starting, and stopping of virtual machines)
The system uses X.509 certificates as tokens. These are authenticated key value
stores. There are four shapes of certificates: a *virtual machine certificate*
which embeds the entire virtual machine image, together with configuration
information (resource usage, how many and which network interfaces, block device
access); a *command certificate* (for interactive use, allowing (a subset of)
commands such as attaching to console output); a *revocation certificate* which
contains a list of revoked certificates; and a *delegation certificate* to
distribute resources to someone else (an intermediate CA certificate).
The resources which can be controlled are CPUs, memory consumption, block
storage, and access to bridge interfaces (virtual switches) - encoded in the
virtual machine and delegation certificates. Additionally, delegation
certificates can limit the number of virtual machines.
Leveraging the X.509 system ensures that the client always has to present a
certificate chain from the root certificate. Each intermediate certificate is a
delegation certificate, which may further restrict resources. The serial
numbers of the chain is used as unique identifier for each virtual machine and
other certificates. The chain restricts access of the leaf certificate as well:
only the subtree of the chain can be viewed. E.g. if there are delegations to
both Alice and Bob from the root certificate, they can not see each other
virtual machines.
Connecting to the vmmd requires a TLS client, a CA certificate, a leaf
certificate (and the delegation chain) and its private key. In the background,
it is a multi-step process using TLS: first, the client establishes a TLS
2018-01-11 14:02:01 +00:00
connection where it authenticates the server using the CA certificate, then the
2017-07-10 09:49:12 +00:00
server demands a TLS renegotiation where it requires the client to authenticate
with its leaf certificate and private key. Using renegotiation over the
encrypted channel prevents passive observers to see the client certificate in
clear.
Depending on the leaf certificate, the server logic is slightly different. A
command certificate opens an interactive session where - depending on
permissions encoded in the certificate - different commands can be issued: the
console output can be streamed, the event log can be viewed, virtual machines
can be destroyed, statistics can be collected, and block devices can be managed.
When a virtual machine certificate is presented, the desired resource usage is
checked against the resource policies in the delegation certificate chain and
the currently running virtual machines. If sufficient resources are free, the
embedded virtual machine is started. In addition to other resource information,
a delegation certificate may embed IP usage, listing the network configuration
(gateway and netmask), and which addresses you're supposed to use. Boot
arguments can be encoded in the certificate as well, they are just passed to the
virtual machine (for easy deployment of off-the-shelf systems).
If a revocation certificate is presented, the embodied revocation list is
verified, and stored on the host system. Revocation is enforced by destroying
any revoked virtual machines and terminating any revoked interactive sessions.
If a delegation certificate is revoked, additionally the connected block devices
are destroyed.
The maximum size of a virtual machine image embedded into a X.509 certificate
transferred over TLS is 2 ^ 24 - 1 bytes, roughly 16 MB. If this turns out to
be not sufficient, compression may help. Or staging of deployment.
## An example
Instructions on how to setup `vmmd` and the certificate authority are in the
2018-01-06 16:22:12 +00:00
README file of the [`albatross` git repository](https://github.com/hannesm/albatross). Here
2017-07-10 09:49:12 +00:00
is some (stripped) terminal output:
```bash
> openssl x509 -text -noout -in admin.pem
Certificate:
Data:
Serial Number: b7:aa:77:f6:ca:08:ee:6a
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=dev
Subject: CN=admin
X509v3 extensions:
1.3.6.1.4.1.49836.42.42: ....
1.3.6.1.4.1.49836.42.0: ...
> openssl asn1parse -in admin.pem
403:d=4 hl=2 l= 18 cons: SEQUENCE
405:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.42
417:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:03020780
423:d=4 hl=2 l= 17 cons: SEQUENCE
425:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.0
437:d=5 hl=2 l= 3 prim: OCTET STRING [HEX DUMP]:020100
> openssl asn1parse -in hello.pem
410:d=4 hl=2 l= 18 cons: SEQUENCE
412:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.42
424:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:03020520
430:d=4 hl=2 l= 18 cons: SEQUENCE
432:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.5
444:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:02020200
450:d=4 hl=2 l= 17 cons: SEQUENCE
452:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.6
464:d=5 hl=2 l= 3 prim: OCTET STRING [HEX DUMP]:020101
469:d=4 hl=5 l=3054024 cons: SEQUENCE
474:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.9
486:d=5 hl=5 l=3054007 prim: OCTET STRING [HEX DUMP]:A0832E99B204832E99AD7F454C46
```
The MirageOS private enterprise number is 1.3.6.1.4.1.49836, I use the arc 42
here. I use 0 as version (an integer), where 0 is the current version.
42 is a bit string representing the permissions. 5 the amount of memory, 6 the
CPU id, and 9 finally the virtual machine image (as ELF binary). If you're
eager to see more, look into the `Vmm_asn` module.
Using a command certificate establishes an interactive session where you can
review the event log, see all currently running virtual machines, or attach to
the console (which is then streamed, if new console output appears while the
interactive session is active, you'll be notified). The `db` file is used to
translate between the internal names (mentioned above, hashed serial numbers) to
common names of the certificates - both on command input (`attach hello`) and
output.
```bash
> vmm_client cacert.pem admin.bundle admin.key localhost:1025 --db dev.db
$ info
info sn.nqsb.io: 'cpuset' '-l' '7' '/tmp/vmm/ukvm-bin.net' '--net=tap27' '--' '/tmp/81363f.0237f3.img' 91540 taps tap27
info nqsbio: 'cpuset' '-l' '5' '/tmp/vmm/ukvm-bin.net' '--net=tap26' '--' '/tmp/81363f.43a0ff.img' 91448 taps tap26
info marrakesh: 'cpuset' '-l' '4' '/tmp/vmm/ukvm-bin.net' '--net=tap25' '--' '/tmp/81363f.cb53e2.img' 91368 taps tap25
info tls.nqsb.io: 'cpuset' '-l' '9' '/tmp/vmm/ukvm-bin.net' '--net=tap28' '--' '/tmp/81363f.ec692e.img' 91618 taps tap28
$ log
log: 2017-07-10 09:43:39 +00:00: marrakesh LOGIN 128.232.110.109:43142
log: 2017-07-10 09:43:39 +00:00: marrakesh STARTED 91368 (tap tap25, block no)
log: 2017-07-10 09:43:51 +00:00: nqsbio LOGIN 128.232.110.109:44663
log: 2017-07-10 09:43:51 +00:00: nqsbio STARTED 91448 (tap tap26, block no)
log: 2017-07-10 09:44:07 +00:00: sn.nqsb.io LOGIN 128.232.110.109:38182
log: 2017-07-10 09:44:07 +00:00: sn.nqsb.io STARTED 91540 (tap tap27, block no)
log: 2017-07-10 09:44:21 +00:00: tls.nqsb.io LOGIN 128.232.110.109:11178
log: 2017-07-10 09:44:21 +00:00: tls.nqsb.io STARTED 91618 (tap tap28, block no)
log: 2017-07-10 09:44:25 +00:00: hannes LOGIN 128.232.110.109:24207
success
$ attach hello
console hello: 2017-07-09 18:44:52 +00:00 | ___|
console hello: 2017-07-09 18:44:52 +00:00 __| _ \ | _ \ __ \
console hello: 2017-07-09 18:44:52 +00:00 \__ \ ( | | ( | ) |
console hello: 2017-07-09 18:44:52 +00:00 ____/\___/ _|\___/____/
console hello: 2017-07-09 18:44:52 +00:00 Solo5: Memory map: 512 MB addressable:
console hello: 2017-07-09 18:44:52 +00:00 Solo5: unused @ (0x0 - 0xfffff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: text @ (0x100000 - 0x1e4fff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: rodata @ (0x1e5000 - 0x217fff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: data @ (0x218000 - 0x2cffff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: heap >= 0x2d0000 < stack < 0x20000000
console hello: 2017-07-09 18:44:52 +00:00 STUB: getenv() called
console hello: 2017-07-09 18:44:52 +00:00 2017-07-09 18:44:52 -00:00: INF [application] hello
console hello: 2017-07-09 18:44:53 +00:00 2017-07-09 18:44:53 -00:00: INF [application] hello
console hello: 2017-07-09 18:44:54 +00:00 2017-07-09 18:44:54 -00:00: INF [application] hello
console hello: 2017-07-09 18:44:55 +00:00 2017-07-09 18:44:55 -00:00: INF [application] hello
```
If you use a virtual machine certificate, depending on allowed resource the
virtual machine is started or not:
```bash
> vmm_client cacert.pem hello.bundle hello.key localhost:1025
success VM started
```
## Sharing is caring
Deploying unikernels is now easier for myself on my physical machine. That's
fine. Another aspect comes *for free* by reusing X.509: further delegation (and
limiting thereof). Within a delegation certificate, the basic constraints
extension must be present which marks this certificate as a CA certificate.
This may as well contain a path length - how many other delegations may follow -
or whether the resources may be shared further.
If I delegate 2 virtual machines and 2GB of memory to Alice, and allow an
arbitrary path length, she can issue tokens to her friend Carol and Dan, each up
to 2 virtual machines and 2 GB memory (but also less -- within the X.509 system
even more, but vmmd will reject any resource increase in the chain) - who can
further delegate to Eve, .... Carol and Dan won't know of each other,
and vmmd will only start up to 2 virtual machines using 2GB of memory in total
(sum of Alice, Carol, and Dan deployed virtual machines). Alice may revoke any
issued delegation (using a revocation certificate described above) to free up
some resources for herself. I don't need to interact when Alice or Dan share
their delegated resources further.
## Security
There are several security properties preserved by `vmmd`, such as the virtual
machine image is never transmitted in clear. Only properly authenticated
clients can create, destroy, gather statistics of _their_ virtual machines.
Two disjoint paths in the delegation tree are not able to discover anything
about each other (apart from caches, which depend on how CPUs are delegated and
their concrete physical layout). Only smaller amounts of resources can be
delegated further down. Each running virtual machine image is strongly isolated
from all other virtual machines.
As mentioned in the last section, delegations of delegations may end up in the
hands of malicious people. Vmmd limits delegations to allocate resources on the
host system, namely bridges and file systems. Only top delegations - directly
signed by the certificate authority - create bridge interfaces (which are
explicitly named in the certificate) and file systems (one zfs for each top
delegation (to allow easy snapshots and backups)).
The threat model is that clients have layer 2 access to the hosts network
interface card, and all guests share a single bridge (if this turns out to be a
problem, there are ways to restrict to a point-to-point interface with routed IP
addresses). A malicious virtual machine can try to hijack ethernet and IP
addresses.
Possible DoS scenarios include also to spawn VMs very fast (which immediately
crash) or generating a lot of console output. Both is indirectly handled by the
control channel: to create a virtual machine image, you need to setup a TLS
connection (with two handshakes) and transfer the virtual machine image (there
is intentionally no "respawn on quit" option). The console output is read by a
single process with user privileges (in the future there may be one console
reading process for each top delegation). It may further be rate limited as
well. The console stream is only ever sent to a single session, as soon as
someone attaches to the console in one session, all other sessions have this
console detached (and are notified about that).
The control channel itself can be rate limited using the host system firewall.
The only information persistently stored on a block device are the certificate
revocation lists - virtual machine images, FIFOs, unix domain sockets are all
stored in a memory-backed file system. A virtual machine with a lots of disk
operation may only delay or starve revocation list updates - if this turns out
to be a problem, the solution may be to use separate physical block devices for
the revocation lists and virtual block devices for clients.
## Conclusion
I showed a minimalistic system to provision, deploy, and manage virtual machine
images. It also allows to delegate resources (CPU, disk, ..) further. I'm
pretty satisfied with the security properties of the system.
The system embeds all data (configuration, resource policies, virtual machine
images) into X.509 certificates, and does not rely on an external file transfer
protocol. An advantage thereof is that all deployed images have been signed
with a private key.
All communication between the processes and between the client and the server
use a wire protocol, with structured input and output - this enables more
advanced algorithms (e.g. automated scaling) and fancier user interfaces than
the currently provided terminal based one.
The delegation mechanism allows to actually share computing resources in a
decentralised way - without knowing the final recipient. Revocation is builtin,
which can at any point delete access of a subtree or individual virtual machine
to the system. Instead of requesting revocation lists during the handshake,
they are pushed explicitly by the (sub)CA revoking a certificate.
While this system was designed for a physical server, it should be
straightforward to develop a Google compute engine / EC2 backend which extracts
the virtual machine image, commands, etc. from the certificate and deploys it to
your favourite cloud provider. A virtual machine image itself is only
processor-specific, and should be portable between different hypervisors - being
it FreeBSD and VMM, Linux and KVM, or MacOSX and Hypervisor.Framework.
2018-01-06 16:22:12 +00:00
The code is available [on GitHub](https://github.com/hannesm/albatross). If you want
2017-07-10 09:49:12 +00:00
to deploy your unikernel on my hardware, please send me a certificate signing
request. I'm interested in feedback, either via
[twitter](https://twitter.com/h4nnes) or open issues in the repository. This
article itself is stored [in a different
2021-11-19 18:04:52 +00:00
repository](https://git.robur.io/hannes/hannes.robur.coop) (in case you have typo or
2017-07-10 09:49:12 +00:00
grammatical corrections).
I'm very thankful to people who gave feedback on earlier versions of this
article, and who discussed the system design with me. These are Addie, Chris,
2017-07-10 10:02:56 +00:00
Christiano, Joe, mato, Mindy, Mort, and sg.