Albatross - provisioning, deploying, managing, and monitoring virtual machines
Published: 2017-07-10 (last updated: 2023-05-16)
EDIT (2023-05-16): Please take a look at the updated article.
How to deploy unikernels?
MirageOS has a pretty good story on how to compose your OCaml libraries into a
virtual machine image. The mirage
command line utility contains all the
knowledge about which backend requires which library. This enables it to write a
unikernel using abstract interfaces (such as a network device). Additionally the
mirage
utility can compile for any backend. (It is still unclear whether this
is a sustainable idea, since the mirage
tool needs to be adjusted for every
new backend, but also for additional implementations of an interface.)
Once a virtual machine image has been created, it needs to be deployed. I run
my own physical hardware, with all the associated upsides and downsides.
Specifically I run several physical FreeBSD machines on
the Internet, and use the bhyve hypervisor with MirageOS as
described earlier. Recently, Martin
Lucina
developed
a
vmm
backend for Solo5. This means there is no
need to use virtio anymore, or grub2-bhyve, or the bhyve binary (which links
libvmmapi
that already had a security
advisory).
Instead of the bhyve binary, a ~70kB small ukvm-bin
binary (dynamically
linking libc) can be used which is the solo5 virtual machine monitor on the host
side.
Until now, I manually created and deployed virtual machines using shell scripts, ssh logins, and a network file system shared with the FreeBSD virtual machine which builds my MirageOS unikernels.
But there are several drawbacks with this approach, the biggest is that sharing resources is hard - to enable a friend to run their unikernel on my server, they'll need to have a user account, and even privileged permissions to create virtual network interfaces and execute virtual machines.
To get rid of these ad-hoc shell scripts and copying of virtual machine images, I developed an UNIX daemon which accomplishes the required work. This daemon waits for (mutually!) authenticated network connections, and provides the desired commands; to create a new virtual machine, to acquire a block device of a given size, to destroy a virtual machine, to stream the console output of a virtual machine.
System design
The system bears minimalistic characteristics. The single interface to the outside world is a TLS stream over TCP. Internally, there is a family of processes, one of which has superuser privileges, communicating via unix domain sockets. The processes do not need any persistent storage (apart from the revocation lists). A brief enumeration of the processes is provided below:
vmmd
(superuser privileges), which terminates TLS sessions, proxies messages, and creates and destroys virtual machines (including setup and teardown of network interfaces and virtual block devices)vmm_stats
periodically gathers resource usage and network interface statisticsvmm_console
reads console output of every provided fifo, and stores this in a ringbuffer, replaying to a client on demandvmm_log
consumes the event log (login, starting, and stopping of virtual machines)
The system uses X.509 certificates as tokens. These are authenticated key value stores. There are four shapes of certificates: a virtual machine certificate which embeds the entire virtual machine image, together with configuration information (resource usage, how many and which network interfaces, block device access); a command certificate (for interactive use, allowing (a subset of) commands such as attaching to console output); a revocation certificate which contains a list of revoked certificates; and a delegation certificate to distribute resources to someone else (an intermediate CA certificate).
The resources which can be controlled are CPUs, memory consumption, block storage, and access to bridge interfaces (virtual switches) - encoded in the virtual machine and delegation certificates. Additionally, delegation certificates can limit the number of virtual machines.
Leveraging the X.509 system ensures that the client always has to present a certificate chain from the root certificate. Each intermediate certificate is a delegation certificate, which may further restrict resources. The serial numbers of the chain is used as unique identifier for each virtual machine and other certificates. The chain restricts access of the leaf certificate as well: only the subtree of the chain can be viewed. E.g. if there are delegations to both Alice and Bob from the root certificate, they can not see each other virtual machines.
Connecting to the vmmd requires a TLS client, a CA certificate, a leaf certificate (and the delegation chain) and its private key. In the background, it is a multi-step process using TLS: first, the client establishes a TLS connection where it authenticates the server using the CA certificate, then the server demands a TLS renegotiation where it requires the client to authenticate with its leaf certificate and private key. Using renegotiation over the encrypted channel prevents passive observers to see the client certificate in clear.
Depending on the leaf certificate, the server logic is slightly different. A command certificate opens an interactive session where - depending on permissions encoded in the certificate - different commands can be issued: the console output can be streamed, the event log can be viewed, virtual machines can be destroyed, statistics can be collected, and block devices can be managed.
When a virtual machine certificate is presented, the desired resource usage is checked against the resource policies in the delegation certificate chain and the currently running virtual machines. If sufficient resources are free, the embedded virtual machine is started. In addition to other resource information, a delegation certificate may embed IP usage, listing the network configuration (gateway and netmask), and which addresses you're supposed to use. Boot arguments can be encoded in the certificate as well, they are just passed to the virtual machine (for easy deployment of off-the-shelf systems).
If a revocation certificate is presented, the embodied revocation list is verified, and stored on the host system. Revocation is enforced by destroying any revoked virtual machines and terminating any revoked interactive sessions. If a delegation certificate is revoked, additionally the connected block devices are destroyed.
The maximum size of a virtual machine image embedded into a X.509 certificate transferred over TLS is 2 ^ 24 - 1 bytes, roughly 16 MB. If this turns out to be not sufficient, compression may help. Or staging of deployment.
An example
Instructions on how to setup vmmd
and the certificate authority are in the
README file of the albatross
git repository. Here
is some (stripped) terminal output:
> openssl x509 -text -noout -in admin.pem
Certificate:
Data:
Serial Number: b7:aa:77:f6:ca:08:ee:6a
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=dev
Subject: CN=admin
X509v3 extensions:
1.3.6.1.4.1.49836.42.42: ....
1.3.6.1.4.1.49836.42.0: ...
> openssl asn1parse -in admin.pem
403:d=4 hl=2 l= 18 cons: SEQUENCE
405:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.42
417:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:03020780
423:d=4 hl=2 l= 17 cons: SEQUENCE
425:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.0
437:d=5 hl=2 l= 3 prim: OCTET STRING [HEX DUMP]:020100
> openssl asn1parse -in hello.pem
410:d=4 hl=2 l= 18 cons: SEQUENCE
412:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.42
424:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:03020520
430:d=4 hl=2 l= 18 cons: SEQUENCE
432:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.5
444:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:02020200
450:d=4 hl=2 l= 17 cons: SEQUENCE
452:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.6
464:d=5 hl=2 l= 3 prim: OCTET STRING [HEX DUMP]:020101
469:d=4 hl=5 l=3054024 cons: SEQUENCE
474:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.9
486:d=5 hl=5 l=3054007 prim: OCTET STRING [HEX DUMP]:A0832E99B204832E99AD7F454C46
The MirageOS private enterprise number is 1.3.6.1.4.1.49836, I use the arc 42 here. I use 0 as version (an integer), where 0 is the current version.
42 is a bit string representing the permissions. 5 the amount of memory, 6 the
CPU id, and 9 finally the virtual machine image (as ELF binary). If you're
eager to see more, look into the Vmm_asn
module.
Using a command certificate establishes an interactive session where you can
review the event log, see all currently running virtual machines, or attach to
the console (which is then streamed, if new console output appears while the
interactive session is active, you'll be notified). The db
file is used to
translate between the internal names (mentioned above, hashed serial numbers) to
common names of the certificates - both on command input (attach hello
) and
output.
> vmm_client cacert.pem admin.bundle admin.key localhost:1025 --db dev.db
$ info
info sn.nqsb.io: 'cpuset' '-l' '7' '/tmp/vmm/ukvm-bin.net' '--net=tap27' '--' '/tmp/81363f.0237f3.img' 91540 taps tap27
info nqsbio: 'cpuset' '-l' '5' '/tmp/vmm/ukvm-bin.net' '--net=tap26' '--' '/tmp/81363f.43a0ff.img' 91448 taps tap26
info marrakesh: 'cpuset' '-l' '4' '/tmp/vmm/ukvm-bin.net' '--net=tap25' '--' '/tmp/81363f.cb53e2.img' 91368 taps tap25
info tls.nqsb.io: 'cpuset' '-l' '9' '/tmp/vmm/ukvm-bin.net' '--net=tap28' '--' '/tmp/81363f.ec692e.img' 91618 taps tap28
$ log
log: 2017-07-10 09:43:39 +00:00: marrakesh LOGIN 128.232.110.109:43142
log: 2017-07-10 09:43:39 +00:00: marrakesh STARTED 91368 (tap tap25, block no)
log: 2017-07-10 09:43:51 +00:00: nqsbio LOGIN 128.232.110.109:44663
log: 2017-07-10 09:43:51 +00:00: nqsbio STARTED 91448 (tap tap26, block no)
log: 2017-07-10 09:44:07 +00:00: sn.nqsb.io LOGIN 128.232.110.109:38182
log: 2017-07-10 09:44:07 +00:00: sn.nqsb.io STARTED 91540 (tap tap27, block no)
log: 2017-07-10 09:44:21 +00:00: tls.nqsb.io LOGIN 128.232.110.109:11178
log: 2017-07-10 09:44:21 +00:00: tls.nqsb.io STARTED 91618 (tap tap28, block no)
log: 2017-07-10 09:44:25 +00:00: hannes LOGIN 128.232.110.109:24207
success
$ attach hello
console hello: 2017-07-09 18:44:52 +00:00 | ___|
console hello: 2017-07-09 18:44:52 +00:00 __| _ \ | _ \ __ \
console hello: 2017-07-09 18:44:52 +00:00 \__ \ ( | | ( | ) |
console hello: 2017-07-09 18:44:52 +00:00 ____/\___/ _|\___/____/
console hello: 2017-07-09 18:44:52 +00:00 Solo5: Memory map: 512 MB addressable:
console hello: 2017-07-09 18:44:52 +00:00 Solo5: unused @ (0x0 - 0xfffff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: text @ (0x100000 - 0x1e4fff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: rodata @ (0x1e5000 - 0x217fff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: data @ (0x218000 - 0x2cffff)
console hello: 2017-07-09 18:44:52 +00:00 Solo5: heap >= 0x2d0000 < stack < 0x20000000
console hello: 2017-07-09 18:44:52 +00:00 STUB: getenv() called
console hello: 2017-07-09 18:44:52 +00:00 2017-07-09 18:44:52 -00:00: INF [application] hello
console hello: 2017-07-09 18:44:53 +00:00 2017-07-09 18:44:53 -00:00: INF [application] hello
console hello: 2017-07-09 18:44:54 +00:00 2017-07-09 18:44:54 -00:00: INF [application] hello
console hello: 2017-07-09 18:44:55 +00:00 2017-07-09 18:44:55 -00:00: INF [application] hello
If you use a virtual machine certificate, depending on allowed resource the virtual machine is started or not:
> vmm_client cacert.pem hello.bundle hello.key localhost:1025
success VM started
Sharing is caring
Deploying unikernels is now easier for myself on my physical machine. That's fine. Another aspect comes for free by reusing X.509: further delegation (and limiting thereof). Within a delegation certificate, the basic constraints extension must be present which marks this certificate as a CA certificate. This may as well contain a path length - how many other delegations may follow - or whether the resources may be shared further.
If I delegate 2 virtual machines and 2GB of memory to Alice, and allow an arbitrary path length, she can issue tokens to her friend Carol and Dan, each up to 2 virtual machines and 2 GB memory (but also less -- within the X.509 system even more, but vmmd will reject any resource increase in the chain) - who can further delegate to Eve, .... Carol and Dan won't know of each other, and vmmd will only start up to 2 virtual machines using 2GB of memory in total (sum of Alice, Carol, and Dan deployed virtual machines). Alice may revoke any issued delegation (using a revocation certificate described above) to free up some resources for herself. I don't need to interact when Alice or Dan share their delegated resources further.
Security
There are several security properties preserved by vmmd
, such as the virtual
machine image is never transmitted in clear. Only properly authenticated
clients can create, destroy, gather statistics of their virtual machines.
Two disjoint paths in the delegation tree are not able to discover anything about each other (apart from caches, which depend on how CPUs are delegated and their concrete physical layout). Only smaller amounts of resources can be delegated further down. Each running virtual machine image is strongly isolated from all other virtual machines.
As mentioned in the last section, delegations of delegations may end up in the hands of malicious people. Vmmd limits delegations to allocate resources on the host system, namely bridges and file systems. Only top delegations - directly signed by the certificate authority - create bridge interfaces (which are explicitly named in the certificate) and file systems (one zfs for each top delegation (to allow easy snapshots and backups)).
The threat model is that clients have layer 2 access to the hosts network interface card, and all guests share a single bridge (if this turns out to be a problem, there are ways to restrict to a point-to-point interface with routed IP addresses). A malicious virtual machine can try to hijack ethernet and IP addresses.
Possible DoS scenarios include also to spawn VMs very fast (which immediately crash) or generating a lot of console output. Both is indirectly handled by the control channel: to create a virtual machine image, you need to setup a TLS connection (with two handshakes) and transfer the virtual machine image (there is intentionally no "respawn on quit" option). The console output is read by a single process with user privileges (in the future there may be one console reading process for each top delegation). It may further be rate limited as well. The console stream is only ever sent to a single session, as soon as someone attaches to the console in one session, all other sessions have this console detached (and are notified about that).
The control channel itself can be rate limited using the host system firewall.
The only information persistently stored on a block device are the certificate revocation lists - virtual machine images, FIFOs, unix domain sockets are all stored in a memory-backed file system. A virtual machine with a lots of disk operation may only delay or starve revocation list updates - if this turns out to be a problem, the solution may be to use separate physical block devices for the revocation lists and virtual block devices for clients.
Conclusion
I showed a minimalistic system to provision, deploy, and manage virtual machine images. It also allows to delegate resources (CPU, disk, ..) further. I'm pretty satisfied with the security properties of the system.
The system embeds all data (configuration, resource policies, virtual machine images) into X.509 certificates, and does not rely on an external file transfer protocol. An advantage thereof is that all deployed images have been signed with a private key.
All communication between the processes and between the client and the server use a wire protocol, with structured input and output - this enables more advanced algorithms (e.g. automated scaling) and fancier user interfaces than the currently provided terminal based one.
The delegation mechanism allows to actually share computing resources in a decentralised way - without knowing the final recipient. Revocation is builtin, which can at any point delete access of a subtree or individual virtual machine to the system. Instead of requesting revocation lists during the handshake, they are pushed explicitly by the (sub)CA revoking a certificate.
While this system was designed for a physical server, it should be straightforward to develop a Google compute engine / EC2 backend which extracts the virtual machine image, commands, etc. from the certificate and deploys it to your favourite cloud provider. A virtual machine image itself is only processor-specific, and should be portable between different hypervisors - being it FreeBSD and VMM, Linux and KVM, or MacOSX and Hypervisor.Framework.
The code is available on GitHub. If you want to deploy your unikernel on my hardware, please send me a certificate signing request. I'm interested in feedback, either via twitter or open issues in the repository. This article itself is stored in a different repository (in case you have typo or grammatical corrections).
I'm very thankful to people who gave feedback on earlier versions of this article, and who discussed the system design with me. These are Addie, Chris, Christiano, Joe, mato, Mindy, Mort, and sg.