280 lines
No EOL
20 KiB
Text
280 lines
No EOL
20 KiB
Text
<!DOCTYPE html>
|
|
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Albatross - provisioning, deploying, managing, and monitoring virtual machines</title><meta charset="UTF-8"/><link rel="stylesheet" href="/static/css/style.css"/><link rel="stylesheet" href="/static/css/highlight.css"/><script src="/static/js/highlight.pack.js"></script><script>hljs.initHighlightingOnLoad();</script><link rel="alternate" href="/atom" title="Albatross - provisioning, deploying, managing, and monitoring virtual machines" type="application/atom+xml"/><meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover"/></head><body><nav class="navbar navbar-default navbar-fixed-top"><div class="container"><div class="navbar-header"><a class="navbar-brand" href="/Posts">full stack engineer</a></div><div class="collapse navbar-collapse collapse"><ul class="nav navbar-nav navbar-right"><li><a href="/About"><span>About</span></a></li><li><a href="/Posts"><span>Posts</span></a></li></ul></div></div></nav><main><div class="flex-container"><div class="post"><h2>Albatross - provisioning, deploying, managing, and monitoring virtual machines</h2><span class="author">Written by hannes</span><br/><div class="tags">Classified under: <a href="/tags/mirageos" class="tag">mirageos</a><a href="/tags/deployment" class="tag">deployment</a><a href="/tags/provisioning" class="tag">provisioning</a></div><span class="date">Published: 2017-07-10 (last updated: 2023-05-16)</span><article><p>EDIT (2023-05-16): Please take a look at <a href="/Posts/Albatross">the updated article</a>.</p>
|
|
<h2 id="how-to-deploy-unikernels">How to deploy unikernels?</h2>
|
|
<p>MirageOS has a pretty good story on how to compose your OCaml libraries into a
|
|
virtual machine image. The <code>mirage</code> command line utility contains all the
|
|
knowledge about which backend requires which library. This enables it to write a
|
|
unikernel using abstract interfaces (such as a network device). Additionally the
|
|
<code>mirage</code> utility can compile for any backend. (It is still unclear whether this
|
|
is a sustainable idea, since the <code>mirage</code> tool needs to be adjusted for every
|
|
new backend, but also for additional implementations of an interface.)</p>
|
|
<p>Once a virtual machine image has been created, it needs to be deployed. I run
|
|
my own physical hardware, with all the associated upsides and downsides.
|
|
Specifically I run several physical <a href="https://freebsd.org">FreeBSD</a> machines on
|
|
the Internet, and use the <a href="http://bhyve.org">bhyve</a> hypervisor with MirageOS as
|
|
described <a href="/Posts/Solo5">earlier</a>. Recently, Martin
|
|
Lucina
|
|
<a href="https://github.com/Solo5/solo5/pull/171/commits/e67a007b75fa3fcee5c082aab04c9fe9e897d779">developed</a>
|
|
a
|
|
<a href="https://svnweb.freebsd.org/base/head/sys/amd64/include/vmm.h?view=markup"><code>vmm</code></a>
|
|
backend for <a href="https://github.com/solo5/solo5">Solo5</a>. This means there is no
|
|
need to use virtio anymore, or grub2-bhyve, or the bhyve binary (which links
|
|
<code>libvmmapi</code> that already had a <a href="https://www.freebsd.org/security/advisories/FreeBSD-SA-16:38.bhyve.asc">security
|
|
advisory</a>).
|
|
Instead of the bhyve binary, a ~70kB small <code>ukvm-bin</code> binary (dynamically
|
|
linking libc) can be used which is the solo5 virtual machine monitor on the host
|
|
side.</p>
|
|
<p>Until now, I manually created and deployed virtual machines using shell scripts,
|
|
ssh logins, and a network file system shared with the FreeBSD virtual machine
|
|
which builds my MirageOS unikernels.</p>
|
|
<p>But there are several drawbacks with this approach, the biggest is that sharing
|
|
resources is hard - to enable a friend to run their unikernel on my server,
|
|
they'll need to have a user account, and even privileged permissions to
|
|
create virtual network interfaces and execute virtual machines.</p>
|
|
<p>To get rid of these ad-hoc shell scripts and copying of virtual machine images,
|
|
I developed an UNIX daemon which accomplishes the required work. This daemon
|
|
waits for (mutually!) authenticated network connections, and provides the
|
|
desired commands; to create a new virtual machine, to acquire a block device of
|
|
a given size, to destroy a virtual machine, to stream the console output of a
|
|
virtual machine.</p>
|
|
<h2 id="system-design">System design</h2>
|
|
<p>The system bears minimalistic characteristics. The single interface to the
|
|
outside world is a TLS stream over TCP. Internally, there is a family of
|
|
processes, one of which has superuser privileges, communicating via unix domain
|
|
sockets. The processes do not need any persistent storage (apart from the
|
|
revocation lists). A brief enumeration of the processes is provided below:</p>
|
|
<ul>
|
|
<li><code>vmmd</code> (superuser privileges), which terminates TLS sessions, proxies messages, and creates and destroys virtual machines (including setup and teardown of network interfaces and virtual block devices)
|
|
</li>
|
|
<li><code>vmm_stats</code> periodically gathers resource usage and network interface statistics
|
|
</li>
|
|
<li><code>vmm_console</code> reads console output of every provided fifo, and stores this in a ringbuffer, replaying to a client on demand
|
|
</li>
|
|
<li><code>vmm_log</code> consumes the event log (login, starting, and stopping of virtual machines)
|
|
</li>
|
|
</ul>
|
|
<p>The system uses X.509 certificates as tokens. These are authenticated key value
|
|
stores. There are four shapes of certificates: a <em>virtual machine certificate</em>
|
|
which embeds the entire virtual machine image, together with configuration
|
|
information (resource usage, how many and which network interfaces, block device
|
|
access); a <em>command certificate</em> (for interactive use, allowing (a subset of)
|
|
commands such as attaching to console output); a <em>revocation certificate</em> which
|
|
contains a list of revoked certificates; and a <em>delegation certificate</em> to
|
|
distribute resources to someone else (an intermediate CA certificate).</p>
|
|
<p>The resources which can be controlled are CPUs, memory consumption, block
|
|
storage, and access to bridge interfaces (virtual switches) - encoded in the
|
|
virtual machine and delegation certificates. Additionally, delegation
|
|
certificates can limit the number of virtual machines.</p>
|
|
<p>Leveraging the X.509 system ensures that the client always has to present a
|
|
certificate chain from the root certificate. Each intermediate certificate is a
|
|
delegation certificate, which may further restrict resources. The serial
|
|
numbers of the chain is used as unique identifier for each virtual machine and
|
|
other certificates. The chain restricts access of the leaf certificate as well:
|
|
only the subtree of the chain can be viewed. E.g. if there are delegations to
|
|
both Alice and Bob from the root certificate, they can not see each other
|
|
virtual machines.</p>
|
|
<p>Connecting to the vmmd requires a TLS client, a CA certificate, a leaf
|
|
certificate (and the delegation chain) and its private key. In the background,
|
|
it is a multi-step process using TLS: first, the client establishes a TLS
|
|
connection where it authenticates the server using the CA certificate, then the
|
|
server demands a TLS renegotiation where it requires the client to authenticate
|
|
with its leaf certificate and private key. Using renegotiation over the
|
|
encrypted channel prevents passive observers to see the client certificate in
|
|
clear.</p>
|
|
<p>Depending on the leaf certificate, the server logic is slightly different. A
|
|
command certificate opens an interactive session where - depending on
|
|
permissions encoded in the certificate - different commands can be issued: the
|
|
console output can be streamed, the event log can be viewed, virtual machines
|
|
can be destroyed, statistics can be collected, and block devices can be managed.</p>
|
|
<p>When a virtual machine certificate is presented, the desired resource usage is
|
|
checked against the resource policies in the delegation certificate chain and
|
|
the currently running virtual machines. If sufficient resources are free, the
|
|
embedded virtual machine is started. In addition to other resource information,
|
|
a delegation certificate may embed IP usage, listing the network configuration
|
|
(gateway and netmask), and which addresses you're supposed to use. Boot
|
|
arguments can be encoded in the certificate as well, they are just passed to the
|
|
virtual machine (for easy deployment of off-the-shelf systems).</p>
|
|
<p>If a revocation certificate is presented, the embodied revocation list is
|
|
verified, and stored on the host system. Revocation is enforced by destroying
|
|
any revoked virtual machines and terminating any revoked interactive sessions.
|
|
If a delegation certificate is revoked, additionally the connected block devices
|
|
are destroyed.</p>
|
|
<p>The maximum size of a virtual machine image embedded into a X.509 certificate
|
|
transferred over TLS is 2 ^ 24 - 1 bytes, roughly 16 MB. If this turns out to
|
|
be not sufficient, compression may help. Or staging of deployment.</p>
|
|
<h2 id="an-example">An example</h2>
|
|
<p>Instructions on how to setup <code>vmmd</code> and the certificate authority are in the
|
|
README file of the <a href="https://github.com/hannesm/albatross"><code>albatross</code> git repository</a>. Here
|
|
is some (stripped) terminal output:</p>
|
|
<pre><code class="language-bash">> openssl x509 -text -noout -in admin.pem
|
|
Certificate:
|
|
Data:
|
|
Serial Number: b7:aa:77:f6:ca:08:ee:6a
|
|
Signature Algorithm: sha256WithRSAEncryption
|
|
Issuer: CN=dev
|
|
Subject: CN=admin
|
|
X509v3 extensions:
|
|
1.3.6.1.4.1.49836.42.42: ....
|
|
1.3.6.1.4.1.49836.42.0: ...
|
|
|
|
> openssl asn1parse -in admin.pem
|
|
403:d=4 hl=2 l= 18 cons: SEQUENCE
|
|
405:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.42
|
|
417:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:03020780
|
|
423:d=4 hl=2 l= 17 cons: SEQUENCE
|
|
425:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.0
|
|
437:d=5 hl=2 l= 3 prim: OCTET STRING [HEX DUMP]:020100
|
|
|
|
> openssl asn1parse -in hello.pem
|
|
410:d=4 hl=2 l= 18 cons: SEQUENCE
|
|
412:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.42
|
|
424:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:03020520
|
|
430:d=4 hl=2 l= 18 cons: SEQUENCE
|
|
432:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.5
|
|
444:d=5 hl=2 l= 4 prim: OCTET STRING [HEX DUMP]:02020200
|
|
450:d=4 hl=2 l= 17 cons: SEQUENCE
|
|
452:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.6
|
|
464:d=5 hl=2 l= 3 prim: OCTET STRING [HEX DUMP]:020101
|
|
469:d=4 hl=5 l=3054024 cons: SEQUENCE
|
|
474:d=5 hl=2 l= 10 prim: OBJECT :1.3.6.1.4.1.49836.42.9
|
|
486:d=5 hl=5 l=3054007 prim: OCTET STRING [HEX DUMP]:A0832E99B204832E99AD7F454C46
|
|
</code></pre>
|
|
<p>The MirageOS private enterprise number is 1.3.6.1.4.1.49836, I use the arc 42
|
|
here. I use 0 as version (an integer), where 0 is the current version.</p>
|
|
<p>42 is a bit string representing the permissions. 5 the amount of memory, 6 the
|
|
CPU id, and 9 finally the virtual machine image (as ELF binary). If you're
|
|
eager to see more, look into the <code>Vmm_asn</code> module.</p>
|
|
<p>Using a command certificate establishes an interactive session where you can
|
|
review the event log, see all currently running virtual machines, or attach to
|
|
the console (which is then streamed, if new console output appears while the
|
|
interactive session is active, you'll be notified). The <code>db</code> file is used to
|
|
translate between the internal names (mentioned above, hashed serial numbers) to
|
|
common names of the certificates - both on command input (<code>attach hello</code>) and
|
|
output.</p>
|
|
<pre><code class="language-bash">> vmm_client cacert.pem admin.bundle admin.key localhost:1025 --db dev.db
|
|
$ info
|
|
info sn.nqsb.io: 'cpuset' '-l' '7' '/tmp/vmm/ukvm-bin.net' '--net=tap27' '--' '/tmp/81363f.0237f3.img' 91540 taps tap27
|
|
info nqsbio: 'cpuset' '-l' '5' '/tmp/vmm/ukvm-bin.net' '--net=tap26' '--' '/tmp/81363f.43a0ff.img' 91448 taps tap26
|
|
info marrakesh: 'cpuset' '-l' '4' '/tmp/vmm/ukvm-bin.net' '--net=tap25' '--' '/tmp/81363f.cb53e2.img' 91368 taps tap25
|
|
info tls.nqsb.io: 'cpuset' '-l' '9' '/tmp/vmm/ukvm-bin.net' '--net=tap28' '--' '/tmp/81363f.ec692e.img' 91618 taps tap28
|
|
$ log
|
|
log: 2017-07-10 09:43:39 +00:00: marrakesh LOGIN 128.232.110.109:43142
|
|
log: 2017-07-10 09:43:39 +00:00: marrakesh STARTED 91368 (tap tap25, block no)
|
|
log: 2017-07-10 09:43:51 +00:00: nqsbio LOGIN 128.232.110.109:44663
|
|
log: 2017-07-10 09:43:51 +00:00: nqsbio STARTED 91448 (tap tap26, block no)
|
|
log: 2017-07-10 09:44:07 +00:00: sn.nqsb.io LOGIN 128.232.110.109:38182
|
|
log: 2017-07-10 09:44:07 +00:00: sn.nqsb.io STARTED 91540 (tap tap27, block no)
|
|
log: 2017-07-10 09:44:21 +00:00: tls.nqsb.io LOGIN 128.232.110.109:11178
|
|
log: 2017-07-10 09:44:21 +00:00: tls.nqsb.io STARTED 91618 (tap tap28, block no)
|
|
log: 2017-07-10 09:44:25 +00:00: hannes LOGIN 128.232.110.109:24207
|
|
success
|
|
$ attach hello
|
|
console hello: 2017-07-09 18:44:52 +00:00 | ___|
|
|
console hello: 2017-07-09 18:44:52 +00:00 __| _ \ | _ \ __ \
|
|
console hello: 2017-07-09 18:44:52 +00:00 \__ \ ( | | ( | ) |
|
|
console hello: 2017-07-09 18:44:52 +00:00 ____/\___/ _|\___/____/
|
|
console hello: 2017-07-09 18:44:52 +00:00 Solo5: Memory map: 512 MB addressable:
|
|
console hello: 2017-07-09 18:44:52 +00:00 Solo5: unused @ (0x0 - 0xfffff)
|
|
console hello: 2017-07-09 18:44:52 +00:00 Solo5: text @ (0x100000 - 0x1e4fff)
|
|
console hello: 2017-07-09 18:44:52 +00:00 Solo5: rodata @ (0x1e5000 - 0x217fff)
|
|
console hello: 2017-07-09 18:44:52 +00:00 Solo5: data @ (0x218000 - 0x2cffff)
|
|
console hello: 2017-07-09 18:44:52 +00:00 Solo5: heap >= 0x2d0000 < stack < 0x20000000
|
|
console hello: 2017-07-09 18:44:52 +00:00 STUB: getenv() called
|
|
console hello: 2017-07-09 18:44:52 +00:00 2017-07-09 18:44:52 -00:00: INF [application] hello
|
|
console hello: 2017-07-09 18:44:53 +00:00 2017-07-09 18:44:53 -00:00: INF [application] hello
|
|
console hello: 2017-07-09 18:44:54 +00:00 2017-07-09 18:44:54 -00:00: INF [application] hello
|
|
console hello: 2017-07-09 18:44:55 +00:00 2017-07-09 18:44:55 -00:00: INF [application] hello
|
|
</code></pre>
|
|
<p>If you use a virtual machine certificate, depending on allowed resource the
|
|
virtual machine is started or not:</p>
|
|
<pre><code class="language-bash">> vmm_client cacert.pem hello.bundle hello.key localhost:1025
|
|
success VM started
|
|
</code></pre>
|
|
<h2 id="sharing-is-caring">Sharing is caring</h2>
|
|
<p>Deploying unikernels is now easier for myself on my physical machine. That's
|
|
fine. Another aspect comes <em>for free</em> by reusing X.509: further delegation (and
|
|
limiting thereof). Within a delegation certificate, the basic constraints
|
|
extension must be present which marks this certificate as a CA certificate.
|
|
This may as well contain a path length - how many other delegations may follow -
|
|
or whether the resources may be shared further.</p>
|
|
<p>If I delegate 2 virtual machines and 2GB of memory to Alice, and allow an
|
|
arbitrary path length, she can issue tokens to her friend Carol and Dan, each up
|
|
to 2 virtual machines and 2 GB memory (but also less -- within the X.509 system
|
|
even more, but vmmd will reject any resource increase in the chain) - who can
|
|
further delegate to Eve, .... Carol and Dan won't know of each other,
|
|
and vmmd will only start up to 2 virtual machines using 2GB of memory in total
|
|
(sum of Alice, Carol, and Dan deployed virtual machines). Alice may revoke any
|
|
issued delegation (using a revocation certificate described above) to free up
|
|
some resources for herself. I don't need to interact when Alice or Dan share
|
|
their delegated resources further.</p>
|
|
<h2 id="security">Security</h2>
|
|
<p>There are several security properties preserved by <code>vmmd</code>, such as the virtual
|
|
machine image is never transmitted in clear. Only properly authenticated
|
|
clients can create, destroy, gather statistics of <em>their</em> virtual machines.</p>
|
|
<p>Two disjoint paths in the delegation tree are not able to discover anything
|
|
about each other (apart from caches, which depend on how CPUs are delegated and
|
|
their concrete physical layout). Only smaller amounts of resources can be
|
|
delegated further down. Each running virtual machine image is strongly isolated
|
|
from all other virtual machines.</p>
|
|
<p>As mentioned in the last section, delegations of delegations may end up in the
|
|
hands of malicious people. Vmmd limits delegations to allocate resources on the
|
|
host system, namely bridges and file systems. Only top delegations - directly
|
|
signed by the certificate authority - create bridge interfaces (which are
|
|
explicitly named in the certificate) and file systems (one zfs for each top
|
|
delegation (to allow easy snapshots and backups)).</p>
|
|
<p>The threat model is that clients have layer 2 access to the hosts network
|
|
interface card, and all guests share a single bridge (if this turns out to be a
|
|
problem, there are ways to restrict to a point-to-point interface with routed IP
|
|
addresses). A malicious virtual machine can try to hijack ethernet and IP
|
|
addresses.</p>
|
|
<p>Possible DoS scenarios include also to spawn VMs very fast (which immediately
|
|
crash) or generating a lot of console output. Both is indirectly handled by the
|
|
control channel: to create a virtual machine image, you need to setup a TLS
|
|
connection (with two handshakes) and transfer the virtual machine image (there
|
|
is intentionally no "respawn on quit" option). The console output is read by a
|
|
single process with user privileges (in the future there may be one console
|
|
reading process for each top delegation). It may further be rate limited as
|
|
well. The console stream is only ever sent to a single session, as soon as
|
|
someone attaches to the console in one session, all other sessions have this
|
|
console detached (and are notified about that).</p>
|
|
<p>The control channel itself can be rate limited using the host system firewall.</p>
|
|
<p>The only information persistently stored on a block device are the certificate
|
|
revocation lists - virtual machine images, FIFOs, unix domain sockets are all
|
|
stored in a memory-backed file system. A virtual machine with a lots of disk
|
|
operation may only delay or starve revocation list updates - if this turns out
|
|
to be a problem, the solution may be to use separate physical block devices for
|
|
the revocation lists and virtual block devices for clients.</p>
|
|
<h2 id="conclusion">Conclusion</h2>
|
|
<p>I showed a minimalistic system to provision, deploy, and manage virtual machine
|
|
images. It also allows to delegate resources (CPU, disk, ..) further. I'm
|
|
pretty satisfied with the security properties of the system.</p>
|
|
<p>The system embeds all data (configuration, resource policies, virtual machine
|
|
images) into X.509 certificates, and does not rely on an external file transfer
|
|
protocol. An advantage thereof is that all deployed images have been signed
|
|
with a private key.</p>
|
|
<p>All communication between the processes and between the client and the server
|
|
use a wire protocol, with structured input and output - this enables more
|
|
advanced algorithms (e.g. automated scaling) and fancier user interfaces than
|
|
the currently provided terminal based one.</p>
|
|
<p>The delegation mechanism allows to actually share computing resources in a
|
|
decentralised way - without knowing the final recipient. Revocation is builtin,
|
|
which can at any point delete access of a subtree or individual virtual machine
|
|
to the system. Instead of requesting revocation lists during the handshake,
|
|
they are pushed explicitly by the (sub)CA revoking a certificate.</p>
|
|
<p>While this system was designed for a physical server, it should be
|
|
straightforward to develop a Google compute engine / EC2 backend which extracts
|
|
the virtual machine image, commands, etc. from the certificate and deploys it to
|
|
your favourite cloud provider. A virtual machine image itself is only
|
|
processor-specific, and should be portable between different hypervisors - being
|
|
it FreeBSD and VMM, Linux and KVM, or MacOSX and Hypervisor.Framework.</p>
|
|
<p>The code is available <a href="https://github.com/hannesm/albatross">on GitHub</a>. If you want
|
|
to deploy your unikernel on my hardware, please send me a certificate signing
|
|
request. I'm interested in feedback, either via
|
|
<a href="https://twitter.com/h4nnes">twitter</a> or open issues in the repository. This
|
|
article itself is stored <a href="https://git.robur.io/hannes/hannes.robur.coop">in a different
|
|
repository</a> (in case you have typo or
|
|
grammatical corrections).</p>
|
|
<p>I'm very thankful to people who gave feedback on earlier versions of this
|
|
article, and who discussed the system design with me. These are Addie, Chris,
|
|
Christiano, Joe, mato, Mindy, Mort, and sg.</p>
|
|
</article></div></div></main></body></html> |