--- title: All your metrics belong to influx author: hannes tags: mirageos, monitoring, deployment abstract: How to monitor your MirageOS unikernel with albatross and monitoring-experiments --- # Introduction to monitoring At [robur](https://robur.coop) we use a range of MirageOS unikernels. Recently, we worked on improving the operations story thereof. One part is shipping binaries using our [reproducible builds infrastructure](https://builds.robur.coop). Another part is, once deployed we want to observe what is going on. I first got into touch with monitoring - collecting and graphing metrics - with [MRTG](https://oss.oetiker.ch/mrtg/) and [munin](https://munin-monitoring.org/) - and the simple network management protocol [SNMP](https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol). From the whole system perspective, I find it crucial that the monitoring part of a system does not add pressure. This favours a push-based design, where reporting is done at the disposition of the system. The rise of monitoring where graphs are done dynamically (such as [Grafana](https://grafana.com/)) and can be programmed (with a query language) by the operator are very neat, it allows to put metrics in relation after they have been recorded - thus if there's a thesis why something went berserk, you can graph the collected data from the past and prove or disprove the thesis. # Monitoring a MirageOS unikernel From the operational perspective, taking security into account - either the data should be authenticated and integrity-protected, or being transmitted on a private network. We chose the latter, there's a private network interface only for monitoring. Access to that network is only granted to the unikernels and metrics collector. For MirageOS unikernels, we use the [metrics](https://github.com/mirage/metrics) library - which design shares the idea of [logs](https://erratique.ch/software/logs) that only if there's a reporter registered, work is performed. We use the Influx line protocol via TCP to report via [Telegraf](https://www.influxdata.com/time-series-platform/telegraf/) to [InfluxDB](https://www.influxdata.com/). But due to the design of [metrics](https://github.com/mirage/metrics), other reporters can be developed and used -- prometheus, SNMP, your-other-favourite are all possible. Apart from monitoring metrics, we use the same network interface for logging via syslog. Since the logs library separates the log message generation (in the OCaml libraries) from the reporting, we developed [logs-syslog](https://github.com/hannesm/logs-syslog), which registers a log reporter sending each log message to a syslog sink. We developed a small library for metrics reporting of a MirageOS unikernel into the [monitoring-experiments](https://github.com/roburio/monitoring-experiments) package - which also allows to dynamically adjust log level and disable or enable metrics sources. ## Required components Install from your operating system the packages providing telegraf, influxdb, and grafana. Setup telegraf to contain a socket listener: ``` [[inputs.socket_listener]] service_address = "tcp://192.168.42.14:8094" keep_alive_period = "5m" data_format = "influx" ``` Use a unikernel that reports to Influx (below the heading "Unikernels (with metrics reported to Influx)" on [builds.robur.coop](https://builds.robur.coop)) and provide `--monitor=192.168.42.14` as boot parameter. Conventionally, these unikernels expect a second network interface (on the "management" bridge) where telegraf (and a syslog sink) are running. You'll need to pass `--net=management` and `--arg='--management-ipv4=192.168.42.x/24'` to albatross-client-local. Albatross provides a `albatross-influx` daemon that reports information from the host system about the unikernels to influx. Start it with `--influx=192.168.42.14`. ## Adding monitoring to your unikernel If you want to extend your own unikernel with metrics, follow along these lines. An example is the [dns-primary-git](https://github.com/roburio/dns-primary-git) unikernel, where on the branch `future` we have a single commit ahead of main that adds monitoring. The difference is in the unikernel configuration and the main entry point. See the [binary builts](https://builds.robur.coop/job/dns-primary-git-monitoring/build/latest/) in contrast to the [non-monitoring builts](https://builds.robur.coop/job/dns-primary-git/build/latest/). In config, three new command line arguments are added: `--monitor=IP`, `--monitor-adjust=PORT` `--syslog=IP` and `--name=STRING`. In addition, the package `monitoring-experiments` is required. And a second network interface `management_stack` using the prefix `management` is required and passed to the unikernel. Since the syslog reporter requires a console (to report when logging fails), also a console is passed to the unikernel. Each reported metrics includes a tag `vm=` that can be used to distinguish several unikernels reporting to the same InfluxDB. Command line arguments: ```patch let doc = Key.Arg.info ~doc:"The fingerprint of the TLS certificate." [ "tls-cert-fingerprint" ] in Key.(create "tls_cert_fingerprint" Arg.(opt (some string) None doc)) +let monitor = + let doc = Key.Arg.info ~doc:"monitor host IP" ["monitor"] in + Key.(create "monitor" Arg.(opt (some ip_address) None doc)) + +let monitor_adjust = + let doc = Key.Arg.info ~doc:"adjust monitoring (log level, ..)" ["monitor-adjust"] in + Key.(create "monitor_adjust" Arg.(opt (some int) None doc)) + +let syslog = + let doc = Key.Arg.info ~doc:"syslog host IP" ["syslog"] in + Key.(create "syslog" Arg.(opt (some ip_address) None doc)) + +let name = + let doc = Key.Arg.info ~doc:"Name of the unikernel" ["name"] in + Key.(create "name" Arg.(opt string "ns.nqsb.io" doc)) + let mimic_impl random stackv4v6 mclock pclock time = let tcpv4v6 = tcpv4v6_of_stackv4v6 $ stackv4v6 in let mhappy_eyeballs = mimic_happy_eyeballs $ random $ time $ mclock $ pclock $ stackv4v6 in ``` Requiring `monitoring-experiments`, registering command line arguments: ```patch package ~min:"3.7.0" ~max:"3.8.0" "git-mirage"; package ~min:"3.7.0" "git-paf"; package ~min:"0.0.8" ~sublibs:["mirage"] "paf"; + package "monitoring-experiments"; + package ~sublibs:["mirage"] ~min:"0.3.0" "logs-syslog"; ] in foreign - ~keys:[Key.abstract remote_k ; Key.abstract axfr] + ~keys:[ + Key.abstract remote_k ; Key.abstract axfr ; + Key.abstract name ; Key.abstract monitor ; Key.abstract monitor_adjust ; Key.abstract syslog + ] ~packages ``` Added console and a second network stack to `foreign`: ```patch "Unikernel.Main" - (random @-> pclock @-> mclock @-> time @-> stackv4v6 @-> mimic @-> job) + (console @-> random @-> pclock @-> mclock @-> time @-> stackv4v6 @-> mimic @-> stackv4v6 @-> job) + ``` Passing a console implementation (`default_console`) and a second network stack (with `management` prefix) to `register`: ```patch +let management_stack = generic_stackv4v6 ~group:"management" (netif ~group:"management" "management") let () = register "primary-git" - [dns_handler $ default_random $ default_posix_clock $ default_monotonic_clock $ - default_time $ net $ mimic_impl] + [dns_handler $ default_console $ default_random $ default_posix_clock $ default_monotonic_clock $ + default_time $ net $ mimic_impl $ management_stack] ``` Now, in the unikernel module the functor changes (console and second network stack added): ```patch @@ -4,17 +4,48 @@ open Lwt.Infix -module Main (R : Mirage_random.S) (P : Mirage_clock.PCLOCK) (M : Mirage_clock.MCLOCK) (T : Mirage_time.S) (S : Mirage_stack.V4V6) (_ : sig e nd) = struct +module Main (C : Mirage_console.S) (R : Mirage_random.S) (P : Mirage_clock.PCLOCK) (M : Mirage_clock.MCLOCK) (T : Mirage_time.S) (S : Mirage _stack.V4V6) (_ : sig end) (Management : Mirage_stack.V4V6) = struct module Store = Irmin_mirage_git.Mem.KV(Irmin.Contents.String) module Sync = Irmin.Sync(Store) ``` And in the `start` function, the command line arguments are processed and used to setup syslog and metrics monitoring to the specified addresses. Also, a TCP listener is waiting for monitoring and logging adjustments if `--monitor-adjust` was provided: ```patch module D = Dns_server_mirage.Make(P)(M)(T)(S) + module Monitoring = Monitoring_experiments.Make(T)(Management) + module Syslog = Logs_syslog_mirage.Udp(C)(P)(Management) - let start _rng _pclock _mclock _time s ctx = + let start c _rng _pclock _mclock _time s ctx management = + let hostname = Key_gen.name () in + (match Key_gen.syslog () with + | None -> Logs.warn (fun m -> m "no syslog specified, dumping on stdout") + | Some ip -> Logs.set_reporter (Syslog.create c management ip ~hostname ())); + (match Key_gen.monitor () with + | None -> Logs.warn (fun m -> m "no monitor specified, not outputting statistics") + | Some ip -> Monitoring.create ~hostname ?listen_port:(Key_gen.monitor_adjust ()) ip management); connect_store ctx >>= fun (store, upstream) -> load_git None store upstream >>= function | Error (`Msg msg) -> ``` Once you compiled the unikernel (or downloaded a binary with monitoring), and start that unikernel by passing `--net:service=tap0` and `--net:management=tap10` (or whichever your `tap` interfaces are), and as unikernel arguments `--ipv4=` and `--management-ipv4=192.168.42.2/24` for IPv4 configuration, `--monitor=192.168.42.14`, `--syslog=192.168.42.10`, `--name=my.unikernel`, `--monitor-adjust=12345`. With this, your unikernel will report metrics using the influx protocol to 192.168.42.14 on port 8094 (every 10 seconds), and syslog messages via UDP to 192.168.0.10 (port 514). You should see your InfluxDB getting filled and syslog server receiving messages. When you configure [Grafana to use InfluxDB](https://grafana.com/docs/grafana/latest/getting-started/getting-started-influxdb/), you'll be able to see the data in the data sources. Please reach out to us (at team AT robur DOT coop) if you have feedback and suggestions.