This commit is contained in:
Hannes Mehnert 2023-11-28 22:25:19 +01:00
parent b3ee690da1
commit ca196709ce

View file

@ -31,15 +31,15 @@ Since late August we are running some unikernels using µTCP, e.g. the [retreat]
One of our secondary nameservers attempts to receive zones (via AXFR using TCP) from another nameserver that is currently not running. Thus it replies to each SYN packet a corresponding RST. Below I graphed the network utilization (send data/packets is positive y-axis, receive part on the negative) over time (on the x-axis) on the left and memory usage (bytes on y-axis) over time (x-axis) on the right of our nameserver - you can observe that both increases over time, and roughly every 3 hours the unikernel hits its configured memory limit (64 MB), crashes with out of memory, and is restarted. The graph below is using the mirage-tcpip stack.
[<img src="/static/img/a.ns.mtcp.png" width="500" />](/static/img/a.ns.mtcp.png)
[<img src="/static/img/a.ns.mtcp.png" width="750" />](/static/img/a.ns.mtcp.png)
Now, after switching over to µTCP, graphed below, there's much fewer network utilization and the memory limit is only reached after 36 hours, which is a great result. Though, still it is not very satisfying that the unikernel leaks memory. Both graphs contain on their left side a few hours of mirage-tcpip, and shortly after 20:00 on Nov 23rd µTCP got deployed.
[<img src="/static/img/a.ns.mtcp-utcp.png" width="500" />](/static/img/a.ns.mtcp-utcp.png)
[<img src="/static/img/a.ns.mtcp-utcp.png" width="750" />](/static/img/a.ns.mtcp-utcp.png)
Investigating the involved parts showed that a TCP connection that was never established has been registered at the MirageOS layer, but the pure core does not expose an event from the received RST that the connection has been cancelled. This means the MirageOS layer piles up all the connection attempts, and doesn't inform the application that the connection couldn't be established. Once this was well understood, developing the [required code changes](https://github.com/robur-coop/utcp/commit/67fc49468e6b75b96a481ebe44dd11ce4bb76e6c) was straightforward. The graph shows that the fix was deployed at 15:25. The memory usage is constant afterwards, but the network utilization increased enormously.
[<img src="/static/img/a.ns.utcp-utcp.png" width="500" />](/static/img/a.ns.utcp-utcp.png)
[<img src="/static/img/a.ns.utcp-utcp.png" width="750" />](/static/img/a.ns.utcp-utcp.png)
Now, the network utilization is unwanted. This was hidden by the application waiting forever that the TCP connection getting established. Our bugfix uncovered another issue, a tight loop:
- the nameserver attempts to connect to the other nameserver (`request`);
@ -48,11 +48,11 @@ Now, the network utilization is unwanted. This was hidden by the application wai
This is unnecessary since the DNS server code has a timer to attempt to connect to the remote nameserver periodically (but takes a break between attempts). After understanding this behaviour, we worked on [the fix](https://github.com/mirage/ocaml-dns/pull/347) and re-deployed the nameserver again. The graph has on the left edge the tight loop (so you have a comparison), at 16:05 we deployed the fix - since then it looks pretty smooth, both in memory usage and in network utilization.
[<img src="/static/img/a.ns.utcp-fixed.png" width="500" />](/static/img/a.ns.utcp-fixed.png)
[<img src="/static/img/a.ns.utcp-fixed.png" width="750" />](/static/img/a.ns.utcp-fixed.png)
To give you the entire picture, below is the graph where you can spot the mirage-tcpip stack (lots of network, restarting every 3 hours), µTCP-without-informing-application (run for 3 * ~36 hours), dns-server-high-network-utilization (which only lasted for a brief period, thus it is more a point in the graph), and finally the unikernel with both fixes applied.
[<img src="/static/img/a.ns.all.png" width="500" />](/static/img/a.ns.all.png)
[<img src="/static/img/a.ns.all.png" width="750" />](/static/img/a.ns.all.png)
# Conclusion