Learning on the Job - Part 2
I have a Google Keep note where I keep a list of things that I want to write about. Naturally, most of the items there are about things that happen at work. I must be honest: I was not expecting to be exposed to this myriad of technologies and situations in this speed. Every week feels like a month of learning. Well, “learning”. I barely have time to take stock of things I encountered and write about them - that is when I really feel I have learned. Well, let’s write, then.
Observing a system in real time
The main system of the team I am a part of is a distributed system - meaning that it is divided in “modules” that interact with each other via gRPC.
These modules have varied levels of observability maturity, but it is good enough that we can query traces in Grafana with no problems. (I have plans of writing a blog post about the “observability flow”).
There was a situation we were debugging where a service1 sent a message to service2 and, by the time service2 responded, service1 had already timed out and closed the connection. I had never seen a client timeout.
The first hypothesis was natural: service2 is taking too long to process service1 message. However, when we checked service2 traces, it responded almost immediately when the message arrived. So, naturally, we though: the message is taking too long to arrive at service2. Maybe some congestion on the network layer?
At this moment we realized that our observability setup only covered our applications. If something were to happen between the boundaries of our applications, we had no way to see it. My manager asked me to research ways for us to have some level of observability at the network layer.
After some research I came across node_exporter. It is used to capture hardware and Operating System data from the Linux filesystem and expose them at an endpoint so Prometheus can scrape it and make it available for querying and plotting.
With it we were able to capture per-network-device bytes flow, packets flow, number of errors and even the number of packets being dropped by the CPU after the NIC (Network Interface Card) sent the packets to it. As a side effect, I learned a little bit about softirqs (software interrupt requests). This is a good read about it.
Now, with the data above, we cannot follow a message/request trace-style - we are only collecting metrics - not spans. But, at least, when things fail we will be able to see if the network is generating errors in some unusual way. Also, we will be able to see if the CPU, not the NICs, are part of the problem. I hope we get to use these data soon.
So, the flow is mostly the following: node_exporter captures the data from files, Prometheus scrape it and we visualize it with Grafana.
While setting up the visualizations, I had to learn about virtual network devices (docker and nomad use them) in order to query the right data (we are interested only on physical hardware data).
I’m glad I wrote this post (it is better than not writing it) but can’t help to feel that is very shallow. I want to be able to enter the nitty-gritty details of how these tech work. After all, I want to write the kind of content that I myself like to consume. However, it is becoming very clear that one hour per week in a sunday will not be sufficient to generate this kind of content.
I’m gonna think how to deal with this ‘content shallowness’ going forward.
Thank you for you interest and time!