We recently upgraded one of our clusters to Kubernetes 1.17.0, an upgrade that we have already been running on several other, much larger Kubernetes clusters with no issues. However, since the upgrade, we started seeing some bizarre networking patterns with delays. The thing is, though, none of them seemed reproducible, they all felt like DNS issues, such as delays in starting a connection that was exactly 60 seconds.
As a result, the next step for us was to start looking into trying to reproduce this issue. In order to be able to replicate requests inside the environment easily, we started a shell inside a container inside the cluster, which made requests to a few different services. We noticed that some requests were not failing with no particular consistency. Instead, they were responding with SERVFAIL from CoreDNS. For us, we saw it as a good sign because we finally had a specific failure that we could begin looking at more deeply.
We enabled logging inside CoreDNS in the appropriate ConfigMap, and once we completed the process, we started seeing a stream of logs with all the DNS requests. Therefore, we were able to start resolving things again, allowing us to understand how CoreDNS logged SERVFAIL inside finally. Yet another good sign, this helped us narrow down the specific issue. We also noticed that it was the only particular pod that was responding with SERVFAIL.
for i in `kubectl -n kube-system get pods -l k8s-app=kube-dns -o
name`; do echo $i $(kubectl -n kube-system logs $i | grep -c
At this point, we were able to narrow down the issue to a specific pod. To confirm, we started sending requests to the direct IP address of the pod to see if we could get a SERVFAIL without the clusterIP. Therefore, we eliminated any possibility that this had anything to do with kube-proxy up to this point. As expected, at some point, the pod started responding with SERVFAIL for a few moments until it began to react again.
I believe it’s always important to take a step back and review the current issue, listing all the discoveries in order to determine the next troubleshooting steps. So at this point:
- Intermittent DNS resolution failures with SERVFAIL replies
- One CoreDNS pod is responding with some SERVFAIL; the other pods are OK
- Problematic CoreDNS pod responds with SERVFAIL even with direct-to-pod communication
- Eliminated the possibility of kube-proxy being an issue
At this point, we thought that the next best step is to focus on the node that hosts this pod. It’s clear that something inside of it was acting differently than the others. Therefore, we started up a tcpdump on the node, which hosts that pod, to monitor all traffic going in and out of the pod. This way, we could see if the SERVFAIL is coming from the upstream DNS resolver, if the request is leaving the pod at all, or just to have some extra visibility on what exactly is going on there.
Once we caught our first failure, things started to get pretty interesting. The upstream server, which CoreDNS forwards requests to, responded with SERVFAIL, essentially meaning that the problem is a whole few layers further. However, what’s peculiar is that this DNS server is not the one that is inside the /etc/resolv.conf file on the host (which should be copied into the CoreDNS container because the DNS policy is “Default” for the CoreDNS deployment).
In this particular case, the pod restarted since the /etc/resolv.conf file had been updated. However, the contents of that file inside the container were clearly pointing at the old information. At this point, we had two possible theories:
- Kubelet is reading the resolv.conf file from some other place
- Kubelet only reads the resolv.conf file on start
For the first theory, we simply looked at the /var/lib/kubelet/config.yaml file, which pointed to /etc/resolv.conf for its resolvConf option. This meant that it should be reading it from there.
However, it’s not that simple. Since we deploy our Kubernetes clusters via kubeadm and it has some black magic to handle the existing black magic that is within systemd-resolved, it pointed towards the resolv.conf file which systemd-resolved generates at /run/systemd/resolve/resolv.conf in a flag in the command line that launches the kubelet.
# ps auxf | grep resolv.conf
root 46322 17.1 0.0 6322496 182336 ? Ssl 2019 202:50
config=/var/lib/kubelet/config.yaml –cgroup-driver=systemd —
I’ll let you take a wild guess at what resolvers we found inside that file. That’s right: the ones that weren’t working correctly, therefore, we moved away from them. Upon cleaning those up, everything started to be functional again. We also cleaned up our /var/lib/kubelet/kubeadm-flags.env for this deployment in many of the options that already existed in our Kubelet configuration file. As a result, cluster health ultimately went back to normal.
Hopefully, this story of iterative step-by-step troubleshooting can help you find the root cause of your issues. It seems that this type of information is pretty tribal and lives within organizations, so we’re happy to start sharing more of those experiences to help those looking to resolve their issues.
If you’re looking for someone who will do all of this instead of you having to go through the entire learning process, check out our OpenStack consulting services. We’d be happy to help you with (hopefully not DNS!) any troubles you’re dealing with :)