Worker node network failure monitoring

Hi,

We’re looking for an efficient tool to monitor network connectivity between worker nodes—specifically to detect if a process on one worker node is unable to communicate with a process on another worker node. The tool should not be tied to any specific CNI (Container Network Interface). Are there any recommended tools or best practices for this kind of inter-node network monitoring? We found below tools suggested from internet but can you suggest a better tool ?

Tool Real-Time Traffic Visualization Metrics Policy Awareness
Cilium + Hubble :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Istio + Kiali :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Weave Scope :white_check_mark: :white_check_mark: :x: :x:
Pixie :white_check_mark: :white_check_mark: :white_check_mark: :x:
Prometheus+Grafana :x: (depends) :white_check_mark: :white_check_mark: :x:

Regards,
Debasis

Cilium or Istio are probably your best bet as they actively engage with all pods thus enabling real time traffic monitoring. Weave is a dead project (ran out of funding). Prometheus isn’t designed for traffic monitoring, but you should still use it to gather metrics from your chosen solution to monitor its health through those metrics.

What you’re wanting here is “tracing” from the pillars of observability which are

  • Logging (Elastic, datadog, splunk etc)
  • Monitoring (Prometheus, grafana)
  • Tracing (Hubble, Kiali)
1 Like