Observability on Event Hubs
Overview
Observability is one of the important parts of the system and when it comes to observability in Event Hubs, we have to monitor different kinds of logs to understand what’s going on. In short summary of this post, we used Azure Monitor for basic infrastructure level logs, we used Time Series Insights for validating contents of Event Hubs, and Application Map for latency.
Architecture
In this architecture, we have Azure Functions in between Event Hubs to filter out data and push to another Event Hubs.
As data flow from first Hub to another, content changes but there is not way to validate this just by looking at Event Hubs.
We want to validate contents of data when we are developing
We used Time Series Insights (TSI) to confirm the datapoint at each stage of Event Hubs. Looking at the graph, we could spot the spike. We can do the similar validation by checking each log using a standard debugger but as data increases, this gets more and more complicated and time consuming. TSI was used as a debugging tool on top of Event Hubs in our scenario to have quick feedback while we were developing.
We want to observe throughput of each component.
We used Application Map from Application Insight to see latency and volume of each component. In addition to latency and volume, we can dig deeper into each latency and error logs associated. We use data on latency to make a plan to improve on it. (I am using the jpeg from the official documents but idea still persists)
These are steps we took to improve our system from a latency point of view.
Step1: Spot where it takes longer
Step2: Check log on points with longer latency by clicking on components
Step3: Fix those bugs and check latency again.
We want to observe if system can take the load or not.
How much load Event Hubs can take can easily be looked up here, https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-scalability#throughput-units For one throughput unit, “Ingress: Up to 1 MB per second or 1000 events per second (whichever comes first).” and “Egress: Up to 2 MB per second or 4096 events per second.”
Although we have those numbers, we wanted to confirm if our expected load can be handled or not. We came up with two scenarios load cannot be handled and both of cases can be monitored by Azure Monitor.
- Azure Functions cannot handle the load
This can be observed by plotting a graph for ingress and egress. If egress is less at the instance, we can narrow down that there could be an issue at output level. First graph illustrates that egress is substantially lower than ingress. Second graph illustrates that egress and ingress are equal concluding that Azure Functions can handle current load
2. Event Hubs cannot handle the load
This can be observed by plotting a graph for throttle count. If this number is consistently greater than zero, there could be an issue on load at Event Hubs layer.