How to Meter Execution DurationMeasuring CPU Hours of containers, pods, and serverless in seconds
Charging customers based on execution duration, such as CPU hours, is a standard pricing model for cloud and DevOps companies. AWS's per-second billing for their EC2 VMs is a prime example of this practice. Despite its long history, billing after execution duration comes with its fair share of challenges. Metering workloads like container runtime, CPU hours, or serverless duration can be complex. Customers expect per-second granularity, and handling billing periods' start and end times is necessary. A typical metering method is to detect start and stop lifecycle events, calculating the difference between the two. However, this approach faces difficulties when start/stop events get lost or long-running workloads run beyond billing periods. This article shares our learnings on effectively metering execution duration to avoid under or overcharging your customers.
One of the most common solutions we hear from customers to meter execution duration is to listen to start and stop events and turn them into database records. Then they calculate the delta between the two timestamps. However, this method may initially appear practical but can become complex quickly when you need to handle dropped start/stop events, long-running workloads, or shifting billing periods. Retaining the start event over months or years is also crucial for long-running processes.
For instance, the following workload lifecycle events:
|1234||start||2023-01-01 - 00:00|
|1234||stop||2023-03-05 - 06:34|
can be translated into the following usage:
|Billing Period Start||Billing Period End||Duration in Seconds|
|2023-01-01||2023-01-31||2,678,400s (31 days)|
|2023-02-01||2023-02-28||2,419,200s (28 days)|
|2023-03-01||2023-03-31||520,440s (5 days, 6.34h)|
Indeed, you can make lifecycle event-driven metering work, but it requires managing complexities around:
- Collecting and storing lifecycle events
- Handling lost start and stop events
- Retaining events for long-running workloads
- Calculating usage in billing periods
So, what's the alternative if you'd rather not deal with these complexities?
Let's explore the heartbeat-style usage metering approach.
Heartbeat style metering introduces a new perspective in measuring workload execution duration. Instead of relying on the traditional approach of marking the start and stop events, you check periodically to see if a workload is still running. If it is, the metering system simply increments a counter associated with the workload.
At a high level, implementing heartbeat-style metering involves two main steps:
Scraping workload status: The first step is to set up a process that regularly checks if a workload is running. This check occurs at a predefined frequency, like every second.
Incrementing the counter: If a workload is running when checked, you would then increment a counter associated with that workload. This counter keeps track of the total duration of the workload, providing the execution duration that can be used for billing purposes.
For instance, if a workload has a heartbeat signal every 5 seconds and the counter reaches a value of 10, this suggests the workload ran for 50 seconds (5 seconds x 10 heartbeats).
This system operates independently of start and stop events, meaning it doesn't suffer from the same issues regarding missing events or difficulties with long-running workloads. You can use streaming aggregation to efficiently and accurately process usage events generated by heartbeats.
Check out our Kubernetes Pod Runtime Duration example on GitHub.
One of the primary considerations when using heartbeat-style metering is the significant increase in the number of usage events generated. For example, a month-long billing period represents approximately 2.6 million seconds. Strategies to handle volume can include event streaming to aggregate usage events into time windows. Using a monitoring solution to count usage is certainly possible, but it can lead to inaccuracies discussed in this article.
Having looked at both methods of metering workload execution duration - 1. detecting start/stop events and 2. heartbeat style metering, let's compare them side-by-side to understand their respective advantages and potential drawbacks.
Accuracy: While start/stop event detection relies on the accuracy of event capturing and can fail in cases of event drops, heartbeat-style metering offers a more reliable option as it periodically checks the state of the workload. This reduces the possibility of under or overbilling due to missing lifecycle events.
Complexity: Start/stop detection, although less complex regarding volume, can pose challenges due to lost events and long-running workloads. On the other hand, heartbeat metering is simpler to implement but produces significantly more usage events, which can be challenging to process.
Scalability: Start/stop event detection might be easier to scale, given that the event handling and retention challenges are adequately addressed. In contrast, heartbeat-style metering can generate a high volume of events which may require a streaming aggregation.
In conclusion, despite its high volume of usage events, heartbeat-style metering offers a more straightforward, more accurate method of metering workload execution duration. To efficiently process usage events generated by heartbeats, you can use streaming aggregation or an out-of-box usage metering solution like: https://github.com/openmeterio/openmeter.