The Metering Challenge
It is challenging to build a metering system that is scalable, accurate and cost-effective:
- Cost: Database writes at scale can be expensive.
- Accuracy: Metrics systems may not provide the needed accuracy due to sampling.
- Latency: Periodic batch processing in data warehouses can introduce latency.
Auditable data is classified as such when the loss of any data record is intolerable, and complete records retention is necessary. When utilizing an auditable dataset, it is expected to be comprehensive and complete. Examples of auditable data include transaction logs, replication logs, and billing/finance events.
Operational data, conversely, doesn't require strict completeness. To maintain manageable costs, sampling is often employed, and some degree of data loss is acceptable. Tools designed to manage operational data often prioritize efficiency, bypassing retries and costly guarantees of exactly once delivery. Examples of operational data include telemetry, metrics, and contextual data that describe each request and system component.
OpenMeter is designed to give auditable guarantees required for billing while keeping the scalability and real-time benefits typically only seen with operational data.
Learn more about Auditable or Operational data.
Monitoring systems are tailored to gather, process, and store operational data. However, these systems often lack the level of consistency crucial for auditable use cases like billing, where legally binding contracts necessitate accuracy and complete data.
Metrics systems like Prometheus scale well for many timeseries data points but keeping cardinality low is advised, making it difficult to track individual user resource consumption on a large scale.
Scaling databases to accommodate write-intensive event ingestion and real-time queries can be costly. Not only you need to store every single record in the database, but also, you must scan through them for aggregation in your queries, driving up load and cost.
This is why OpenMeter uses stream processing to pre-aggregate usage data before storing it in a long-term database. As part of the data processing, we send usage events to a message queue first to avoid overwhelming the database and to protect against data loss in case of failures. The long-term aggregates are then stored in an OLAP database designed to store and query large volumes of analytics data efficiently.
Storing all records in a data warehouse and processing them in batches may be cost-effective, but it leads to outdated meters. Most companies run daily batch processing, resulting in stale data that is unsuitable for immediate-response product use cases, such as usage gating, customer dashboards, and billing thresholds.
Stale usage aggregation also means slower feedback loops around consumption, which can lead to unexpected charges and customer dissatisfaction.