We’ve been running a bit of a baking-off of observability tools for the past few years. Over the last couple months we integrated Google Cloud Monitoring (formerly Stackdriver Metrics) to track custom application metrics. The impetus for this was two-fold:
- Reduced cognitive load: it’s baked into Google Cloud so it’s theoretically one less tool to have (although GCP Console is so vast I think of it as many tools)
- Reduced operational costs
The upshot is that it is inordinately complicated to get working (both the engineering and the visualizations) and I don’t think the investment in effort to implement it is worth any savings in on-going costs. Even though Stackdriver was founded in 2012, the tooling feels really basic (likely because of the long integration time into Google Cloud) compared to competitors. The off-the-shelf reports are very high-level, and the interface is buggy and slow.
Price
Let’s start with cost. Monitoring, log aggregation, and observability services are very expensive. Here are some quotes from tech leaders about observability products:
“Neither [New Relic nor DataDog] is cheap, per se”
“We switched [from an ELK solution] to DataDog…. The developers like it, but it isn’t as cheap as it used to be.”
“If you want a price break, go with Honeycomb…. Want to spend all your money? New Relic can help with that”
“From our experience, pricing pretty quickly dominates everything else…. DataDog is very good, and expensive…. Dynatrace is even more expensive.”
For many of these services, I’ve found that the cost of log collection and monitoring ends up being more than the cost of hosting the resources being monitored. Maybe I just need to get over that and accept it, because what I’m aiming to reduce is labor.
As we were already using Google Cloud Logging for log aggregation, and Google Cloud Monitoring already covers basic VM and container metrics without charge, we’d only be adding costs for custom metrics. That made it an easy choice try out.
The cost savings didn’t play out. There’s significant effort to get the basic operational reports that we use out of the system. My experience with New Relic is years old, but they were good in the APM space for getting important, tactical metrics into the out-of-the-box dashboards. Sysdig has operational metrics but those were less useful, like average CPU across a cluster rather than 99th%ile, which is what I care about. The biggest issue I encountered though was getting custom metrics data into Google Cloud Monitoring.
Custom Metrics
I specifically was looking at Monitoring for custom metrics. There are things we want to track within our application to determine where performance bottlenecks are and where we should focus our work. Some examples:
- the running time of background jobs
- cache hit/miss rates
- bytes billed for BigQuery analysis
- background job queue depth.
We’ve done this with New Relic, Sysdig, Hosted Graphite, and Redis with hand-rolled reports.
Adding custom metrics to Google Cloud Monitoring is a chore and there are not a lot of examples of others doing this (the first clue that this isn’t very popular). Mark Chmarny wrote a brief post about writing metrics and Aja Hammerly wrote one on reading metrics (I never found the follow up she promised about writing metrics). In the end, we resorted to the API documentation and the source and documentation for the Ruby library.
One of the challenges for me was wrapping my head around the concepts they use for data collection. First you have to create (or have already created) a Metric Descriptor which defines the schema for the metrics data.
client = Google::Cloud::Monitoring::Metric.new(version: :v3)
project_path = Google::Cloud::Monitoring::V3::MetricServiceClient.project_path(project_id)
client.create_metric_descriptor(
project_path,
description: "...",
labels: [{key: "...", description: "..."],
metrics_kind: Google::Api::MetricDescriptor::MetricKind::CUMULATIVE,
type: "custom.googleapis.com/...",
unit: "s",
value_type: value_type: Google::Api::MetricDescriptor::ValueType::DOUBLE
)
You can look these fields up in the API reference. The most confusing part here is the metrics_kind
. There are three types, as follows, but DELTA
cannot be used for custom metrics:
GAUGE | An instantaneous measurement of a value. |
DELTA | The change in a value during a time interval. |
CUMULATIVE | A value accumulated over a time interval. Cumulative measurements in a time series should have the same start time and increasing end times, until an event resets the cumulative value to zero and sets a new start time for the following points. |
I understand the GAUGE
type — I’d use that for CPU or memory or disk usage. A value at a point in time. CUMULATIVE
makes sense for counting things: how many API calls did the process make during a rate limiting window? The requirements on CUMULATIVE
make coding it really complicated — especially if the metrics are not embedded in the code. In order to report the same start time, the metrics recorder must include state, not just labeling (to be aggregated by Monitoring).
These two options seem to miss a whole lots of interesting metrics. From my examples above, the running time of a job — is that CUMULATIVE
? That’s what we chose, and we had to report it at the start to “reset” the counter and then at the end of the job with the value to record. What about the cost of a resource — is that GAUGE
because it’s a cost at a particular time?
The next concept is the Time Series. You can think of it as a metrics (measurement) associated with labels.
client = Google::Cloud::Monitoring::Metric.new(version: :v3)
project_path = Google::Cloud::Monitoring::V3::MetricServiceClient.project_path(project_id)
data_point = {
interval: {
start_time: Google::Protobuf::Timestamp.new(seconds: start_time.to_i),
end_time: Google::Protobuf::Timestamp.new(seconds: end_time.to_i)
},
value: {double_value: ...}
}
time_series = {
metric: {
type: METRIC_NAME,
labels: {job_id: job_id, context: context, app: app}
},
resource: {
"type" => "gce_instance",
"labels" => {
"instance_id" => instance_id || "unknown",
"zone" => zone || "unknown",
"project_id" => project_id
}
},
points: [data_point]
}
client.create_time_series(project_path, [time_series])
While the API supports multiple data points per time series the documentation states that you can only record one data point per API call.
There’s one more complication: you can’t submit metrics for a given time series (defined by its labels) more than once per second. This required us to add a label to guarantee each time series was unique, which leads to a very high cardinality dimension. The documentation warns against this and most monitoring systems (Honeycomb being the exception) don’t handle high cardinality very well. So far, I haven’t seen detrimental effects of this in Monitoring.
We’ve wrapped up a few metrics here to test this, but I’m not happy with the results. Compare this to the StatsD interface that Sysdig, Honeycomb, and most monitoring systems support for custom metrics and it’s byzantine. With Sysdig’s StatsD interface we have shell scripts that write metrics — there’s no way I could do that with this API.