Skip to content
Slicekit
All posts
· Slicekit Team

OpenTelemetry from day one, and the cardinality mistake that blows up your bill

Wiring traces, metrics and logs in before the first feature is the easy part. The expensive trap is high-cardinality metric labels, and one rule keeps it from wrecking your storage bill.

There are two ways observability hurts. The first is the one everyone warns you about: you promise to “add it later,” then the first production incident arrives, an endpoint is slow or an error rate is climbing, and there is nothing to look at. You thread tracing through code that was never built for it, under pressure, while the incident is still open.

The second hurts later and quieter. You did add observability on time, you were diligent, and six months in your Prometheus instance is eating memory you did not budget for and your Loki bill has quietly tripled. Nobody broke anything. Someone just put a user id on a metric label. This post is about both, because the team that ships telemetry carelessly trades the retrofit pain for a storage bill that surprises them, and the fix for the second problem is a single rule you can adopt on day one.

Why wired-from-day-one beats retrofitting

Slicekit wires the pipeline in before you write a single feature, so your very first request is already traced, measured and logged. The reason is not tidiness. It is that the signals are only worth what their correlation buys you, and correlation has to be configured together, up front.

Adding traces today and logs next quarter leaves them unlinked: you get the data but not the thread that connects it. When the whole request path is instrumented from the start, a single browser action produces one trace spanning the HTTP endpoint, the database calls and any outbound requests, and from the slow span you jump straight to its correlated logs. The ad hoc alternative, a few log lines added near the code that broke, only ever teaches you about the one path you happened to instrument. There is no trace tying the request together and no metric to alert on before a user notices. Doing the wiring once, when it is cheap, is what makes that thread exist at all.

How it actually works in .NET

A point of confusion worth clearing up: in .NET there is no proprietary OpenTelemetry tracing API you have to learn. OpenTelemetry is a vendor-neutral standard for traces, metrics and logs, transported over OTLP, and in .NET it plugs into the runtime APIs you already have. System.Diagnostics.ActivitySource and Activity are the tracer and span. System.Diagnostics.Metrics.Meter is the metrics API. Logs flow through the ILogger you are already calling. All three signals are stable in .NET, so this is not a preview you are betting on. (Profiles are an emerging fourth signal and are not yet stable; treat them as experimental, not production.)

Slicekit turns on the standard auto-instrumentation (ASP.NET Core, HttpClient and EF Core), so endpoints and database calls are traced without you writing anything. When you want a custom span or counter in a slice, you add it with those same runtime types, no vendor SDK:

private static readonly ActivitySource Activity = new("Slicekit.Search");
private static readonly Counter<long> Rebuilds =
    new Meter("Slicekit.Search").CreateCounter<long>("search.index.rebuilds");

using var span = Activity.StartActivity("rebuild-search-index");
span?.SetTag("project.id", projectId);
Rebuilds.Add(1);

The API emits all three signals to an OpenTelemetry Collector, which fans them out to purpose-built backends, with Grafana in front of all three.

.NET API

Serilog + OTel SDK

OTel Collector

receives · routes

Traces Tempo
Metrics Prometheus
Logs Loki

Grafana

dashboards · alerts

One request becomes one trace spanning the endpoint, the handler, the database call and every published message, correlated with its logs and metrics.
SignalStored inExplored in
TracesTempoGrafana
MetricsPrometheusGrafana
LogsLokiGrafana

The collector is technically optional, but recommended in production: it is the seam where you batch, sample and re-export, and where you would later repoint everything at a managed backend without touching application code. Everything is in the Docker Compose stack, so docker compose up -d gives you Grafana with dashboards already provisioned at localhost:3010.

The cardinality mistake, and the rule that prevents it

Here is where teams hurt themselves. A metric is not one number; it is one time series per unique combination of label values. Put a bounded label on a counter, say http.route with a few dozen routes, and you have a few dozen series. Put an unbounded value on it, a user id, a request id, a full URL with ids baked into the path, an email address, and you get a new series for every distinct value that has ever appeared. That is cardinality, and it is the single biggest driver of cost in a metrics backend.

This is not a tuning detail. The number of active time series is what drives Prometheus memory and storage, so an innocent-looking user_id label on a request counter can turn a handful of series into millions and take the instance down. The official guidance is blunt: every unique label-value combination is a new time series, so you should keep cardinality low and never use values that can grow without bound. The same trap exists in Loki, where high-cardinality stream labels fragment storage and drive log cost the same way.

The rule that prevents all of it is simple and worth internalizing once:

Keep unbounded, high-cardinality values off metric labels. Put them on traces and logs instead, and use exemplars to jump between them.

A user id belongs on a span attribute and in a log line, where one extra distinct value costs you one more searchable record, not one more permanent time series. Your metrics stay aggregatable (error rate by route, latency by handler) while the per-request detail lives in the trace you can pivot to. In the snippet above, project.id sits on the span, while the counter it accompanies carries no unbounded dimension. That is the pattern to follow when you add your own instrumentation.

What Slicekit defaults, and what you still own

Slicekit ships the three signals already wired through a collector to Tempo, Prometheus and Loki, all explored in Grafana, with sane defaults and dashboards provisioned from deploy/. Application logs go through Serilog and out over OTLP to Loki, and audit events ride that same pipeline rather than living in a separate table, so audit retention becomes a Loki configuration instead of an application purge job. (More on that in a tamper-evident audit trail.) Alertmanager is in the stack so you can alert on the metrics that matter.

What the template cannot do for you is enforce discipline inside the slices you write next. The correlation, the backends and the wiring are handled. The cardinality of your own metrics is yours to guard: when you add a Meter counter for a new feature, that is the moment to ask whether any label on it can grow without bound, and to move that value onto the span or the log instead. The pipeline being free on day one is what makes it tempting to be sloppy with it. One rule keeps the bill flat.

See the observability guide for the collector configuration and how to add custom spans and metrics to your own slices.