-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Distributor usage trackers #4162
base: main
Are you sure you want to change the base?
Conversation
Should we create a doc issue for this when it's ready? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core code looks great. have some Qs but mainly just needs changelog and docs
if len(t.series) >= maxCardinality { | ||
// Overflow | ||
// It goes into the unlabeled bucket | ||
// TODO - Do we want to do something else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i really like this solution and can't come up with a better one off the top of my head. you can clearly see the severity of the unknown data and choose whether or not to pick a different label set.
modules/distributor/usage/tracker.go
Outdated
if v == "" { | ||
continue | ||
} | ||
for i, d := range dimensions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it looks like span data "overrides" resource level data. should we require scopes on the configured dimensions to enforce level?
I'm imagining a scenario like:
batches: [
atts: {
"foo": "A",
},
spans: [
{
atts: {
"foo": "B",
},
},
{
// no value for foo
},
]
]
will the first span overwrite the foo value to be "B" and that will carry forward to the second span even though it doesn't have a value for "foo"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely considered this, but for this use case we need to check both areas, as different applications can be instrumented differently (service A puts the attribute on the resource, and service B puts the attribute on the span). Was thinking we could evolve into that in the future, parsing as a traceql identifier allows us to know if it's a plain old label (current), or scoped (span.foo) or even unscoped again (.foo).
Another consideration was to keep the configuration simple, and consistent across signals. For example here is the PR for Mimir: grafana/mimir#9392
Ultimately I think at this time I wouldn't want to require scopes yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry reading closer, yes I think you are right and that is a bug. It looks like "B" would carry forward to the second span.
dimensions := u.labelsFn(tenant) | ||
if len(dimensions) == 0 { | ||
// Not configured | ||
// TODO - Should we put it all in the unattributed bucket instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean on the side of defaulting to service name, that seems more useful then the unattributed bucket 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever is decided here, it should be the same for all our DBs.
I'm happy to add docs in this PR if we want. At the minimum I will update the config and url sections. |
Pushed some changes to configure via |
What this PR does:
This a new feature that allows a tenant to accurately track the amount of ingested traffic by a set of custom labels. It's similar to the existing
traces_spanmetrics_size_total
metric created by the generators but improves on it in some key ways.** Need **
The core need is to export a set of highly accurate metrics on ingested traffic that tenants can use for cost-attribution. This means that every ingested byte can be attributed to something, i.e. a team or department, so that tracing costs can be reconciled. Any attribute in the tracing data can be used.
** Reasons for a new feature**:
** Important concepts about this new feature **
/usage_metrics
. This is so that they aren't mixed with the existing operational/metrics
, because they are expected to have much higher cardinality and a different purpose.proto.Size()
, and the series tracking contains buffering that enables it to be zero-allocation for existing series. If you run the benchmark you will see:BenchmarkUsageTracker 1995282 5904 ns/op 0 B/op 0 allocs/op
TODOs
There is some remaining questions:
overrides.cost_attribution.dimensions
isn't set then the distributor records nothing. But maybe it makes sense to have default behavior here? Service name may be a good default.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]