Skip to content

Commit 73ba287

Browse files
committed
Otel with collector works well
1 parent 78b175a commit 73ba287

22 files changed

Lines changed: 195 additions & 260 deletions

charts/mlrun-ce/README.md

Lines changed: 25 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -107,20 +107,16 @@ helm --namespace mlrun \
107107
--set opentelemetry-operator.enabled=true \
108108
--set opentelemetry.namespaceLabel.enabled=true \
109109
--set opentelemetry.collector.enabled=true \
110-
--set opentelemetry.collector.scrapeMode=otel \
111110
--set opentelemetry.instrumentation.enabled=true \
112111
mlrun/mlrun-ce
113112
```
114113

115-
> **Important:** When enabling OpenTelemetry, set `opentelemetry.collector.scrapeMode=otel` to collect metrics
116-
> via the OTEL sidecar and prevent duplicate metrics. The default is `direct` (for when OTEL is disabled).
117-
118114
The installation will:
119115
- Deploy the OpenTelemetry Operator
120-
- Create an OpenTelemetryCollector CR (sidecar mode)
116+
- Create an OpenTelemetryCollector CR (deployment mode — one collector per namespace)
121117
- Create an Instrumentation CR for Python auto-instrumentation
122-
- Label the namespace with `opentelemetry.io/inject=enabled`
123-
- Configure Prometheus to scrape OTEL sidecar metrics (port 8889)
118+
- Label and annotate the namespace so all Python pods are auto-instrumented automatically
119+
- Configure Prometheus to scrape OTEL collector metrics (port 8889)
124120

125121
#### Step 5: Verify OpenTelemetry Installation
126122

@@ -140,21 +136,14 @@ kubectl -n mlrun get instrumentations
140136
kubectl -n mlrun get pods | grep opentelemetry
141137
```
142138

143-
#### Step 6: Verify Jupyter has OTEL Sidecar Annotations
139+
#### Step 6: Verify OTel Pod Labels and Namespace Annotation
144140

145141
```bash
146-
kubectl -n mlrun get deployment -l app.kubernetes.io/component=jupyter-notebook \
147-
-o jsonpath='{.items[0].spec.template.metadata.annotations}' | jq .
148-
```
142+
# Check that the namespace has the instrumentation annotation (enables auto-instrumentation for all Python pods)
143+
kubectl get namespace mlrun -o jsonpath='{.metadata.annotations}' | jq .
149144

150-
You should see annotations like:
151-
```json
152-
{
153-
"instrumentation.opentelemetry.io/inject-python": "my-mlrun-otel-instrumentation",
154-
"prometheus.io/port": "8889",
155-
"prometheus.io/scrape": "true",
156-
"sidecar.opentelemetry.io/inject": "my-mlrun-otel-collector"
157-
}
145+
# Check pod labels — all chart-managed pods should have mlrun.io/otel=true
146+
kubectl -n mlrun get pods --show-labels | grep mlrun.io/otel
158147
```
159148

160149
### Installing MLRun-ce on minikube
@@ -185,7 +174,7 @@ Override those [in the normal methods](https://helm.sh/docs/chart_template_guide
185174
### Configuring OpenTelemetry (Observability)
186175

187176
MLRun CE includes the OpenTelemetry Operator for collecting metrics and traces from your ML workloads.
188-
The operator runs in **sidecar mode**, automatically injecting collector containers into annotated pods.
177+
The operator runs one collector **Deployment** per namespace. Instrumented pods send OTLP metrics to the collector, which exports them to Prometheus.
189178

190179
> **Note:** OpenTelemetry is **disabled by default**. See below for how to enable it.
191180
@@ -212,10 +201,10 @@ kubectl label namespace <your-namespace> opentelemetry.io/inject=enabled
212201
#### Default Configuration
213202

214203
By default, OpenTelemetry is **disabled**. When enabled, it provides:
215-
- Namespace labeling for OTEL operator webhook targeting
216-
- Sidecar collector injection for instrumented pods
217-
- Python auto-instrumentation for Jupyter notebooks
218-
- Prometheus metrics export on port 8889
204+
- A single OTel Collector Deployment per namespace (OTLP receiver → Prometheus exporter on port 8889)
205+
- Namespace-level Python auto-instrumentation (all Python pods in the namespace are instrumented automatically)
206+
- `mlrun.io/otel: "true"` label on Jupyter, SeaweedFS, and Nuclio function pods
207+
- Prometheus scrapes the collector pod (not individual pods)
219208

220209
#### Enabling OpenTelemetry
221210

@@ -228,7 +217,6 @@ helm --namespace mlrun install my-mlrun \
228217
--set opentelemetry-operator.enabled=true \
229218
--set opentelemetry.namespaceLabel.enabled=true \
230219
--set opentelemetry.collector.enabled=true \
231-
--set opentelemetry.collector.scrapeMode=otel \
232220
--set opentelemetry.instrumentation.enabled=true \
233221
mlrun/mlrun-ce
234222
```
@@ -240,7 +228,6 @@ helm --namespace mlrun upgrade my-mlrun \
240228
--set opentelemetry-operator.enabled=true \
241229
--set opentelemetry.namespaceLabel.enabled=true \
242230
--set opentelemetry.collector.enabled=true \
243-
--set opentelemetry.collector.scrapeMode=otel \
244231
--set opentelemetry.instrumentation.enabled=true \
245232
mlrun/mlrun-ce
246233
```
@@ -253,13 +240,12 @@ helm --namespace mlrun upgrade my-mlrun \
253240
--set opentelemetry.collector.enabled=false \
254241
--set opentelemetry.instrumentation.enabled=false \
255242
--set opentelemetry.namespaceLabel.enabled=false \
256-
--set opentelemetry.collector.scrapeMode=direct \
257243
mlrun/mlrun-ce
258244
```
259245

260246
#### Custom Resource Limits
261247

262-
Configure collector sidecar resources:
248+
Configure collector resources:
263249

264250
```bash
265251
helm --namespace mlrun install my-mlrun \
@@ -282,63 +268,23 @@ helm --namespace mlrun install my-mlrun \
282268

283269
#### Adding OpenTelemetry to Custom Workloads
284270

285-
To instrument your own deployments with the OTEL sidecar and Python auto-instrumentation:
286-
287-
1. Ensure your namespace has the OpenTelemetry label:
288-
```bash
289-
kubectl label namespace <your-namespace> opentelemetry.io/inject=enabled
290-
```
291-
292-
2. Add these annotations to your pod spec:
293-
```yaml
294-
metadata:
295-
annotations:
296-
sidecar.opentelemetry.io/inject: "<release-name>-otel-collector"
297-
instrumentation.opentelemetry.io/inject-python: "<release-name>-otel-instrumentation"
298-
prometheus.io/scrape: "true"
299-
prometheus.io/scrape-mode: "otel"
300-
prometheus.io/port: "8889"
301-
```
302-
303-
#### Preventing Prometheus/OTEL Metric Overlap
304-
305-
To prevent duplicate metrics when using both Prometheus direct scraping and OpenTelemetry,
306-
MLRun CE uses a **scrape-mode** annotation system:
307-
308-
| Scrape Mode | Description | Use Case |
309-
|-------------|-------------|----------|
310-
| `direct` | Direct Prometheus scraping only | **Default** - When OTEL is disabled |
311-
| `otel` | Metrics collected via OTEL sidecar only | **Recommended when OTEL enabled** |
312-
| `both` | Both OTEL and direct scraping | Debugging/transition only |
313-
314-
> **Note:** The default scrape mode is `direct`. When enabling OpenTelemetry, you must set
315-
> `--set opentelemetry.collector.scrapeMode=otel` to collect metrics via the OTEL sidecar.
316-
317-
**How it works:**
318-
- OTEL-collected metrics have the `mlrun_otel_` prefix and `metrics_source=otel_collector` label
319-
- Direct-scraped metrics have `metrics_source=direct_scrape` label
320-
- Prometheus scrape configs filter based on `prometheus.io/scrape-mode` annotation
271+
Python instrumentation is applied **namespace-wide** — any Python pod in the MLRun namespace is automatically instrumented when OTel is enabled. No per-pod annotations are required.
321272

322-
**Configure scrape mode when enabling OTEL:**
273+
For pods in other namespaces, annotate the namespace directly:
323274
```bash
324-
helm --namespace mlrun install my-mlrun \
325-
--set opentelemetry-operator.enabled=true \
326-
--set opentelemetry.collector.enabled=true \
327-
--set opentelemetry.collector.scrapeMode=otel \
328-
--set opentelemetry.instrumentation.enabled=true \
329-
mlrun/mlrun-ce
275+
kubectl annotate namespace <your-namespace> \
276+
instrumentation.opentelemetry.io/inject-python=<release-name>-otel-instrumentation
330277
```
331278

332-
**Query metrics by source in Prometheus:**
333-
```promql
334-
# OTEL-collected metrics only
335-
{metrics_source="otel_collector"}
336-
337-
# Direct-scraped metrics only
338-
{metrics_source="direct_scrape"}
279+
The `mlrun.io/otel: "true"` label is applied to: **Jupyter**, **SeaweedFS** (master, volume, filer, s3, admin), and **Nuclio function pods** (via `functionDefaults.metadata.labels`). This label is used for Prometheus metric filtering and enrichment.
339280

340-
# OTEL metrics use prefix
281+
**Query OTEL-collected metrics in Prometheus:**
282+
```promql
283+
# OTEL metrics use the mlrun_otel_ prefix
341284
mlrun_otel_http_server_duration_seconds_bucket{...}
285+
286+
# Filter by source
287+
{metrics_source="otel_collector"}
342288
```
343289

344290
#### Split Installation (Admin/Non-Admin)

charts/mlrun-ce/non_admin_installation_values.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,8 +88,8 @@ opentelemetry-operator:
8888
enabled: false
8989

9090
# OpenTelemetry CRs - enabled for user namespace
91-
# The namespace will be labeled with opentelemetry.io/inject=enabled
92-
# so the operator can inject sidecars into pods
91+
# The namespace will be labeled and annotated for OTel deployment-mode collection
92+
# and namespace-wide Python auto-instrumentation.
9393
opentelemetry:
9494
namespaceLabel:
9595
enabled: true

charts/mlrun-ce/templates/NOTES.txt

Lines changed: 15 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -134,47 +134,36 @@ OpenTelemetry Operator is enabled!
134134
- Namespace selector: opentelemetry.io/inject=enabled
135135
{{- if .Values.opentelemetry.collector.enabled }}
136136
{{- "\n" }}
137-
OpenTelemetry Collector (sidecar mode):
138-
- Collector CR: {{ .Release.Name }}-otel-collector
137+
OpenTelemetry Collector (deployment mode):
138+
- Collector CR: {{ include "mlrun-ce.otel.collector.fullname" . }}
139139
- Mode: {{ .Values.opentelemetry.collector.mode }}
140-
- OTLP gRPC endpoint: localhost:{{ .Values.opentelemetry.collector.otlp.grpcPort }} (inside pod)
141-
- OTLP HTTP endpoint: localhost:{{ .Values.opentelemetry.collector.otlp.httpPort }} (inside pod)
142-
- Prometheus metrics port: {{ .Values.opentelemetry.collector.prometheus.port }}
143-
- Prometheus scrape mode: {{ .Values.opentelemetry.collector.scrapeMode }}
144-
{{- if eq .Values.opentelemetry.collector.scrapeMode "direct" }}
145-
146-
⚠️ WARNING: Scrape mode is "direct" - OTEL sidecar metrics will NOT be collected!
147-
To collect metrics via OTEL, reinstall with: --set opentelemetry.collector.scrapeMode=otel
148-
{{- end }}
140+
- OTLP gRPC endpoint: {{ include "mlrun-ce.otel.collector.fullname" . }}-collector:{{ .Values.opentelemetry.collector.otlp.grpcPort }}
141+
- OTLP HTTP endpoint: {{ include "mlrun-ce.otel.collector.fullname" . }}-collector:{{ .Values.opentelemetry.collector.otlp.httpPort }}
142+
- Prometheus metrics port: {{ .Values.opentelemetry.collector.prometheus.port }} (scraped by Prometheus from the collector pod)
149143
{{- end }}
150144
{{- if .Values.opentelemetry.instrumentation.enabled }}
151145
{{- "\n" }}
152146
OpenTelemetry Auto-Instrumentation:
153-
- Instrumentation CR: {{ .Release.Name }}-otel-instrumentation
147+
- Instrumentation CR: {{ include "mlrun-ce.otel.instrumentation.fullname" . }}
154148
{{- if .Values.opentelemetry.instrumentation.python.enabled }}
155-
- Python auto-instrumentation: enabled
149+
- Python auto-instrumentation: enabled (namespace-wide via namespace annotation)
156150
{{- end }}
157151
{{- if .Values.opentelemetry.instrumentation.java.enabled }}
158152
- Java auto-instrumentation: enabled
159153
{{- end }}
160154
{{- end }}
161155
{{- if .Values.opentelemetry.namespaceLabel.enabled }}
162156
{{- "\n" }}
163-
Namespace Label:
164-
- Namespace {{ .Release.Namespace }} is labeled with: {{ .Values.opentelemetry.namespaceLabel.key }}={{ .Values.opentelemetry.namespaceLabel.value }}
157+
Namespace OTel configuration:
158+
- Label: {{ .Values.opentelemetry.namespaceLabel.key }}={{ .Values.opentelemetry.namespaceLabel.value }}
159+
{{- if .Values.opentelemetry.instrumentation.enabled }}
160+
- Python instrumentation annotation applied to all pods in namespace {{ .Release.Namespace }}
165161
{{- end }}
162+
{{- end }}
163+
{{- if or .Values.opentelemetry.collector.enabled .Values.opentelemetry.instrumentation.enabled }}
166164
{{- "\n" }}
167-
Prometheus Scrape Modes:
168-
- "otel" : Metrics collected via OTEL sidecar only (recommended)
169-
- "direct" : Direct Prometheus scraping only (current: {{ .Values.opentelemetry.collector.scrapeMode }})
170-
- "both" : Both methods active (for debugging)
171-
{{- "\n" }}
172-
To add OTEL instrumentation to your pods, add these annotations:
173-
sidecar.opentelemetry.io/inject: "{{ .Release.Name }}-otel-collector"
174-
instrumentation.opentelemetry.io/inject-python: "{{ .Release.Name }}-otel-instrumentation"
175-
prometheus.io/scrape: "true"
176-
prometheus.io/scrape-mode: "otel"
177-
prometheus.io/port: "{{ .Values.opentelemetry.collector.prometheus.port }}"
165+
Pods labeled with mlrun.io/otel=true: Jupyter, SeaweedFS (master/volume/filer/s3/admin), and Nuclio function pods.
166+
{{- end }}
178167
{{- end }}
179168

180169
Happy MLOPSing!!! :]

charts/mlrun-ce/templates/_helpers.tpl

Lines changed: 15 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -423,7 +423,7 @@ OpenTelemetry helpers
423423
OpenTelemetry Collector name
424424
*/}}
425425
{{- define "mlrun-ce.otel.collector.name" -}}
426-
{{- default "otel-collector" .Values.opentelemetry.collector.nameOverride | trunc 63 | trimSuffix "-" }}
426+
{{- default "otel" .Values.opentelemetry.collector.nameOverride | trunc 63 | trimSuffix "-" }}
427427
{{- end }}
428428

429429
{{/*
@@ -433,7 +433,7 @@ OpenTelemetry Collector fullname
433433
{{- if .Values.opentelemetry.collector.fullnameOverride }}
434434
{{- .Values.opentelemetry.collector.fullnameOverride | trunc 63 | trimSuffix "-" }}
435435
{{- else }}
436-
{{- $name := default "otel-collector" .Values.opentelemetry.collector.nameOverride }}
436+
{{- $name := default "otel" .Values.opentelemetry.collector.nameOverride }}
437437
{{- if contains $name .Release.Name }}
438438
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
439439
{{- else }}
@@ -526,7 +526,7 @@ spec:
526526
endpoint: 0.0.0.0:{{ .Values.opentelemetry.collector.prometheus.port }}
527527
namespace: {{ .Values.opentelemetry.collector.prometheus.namespace }}
528528
const_labels:
529-
collector_mode: sidecar
529+
collector_mode: deployment
530530
metrics_source: otel_collector
531531
resource_to_telemetry_conversion:
532532
enabled: true
@@ -579,6 +579,8 @@ metadata:
579579
labels:
580580
{{- include "mlrun-ce.otel.labels" . | nindent 4 }}
581581
spec:
582+
exporter:
583+
endpoint: http://{{ include "mlrun-ce.otel.collector.fullname" . }}-collector:{{ .Values.opentelemetry.collector.otlp.httpPort }}
582584
propagators:
583585
{{- toYaml .Values.opentelemetry.instrumentation.propagators | nindent 4 }}
584586
sampler:
@@ -589,24 +591,6 @@ spec:
589591
valueFrom:
590592
fieldRef:
591593
fieldPath: metadata.labels['app.kubernetes.io/name']
592-
- name: OTEL_RESOURCE_ATTRIBUTES
593-
value: >-
594-
k8s.namespace.name=$(OTEL_RESOURCE_ATTRIBUTES_NAMESPACE),
595-
k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),
596-
k8s.container.name=$(OTEL_RESOURCE_ATTRIBUTES_CONTAINER_NAME),
597-
service.namespace=$(OTEL_RESOURCE_ATTRIBUTES_NAMESPACE)
598-
- name: OTEL_RESOURCE_ATTRIBUTES_NAMESPACE
599-
valueFrom:
600-
fieldRef:
601-
fieldPath: metadata.namespace
602-
- name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME
603-
valueFrom:
604-
fieldRef:
605-
fieldPath: metadata.name
606-
- name: OTEL_RESOURCE_ATTRIBUTES_CONTAINER_NAME
607-
valueFrom:
608-
fieldRef:
609-
fieldPath: metadata.name
610594
- name: OTEL_METRICS_EXPORTER
611595
value: otlp
612596
- name: OTEL_TRACES_EXPORTER
@@ -624,7 +608,7 @@ spec:
624608
- name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
625609
value: "false"
626610
- name: OTEL_PYTHON_DISABLED_INSTRUMENTATIONS
627-
value: ""
611+
value: "aws_lambda"
628612
{{- end }}
629613
{{- if .Values.opentelemetry.instrumentation.java.enabled }}
630614
java:
@@ -636,3 +620,12 @@ spec:
636620
value: "true"
637621
{{- end }}
638622
{{- end }}
623+
..
624+
{{/*
625+
OTel pod label — marks a pod as OTel-monitored for metric enrichment and discovery.
626+
Namespace-level instrumentation annotation (set by namespace-label job) handles Python auto-instrumentation.
627+
Wrap usage with: {{- if and .Values.opentelemetry.collector.enabled .Values.opentelemetry.instrumentation.enabled }}
628+
*/}}
629+
{{- define "mlrun-ce.otel.podLabels" -}}
630+
mlrun.io/otel: "true"
631+
{{- end }}

charts/mlrun-ce/templates/jupyter-notebook/deployment.yaml

Lines changed: 3 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,22 +14,9 @@ spec:
1414
metadata:
1515
labels:
1616
{{- include "mlrun-ce.jupyter.selectorLabels" . | nindent 8 }}
17-
{{- if and .Values.opentelemetry.collector.enabled .Values.opentelemetry.instrumentation.enabled }}
18-
annotations:
19-
# OpenTelemetry sidecar injection
20-
sidecar.opentelemetry.io/inject: "{{ include "mlrun-ce.otel.collector.fullname" . }}"
21-
# Python auto-instrumentation injection
22-
instrumentation.opentelemetry.io/inject-python: "{{ include "mlrun-ce.otel.instrumentation.fullname" . }}"
23-
# Prometheus scraping configuration
24-
# scrape-mode controls how metrics are collected to prevent duplicates:
25-
# "otel" - Only OTEL sidecar metrics (recommended)
26-
# "both" - Both OTEL and direct scraping (debugging)
27-
# "direct" - Only direct scraping (OTEL metrics ignored)
28-
prometheus.io/scrape: "true"
29-
prometheus.io/scrape-mode: {{ .Values.opentelemetry.collector.scrapeMode | quote }}
30-
prometheus.io/port: "{{ .Values.opentelemetry.collector.prometheus.port }}"
31-
prometheus.io/path: "/metrics"
32-
{{- end }}
17+
{{- if and .Values.opentelemetry.collector.enabled .Values.opentelemetry.instrumentation.enabled }}
18+
{{- include "mlrun-ce.otel.podLabels" . | nindent 8 }}
19+
{{- end }}
3320
spec:
3421
{{- with .Values.jupyterNotebook.image.pullSecrets }}
3522
imagePullSecrets:

charts/mlrun-ce/templates/opentelemetry/namespace-label.yaml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ metadata:
99
labels:
1010
{{ include "mlrun-ce.otel.labels" . | indent 4 }}
1111
annotations:
12-
"helm.sh/hook": post-install,post-upgrade
12+
"helm.sh/hook": pre-install,pre-upgrade
1313
"helm.sh/hook-weight": "-10"
1414
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
1515
"helm.sh/hook-timeout": "120s"
@@ -30,5 +30,11 @@ spec:
3030
- |
3131
echo "Labeling namespace {{ .Release.Namespace }} for OpenTelemetry..."
3232
kubectl label namespace {{ .Release.Namespace }} {{ .Values.opentelemetry.namespaceLabel.key }}={{ .Values.opentelemetry.namespaceLabel.value }} --overwrite
33-
echo "Namespace labeled successfully!"
33+
{{- if .Values.opentelemetry.instrumentation.enabled }}
34+
echo "Annotating namespace for namespace-wide Python auto-instrumentation..."
35+
kubectl annotate namespace {{ .Release.Namespace }} \
36+
instrumentation.opentelemetry.io/inject-python={{ include "mlrun-ce.otel.instrumentation.fullname" . }} \
37+
--overwrite
38+
{{- end }}
39+
echo "Namespace configured for OpenTelemetry successfully!"
3440
{{- end -}}

0 commit comments

Comments
 (0)