Skip to content

health_check: include HTTP status code in health check events#44390

Open
k1chik wants to merge 3 commits intoenvoyproxy:mainfrom
k1chik:fix/40221-health-check-status-code
Open

health_check: include HTTP status code in health check events#44390
k1chik wants to merge 3 commits intoenvoyproxy:mainfrom
k1chik:fix/40221-health-check-status-code

Conversation

@k1chik
Copy link
Copy Markdown

@k1chik k1chik commented Apr 10, 2026

Commit Message:
health_check: include HTTP status code in health check events

Additional Description:
Add http_status_code field to HealthCheckEjectUnhealthy and HealthCheckFailure proto messages. When the HTTP health checker receives a non-2xx response, the status code is threaded through handleFailure() and setUnhealthy() into the event logger and populated on the proto. Non-HTTP health checkers (TCP, gRPC, Redis,
Thrift) pass 0, which the logger skips — the field is omitted from the event for non-HTTP failures.

This fix was developed with assistance from Claude Code. I fully understand all submitted code and take ownership of this PR.

Risk Level: Low

Testing:

  • bazel test //test/common/upstream:health_checker_impl_test --config=clang -c fastbuild
  • bazel test //test/extensions/health_checkers/redis:redis_test --config=clang -c fastbuild
  • bazel test //test/extensions/health_checkers/thrift:thrift_test --config=clang -c fastbuild

Docs Changes: N/A

Release Notes: Added under new_features in changelogs/current.yaml.

Platform Specific Features: N/A

Fixes #40221

k1chik added 3 commits April 7, 2026 14:28
Add http_status_code field to HealthCheckEjectUnhealthy and
HealthCheckFailure proto messages. When the HTTP health checker
receives a non-2xx response, the status code is threaded through
handleFailure() and setUnhealthy() into the event logger and
populated on the proto. Non-HTTP health checkers (TCP, gRPC, Redis,
Thrift) pass the default value of 0, which the logger skips.

Fixes envoyproxy#40221

Signed-off-by: k1chik <107162115+k1chik@users.noreply.github.com>
…eck-status-code

Signed-off-by: k1chik <107162115+k1chik@users.noreply.github.com>

# Conflicts:
#	changelogs/current.yaml
#	source/extensions/health_checkers/http/health_checker_impl.cc
…eck-status-code

Signed-off-by: k1chik <107162115+k1chik@users.noreply.github.com>
@repokitteh-read-only
Copy link
Copy Markdown

Hi @k1chik, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #44390 was opened by k1chik.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #44390 was opened by k1chik.

see: more, trace.

Copy link
Copy Markdown
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @k1chik - a couple of docs nits

bool first_check = 2;

// The HTTP status code of the health check response that caused this failure.
// Only set when the health checker type is HTTP and the failure type is ACTIVE.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Only set when the health checker type is HTTP and the failure type is ACTIVE.
// Only set when the health checker type is HTTP and the failure type is ``ACTIVE``.


// The HTTP status code of the health check response that caused this failure.
// Only set when the health checker type is HTTP and the failure type is ACTIVE.
// A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures
// A value of ``0`` indicates that no HTTP status code was recorded (e.g., network-level failures

@phlax
Copy link
Copy Markdown
Member

phlax commented Apr 14, 2026

@k1chik needs main merge

@adisuissa i think this is waiting on your review

// Only set when the health checker type is HTTP and the failure type is ACTIVE.
// A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures
// or non-HTTP health checkers).
uint32 http_status_code = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the type be HttpStatus?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StatusCode enum enforces defined_only: true, so non-standard codes like nginx's 499 or Cloudflare's 520-530 range would get rejected, IMO I'd rather just log whatever the upstream actually sent back. Also, the enum serializes as string names like ServiceUnavailable wrapped in a nested object, which isn't ideal when we’re trying to filter events by status code. A flat uint32 keeps things simple, lines up with the other fields in the same message, and doesn't tie a logging change to a shared API type. Happy to change this if you feel differently though, open to guidance here.

cluster_->prioritySet().getMockHostSet(0)->hosts_[0]->coarseHealth());
}

TEST_F(HttpHealthCheckerImplTest, HttpStatusCode500) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the various HttpStatusCodeXXX tests?
Do they exercise different code-paths?
If not, please remove the unnecessary tests.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, kept two tests: HttpStatusCodeInHealthCheckEvents, which covers the non-2xx path (503), and HttpStatusCodeZeroOnTimeout, which is a distinct code path. I dropped the rest!

// Only set when the health checker type is HTTP and the failure type is ACTIVE.
// A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures
// or non-HTTP health checkers).
uint32 http_status_code = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

// The type of failure that caused this ejection.
HealthCheckFailureType failure_type = 1 [(validate.rules).enum = {defined_only: true}];

// The HTTP status code of the health check response that triggered the ejection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level question:
are there other http-specific responses that are expected to be populated?
If so, will it make sense to add a type that is shared between HealthCheckEjectUnhealthy and HealthCheckFailure and is HTTP specific and is populated when an HTTP rejection occurs?

Copy link
Copy Markdown
Author

@k1chik k1chik Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. no other HTTP-specific fields are planned for this PR. but if you'd prefer a shared sub-message now for future extensibility, happy to add one. just let me know which way you want to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve health status logging on envoy

3 participants