health_check: include HTTP status code in health check events#44390
health_check: include HTTP status code in health check events#44390k1chik wants to merge 3 commits intoenvoyproxy:mainfrom
Conversation
Add http_status_code field to HealthCheckEjectUnhealthy and HealthCheckFailure proto messages. When the HTTP health checker receives a non-2xx response, the status code is threaded through handleFailure() and setUnhealthy() into the event logger and populated on the proto. Non-HTTP health checkers (TCP, gRPC, Redis, Thrift) pass the default value of 0, which the logger skips. Fixes envoyproxy#40221 Signed-off-by: k1chik <107162115+k1chik@users.noreply.github.com>
…eck-status-code Signed-off-by: k1chik <107162115+k1chik@users.noreply.github.com> # Conflicts: # changelogs/current.yaml # source/extensions/health_checkers/http/health_checker_impl.cc
…eck-status-code Signed-off-by: k1chik <107162115+k1chik@users.noreply.github.com>
|
Hi @k1chik, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
| bool first_check = 2; | ||
|
|
||
| // The HTTP status code of the health check response that caused this failure. | ||
| // Only set when the health checker type is HTTP and the failure type is ACTIVE. |
There was a problem hiding this comment.
| // Only set when the health checker type is HTTP and the failure type is ACTIVE. | |
| // Only set when the health checker type is HTTP and the failure type is ``ACTIVE``. |
|
|
||
| // The HTTP status code of the health check response that caused this failure. | ||
| // Only set when the health checker type is HTTP and the failure type is ACTIVE. | ||
| // A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures |
There was a problem hiding this comment.
| // A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures | |
| // A value of ``0`` indicates that no HTTP status code was recorded (e.g., network-level failures |
|
@k1chik needs main merge @adisuissa i think this is waiting on your review |
| // Only set when the health checker type is HTTP and the failure type is ACTIVE. | ||
| // A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures | ||
| // or non-HTTP health checkers). | ||
| uint32 http_status_code = 2; |
There was a problem hiding this comment.
The StatusCode enum enforces defined_only: true, so non-standard codes like nginx's 499 or Cloudflare's 520-530 range would get rejected, IMO I'd rather just log whatever the upstream actually sent back. Also, the enum serializes as string names like ServiceUnavailable wrapped in a nested object, which isn't ideal when we’re trying to filter events by status code. A flat uint32 keeps things simple, lines up with the other fields in the same message, and doesn't tie a logging change to a shared API type. Happy to change this if you feel differently though, open to guidance here.
| cluster_->prioritySet().getMockHostSet(0)->hosts_[0]->coarseHealth()); | ||
| } | ||
|
|
||
| TEST_F(HttpHealthCheckerImplTest, HttpStatusCode500) { |
There was a problem hiding this comment.
What is the difference between the various HttpStatusCodeXXX tests?
Do they exercise different code-paths?
If not, please remove the unnecessary tests.
There was a problem hiding this comment.
thanks, kept two tests: HttpStatusCodeInHealthCheckEvents, which covers the non-2xx path (503), and HttpStatusCodeZeroOnTimeout, which is a distinct code path. I dropped the rest!
| // Only set when the health checker type is HTTP and the failure type is ACTIVE. | ||
| // A value of 0 indicates that no HTTP status code was recorded (e.g., network-level failures | ||
| // or non-HTTP health checkers). | ||
| uint32 http_status_code = 3; |
| // The type of failure that caused this ejection. | ||
| HealthCheckFailureType failure_type = 1 [(validate.rules).enum = {defined_only: true}]; | ||
|
|
||
| // The HTTP status code of the health check response that triggered the ejection. |
There was a problem hiding this comment.
High-level question:
are there other http-specific responses that are expected to be populated?
If so, will it make sense to add a type that is shared between HealthCheckEjectUnhealthy and HealthCheckFailure and is HTTP specific and is populated when an HTTP rejection occurs?
There was a problem hiding this comment.
good point. no other HTTP-specific fields are planned for this PR. but if you'd prefer a shared sub-message now for future extensibility, happy to add one. just let me know which way you want to go.
Commit Message:
health_check: include HTTP status code in health check events
Additional Description:
Add
http_status_codefield toHealthCheckEjectUnhealthyandHealthCheckFailureproto messages. When the HTTP health checker receives a non-2xx response, the status code is threaded throughhandleFailure()andsetUnhealthy()into the event logger and populated on the proto. Non-HTTP health checkers (TCP, gRPC, Redis,Thrift) pass 0, which the logger skips — the field is omitted from the event for non-HTTP failures.
This fix was developed with assistance from Claude Code. I fully understand all submitted code and take ownership of this PR.
Risk Level: Low
Testing:
bazel test //test/common/upstream:health_checker_impl_test --config=clang -c fastbuildbazel test //test/extensions/health_checkers/redis:redis_test --config=clang -c fastbuildbazel test //test/extensions/health_checkers/thrift:thrift_test --config=clang -c fastbuildDocs Changes: N/A
Release Notes: Added under new_features in changelogs/current.yaml.
Platform Specific Features: N/A
Fixes #40221