Skip to content

Commit 2650680

Browse files
Document max inflight tasks (TraceMachina#2167)
* Document max inflight tasks * docs: use json5 fences in production config
1 parent faad8bb commit 2650680

2 files changed

Lines changed: 32 additions & 11 deletions

File tree

web/platform/src/content/docs/docs/config/basic-configs.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,9 @@ memory and filesystem stores instead of S3 and Redis.
8989
"worker_api_endpoint": {
9090
"uri": "grpc://127.0.0.1:50061",
9191
},
92+
// Limit concurrent actions on this worker to avoid saturation.
93+
// Set to 0 for unlimited.
94+
"max_inflight_tasks": 16,
9295
"cas_fast_slow_store": "WORKER_FAST_SLOW_STORE",
9396
"upload_action_result": {
9497
"ac_store": "AC_MAIN_STORE",

web/platform/src/content/docs/docs/config/production-config.mdx

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ At the top level of the CAS Config, we've stores and servers. Each server define
2626

2727
Specifically, under servers, we've two separate servers defined:
2828

29-
```json
29+
```json5
3030
"servers": [{
3131
"listener": {
3232
"http": {
@@ -52,7 +52,7 @@ Specifically, under servers, we've two separate servers defined:
5252
```
5353

5454
Let’s focus on the main server that exposes the CAS and ActionCache services.
55-
```json
55+
```json5
5656
{
5757
"listener": {
5858
"http": {
@@ -82,7 +82,7 @@ From this definition, we see that an HTTP listener binds to port 50051 on all ne
8282
This server hosts four services: CAS, ac, capabilities, and bytestream. The capabilities service is needed for supporting the Bazel protocol. The bytestream service is used to stream data to and from the CAS and is recommended for handling large objects.
8383

8484
You might be wondering what the “main” object under "CAS" and “AC” services means. In this case, it indicates the instance name, which means you need to pass --remote_instance_name=main. Alternatively, you can use the following Configuration so your Bazel clients don’t have to pass the --remote_instance_name parameter:
85-
```json
85+
```json5
8686
"cas": [{
8787
"cas_store": "cas_STORE"
8888
}],
@@ -128,7 +128,7 @@ Completeness checking store verifies if the output files & folders exist in the
128128

129129
Effectively, this store ensures the CAS and ActionCache are in a consistent state for a given Action digest (key). If not, then the requested Action digest is treated as a cache miss and needs to be re-computed. As mentioned above, the Remote execution proto gives hints about the behavior of the ActionCache, such as this comment for the GetActionResult endpoint:
130130

131-
```json
131+
```json5
132132
// Implementations SHOULD ensure that any blobs referenced from the
133133
// [ContentAddressableStorage][build.bazel.remote.execution.v2.ContentAddressableStorage]
134134
// are available at the time of returning the
@@ -160,7 +160,7 @@ The slow side of the Action Cache `fast_slow` in our cloud platform uses the Red
160160
Notice that we pull the actual address of Redis from the REDIS_STORE_URL environment variable, which helps keep the Config structure free of environment specific settings.
161161

162162
The fast side of the Action Cache `fast_slow` store is a `size_partitioning` store:
163-
```json
163+
```json5
164164
"size_partitioning":{
165165
"size": 1000,
166166
"lower_store": {
@@ -187,7 +187,7 @@ That covers the stores for the ActionCache, now let’s look at the CAS service
187187
CAS
188188

189189
The NativeLink CAS service stores content using a cryptographic hash of the content itself as the cache key, known as Content Addressable Storage. From a distributed build system perspective, it makes sense to use a CAS since we can avoid rebuilding outputs during the build process because the CAS guarantees stored content hasn't changed for any given hash key. However, we’re not here to learn how Bazel remote caching works with CAS, as there are plenty of resources about that on the Web, so let’s turn our attention to how the NativeLink CAS store works. In the Config JSON, we define the top-level cas_STORE:
190-
```json
190+
```json5
191191
"cas_STORE": {
192192
"existence_cache": {
193193
"backend": {
@@ -219,7 +219,7 @@ Intuitively, this store is an optimization that helps speed up requests for the
219219
Here we’re using a verify store which verifies the size of the data being uploaded into the CAS. This store helps ensure the integrity of your CAS. In this case, we chose to not have a store named cas_VERIFY_STORE that references the cas_FAST_SLOW_STORE but that would be an acceptable Configuration if you wanted to avoid nesting stores within stores in your Configuration.
220220

221221
The back-end for the verify store is a `fast_slow` store. Let’s look at the slow store first.
222-
```json
222+
```json5
223223
"slow": {
224224
"size_partitioning":{
225225
"size": 1500000,
@@ -258,7 +258,7 @@ To recap, for our CAS slow store, we send smaller objects to Redis and larger to
258258
259259
On the fast side, we use a similar approach we did for ActionCache using `size_partitioning` scheme with a memory store.
260260
261-
```json
261+
```json5
262262
"fast": {
263263
"size_partitioning":{
264264
"size": 64000,
@@ -283,7 +283,7 @@ CAS Config JSON
283283
Here is the final CAS Config JSON without the 99 extra shards for writing to S3.
284284
285285
## Production CAS JSON
286-
```json
286+
```json5
287287
{
288288
"stores": {
289289
"AC_FAST_SLOW_STORE": {
@@ -454,6 +454,24 @@ Here is the final CAS Config JSON without the 99 extra shards for writing to S3.
454454
}
455455
```
456456
457+
## Limit Worker Inflight Tasks
458+
459+
If your workers are getting saturated, cap the number of concurrent tasks they
460+
will accept with `max_inflight_tasks`. This helps avoid runaway scheduling when
461+
actions spike or when a single worker falls behind.
462+
463+
```json5
464+
workers: [{
465+
local: {
466+
worker_api_endpoint: {
467+
uri: "grpc://127.0.0.1:50061",
468+
},
469+
// Set to 0 for unlimited.
470+
max_inflight_tasks: 32,
471+
}
472+
}]
473+
```
474+
457475
458476
## Speed Up NativeLink by Turning Off a Hidden Redis Query
459477
@@ -469,11 +487,11 @@ Every time this runs, it fires off a wildcard query to Redis. These queries aren
469487
470488
Add one line to your scheduler config:
471489
472-
```json
490+
```json5
473491
worker_match_logging_interval_s: -1
474492
```
475493
476-
```json
494+
```json5
477495
schedulers: [
478496
{
479497
name: "MAIN_SCHEDULER",

0 commit comments

Comments
 (0)