Skip to content

Commit ef16d33

Browse files
committed
Add POST /resources/find-slot orchestrator proxy and documentation
Adds orchestrator proxy endpoint for the reports find-slot API: handler, response controller, connexion controller, OpenAPI spec, constants, and test stubs. Also adds ARCHITECTURE.md and ROADMAP.md documenting the system design and future improvement plans.
1 parent 4e9979b commit ef16d33

8 files changed

Lines changed: 740 additions & 2 deletions

File tree

ARCHITECTURE.md

Lines changed: 374 additions & 0 deletions
Large diffs are not rendered by default.

ROADMAP.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# ROADMAP — Resource Calendar & Scheduling
2+
3+
Future improvements related to resource availability calendar, find-slot, and advance scheduling.
4+
5+
## 1. Extend `determine_future_lease_time` to handle links and facility ports
6+
7+
**Current state**: `determine_future_lease_time` in `orchestrator_kernel.py` only processes `NodeSliver` (compute). Network slivers and facility port slivers are silently skipped (`if not isinstance(requested_sliver, NodeSliver): continue`).
8+
9+
**Improvement**: Extend the `ResourceTracker` and `determine_future_lease_time` to also check link bandwidth availability and facility port VLAN availability when auto-adjusting lease times for future slices. This would prevent the orchestrator from picking a start time where compute fits but a required link is saturated.
10+
11+
**Files**: `orchestrator/core/resource_tracker.py`, `orchestrator/core/orchestrator_kernel.py`
12+
13+
## 2. Add bin-packing to `determine_future_lease_time`
14+
15+
**Current state**: Each compute reservation is checked independently against candidate hosts. Two VMs that each need 32 cores could both "see" the same 32 free cores on the same host, leading to a false positive.
16+
17+
**Improvement**: Track cumulative demand across compute reservations (greedy bin-packing), similar to what `find_slot()` in the reports DB does. This ensures the auto-adjusted start time actually has enough capacity for all VMs simultaneously.
18+
19+
**Files**: `orchestrator/core/orchestrator_kernel.py`, `orchestrator/core/resource_tracker.py`
20+
21+
## 3. FABlib integration for `find-slot`
22+
23+
**Current state**: Users must craft raw HTTP POST requests to `POST /resources/find-slot` via the orchestrator.
24+
25+
**Improvement**: Add a `find_available_slot()` convenience method to `fabrictestbed-extensions` (FABlib) that wraps the orchestrator endpoint. This lets users discover available windows directly from their Jupyter notebooks before submitting slices.
26+
27+
**Repo**: `fabrictestbed-extensions` (external)
28+
29+
## 4. OrchestratorProxy client method for `find-slot`
30+
31+
**Current state**: The orchestrator exposes `POST /resources/find-slot` but the Python `OrchestratorProxy` client class (used by FABlib) does not have a corresponding method.
32+
33+
**Improvement**: Add `find_slot()` to `OrchestratorProxy` so FABlib and other Python clients can call it without raw HTTP.
34+
35+
**Files**: `orchestrator/orchestrator_proxy.py` (if applicable)
36+
37+
## 5. Portal integration for find-slot
38+
39+
**Current state**: The calendar UI shows per-slot availability. Users must manually inspect it to find windows.
40+
41+
**Improvement**: Add a "Find Available Window" form in the portal that calls `POST /resources/find-slot` and displays matching time windows. Users could then click to pre-fill slice creation with the selected window.
42+
43+
## 6. Unified scheduling: replace `determine_future_lease_time` with reports-backed find-slot
44+
45+
**Current state**: Two parallel systems find available time windows — `determine_future_lease_time` (live broker data, compute only) and `find_slot` (reports DB, compute + links + facility ports).
46+
47+
**Improvement**: Once `find_slot` is proven reliable and the reports DB refresh cadence is sufficient, consider having the orchestrator's advance scheduling thread use the reports API instead of querying the broker directly. This would unify the logic and extend coverage to all resource types. Requires ensuring reports data freshness is acceptable for server-side scheduling decisions.
48+
49+
**Trade-off**: Reports data is ~1 hour stale vs live broker queries. May not be suitable for time-critical allocation decisions.
50+
51+
## 7. Smarter sliding window with skip-ahead optimization
52+
53+
**Current state**: `find_slot()` in reports `db_manager.py` checks every hour sequentially. For a 30-day range that's 720 iterations, each checking `duration` hours.
54+
55+
**Improvement**: When a window fails at hour `dh` within the duration, skip ahead to the next hour where the blocking sliver ends rather than advancing by 1. This reduces iterations significantly for heavily loaded time ranges.
56+
57+
**Files**: `reports_api/database/db_manager.py` (`_check_window`, `find_slot`)
58+
59+
## 8. Async and scalability improvements for high-concurrency workloads
60+
61+
**Context**: As AI agents and automated workflows increasingly interact with FABRIC, the orchestrator and reports APIs may face high concurrent request volumes. The current synchronous, single-threaded-per-request architecture will become a bottleneck.
62+
63+
### 8a. Orchestrator API — async framework and connection pooling
64+
65+
**Current state**: The orchestrator uses Connexion/Flask (synchronous WSGI). Each incoming request blocks a worker thread for the entire duration, including downstream calls to broker, reports API, and database. Under high concurrency from AI agents making rapid slice/find-slot/resource queries, worker threads will be exhausted.
66+
67+
**Improvements**:
68+
- Migrate to an async-capable framework (e.g., Connexion 3.x with ASGI, or FastAPI) so request handlers can `await` downstream calls without blocking threads
69+
- Use `aiohttp` or `httpx.AsyncClient` with connection pooling for outbound calls to reports API and broker
70+
- Add request queuing and backpressure (return `429 Too Many Requests` with `Retry-After` header) to protect downstream services
71+
- Consider async SQLAlchemy (`asyncio` extension) for database operations
72+
73+
**Files**: `orchestrator/swagger_server/`, `orchestrator/core/orchestrator_handler.py`
74+
75+
### 8b. Reports API — async queries and caching
76+
77+
**Current state**: Reports API uses Flask (synchronous). Complex queries like `find_slot` with 30-day ranges scan 720+ hourly slots against multiple DB queries. Under concurrent load, DB connection pool and CPU become bottlenecks.
78+
79+
**Improvements**:
80+
- Parallelize independent DB queries (host capacities, link capacities, facility port capacities, slivers) using `concurrent.futures.ThreadPoolExecutor` or async SQLAlchemy
81+
- Add short-lived response caching (e.g., 5-minute TTL) for `find_slot` results — since reports DB only refreshes hourly, identical queries within minutes will return the same result
82+
- Pre-compute and cache hourly allocation arrays on reports DB refresh rather than computing on every request
83+
- Consider read replicas for the reports PostgreSQL to distribute query load
84+
85+
**Files**: `reports_api/database/db_manager.py`, `reports_api/response_code/calendar_controller.py`
86+
87+
### 8c. Orchestrator advance scheduling — parallel host queries
88+
89+
**Current state**: `determine_future_lease_time` queries broker reservations per-host sequentially. For slices with many candidate nodes across multiple sites, this is O(nodes) sequential round-trips.
90+
91+
**Improvements**:
92+
- Query candidate host reservations in parallel using `concurrent.futures.ThreadPoolExecutor` — each `get_reservations(node_id=c, ...)` call is independent
93+
- Batch reservation queries where possible (query all candidate nodes in one DB call with `node_id IN (...)`)
94+
95+
**Files**: `orchestrator/core/orchestrator_kernel.py`, `orchestrator/core/resource_tracker.py`
96+
97+
### 8d. Client-side — async and rate-limited clients
98+
99+
**Current state**: `ReportsApi` and `OrchestratorProxy` clients use synchronous `requests` library. AI agents making many concurrent calls will block on each.
100+
101+
**Improvements**:
102+
- Provide async client variants using `httpx.AsyncClient` for both `ReportsApi` and `OrchestratorProxy`
103+
- Add client-side rate limiting and retry with exponential backoff to be a good citizen under load
104+
- FABlib: provide both `find_available_slot()` (sync, for notebooks) and `async_find_available_slot()` (async, for agent frameworks)
105+
106+
**Files**: `reports_client/fabric_reports_client/reports_api.py`, `orchestrator/orchestrator_proxy.py`
107+
108+
### 8e. Long-running find-slot — async job pattern
109+
110+
**Current state**: Large find-slot queries (30-day range, many resources) may take seconds. Under high concurrency, these tie up worker threads.
111+
112+
**Improvements**:
113+
- For requests exceeding a threshold (e.g., >7 day range with >3 resources), return `202 Accepted` with a job ID
114+
- Process in background worker (Celery, or simple thread pool)
115+
- Client polls `GET /calendar/find-slot/{job_id}` for result
116+
- Add TTL-based cleanup for completed jobs
117+
118+
**Files**: `reports_api/response_code/calendar_controller.py`, `orchestrator/core/orchestrator_handler.py`

fabric_cf/orchestrator/core/orchestrator_handler.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1203,3 +1203,36 @@ def list_resources_calendar(self, *, token: str, start_date: str, end_date: str,
12031203
self.logger.error(traceback.format_exc())
12041204
self.logger.error(f"Exception occurred processing list_resources_calendar e: {e}")
12051205
raise e
1206+
1207+
def find_resource_slot(self, *, token: str, body: dict) -> dict:
1208+
"""
1209+
Proxy find-slot request to reports API.
1210+
:param token: Fabric Identity Token
1211+
:param body: Request body with start, end, duration, resources, max_results
1212+
:returns dict with find-slot results
1213+
"""
1214+
try:
1215+
self.__authorize_request(id_token=token, action_id=ActionId.query)
1216+
1217+
reports_conf = self.config.get_reports_api()
1218+
if not reports_conf or not reports_conf.get("enable", False):
1219+
raise OrchestratorException(message="Reports API is not enabled")
1220+
1221+
reports_host = reports_conf.get("host")
1222+
reports_token = reports_conf.get("token")
1223+
if not reports_host:
1224+
raise OrchestratorException(message="Reports API host is not configured")
1225+
1226+
from fabric_reports_client.reports_api import ReportsApi
1227+
reports_api = ReportsApi(base_url=reports_host, token=reports_token)
1228+
return reports_api.find_slot(
1229+
start_time=body.get("start"),
1230+
end_time=body.get("end"),
1231+
duration=body.get("duration"),
1232+
resources=body.get("resources"),
1233+
max_results=body.get("max_results", 1)
1234+
)
1235+
except Exception as e:
1236+
self.logger.error(traceback.format_exc())
1237+
self.logger.error(f"Exception occurred processing find_resource_slot e: {e}")
1238+
raise e

fabric_cf/orchestrator/swagger_server/controllers/resources_controller.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,16 @@ def resources_calendar_get(start_date, end_date, interval=None, site=None, host=
9494
return rc.resources_calendar_get(start_date=start_date, end_date=end_date, interval=interval,
9595
site=site, host=host, exclude_site=exclude_site,
9696
exclude_host=exclude_host)
97+
98+
99+
def resources_find_slot(body): # noqa: E501
100+
"""Find available time slots for resources
101+
102+
Find the earliest time windows where all requested resources are simultaneously available. # noqa: E501
103+
104+
:param body: Resource request payload
105+
:type body: dict
106+
107+
:rtype: Resources
108+
"""
109+
return rc.resources_find_slot(body=body)

fabric_cf/orchestrator/swagger_server/response/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
RESOURCES_PATH = '/resources'
3333
RESOURCES_SUMMARY_PATH = '/resources/summary'
3434
RESOURCES_CALENDAR_PATH = '/resources/calendar'
35+
RESOURCES_FIND_SLOT_PATH = '/resources/find-slot'
3536
PORTAL_RESOURCES_PATH = '/portalresources'
3637
PORTAL_RESOURCES_SUMMARY_PATH = '/portalresources/summary'
3738

fabric_cf/orchestrator/swagger_server/response/resources_controller.py

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,9 @@
3232
from fabric_cf.orchestrator.swagger_server.models.resources import Resources # noqa: E501
3333
from fabric_cf.orchestrator.swagger_server import received_counter, success_counter, failure_counter
3434
from fabric_cf.orchestrator.swagger_server.response.constants import (
35-
GET_METHOD, RESOURCES_PATH, PORTAL_RESOURCES_PATH,
35+
GET_METHOD, POST_METHOD, RESOURCES_PATH, PORTAL_RESOURCES_PATH,
3636
RESOURCES_SUMMARY_PATH, PORTAL_RESOURCES_SUMMARY_PATH,
37-
RESOURCES_CALENDAR_PATH
37+
RESOURCES_CALENDAR_PATH, RESOURCES_FIND_SLOT_PATH
3838
)
3939
from fabric_cf.orchestrator.swagger_server.response.utils import get_token, cors_error_response, cors_success_response
4040

@@ -301,3 +301,32 @@ def resources_calendar_get(start_date: str, end_date: str, interval: str = "day"
301301
logger.exception(e)
302302
failure_counter.labels(GET_METHOD, RESOURCES_CALENDAR_PATH).inc()
303303
return cors_error_response(error=e)
304+
305+
306+
def resources_find_slot(body: dict) -> Resources:
307+
"""Proxy find-slot request to reports API
308+
309+
:param body: Request body with start, end, duration, resources, max_results
310+
:rtype: Resources
311+
"""
312+
handler = OrchestratorHandler()
313+
logger = handler.get_logger()
314+
received_counter.labels(POST_METHOD, RESOURCES_FIND_SLOT_PATH).inc()
315+
try:
316+
token = get_token()
317+
318+
result = handler.find_resource_slot(token=token, body=body)
319+
response = Resources()
320+
response.data = [result]
321+
response.size = 1
322+
response.type = "resources.find_slot"
323+
success_counter.labels(POST_METHOD, RESOURCES_FIND_SLOT_PATH).inc()
324+
return cors_success_response(response_body=response)
325+
except OrchestratorException as e:
326+
logger.exception(e)
327+
failure_counter.labels(POST_METHOD, RESOURCES_FIND_SLOT_PATH).inc()
328+
return cors_error_response(error=e)
329+
except Exception as e:
330+
logger.exception(e)
331+
failure_counter.labels(POST_METHOD, RESOURCES_FIND_SLOT_PATH).inc()
332+
return cors_error_response(error=e)

fabric_cf/orchestrator/swagger_server/swagger/swagger.yaml

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -554,6 +554,120 @@ paths:
554554
security:
555555
- bearerAuth: []
556556
x-openapi-router-controller: fabric_cf.orchestrator.swagger_server.controllers.resources_controller
557+
/resources/find-slot:
558+
post:
559+
tags:
560+
- resources
561+
summary: Find available time slots for resources
562+
description: Find the earliest time windows where all requested resources are simultaneously available for a given duration. Proxied from the reports API.
563+
operationId: resources_find_slot
564+
requestBody:
565+
required: true
566+
content:
567+
application/json:
568+
schema:
569+
type: object
570+
required:
571+
- start
572+
- end
573+
- duration
574+
- resources
575+
properties:
576+
start:
577+
type: string
578+
format: date-time
579+
description: Start of search range (ISO 8601)
580+
end:
581+
type: string
582+
format: date-time
583+
description: End of search range (ISO 8601)
584+
duration:
585+
type: integer
586+
minimum: 1
587+
description: Consecutive hours needed
588+
max_results:
589+
type: integer
590+
minimum: 1
591+
maximum: 50
592+
default: 1
593+
description: Maximum number of windows to return (1-50)
594+
resources:
595+
type: array
596+
minItems: 1
597+
description: Array of resource requests
598+
items:
599+
type: object
600+
required:
601+
- type
602+
properties:
603+
type:
604+
type: string
605+
enum:
606+
- compute
607+
- link
608+
- facility_port
609+
site:
610+
type: string
611+
cores:
612+
type: integer
613+
ram:
614+
type: integer
615+
disk:
616+
type: integer
617+
components:
618+
type: object
619+
additionalProperties:
620+
type: integer
621+
site_a:
622+
type: string
623+
site_b:
624+
type: string
625+
bandwidth:
626+
type: integer
627+
name:
628+
type: string
629+
vlans:
630+
type: integer
631+
responses:
632+
"200":
633+
description: OK
634+
content:
635+
application/json:
636+
schema:
637+
$ref: '#/components/schemas/resources'
638+
"400":
639+
description: Bad Request
640+
content:
641+
application/json:
642+
schema:
643+
$ref: '#/components/schemas/status_400_bad_request'
644+
"401":
645+
description: Unauthorized
646+
content:
647+
application/json:
648+
schema:
649+
$ref: '#/components/schemas/status_401_unauthorized'
650+
"403":
651+
description: Forbidden
652+
content:
653+
application/json:
654+
schema:
655+
$ref: '#/components/schemas/status_403_forbidden'
656+
"404":
657+
description: Not Found
658+
content:
659+
application/json:
660+
schema:
661+
$ref: '#/components/schemas/status_404_not_found'
662+
"500":
663+
description: Internal Server Error
664+
content:
665+
application/json:
666+
schema:
667+
$ref: '#/components/schemas/status_500_internal_server_error'
668+
security:
669+
- bearerAuth: []
670+
x-openapi-router-controller: fabric_cf.orchestrator.swagger_server.controllers.resources_controller
557671
/portalresources/summary:
558672
get:
559673
tags:

0 commit comments

Comments
 (0)