Feat: Allow specifying a minimum number of intervals to include for each model in a plan#4780
Feat: Allow specifying a minimum number of intervals to include for each model in a plan#4780
Conversation
8db1a48 to
237c765
Compare
| @click.option( | ||
| "--min-intervals", | ||
| default=0, | ||
| help="In new environments created against a specific time range, ensure that models contain at least this many intervals", |
There was a problem hiding this comment.
I don't think this only impacts new environments, does it?
There was a problem hiding this comment.
Yes, that was meant to say dev environments (there was a check that threw an error if you specified this on prod).
I've updated the text and removed the dev environment check because specifying this on prod is harmless in the sense that it doesnt do anything because prod doesnt support --start and --end so already considers the full time range
237c765 to
718ea0f
Compare
718ea0f to
0ed5945
Compare
0ed5945 to
3a1a7c3
Compare
| if not snapshot: | ||
| continue | ||
|
|
||
| starting_point = plan_end_dt |
There was a problem hiding this comment.
Shouldn't we use interval_end_per_model here instead of a global end?
There was a problem hiding this comment.
Since it's an explicit goal not to backfill anything beyond what exists in prod, yes.
I've updated this
| ), | ||
| end_bounded=not run, | ||
| ensure_finalized_snapshots=self.config.plan.use_finalized_state, | ||
| start_override_per_model=start_override_per_model, |
There was a problem hiding this comment.
Should the name be consistent with interval_end_per_model? I don't care which one is it, but I feel like they represent similar thing and should be named similarly.
There was a problem hiding this comment.
I've renamed interval_end_per_model to end_override_per_model on the Plan to match
| snapshot_start_date = start_dt | ||
|
|
||
| snapshot_start_override = start_override_per_model.get(snapshot.name, None) | ||
| snapshot_start_date = snapshot_start_override or start_dt |
There was a problem hiding this comment.
snapshot_start_date = start_override_per_model.get(snapshot.name, start_dt)
?
There was a problem hiding this comment.
Good point, updated
02bbb5f to
9c5dfc2
Compare
| # for example, A(hourly) <- B(daily) | ||
| # if min_intervals=1, A would have 1 hour and B would have 1 day | ||
| # but B depends on A so in order for B to have 1 valid day, A needs to be expanded to 24 hours | ||
| backfill_dag = self.dag.prune(*backfill_model_fqns) |
There was a problem hiding this comment.
Are you sure we can reuse this DAG? Wouldn't the loaded DAG be different if they use a selector?
There was a problem hiding this comment.
I thought it was fine due to the pruning, but i've updated the code to construct a new DAG
| ] | ||
|
|
||
| # start from the leaf nodes and work back towards the root because the min_start at the root node is determined by the calculated starts in the leaf nodes | ||
| for subdag in reversed_subdags: |
There was a problem hiding this comment.
Wouldn't this contain overlapping subdags? Why can't we reverse the whole DAG and just traverse it in one go? So something like:
reversed_dag = dag.reversed
for model_fqn in reversed_dag:
snapshot = snapshots_by_model_fqn[model_fqn]
# Get the minimum start from all immediate children of this snapshot
min_child_start = min([
start_overrides.get(fqn, sys.max)
for fqn in reversed_dag.get(model_fqn, set())
])
# Proceed with computing the start for this snapshot and taking a min of computed start and min_child_start
```
There was a problem hiding this comment.
Yep, the key difference here is only checking immediate children which I missed when scanning the DAG API originally (I thought it had to be all downstream nodes).
I've adjusted it as per your suggestion and also set an override whether it's needed or not so there is always a value for each node in the start_overrides dict
…ach model in a plan
…ld start date override
9c5dfc2 to
1e56590
Compare
| return None, None | ||
|
|
||
| default_end = max(max_interval_end_per_model.values()) | ||
| default_end = to_timestamp(max(max_interval_end_per_model.values())) |
There was a problem hiding this comment.
Just curious, why did we change this to datetime only to convert back to timestamp later?
There was a problem hiding this comment.
Rome wasnt built in a day and the rest of the code in that method was ints.
One day I hope we will use proper types internally and push TimeLike and co back to the edges / user input handling only
This addresses part of issue #4069 , albeit in a slightly different way to what is described in the ticket.
This PR adds a new plan option,
--min-intervals, intended to be used like so:What this option does is ensure that all models in the new environment have at least 1 interval backfilled, even if their interval unit is larger than the relative time period specified for
--start.It does this by allowing a list of per-model start date overrides to be supplied to a plan (similar to the existing
interval_end_per_modelargument). If there is a start date override available for a given snapshot, it gets used, otherwise the plan start date gets used.Thus,
--min-intervalsis implemented in terms of calculating the earliest start date that would be needed to cover--min-intervalsintervals. If this calculated date is earlier than the plan start date, it is added to the start date overrides.The start date overrides are used by:
DeployabilityIndex.create()to ensure that the adjusted per-model start date still results in deployable datamissing_intervals()to override the start date that is given toSnapshot.missing_intervals()to return intervals that can be outside the default plan boundsThe immediate use-case is for PR environments created by the CI/CD bot which would allow you to say things like:
Right now these are excluded which can result in downstream daily models missing data in PR envs.
This option could be extended in future to also specify the minimum number of intervals to cover for dev previews