diff --git a/.codex/skills/high-performance-java/SKILL.md b/.codex/skills/high-performance-java/SKILL.md new file mode 100644 index 00000000000..ac06d137129 --- /dev/null +++ b/.codex/skills/high-performance-java/SKILL.md @@ -0,0 +1,194 @@ +--- +name: high-performance-java +description: Use when writing, reviewing, or reshaping HotSpot Java where algorithmic complexity, data-structure choice, throughput, latency, allocation rate, zero-copy, lazy evaluation, non-materialization, runtime specialization, query-engine code generation, Janino, primitive collections, performance libraries, intrinsics, SuperWord auto-vectorization, or C2 assembly matter. Also use for advanced algorithmic problem solving in Java, including dynamic programming, graph/range techniques, cache-aware code shape, and choosing between interpreted, vectorized, and compiled execution paths. Bias toward asymptotic wins first, then the right execution model, then specialized hot-path code, then benchmark and JIT evidence. +--- + +# High-Performance Java + +Use this skill for Java hot paths, algorithm-heavy Java, and JVM-side runtime specialization. Default bias: asymptotic win first, then the right execution model, then fewer allocations, fewer copies, less polymorphism, narrower code shape, stronger evidence. + +HotSpot-only v1. Baseline assumptions: +- repo baseline: JDK 21 +- current local runtime may be newer +- low-level claims stay provisional until benchmark + JIT evidence agree +- algorithm/data-structure claims stay provisional until they match the actual workload constraints +- runtime codegen claims stay provisional until cold-start cost, warm steady-state behavior, and fallback behavior are all understood + +## Core loop + +1. Identify the workload shape and constraints. +2. Pick the algorithm and data structure that change the slope. +3. Decide whether the workload should stay interpreted, become vectorized/batched, or justify runtime specialization/code generation. +4. Find the hot loop, hot call chain, or hot operator pipeline. +5. Write the narrow fast path first. +6. Push generic abstraction, materialization, and dispatch out of the loop. +7. Benchmark before claiming improvement. +8. Inspect HotSpot decisions before claiming JVM-level reasons. + +## Default coding bias + +- Prefer an algorithmic win over a micro win. +- Prefer data structures that fit the operation mix, memory budget, and key domain. +- Prefer the right execution model over reflexively adding code generation. +- Prefer primitive-friendly layouts before boxed object graphs. +- Prefer zero-copy over copy-transform-copy. +- Prefer reuse over per-item allocation. +- Prefer lazy traversal over full materialization. +- Prefer primitives, flat arrays, and tight counted loops in hot paths. +- Prefer monomorphic calls that inline away. +- Prefer specialized lambda/adaptor code for the active workload. +- Prefer one fast path plus one cold fallback over a single generalized hot path. +- Prefer Janino only when generated Java can stay simple, code size can stay bounded, and compile cost can be amortized. + +## Hard rules + +- Do not micro-optimize a fundamentally wrong algorithm. +- Do not defend a perf change with style arguments alone. +- Do not claim “faster” without a measurement path. +- Do not claim “JIT will optimize this” without checking inlining / compilation evidence. +- Do not add a specialized library until you know what property it buys: fewer allocations, fewer copies, lower contention, off-heap layout, better primitive support, stronger compilation/runtime specialization, or a stronger algorithm. +- Do not introduce Janino or other runtime codegen unless compile latency, cache keys, code-size limits, classloader lifetime, and fallback behavior are explicit. +- Do not compile entire query plans blindly when only a subset of operators is hot or fusible. +- Do not generate fancy modern Java syntax for Janino unless support is verified on the active Janino/runtime combination; conservative generated Java is the default. +- Do not keep elegant-but-generic stream pipelines in verified hot loops. +- Do not pay interface / visitor / wrapper overhead inside the hottest loop unless evidence shows it disappears. +- Do not default to boxed `Map` / `Set` / `List` shapes when primitive collections or flat arrays better fit the dominant path. + +## Design checklist + +Ask these first: +- What are `N`, `Q`, the update/query ratio, and the memory budget? +- Is the main problem asymptotic complexity, cache locality, allocation pressure, branchiness, contention, I/O, or execution-model overhead? +- What operation dominates: membership, counting, top-k, range query, join, shortest path, DP transition, parsing, encoding, filter/projection evaluation, aggregation, or tuple materialization? +- Can the key/value/state space stay primitive or bit-packed? +- Can the workload become offline, batched, sorted, prefix-based, vectorized, or compressed? +- What allocates on the steady-state path? +- What copies bytes, chars, arrays, or collections? +- What materializes intermediate state that could stay streamed or cursor-based? +- What dispatch stays virtual or megamorphic in the inner loop? +- What loop shape blocks scalar replacement, inlining, or SuperWord vectorization? +- What “generic” branch handles cases the active workload never uses? +- How often will a generated shape execute, and can compile cost be amortized? +- Can compiled artifacts be cached by normalized shape, types, nullability, and algorithm choice? +- What is the fallback path for cold queries, oversized generated code, compile failure, or classloader churn? +- What method-size, class-size, or constant-pool limits could the generated code hit? +- Who owns generated classes, caches, and classloaders over time? + +## Workflow + +### 0) Pick the algorithmic shape + +- Estimate the real workload: input size, query count, mutation pattern, latency target, and memory ceiling. +- Choose the algorithm and data structure before tuning loop syntax. +- Favor contiguous, cache-friendly, primitive-heavy representations when semantics allow. +- For dynamic programming, define state, transition cost, base case, iteration order, and whether state compression is possible. +- For graph/range/string problems, look for offline transforms, prefix structures, monotonic structures, or specialized search before hand-tuning. + +Read these only when relevant: +- [references/algorithms-data-structures.md](references/algorithms-data-structures.md) for algorithm and data-structure selection. +- [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md) for dynamic programming and advanced problem-solving patterns. + +### 1) Choose the execution model before shaping the code + +- Ask whether the path should stay interpreted, become vectorized/batched, or justify runtime code generation. +- Prefer interpretation for cold, one-shot, or highly irregular workloads when compile latency will dominate. +- Prefer vectorization/batching when cache-miss hiding, SIMD-friendly processing, or blocking operator boundaries dominate. +- Prefer runtime code generation when the same shape executes repeatedly, per-tuple overhead dominates, and generated code can stay narrow and bounded. +- In query engines, fuse straight pipelines first; split at blocking operators, large mutable state, code-size pressure, or unstable branches. +- If Janino is chosen, keep generated Java conservative, keep helper methods small, and plan for cache + fallback from the start. + +Detailed guidance: see [references/codegen-and-janino.md](references/codegen-and-janino.md). + +### 2) Shape the code for HotSpot + +- Split hot and cold paths. +- Hoist invariant checks and decoding outside the loop. +- Replace generic callback stacks with narrow-path adapters. +- Reuse mutable carriers only when ownership is clear. +- Keep loop bodies predictable, contiguous, and exception-light. +- For generated code, favor explicit loops, primitive locals/fields, simple helper methods, and stable call targets. + +Detailed rules: see [references/coding-rules.md](references/coding-rules.md). + +### 3) Measure + +If you are in this RDF4J repo, use the local benchmark wrapper first: + +```bash +scripts/run-single-benchmark.sh --module --class --method +``` + +If you are outside RDF4J, use JMH or an existing reproducible micro/macro benchmark. + +Measurement workflow: see [references/evidence-workflow.md](references/evidence-workflow.md). + +### 4) Explain with JVM evidence + +When a benchmark moves, inspect what HotSpot actually did: +- compilation tier +- inlining success/failure +- intrinsic usage when relevant +- allocation pressure +- assembly / C2 logs when needed + +Use sibling skill [hotspot-jit-forensics](../hotspot-jit-forensics/SKILL.md) for method-scoped C2 evidence. Use `async-profiler-java-macos` when wall/cpu/alloc evidence is needed on macOS. + +### 5) Use libraries intentionally + +- Prefer the JDK first when it is close enough and operationally simpler. +- Reach for specialized libraries when they remove boxing, copies, parser overhead, contention, off-heap indirection, or runtime compilation friction the JDK cannot. +- Check dependency health before adding a new library. +- Benchmark the library choice against the simplest credible in-repo baseline. + +Library reference: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md). + +### 6) Report honestly + +Frame conclusions as: +- hypothesis +- algorithm/data-structure choice +- execution-model choice +- benchmark result +- JIT/profile evidence +- confidence + +If assembly is unavailable, say so and fall back to compilation logs, inlining diagnostics, and profile data. + +## Trigger examples + +Use this skill when the user asks to: +- remove allocation pressure from a parser, iterator, encoder, decoder, or query loop +- make a Java path zero-copy or lazy +- choose the right data structure for a Java workload +- solve a dynamic programming, graph, interval, ranking, or range-query problem in Java under performance constraints +- replace boxed collections with primitive or cache-friendly structures +- choose between the JDK and specialized Java performance libraries +- decide whether a query engine should stay interpreted, become vectorized, or use Janino/runtime code generation +- design generated Java for projections, filters, joins, aggregations, or expression evaluators +- specialize code for one workload instead of many +- explain whether a HotSpot optimization actually happened +- ground a Java perf change in benchmark + C2 evidence + +## Reference map + +- Algorithms and data structures: [references/algorithms-data-structures.md](references/algorithms-data-structures.md) +- Advanced coding techniques: [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md) +- Codegen and Janino for query engines: [references/codegen-and-janino.md](references/codegen-and-janino.md) +- High-performance Java libraries: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md) +- Coding rules: [references/coding-rules.md](references/coding-rules.md) +- Evidence workflow: [references/evidence-workflow.md](references/evidence-workflow.md) +- JDK version guardrails: [references/jdk-21-26-notes.md](references/jdk-21-26-notes.md) + +## Output contract + +When you use this skill, the answer should usually include: +- workload model and asymptotic bottleneck +- execution-model recommendation: interpreted, vectorized/batched, Janino/runtime codegen, or another compilation path +- algorithm and data-structure recommendation +- hot-path hypothesis +- concrete code-shape recommendation +- cache/fallback plan when runtime codegen is part of the design +- library recommendation when a library meaningfully changes the design +- benchmark command or benchmark evidence +- JIT/profile evidence or the missing prerequisite +- a confidence statement tied to the active JDK diff --git a/.codex/skills/high-performance-java/agents/openai.yaml b/.codex/skills/high-performance-java/agents/openai.yaml new file mode 100644 index 00000000000..32383dbcad8 --- /dev/null +++ b/.codex/skills/high-performance-java/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "High-Performance Java" + short_description: "Hot-path Java, plus algorithm/perf-library, execution-model, and runtime-codegen guidance" + default_prompt: "Use $high-performance-java to choose the right algorithm, execution model, data structure, runtime codegen strategy, library, and HotSpot-friendly code shape for a high-performance Java task." diff --git a/.codex/skills/high-performance-java/references/advanced-coding-techniques.md b/.codex/skills/high-performance-java/references/advanced-coding-techniques.md new file mode 100644 index 00000000000..3b20bb09f2b --- /dev/null +++ b/.codex/skills/high-performance-java/references/advanced-coding-techniques.md @@ -0,0 +1,220 @@ +# Advanced Coding Techniques + +Use this reference when the problem needs more than basic loops and collections: dynamic programming, advanced search, state compression, offline transforms, or optimization patterns that materially change runtime. + +## Dynamic programming checklist + +Before writing code, define: +- state: the minimum information needed to continue +- transition: how one state moves to the next +- base case: the smallest solved states +- order: top-down memoization or bottom-up tabulation +- objective: min, max, count, feasibility, reconstruction +- memory plan: full table, rolling rows, bitset, or sparse map + +If any of those are fuzzy, the DP is not ready. + +## DP implementation bias in Java + +- Prefer flat primitive arrays over nested object graphs. +- Flatten `dp[row][col]` into one array when locality matters. +- Use sentinel values (`INF`, `-1`, impossible masks) instead of wrapper objects. +- Compress dimensions aggressively when a transition only needs prior rows or prior prefixes. +- Use iterative tabulation when recursion depth or call overhead is risky. +- Use memoization when the reachable state space is sparse or pruning is strong. + +## Common DP families + +### 1D DP + +Use for: +- linear decisions +- prefix optimization +- classic knapsack-style transitions + +Java notes: +- Often compresses to one array. +- Direction matters: reverse iterate for 0/1 knapsack; forward iterate for unbounded knapsack. + +### 2D grid / sequence DP + +Use for: +- edit distance +- LCS variants +- path counting +- interval composition + +Java notes: +- Two rolling rows often replace the full matrix. +- Keep row-major iteration consistent with memory layout. + +### Interval DP + +Use for: +- merge cost +- matrix chain multiplication +- optimal parenthesization +- palindrome partitioning + +Heuristic: +- Try increasing interval length order. +- Precompute reusable range costs. + +### Tree DP + +Use for: +- subtree aggregation +- rerooting +- independent set / matching variants on trees + +Java notes: +- Iterative traversal can avoid stack overflow. +- Store parent/index arrays once; reuse buffers for passes. + +### DAG DP + +Use for: +- longest path in DAG +- path counts +- dependency-ordered optimization + +Heuristic: +- Topological order first, transitions second. + +### Bitmask DP + +Use for: +- small `n` subset problems +- travelling-salesman-style state +- assignment and partition variants + +Java notes: +- Use `int` masks up to 31 bits, `long` masks up to 63. +- Precompute subset transitions when reused heavily. +- Beware exponential memory growth; consider meet-in-the-middle. + +### Digit DP + +Use for: +- counting numbers with digit constraints +- lexicographic numeric constraints + +State usually includes: +- position +- tight/limited flag +- started/leading-zero flag +- problem-specific accumulator + +## DP optimization patterns + +### Prefix/suffix acceleration + +If a transition scans prior states, ask whether prefix minima/maxima/sums can reduce it from `O(n^2)` to `O(n)`. + +### Monotonic queue optimization + +Use when transitions need min/max over a sliding window. + +### Divide-and-conquer DP optimization + +Use when the optimal split point is monotonic across rows or columns. + +### Convex hull trick / Li Chao tree + +Use when transitions are of the form: +- `dp[i] = min_j(m[j] * x[i] + b[j])` +- `max` variant of the same + +Only use when the algebra really matches. + +### Bitset DP + +Use when boolean subset transitions can become word-parallel bit operations. + +Examples: +- subset sum +- knapsack feasibility +- reachability layers + +### State compression + +Reduce dimensions by: +- keeping only prior row/column +- encoding booleans into bits +- coordinate-compressing sparse values +- using ids instead of objects + +## Search and optimization patterns + +### Binary search on answer + +Use when: +- feasibility is monotonic +- exact objective is hard but checking a threshold is easier + +### Meet-in-the-middle + +Use when: +- brute force is `2^n` +- `n` is small enough to split into two `2^(n/2)` halves + +### Branch and bound + +Use when: +- you can compute tight upper/lower bounds +- a good heuristic ordering prunes much of the tree + +### Iterative deepening + +Use when: +- memory is tight +- solution depth is unknown but usually shallow + +### Offline query processing + +Use when: +- query order is irrelevant +- sorting queries/events lets you reuse structure updates + +## Greedy and exchange-thinking + +Before building DP or search, test whether a greedy proof exists: +- local choice stays globally optimal +- exchange argument repairs any non-greedy optimal solution +- matroid-like or interval-scheduling structure is present + +If greedy works, it often beats DP both asymptotically and operationally. + +## Range and sequence patterns + +- Sliding window: monotonic boundary expansion or contraction. +- Two pointers: sorted arrays, pair/triple sums, dedup, partitioning. +- Monotonic stack: next greater/smaller, histogram, span problems. +- Difference arrays: batch range updates. +- Prefix sums / xor / hashes: cheap repeated range queries. + +## Java-specific implementation notes + +- Avoid recursion for deep graphs, trees, or DP unless the depth bound is small. +- Replace tuple objects with parallel arrays or packed longs in hot paths. +- Pre-size arrays and reusable buffers for repeated test cases. +- Be explicit about overflow; use `long` for counts/costs unless `int` is proven safe. +- Separate correctness code from hot code paths once the algorithm is clear. + +## Problem-solving ladder + +When stuck, try this order: +1. Can I sort or batch the work? +2. Can I precompute prefix, suffix, or compressed state? +3. Can a different data structure remove a nested loop? +4. Is the problem actually graph, interval, or DP in disguise? +5. Can the state shrink to primitives or bits? +6. Can I prove greedy, monotonicity, or convexity? + +## Red flags + +- DP state includes fields that do not affect future transitions. +- Memoization key is a heavyweight object when a few ints suffice. +- Full `O(n^2)` table retained even though only one frontier is used. +- Search explores symmetric states repeatedly. +- A library data structure is used where a flat array plus sort is enough. diff --git a/.codex/skills/high-performance-java/references/algorithms-data-structures.md b/.codex/skills/high-performance-java/references/algorithms-data-structures.md new file mode 100644 index 00000000000..96dc2481c54 --- /dev/null +++ b/.codex/skills/high-performance-java/references/algorithms-data-structures.md @@ -0,0 +1,181 @@ +# Algorithms and Data Structures + +Use this reference when the main question is algorithmic shape, data-structure choice, or whether a complexity change dominates any JVM-level tuning. Biggest wins usually come from changing the slope before shaving cycles. + +## Triage first + +Before choosing a structure, answer these: +- Is the workload one-shot, batched, or online? +- Do you need insertion order, sorted order, or just membership? +- Are keys dense integers, sparse integers, strings, tuples, or custom objects? +- Are queries point lookups, range queries, top-k queries, path queries, or aggregate queries? +- Is the structure static after build, append-only, or heavily mutable? +- Can the state stay primitive, bit-packed, or index-based? + +## Default data-structure bias + +- `int[]`, `long[]`, `byte[]`: best starting point when size is known or can grow geometrically. +- `ArrayList`: good general dynamic array when boxing is acceptable and traversal dominates. +- `ArrayDeque`: default queue/stack/deque. Better cache shape than `LinkedList`. +- `HashMap` / `HashSet`: baseline for sparse membership and counting when boxing cost is acceptable. +- `TreeMap` / `TreeSet`: only when ordered updates and queries are intrinsic. Do not pay `O(log n)` if sort-once plus scan works. +- `BitSet`: excellent for dense integer domains, set algebra, visited flags, and some DP/state compression. + +## Primitive-first guidance + +If keys/values are primitive and the path is hot: +- Prefer flat arrays when bounds are manageable. +- Prefer primitive maps/sets/heaps over boxed collections when boxing dominates time or memory. +- Use coordinate compression when raw keys are large but the distinct key count is moderate. +- Represent relations as integer ids plus parallel arrays instead of object graphs when traversal dominates. + +## Membership, dedup, counting + +- Hash table: default for sparse exact membership and frequency counting. +- Sort plus scan: strong when the data is batch-oriented, read-mostly, or you also need grouping/order. +- BitSet / boolean array: best for dense bounded integer keys. +- Bloom filter: prefilter only. Use when false positives are acceptable but false negatives are not. + +Red flags: +- Nested membership scans over lists. +- Repeated `contains` on `ArrayList` in hot code. +- Boxing primitive keys when the domain can be compressed. + +## Top-k, ranking, scheduling + +- Binary heap / priority queue: streaming top-k, best-first search, event scheduling. +- Quickselect: one-shot kth element or top-k partition when full sort is wasteful. +- Bucket/counting approach: when values live in a small bounded domain. +- Monotonic deque: sliding-window min/max in linear time. + +Java notes: +- JDK `PriorityQueue` is fine for many cases but boxes primitives. +- For tiny fixed `k`, a sorted small array can beat a heap. + +## Prefix, range, and interval workloads + +- Prefix sum: immutable range-sum/count queries. +- Difference array: batched range updates with one final sweep. +- Fenwick tree: point updates plus prefix/range aggregates in `O(log n)` with low constants. +- Segment tree: more flexible range updates/queries, but heavier than Fenwick. +- Sparse table: immutable idempotent range queries such as min/max/gcd. +- Sweep line: interval overlap, event merging, skyline, booking, and geometry-style event problems. + +Decision rule: +- Static data: prefer prefix sums or sparse tables. +- Dynamic point updates: Fenwick first. +- Complex dynamic range operations: segment tree only when simpler structures fail. + +## Graph workloads + +- BFS: unweighted shortest path, level order, flood fill. +- 0-1 BFS: edge weights only `0` or `1`. +- Dijkstra: non-negative weighted shortest path. +- Topological sort plus DP: DAG path/count problems. +- Union-find (disjoint set union): connectivity under merges, Kruskal, component grouping. +- Tarjan/Kosaraju: strongly connected components. + +Java notes: +- Prefer adjacency as primitive arrays or compact edge lists when the graph is large. +- Avoid per-edge objects on hot traversals. +- Beware recursion depth on DFS; iterative stacks are often safer. + +## String and sequence workloads + +- Sliding window / two pointers: substring/segment constraints with monotonic boundaries. +- KMP or Z-function: repeated pattern matching in linear time. +- Rolling hash: fast substring comparisons with collision caveat. +- Trie: prefix queries and dictionary walks when the alphabet is manageable. +- Aho-Corasick: multiple pattern matching. +- Patience sorting / tails array: `O(n log n)` LIS. + +Java notes: +- Avoid repeated substring materialization in tight loops. +- Work on `byte[]`, `char[]`, offsets, or integer ids when possible. + +## Ordered search and offline transforms + +- Sort plus binary search: often simpler and faster than maintaining ordered trees. +- Coordinate compression: map large sparse keys into `[0..m)` for arrays, Fenwick trees, and bitsets. +- Offline queries: sort events/queries once, then answer in a sweep. +- Meet-in-the-middle: split exponential search into two half-enumerations. + +## Data-structure atlas + +### Arrays and flat buffers + +Use when: +- Traversal dominates. +- Keys can be mapped to integer indexes. +- You need maximum locality and minimum allocation. + +Avoid when: +- Sparse domains would explode memory. +- Mutation semantics need expensive shifting and you cannot batch. + +### Hash tables + +Use when: +- Exact membership/counting dominates. +- Order is irrelevant or can be restored later. + +Avoid when: +- Dense bounded keys fit a bitset or direct array. +- You only need one batch query and sort plus scan is cheaper. + +### Heaps + +Use when: +- You need repeated access to min/max with incremental updates. +- Best-first exploration or top-k streaming dominates. + +Avoid when: +- You only need one final order; sort once instead. +- `k` is tiny and a fixed small array is cheaper. + +### Bitsets + +Use when: +- The domain is dense or compressible. +- Boolean DP, visited state, set algebra, or fast intersections matter. + +Avoid when: +- The domain is too sparse after compression. + +### Fenwick and segment trees + +Use when: +- Simple arrays are too static. +- Query/update interleaving matters. + +Avoid when: +- Prefix sums or difference arrays solve the same problem. +- The workload is too small to justify structural overhead. + +### Union-find + +Use when: +- The workload is merge-only connectivity. +- You need amortized near-constant component unions/finds. + +Avoid when: +- You need deletions or rich path queries. + +## Algorithmic red flags + +- `O(n^2)` nested scans hidden inside "simple" collection code. +- Re-sorting on every query. +- `LinkedList` for queue/stack workloads. +- Object-per-node or object-per-edge layouts in large graphs or DP tables. +- Recomputing prefix information instead of caching it. +- Dense DP stored as `Map` when a flat array works. +- Maintaining balanced trees when sort-once plus array search is enough. + +## Escalation rule + +If you can change: +- `O(n^2)` to `O(n log n)` or `O(n)`, +- boxed/object-heavy state to primitive/flat state, +- online mutable work to offline batched work, + +do that before micro-tuning loop syntax or arguing about JIT trivia. diff --git a/.codex/skills/high-performance-java/references/codegen-and-janino.md b/.codex/skills/high-performance-java/references/codegen-and-janino.md new file mode 100644 index 00000000000..15178c8b086 --- /dev/null +++ b/.codex/skills/high-performance-java/references/codegen-and-janino.md @@ -0,0 +1,259 @@ +# Codegen and Janino for Query Engines + +Use this reference when the question is not just “how do I make this loop faster?” but “should this JVM path be specialized or compiled at runtime at all?” + +This file is especially relevant for: +- query engines +- expression evaluators +- generated filters, projections, joins, and aggregations +- repeated execution of the same logical shape with different bindings +- deciding between interpretation, vectorization, and runtime code generation + +## First decision: should you use runtime codegen? + +Ask these before introducing Janino or any other runtime compiler: +- Is the same plan or operator shape executed enough times to amortize compile cost? +- Is the bottleneck per-tuple interpreter/dispatch overhead rather than a worse algorithm or poor data layout? +- Can the generated code stay small, simple, and stable enough to compile quickly? +- Can you cache by normalized plan shape, types, nullability, and algorithm choice? +- Do you already have a correct interpreted or vectorized fallback? + +If the answers are mostly “no”, do not start with Janino. + +## Decision rule + +### Prefer interpretation when: +- the workload is cold or one-shot +- plans are highly irregular or rarely repeated +- compile latency would dominate wall time +- you still do not understand the real hot operators + +### Prefer vectorization or batching when: +- you need better cache-miss hiding +- SIMD-friendly bulk processing is available +- pipelines naturally break at blocking operators +- compile latency is hard to amortize + +### Prefer Janino/runtime codegen when: +- the same shape runs repeatedly +- per-row function-call / virtual-dispatch / boxing overhead dominates +- the generated code can be primitive-heavy and monomorphic +- the engine already has a clean IR/template layer and fallback path + +## What Janino is good at + +Janino is a small embedded Java compiler for runtime compilation. Treat it as a pragmatic JVM tool for turning generated Java source into bytecode in memory. + +Good fits: +- scalar expression evaluators +- generated projections and filters +- compact join/aggregation helpers +- medium-size fused operator pipelines +- metadata or dispatch classes that are expensive to interpret repeatedly + +Bad fits: +- giant whole-query classes without splitting +- source that relies on modern Java syntax unless verified +- designs that need very tight control over bytecode layout or native code generation +- systems with no plan for classloader ownership, eviction, and cache pressure + +## What other Java-based engines and frameworks do + +### Spark SQL + +Spark uses whole-stage Java code generation and compiles generated Java with Janino. The engine explicitly tracks compilation time and inspects generated bytecode statistics. It also splits generated code to stay under JVM method-size and constant-pool limits. + +Design lesson: +- generate fused Java for hot pipelines +- but split aggressively before code size becomes a correctness or compile-time problem +- treat compile time as a first-class metric, not as background noise + +### Flink Table/SQL runtime + +Flink uses generated runtime classes broadly and also exposes Janino `ExpressionEvaluator` compilation for expressions. Flink caches compiled code because repeated Janino compilation creates new class loaders/classes and can become a metaspace/class-unloading bottleneck. + +Design lesson: +- cache compiled artifacts by code shape and classloader context +- own classloader lifetime deliberately +- watch for cache-related class leaks, not just compile speed + +### Apache Calcite + +Calcite uses Janino for scalar `RexNode` compilation and for generated metadata dispatch handlers. + +Design lesson: +- Janino is useful even when you are not compiling whole pipelines +- expression compilation and dispatch generation are often easier wins than compiling everything + +### Apache Drill + +Drill supports both Janino and the JDK compiler, and by default uses Janino only below a configurable source-size threshold. Larger generated sources are handed to the JDK compiler. + +Design lesson: +- do not bind the engine to one compiler policy +- use size-based or complexity-based compiler selection +- have a fallback when Janino stops being the right tool + +## What the research says + +### Foundational result + +Compiled query execution can substantially outperform classic iterator-style interpretation because it reduces generic per-tuple overhead and gives the compiler a tighter, more optimizable control flow. + +### Important correction + +Compiled execution is not a universal winner. Research comparing compiled and vectorized query engines found no single paradigm that always dominates. + +### More recent direction + +State-of-the-art work focuses on: +- reducing compilation latency +- compiling only the fragments that pay for themselves +- combining vectorized and compiled execution instead of forcing a binary choice +- adaptive compilation decisions during execution + +Practical conclusion: +- “Use Janino everywhere” is not state of the art +- “Never use Janino” is also wrong +- the modern answer is selective, cached, fallback-friendly specialization + +## Query-engine design rules + +### 1) Separate logical planning from codegen + +Do not generate Java directly from arbitrary logical trees everywhere. + +Prefer: +- logical plan +- physical plan +- small codegen IR / templates / operator fragments +- generated Java only at the final step + +This keeps code splitting, reuse, and fallback manageable. + +### 2) Compile fragments, not dogma + +Good fusion targets: +- scan -> filter -> project +- probe-side inner loops +- simple aggregate update loops +- expression trees that are repeatedly evaluated + +Good split points: +- hash build / sort / materialization boundaries +- highly branchy optional logic +- very large generated state +- code-size pressure +- operators that already benefit from vectorized kernels + +### 3) Keep generated Java conservative + +For Janino, default to a conservative Java subset: +- primitive locals and fields +- explicit loops +- explicit null checks +- simple helper methods +- predictable class shapes +- minimal reflection +- minimal generics in generated source + +Do not assume support for newer Java syntax just because the runtime JDK is modern. + +### 4) Design for code-size limits up front + +Watch these failure modes: +- huge methods +- giant static initializers +- constant-pool pressure +- large switch ladders +- giant string-built source blobs + +Countermeasures: +- split helper methods early +- split classes if state grows too large +- move large literals/references into arrays or external holders +- stop fusing once code-size pressure starts dominating + +### 5) Cache compiled artifacts deliberately + +Cache keys often need more than the SQL string. Include the pieces that change the generated machine shape, such as: +- normalized operator tree or pipeline shape +- physical operator choices +- input schema / internal types +- nullability +- sort/hash/key layout +- relevant runtime feature flags + +The cache policy must also answer: +- who owns the classloader? +- when are generated classes collectible? +- what is the eviction policy? +- what is the fallback on cache miss or compile failure? + +### 6) Measure cold and warm separately + +For codegen, “query time” is ambiguous. + +Always separate: +- first-run compile + execute +- warm cached execute +- compile-failure fallback path +- classloading/metaspace side effects + +If you only report the warm number, you can easily hide a bad design. + +### 7) Keep a non-codegen path + +A query engine should usually keep at least one of: +- interpreted fallback +- vectorized fallback +- alternate compiler path + +Reasons: +- cold queries +- oversized generated sources +- Janino language or method-size limits +- production debugging +- incremental rollout and A/B comparison + +## Janino-specific checklist for generated query code + +Before you approve a Janino-based design, check these: +- Is the generated source simple enough for Janino rather than javac/bytecode generation? +- Are helper methods split before they approach size limits? +- Are repeated plan shapes cached? +- Is there a strategy for dumping generated source on failure? +- Is compile latency measured independently from execution latency? +- Is the parent classloader stable and intentional? +- Is there a fallback compiler or interpreted/vectorized path? +- Does the codegen layer avoid user text injection and only emit from validated IR/templates? + +## When to prefer something other than Janino + +### Prefer the JDK compiler when: +- source is large +- compile latency is less critical than language compatibility +- Janino limitations become binding + +### Prefer ASM / bytecode libraries when: +- you need exact bytecode control +- source generation itself becomes expensive or brittle +- you need to avoid Java-parser/compiler overhead entirely + +### Prefer native/IR-based compilation when: +- research-grade peak throughput matters most +- you need tighter control than the JVM compilation pipeline gives you +- you are prepared to own a much more complex toolchain + +## Practical recommendation for RDF/query-engine style work + +A strong JVM default is: +1. start with a correct interpreted path +2. add vectorized/batched kernels for the obvious bulk operators +3. add Janino for repeated scalar/pipeline specializations that remain small and stable +4. cache compiled shapes aggressively +5. split code before limits force you to +6. keep a fallback for cold or oversized paths +7. benchmark cold, warm, and failure/fallback paths separately + +That is much closer to current best practice than either “compile nothing” or “compile everything.” diff --git a/.codex/skills/high-performance-java/references/coding-rules.md b/.codex/skills/high-performance-java/references/coding-rules.md new file mode 100644 index 00000000000..acfb3434c3f --- /dev/null +++ b/.codex/skills/high-performance-java/references/coding-rules.md @@ -0,0 +1,65 @@ +# Coding Rules + +Use these rules only for real or suspected hot paths. Outside hot code, keep the code simple. + +## Zero-copy rules + +- Pass slices, offsets, lengths, cursors, or views instead of copying into new arrays/strings/collections. +- Decode or parse directly from the source buffer when ownership and lifetime allow it. +- Delay conversion to `String`, boxed numbers, or collection objects until a boundary that actually needs them. +- Prefer bulk operations that map to JDK intrinsics when they replace manual copy/compare loops. + +## Reuse rules + +- Reuse mutable carriers, builders, encoders, decoders, and scratch arrays when one owner controls lifetime. +- Reinitialize reusable state cheaply; do not reconstruct deep object graphs inside the loop. +- Avoid thread-local caches unless the access pattern is proven hot and safe. +- Do not reuse objects across boundaries where aliasing or stale-state bugs become likely. + +## Lazy and non-materializing rules + +- Stream results directly to the consumer when the consumer can handle incremental delivery. +- Prefer iterators/cursors/sinks over `collect then filter/map`. +- Keep intermediate state as indices, spans, or primitive accumulators instead of wrapper objects. +- Materialize once at a boundary; not at each transformation stage. + +## Dispatch and inlining rules + +- Prefer `static`, `private`, or effectively final call targets on the inner path. +- Keep call sites monomorphic when possible; push interface selection above the hot loop. +- Split fast path from generic path when one workload dominates. +- Flatten tiny wrapper/helper layers when they prevent clear inlining. +- Treat interface-heavy visitor chains and generic function stacks as suspects until proven free by evidence. + +## Intrinsic and vectorization rules + +- Prefer primitive arrays and contiguous memory access. +- Write simple counted loops with hoisted bounds and invariant checks. +- Avoid hidden aliasing, side exits, and exception-heavy bodies in vectorizable loops. +- Prefer JDK library methods that HotSpot commonly treats specially over open-coded copies/comparisons/hashes when semantics match. +- Verify vectorization and intrinsic assumptions on the active JDK; do not assume cross-version stability. + +## Lambda specialization rules + +- Generate or choose workload-specific lambdas/adapters when the hot path only needs one shape. +- Prebind constants and remove unused branches from the inner callback. +- Avoid polymorphic chains of `Function` / `Predicate` / `Consumer` in hot loops when a direct method or specialized adapter will do. +- Prefer one specialized lambda per workload over one generalized lambda with internal branching. + +## Anti-patterns + +| Anti-pattern | Hot-path cost | Prefer | +| --- | --- | --- | +| Streams in verified hot loops | allocation, boxing, dispatch | direct counted loop | +| Generic visitor/callback towers | missed inline, megamorphism | split fast path + cold fallback | +| Temporary wrappers per item | allocation pressure | primitive fields or reusable carrier | +| Defensive copies on steady-state path | bandwidth + GC | views/slices/ownership checks | +| Materialize then filter/map | memory + latency | lazy cursor/sink pipeline | +| Repeated decode/encode boundary crossings | redundant work | keep native form longer | +| One abstraction for all workloads | branchy hot path | specialized narrow path | + +## Decision rule + +If a change makes the code uglier but removes copies, allocations, or polymorphism from a measured hot path, it can be worth it. + +If the path is not hot, do not apply these rules aggressively. diff --git a/.codex/skills/high-performance-java/references/evidence-workflow.md b/.codex/skills/high-performance-java/references/evidence-workflow.md new file mode 100644 index 00000000000..d2fb6d1280d --- /dev/null +++ b/.codex/skills/high-performance-java/references/evidence-workflow.md @@ -0,0 +1,78 @@ +# Evidence Workflow + +Use this workflow before making strong performance claims. + +## RDF4J path + +1. Reproduce with the local benchmark wrapper. + +```bash +scripts/run-single-benchmark.sh --module --class --method +``` + +2. If the benchmark moves but cause is unclear: + - use `--enable-jfr` for benchmark-side JFR capture + - or use `async-profiler-java-macos` for cpu / alloc / wall evidence on macOS +3. If code shape or JIT behavior is the question: + - use [hotspot-jit-forensics](../hotspot-jit-forensics/SKILL.md) + - capture compilation tier, inlining decisions, and method-scoped C2 evidence + +## Generic Java path + +1. Build the smallest reproducible JMH or app-level benchmark. +2. Capture baseline result. +3. Change code shape. +4. Capture candidate result with same JVM, flags, input size, and warmup assumptions. +5. If the delta matters, inspect JIT evidence: + +```bash +java -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:LogFile=jit.xml -XX:+PrintCompilation -jar app.jar +``` + +If assembly or per-method diagnostics are needed, move to focused compiler directives and the `hotspot-jit-forensics` workflow. + +## Additional workflow for runtime codegen / Janino + +If the change introduces generated Java or runtime compilation, do not stop at a single warm benchmark. + +Also capture: +- cold compile + first execute time +- warm cached execute time +- generated source size or code-shape proxy +- cache hit/miss behavior if caching is part of the design +- fallback behavior on compile failure or code-size overflow +- classloader / metaspace symptoms if repeated compilation is involved + +## Output contract + +Report these five items: +- benchmark delta: throughput/latency before vs after +- allocation delta: lower / unchanged / unknown +- JIT evidence: inline success/failure, tier, bailout, intrinsic, vectorization clue, or “not inspected” +- exact command or benchmark selector +- confidence: high / medium / low + +If runtime codegen is involved, also report: +- cold compile cost: measured / unknown +- warm cache behavior: hit / miss / unknown +- fallback path exercised: yes / no / unknown + +## Confidence rules + +- High: repeatable benchmark delta plus matching profile/JIT evidence +- Medium: repeatable benchmark delta without definitive low-level proof +- Low: one run, noisy run, or JVM explanation not verified + +For runtime codegen, confidence drops if cold-start cost, cache reuse, or fallback behavior were not inspected. + +## Fallback when assembly is unavailable + +Do not stop at “assembly unavailable”. + +Still collect: +- `jit.xml` +- compiler directives output +- `PrintCompilation` / inlining diagnostics +- async-profiler or JFR evidence + +Then say the exact missing piece: for example `hsdis` not installed or assembly printing not enabled. diff --git a/.codex/skills/high-performance-java/references/high-performance-java-libraries.md b/.codex/skills/high-performance-java/references/high-performance-java-libraries.md new file mode 100644 index 00000000000..e4f09a131c2 --- /dev/null +++ b/.codex/skills/high-performance-java/references/high-performance-java-libraries.md @@ -0,0 +1,238 @@ +# High-Performance Java Libraries + +Use this reference when the JDK baseline is known and you need to decide whether a library meaningfully improves layout, primitive support, concurrency, serialization, caching, runtime code generation, or observability. + +## Selection rule + +Do not add a library because it is "fast" in the abstract. Add it only when it buys at least one concrete property: +- primitive collections without boxing +- better buffer or off-heap control +- lower-contention queues or caches +- tighter binary encoding +- runtime compilation or code-generation support you actually need +- observability or benchmarking you cannot credibly replace + +Always compare against the simplest viable JDK baseline first. + +## JDK first choices + +Start here before adding dependencies: +- `ArrayDeque`: queue/stack/deque default +- `BitSet`: dense boolean/set algebra and bit-parallel state +- `PriorityQueue`: heap baseline +- `ConcurrentHashMap`: baseline concurrent map +- `LongAdder` / `LongAccumulator`: striped counters under contention +- `VarHandle`: low-level atomic/ordered field access +- `ByteBuffer`: baseline direct or heap buffer abstraction +- `javax.tools.JavaCompiler`: baseline full Java compiler when compatibility matters more than minimal footprint +- JMH, JFR, and `jcmd`: measurement and runtime evidence + +If these solve the problem with acceptable cost, stop. + +## Primitive collections + +### fastutil + +Use when: +- primitive maps, sets, lists, heaps, or big arrays are needed +- boxing in JDK collections is visible in memory or CPU profiles + +Good fit: +- `int -> int`, `long -> long`, and similar dense/sparse maps +- adjacency lists, frequency maps, index maps + +Caution: +- still benchmark against flat arrays when keys can be compressed + +### HPPC + +Use when: +- you want lean primitive collections with a smaller API surface +- hot loops need primitive containers without a broad framework + +### Eclipse Collections primitive containers + +Use when: +- you already use Eclipse Collections +- you need richer collection operations but want primitive variants + +## Buffers, off-heap, and low-latency plumbing + +### Agrona + +Use when: +- you need direct buffers, ring buffers, counters, or low-latency transport helpers +- you want explicit control over memory layout and flyweight-style access + +### Chronicle Bytes / Chronicle Queue / Chronicle Map + +Use when: +- off-heap or memory-mapped storage is intrinsic to the design +- inter-process communication or persisted queue semantics matter + +Caution: +- operational complexity is much higher than plain on-heap structures + +### Netty `ByteBuf` + +Use when: +- the stack already uses Netty +- pooled buffers and zero-copy byte handling matter + +Avoid when: +- pulling in Netty only for a small standalone buffer need + +## Runtime compilation and code generation + +### Janino + +Use when: +- you need embedded, in-memory compilation of generated Java at runtime +- generated classes stay compact, conservative, and easy to split +- the same generated shapes execute enough times to amortize compile cost +- you want a pragmatic JVM path for expression evaluators or medium-size query-engine pipelines + +Good fit: +- generated filters and projections +- scalar expression evaluators +- compact aggregation or join helpers +- metadata or dispatch classes that are expensive to interpret repeatedly + +Caution: +- do not use Janino as a reflex +- plan for code-size limits, compile latency, cache keys, classloader lifetime, and fallback +- generated Java should stay conservative and simple +- if the source gets large or needs richer Java support, compare against the JDK compiler or a bytecode-level approach + +### ASM / bytecode-generation libraries + +Use when: +- source-code generation becomes brittle or expensive +- you need exact bytecode control +- Janino or javac overhead becomes the bottleneck + +Caution: +- complexity is much higher than template-based Java generation +- debugging and maintenance costs rise quickly + +## Concurrency and queues + +### JCTools + +Use when: +- single-producer/single-consumer or MPSC queue semantics are well defined +- `java.util.concurrent` queues show contention or allocation issues + +### LMAX Disruptor + +Use when: +- you have a staged event-processing pipeline +- extremely low latency and mechanical sympathy matter more than API simplicity + +Caution: +- only a fit for specific architectures; not a general queue replacement + +### Caffeine + +Use when: +- you need a production cache with strong hit-rate behavior and concurrency +- cache eviction policy quality matters, not just raw map speed +- you need to cache compiled artifacts, metadata, or other reuse-heavy structures with bounded growth + +## Bitmaps and compressed sets + +### RoaringBitmap + +Use when: +- integer sets are sparse-to-medium density +- you need fast unions, intersections, or membership with lower memory than plain bitsets + +Good fit: +- analytics filters +- posting lists +- visited/frontier sets with large sparse ids + +## Serialization, parsing, and wire formats + +### Jackson + +Use when: +- interoperability and ecosystem support matter more than max throughput + +Tune before replacing: +- reuse `ObjectMapper` +- avoid tree model on hot paths +- stream when full materialization is unnecessary + +### DSL-JSON / jsoniter / specialized parsers + +Use when: +- JSON remains required but generic reflection-heavy parsing is too expensive + +### Protocol Buffers + +Use when: +- schema evolution and interoperability matter + +### FlatBuffers / SBE / Chronicle Wire + +Use when: +- binary layout, lower-copy reads, or ultra-low latency wire handling matter more than generality + +Caution: +- these choices affect interfaces and tooling, not just speed + +## Numerics and vector-style work + +### JDK Vector API + +Use when: +- the workload is data parallel +- you can express operations as bulk lane-wise math + +Caution: +- JDK-version-sensitive; validate on the active runtime + +### EJML and similar numerics libraries + +Use when: +- matrix or numeric kernels dominate and bespoke loops are not the business value + +## Benchmarking and profiling + +### JMH + +Use when: +- you need trustworthy microbenchmarks + +### JFR + +Use when: +- you need low-overhead production-friendly profiling + +### async-profiler + +Use when: +- you need CPU, wall, alloc, or lock evidence with low overhead + +## Practical defaults + +If the bottleneck is: +- boxing in maps/sets: try `fastutil` first +- queue contention: compare JDK queues with `JCTools` +- cache behavior: use `Caffeine` +- sparse integer set algebra: use `RoaringBitmap` +- direct/off-heap buffer control: look at `Agrona` +- repeated compilation of the same generated shapes: use `Caffeine` or an equivalent bounded cache around Janino/JDK compilation +- serious binary wire efficiency: compare Protobuf with FlatBuffers or SBE +- generated Java source getting too large or too fragile: compare Janino with `JavaCompiler` or ASM rather than forcing one tool to fit every case + +## Library red flags + +- Adding a library before a JDK baseline exists +- Replacing a simple array algorithm with a complex dependency +- Using a concurrency library without matching the actual producer/consumer pattern +- Choosing off-heap because it sounds faster, not because GC or sharing semantics require it +- Adopting a serialization stack without accounting for ecosystem, tooling, and evolution constraints +- Introducing Janino without an explicit cache/fallback/classloader strategy +- Using runtime codegen to paper over a bad algorithm or wrong execution model diff --git a/.codex/skills/high-performance-java/references/jdk-21-26-notes.md b/.codex/skills/high-performance-java/references/jdk-21-26-notes.md new file mode 100644 index 00000000000..8e4fad77f23 --- /dev/null +++ b/.codex/skills/high-performance-java/references/jdk-21-26-notes.md @@ -0,0 +1,42 @@ +# JDK 21 to 26 Notes + +Treat JDK behavior as version-sensitive. + +## Defaults + +- Repository baseline: JDK 21 +- Current local runtime may be newer (e.g. JDK 26) +- Advice about inlining, intrinsics, vectorization, and loop optimizations must be checked on the active runtime + +## What stays stable enough + +- Fewer allocations usually helps +- Fewer copies usually helps +- Monomorphic hot calls are easier to inline than megamorphic ones +- Primitive, contiguous loop shapes are friendlier to optimization than object-heavy callback stacks + +## What must be verified + +- Whether a specific JDK method lowers to an intrinsic on this runtime +- Whether SuperWord or related loop optimizations fire for this loop shape +- Whether a call chain fully inlines on this runtime +- Whether scalar replacement / escape analysis removes the expected allocation +- Whether benchmark results carry across JDK 21 and JDK 26 + +## Reporting rule + +When giving low-level JVM explanations, say which JDK you are talking about. + +Good: +- `On JDK 26 this loop appears to inline fully and the benchmark improves by 12%.` +- `On the JDK 21 baseline, verify the same claim before treating it as settled.` + +Bad: +- `HotSpot will optimize this.` +- `The JVM should vectorize it.` + +## Upgrade rule + +If a change is intended for the repo baseline, prefer evidence on JDK 21. + +If only a newer runtime is available locally, say that clearly and lower confidence until the baseline JVM is checked. diff --git a/AGENTS.md b/AGENTS.md index 6d98efc76de..46cc656a079 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -127,46 +127,6 @@ When writing complex features or significant refactors, use an ExecPlan (as desc When writing complex features or significant refactors, use an ExecPlan (as described in PLANS.md) from design to implementation. -## PIOSEE Decision Model (Adopted) - -Use this as a compact, repeatable loop for anything from a one‑line bug fix to a multi‑quarter program. - -### P — **Problem** - -**Goal:** State the core problem and what “good” looks like. -**Ask:** Who’s affected? What outcome is required? What happens if we do nothing? -**Tip:** Include measurable target(s): error rate ↓, latency p95 ↓, revenue ↑, risk ↓. - -### I — **Information** - -**Goal:** Gather only the facts needed to move. -**Ask:** What do logs/metrics/user feedback say? What constraints (security, compliance, budget, SLA/SLO)? What assumptions must we test? - -### O — **Options** - -**Goal:** Generate viable ways forward, including “do nothing.” -**Ask:** What are 2–4 distinct approaches (patch, redesign, buy vs. build, defer)? What risks, costs, and second‑order effects? -**Tip:** Check guardrails: reliability, security/privacy, accessibility, performance, operability, unit economics. - -### S — **Select** - -**Goal:** Decide deliberately and document why. -**Ask:** Which option best meets the success criteria under constraints? Who is the decision owner? What’s the fallback/abort condition? -**Tip:** Use lightweight scoring (e.g., Impact×Confidence÷Effort) to avoid bike‑shedding. - -### E — **Execute** - -**Goal:** Ship safely and visibly. -**Ask:** What is the smallest safe slice? How do we de‑risk (feature flag, canary, dark launch, rollback)? Who owns what? -**Checklist:** Traces/logs/alerts; security & privacy checks; docs & changelog; incident plan if relevant. - -### E — **Evaluate** - -**Goal:** Verify outcomes and learn. -**Ask:** Did metrics hit targets? Any regressions or side effects? What will we keep/change next loop? -**Output:** Post‑release review (or retro), decision log entry, follow‑ups (tickets), debt captured. -**Tip:** If outcomes miss, either **iterate** (new Options) or **reframe** (back to Problem). - --- ### Benchmarking workflow (repository-wide) @@ -489,7 +449,6 @@ When writing complex features or significant refactors, use an ExecPlan (as desc ## Working Loop -* **PIOSEE first:** restate Problem, gather Information, list Options; then Select, Execute, Evaluate. * **Plan:** small, verifiable steps; keep one `in_progress`, or follow PLANS.md (ExecPlans) * **Change:** minimal, surgical edits; keep style/structure consistent. * **Format:** `mvn -o -Dmaven.repo.local=.m2_repo -q -T 2C process-resources` @@ -688,7 +647,6 @@ Immediately after creating any new Java source file, add the signature comment ( * **Files touched:** list file paths. * **Commands run:** key build/test commands. * **Verification:** which tests passed, where you checked reports. -* **PIOSEE trace (concise):** P/I/O summary, selected option/routine, key evaluate outcomes. * **Evidence:** *Routine A:* failing output (pre‑fix) and passing output (post‑fix). *Routine B:* pre‑ and post‑green snippets from the **same selection** + **Hit Proof**. diff --git a/core/common/iterator/src/main/java/org/eclipse/rdf4j/common/iteration/DistinctIteration.java b/core/common/iterator/src/main/java/org/eclipse/rdf4j/common/iteration/DistinctIteration.java index 0f9215ac837..69e9952bf8b 100644 --- a/core/common/iterator/src/main/java/org/eclipse/rdf4j/common/iteration/DistinctIteration.java +++ b/core/common/iterator/src/main/java/org/eclipse/rdf4j/common/iteration/DistinctIteration.java @@ -27,7 +27,7 @@ public class DistinctIteration extends FilterIteration { /** * The elements that have already been returned. */ - private final Set excludeSet; + private Set excludeSet; /*--------------* * Constructors * @@ -76,26 +76,13 @@ public DistinctIteration(CloseableIteration iter, Supplier> */ @Override protected boolean accept(E object) { - if (inExcludeSet(object)) { - // object has already been returned - return false; - } else { - add(object); - return true; - } + return add(object); } @Override protected void handleClose() { - - } - - /** - * @param object - * @return true if the object is in the excludeSet - */ - private boolean inExcludeSet(E object) { - return excludeSet.contains(object); + // help GC by removing link to set + excludeSet = null; } /** diff --git a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSet.java b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSet.java index 9b0fa63b047..0a2736e6d07 100644 --- a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSet.java +++ b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSet.java @@ -8,8 +8,11 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.query.algebra.evaluation; +import java.lang.invoke.MethodHandles; +import java.lang.invoke.VarHandle; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; @@ -43,9 +46,24 @@ public class ArrayBindingSet extends AbstractBindingSet implements MutableBindin private static final long serialVersionUID = -1L; + @FunctionalInterface + public interface SortedBindingNamesCache { + List get(long activeBindingMask); + } + private static final Logger logger = LoggerFactory.getLogger(ArrayBindingSet.class); private static final Value NULL_VALUE = Values .iri("urn:null:d57c56f3-41a9-468e-8dce-5706ebdef84c_e88d9e52-27cb-4056-a889-1ea353fa6f0c"); + private static final VarHandle LAST_EQUALS_MISMATCH_INDEX; + private static int lastEqualsMismatchIndex; + static { + try { + LAST_EQUALS_MISMATCH_INDEX = MethodHandles.lookup() + .findStaticVarHandle(ArrayBindingSet.class, "lastEqualsMismatchIndex", int.class); + } catch (ReflectiveOperationException e) { + throw new ExceptionInInitializerError(e); + } + } private final String[] bindingNames; @@ -54,6 +72,9 @@ public class ArrayBindingSet extends AbstractBindingSet implements MutableBindin private boolean empty; private final Value[] values; + private final SortedBindingNamesCache sharedSortedBindingNamesCache; + private long activeBindingMask; + private int cachedSize = -1; private int cachedHashCode; /** @@ -64,16 +85,29 @@ public class ArrayBindingSet extends AbstractBindingSet implements MutableBindin * @param names The binding names. */ public ArrayBindingSet(String... names) { + this(names, null); + } + + public ArrayBindingSet(String[] names, SortedBindingNamesCache sharedSortedBindingNamesCache) { this.bindingNames = names; this.values = new Value[names.length]; + this.sharedSortedBindingNamesCache = sharedSortedBindingNamesCache; this.empty = true; + this.cachedSize = 0; } public ArrayBindingSet(BindingSet toCopy, Set names, String[] namesArray) { + this(toCopy, names, namesArray, null); + } + + public ArrayBindingSet(BindingSet toCopy, Set names, String[] namesArray, + SortedBindingNamesCache sharedSortedBindingNamesCache) { assert !(toCopy instanceof ArrayBindingSet); this.bindingNames = namesArray; this.values = new Value[this.bindingNames.length]; + this.sharedSortedBindingNamesCache = sharedSortedBindingNamesCache; + int size = 0; for (int i = 0; i < this.bindingNames.length; i++) { Binding binding = toCopy.getBinding(this.bindingNames[i]); @@ -82,20 +116,36 @@ public ArrayBindingSet(BindingSet toCopy, Set names, String[] namesArray if (this.values[i] == null) { this.values[i] = NULL_VALUE; } + size++; } else if (hasBinding(this.bindingNames[i])) { this.values[i] = NULL_VALUE; + size++; } } this.empty = toCopy.isEmpty(); + this.cachedSize = this.empty ? 0 : size; + if (sharedSortedBindingNamesCache != null) { + this.activeBindingMask = calculateActiveBindingMask(); + } assert !this.empty || size() == 0; } public ArrayBindingSet(ArrayBindingSet toCopy, String... names) { + this(toCopy, names, null); + } + + public ArrayBindingSet(ArrayBindingSet toCopy, String[] names, + SortedBindingNamesCache sharedSortedBindingNamesCache) { this.bindingNames = names; this.values = Arrays.copyOf(toCopy.values, toCopy.values.length); + this.sharedSortedBindingNamesCache = sharedSortedBindingNamesCache; this.empty = toCopy.empty; + this.cachedSize = toCopy.cachedSize; + if (sharedSortedBindingNamesCache != null) { + this.activeBindingMask = calculateActiveBindingMask(); + } assert !this.empty || size() == 0; } @@ -117,6 +167,7 @@ public BiConsumer getDirectSetBinding(String bindingName return (v, a) -> { a.values[index] = v == null ? NULL_VALUE : v; a.empty = false; + a.updateActiveBindingMask(index, true); a.clearCache(); }; } @@ -132,6 +183,7 @@ public BiConsumer getDirectAddBinding(String bindingName assert a.values[index] == null; a.values[index] = v == null ? NULL_VALUE : v; a.empty = false; + a.updateActiveBindingMask(index, true); a.clearCache(); }; @@ -280,18 +332,23 @@ public Iterator iterator() { @Override public int size() { if (isEmpty()) { + cachedSize = 0; return 0; } - int size = 0; + if (cachedSize == -1) { + int size = 0; - for (Value value : values) { - if (value != null) { - size++; + for (Value value : values) { + if (value != null) { + size++; + } } + + cachedSize = size; } - return size; + return cachedSize; } @Override @@ -355,23 +412,28 @@ private boolean slowIsCompatible(BindingSet other) { public List getSortedBindingNames() { if (sortedBindingNames == null) { - int size = size(); - - if (size == 1) { - for (int i = 0; i < bindingNames.length; i++) { - if (values[i] != null) { - sortedBindingNames = Collections.singletonList(bindingNames[i]); - } - } + if (sharedSortedBindingNamesCache != null) { + sortedBindingNames = sharedSortedBindingNamesCache.get(activeBindingMask); } else { - ArrayList names = new ArrayList<>(size); - for (int i = 0; i < bindingNames.length; i++) { - if (values[i] != null) { - names.add(bindingNames[i]); + int size = size(); + + if (size == 1) { + for (int i = 0; i < bindingNames.length; i++) { + if (values[i] != null) { + sortedBindingNames = Collections.singletonList(bindingNames[i]); + break; + } + } + } else { + ArrayList names = new ArrayList<>(size); + for (int i = 0; i < bindingNames.length; i++) { + if (values[i] != null) { + names.add(bindingNames[i]); + } } + names.sort(String::compareTo); + sortedBindingNames = names; } - names.sort(String::compareTo); - sortedBindingNames = names; } } @@ -393,6 +455,7 @@ public void addBinding(Binding binding) { assert this.values[index] == null; this.values[index] = value == null ? NULL_VALUE : value; empty = false; + updateActiveBindingMask(index, true); clearCache(); } @@ -405,6 +468,7 @@ public void setBinding(Binding binding) { Value value = binding.getValue(); this.values[index] = value == null ? NULL_VALUE : value; empty = false; + updateActiveBindingMask(index, true); clearCache(); } @@ -416,6 +480,7 @@ public void setBinding(String name, Value value) { } this.values[index] = value; + updateActiveBindingMask(index, value != null); if (value == null) { this.empty = true; for (Value value1 : this.values) { @@ -437,6 +502,8 @@ public boolean isEmpty() { private void clearCache() { bindingNamesSetCache = null; + sortedBindingNames = null; + cachedSize = -1; cachedHashCode = 0; } @@ -459,7 +526,33 @@ public void addAll(ArrayBindingSet other) { } clearCache(); + if (sharedSortedBindingNamesCache != null) { + activeBindingMask = calculateActiveBindingMask(); + } + + } + + private void updateActiveBindingMask(int index, boolean hasValue) { + if (sharedSortedBindingNamesCache == null) { + return; + } + + long maskBit = 1L << index; + if (hasValue) { + activeBindingMask |= maskBit; + } else { + activeBindingMask &= ~maskBit; + } + } + private long calculateActiveBindingMask() { + long mask = 0L; + for (int i = 0; i < values.length; i++) { + if (values[i] != null) { + mask |= 1L << i; + } + } + return mask; } @Override @@ -495,9 +588,20 @@ public boolean equals(Object other) { } if (bindingNames == o.bindingNames) { - for (int i = 0; i < values.length; i++) { - if (values[i] != o.values[i]) { - if (!Objects.equals(values[i], o.values[i])) { + int valuesLength = values.length; + int startIndex = (int) LAST_EQUALS_MISMATCH_INDEX.getOpaque(); + if (startIndex >= valuesLength) { + startIndex = 0; + } + + for (int i = 0; i < valuesLength; i++) { + int index = startIndex + i; + if (index >= valuesLength) { + index -= valuesLength; + } + if (values[index] != o.values[index]) { + if (!Objects.equals(values[index], o.values[index])) { + LAST_EQUALS_MISMATCH_INDEX.setOpaque(index); return false; } } diff --git a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/impl/ArrayBindingBasedQueryEvaluationContext.java b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/impl/ArrayBindingBasedQueryEvaluationContext.java index 8ae18963cd5..26f008efe07 100644 --- a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/impl/ArrayBindingBasedQueryEvaluationContext.java +++ b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/impl/ArrayBindingBasedQueryEvaluationContext.java @@ -8,15 +8,19 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.query.algebra.evaluation.impl; +import java.util.ArrayList; import java.util.Arrays; +import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.HashSet; import java.util.LinkedHashMap; import java.util.List; import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; import java.util.function.BiConsumer; import java.util.function.Function; import java.util.function.Predicate; @@ -50,6 +54,9 @@ public final class ArrayBindingBasedQueryEvaluationContext implements QueryEvaluationContext { + private static final int FULL_MASK_CACHE_MAX_VARIABLES = 12; + private static final int FULL_MASK_CACHE_MAX_VARIABLES_POW = 2 << FULL_MASK_CACHE_MAX_VARIABLES; + public static final Predicate HAS_BINDING_FALSE = (bs) -> false; public static final Function GET_BINDING_NULL = (bs) -> null; public static final Function GET_VALUE_NULL = (bs) -> null; @@ -67,6 +74,7 @@ public final class ArrayBindingBasedQueryEvaluationContext implements QueryEvalu private final BiConsumer[] setBinding; private final BiConsumer[] addBinding; private final Comparator comparator; + private final ArrayBindingSet.SortedBindingNamesCache sortedBindingNamesCache; private final boolean initialized; @@ -77,8 +85,9 @@ public ArrayBindingBasedQueryEvaluationContext(QueryEvaluationContext context, S this.context = context; this.allVariables = allVariables; this.allVariablesSet = Set.of(allVariables); - this.defaultArrayBindingSet = new ArrayBindingSet(allVariables); this.comparator = comparator; + this.sortedBindingNamesCache = createSortedBindingNamesCache(); + this.defaultArrayBindingSet = new ArrayBindingSet(allVariables, sortedBindingNamesCache); hasBinding = new Predicate[allVariables.length]; getBinding = new Function[allVariables.length]; @@ -115,7 +124,7 @@ public Dataset getDataset() { @Override public ArrayBindingSet createBindingSet() { - return new ArrayBindingSet(allVariables); + return new ArrayBindingSet(allVariables, sortedBindingNamesCache); } @Override @@ -329,12 +338,77 @@ public BiConsumer addBinding(String variableName) { @Override public ArrayBindingSet createBindingSet(BindingSet bindings) { if (bindings instanceof ArrayBindingSet) { - return new ArrayBindingSet((ArrayBindingSet) bindings, allVariables); + return new ArrayBindingSet((ArrayBindingSet) bindings, allVariables, sortedBindingNamesCache); } else if (bindings == EmptyBindingSet.getInstance()) { return createBindingSet(); } else { - return new ArrayBindingSet(bindings, allVariablesSet, allVariables); + return new ArrayBindingSet(bindings, allVariablesSet, allVariables, sortedBindingNamesCache); + } + } + + private ArrayBindingSet.SortedBindingNamesCache createSortedBindingNamesCache() { + if (allVariables.length <= Long.SIZE) { + return new TieredSortedBindingNamesCache(); + } + return null; + } + + private List createSortedBindingNames(long activeBindingMask) { + int size = Long.bitCount(activeBindingMask); + + if (size == 0) { + return Collections.emptyList(); } + + if (size == 1) { + return Collections.singletonList(allVariables[Long.numberOfTrailingZeros(activeBindingMask)]); + } + + ArrayList names = new ArrayList<>(size); + for (int i = 0; i < allVariables.length; i++) { + if ((activeBindingMask & (1L << i)) != 0) { + names.add(allVariables[i]); + } + } + names.sort(String::compareTo); + return names; + } + + private final class TieredSortedBindingNamesCache implements ArrayBindingSet.SortedBindingNamesCache { + + private final List[] fullMaskCache; + private final ConcurrentHashMap> largerBindingCache = new ConcurrentHashMap<>(); + + private TieredSortedBindingNamesCache() { + if (allVariables.length <= FULL_MASK_CACHE_MAX_VARIABLES) { + fullMaskCache = new List[1 << allVariables.length]; + } else { + fullMaskCache = null; + } + } + + @Override + public List get(long activeBindingMask) { + if (fullMaskCache != null) { + return getFromFullMaskCache(activeBindingMask); + } + + return largerBindingCache.computeIfAbsent(activeBindingMask, + ArrayBindingBasedQueryEvaluationContext.this::createSortedBindingNames); + } + + private List getFromFullMaskCache(long activeBindingMask) { + int index = (int) activeBindingMask; + List sortedBindingNames = fullMaskCache[index]; + if (sortedBindingNames != null) { + return sortedBindingNames; + } + + List created = createSortedBindingNames(activeBindingMask); + fullMaskCache[index] = created; + return fullMaskCache[index]; + } + } public static String[] findAllVariablesUsedInQuery(QueryRoot node) { diff --git a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/BadlyDesignedLeftJoinIterator.java b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/BadlyDesignedLeftJoinIterator.java index be0eae6c875..bec5afe007f 100644 --- a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/BadlyDesignedLeftJoinIterator.java +++ b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/BadlyDesignedLeftJoinIterator.java @@ -10,8 +10,11 @@ *******************************************************************************/ package org.eclipse.rdf4j.query.algebra.evaluation.iterator; +import java.util.NoSuchElementException; import java.util.Set; +import org.eclipse.rdf4j.common.iteration.CloseableIteration; +import org.eclipse.rdf4j.common.iteration.LookAheadIteration; import org.eclipse.rdf4j.query.BindingSet; import org.eclipse.rdf4j.query.MutableBindingSet; import org.eclipse.rdf4j.query.QueryEvaluationException; @@ -23,7 +26,7 @@ /** * @author Arjohn Kampman */ -public class BadlyDesignedLeftJoinIterator extends LeftJoinIterator { +public class BadlyDesignedLeftJoinIterator extends LookAheadIteration { /*-----------* * Variables * @@ -33,6 +36,12 @@ public class BadlyDesignedLeftJoinIterator extends LeftJoinIterator { private final Set problemVars; + private final CloseableIteration leftIter; + + private final QueryEvaluationStep rightEvaluationStep; + + private CloseableIteration rightIter; + /*--------------* * Constructors * *--------------*/ @@ -43,10 +52,12 @@ public BadlyDesignedLeftJoinIterator( BindingSet inputBindings, Set problemVars, QueryEvaluationStep rightEvaluationStep) throws QueryEvaluationException { - super(strategy, join, getFilteredBindings(inputBindings, problemVars), rightEvaluationStep); + leftIter = strategy.evaluate(join.getLeftArg(), getFilteredBindings(inputBindings, problemVars)); + this.rightEvaluationStep = rightEvaluationStep; + rightIter = null; + join.setAlgorithm(this); this.inputBindings = inputBindings; this.problemVars = problemVars; - } /*---------* @@ -58,18 +69,20 @@ public BadlyDesignedLeftJoinIterator(QueryEvaluationStep left, Set problemVars, QueryEvaluationStep rightEvaluationStep) throws QueryEvaluationException { - super(left, getFilteredBindings(inputBindings, problemVars), rightEvaluationStep); + leftIter = left.evaluate(getFilteredBindings(inputBindings, problemVars)); + this.rightEvaluationStep = rightEvaluationStep; + rightIter = null; this.inputBindings = inputBindings; this.problemVars = problemVars; } @Override protected BindingSet getNextElement() throws QueryEvaluationException { - BindingSet result = super.getNextElement(); + BindingSet result = getNextLeftJoinElement(); // Ignore all results that are not compatible with the input bindings while (result != null && !inputBindings.isCompatible(result)) { - result = super.getNextElement(); + result = getNextLeftJoinElement(); } if (result != null) { @@ -94,6 +107,63 @@ protected BindingSet getNextElement() throws QueryEvaluationException { return result; } + private BindingSet getNextLeftJoinElement() throws QueryEvaluationException { + + try { + CloseableIteration nextRightIter = rightIter; + while (nextRightIter == null || nextRightIter.hasNext() || leftIter.hasNext()) { + BindingSet leftBindings = null; + + if (nextRightIter == null) { + if (leftIter.hasNext()) { + // Use left arg's bindings in case join fails + leftBindings = leftIter.next(); + nextRightIter = rightIter = rightEvaluationStep.evaluate(leftBindings); + } else { + return null; + } + } else if (!nextRightIter.hasNext()) { + // Use left arg's bindings in case join fails + leftBindings = leftIter.next(); + + nextRightIter.close(); + nextRightIter = rightIter = rightEvaluationStep.evaluate(leftBindings); + } + + if (nextRightIter == QueryEvaluationStep.EMPTY_ITERATION) { + rightIter = null; + return leftBindings; + } + + if (nextRightIter.hasNext()) { + return nextRightIter.next(); + } + + if (leftBindings != null) { + rightIter = null; + // Join failed, return left arg's bindings + return leftBindings; + } + } + } catch (NoSuchElementException ignore) { + // probably, one of the iterations has been closed concurrently in + // handleClose() + } + + return null; + } + + @Override + protected void handleClose() throws QueryEvaluationException { + try { + leftIter.close(); + } finally { + if (rightIter != null) { + rightIter.close(); + } + } + } + /*--------------------* * Static util method * *--------------------*/ diff --git a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/OrderIterator.java b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/OrderIterator.java index 43ad68233b4..7d63e887845 100644 --- a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/OrderIterator.java +++ b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/iterator/OrderIterator.java @@ -19,6 +19,7 @@ import java.io.ObjectOutputStream; import java.io.Serializable; import java.util.AbstractQueue; +import java.util.ArrayDeque; import java.util.ArrayList; import java.util.Arrays; import java.util.Collection; @@ -336,7 +337,7 @@ private static class SortedIterators implements Iterator { private final List> iterators; - private final TreeMap> head; + private final TreeMap> head; private final boolean distinct; @@ -375,8 +376,8 @@ public E next() { if (head.isEmpty()) { return null; } else { - Entry> e = head.firstEntry(); - advance(e.getValue().remove(0)); + Entry> e = head.firstEntry(); + advance(e.getValue().removeFirst()); if (e.getValue().isEmpty()) { head.remove(e.getKey()); } @@ -387,11 +388,12 @@ public E next() { private void advance(int i) { while (iterators.get(i).hasNext()) { E key = iterators.get(i).next(); - if (!head.containsKey(key)) { - head.put(key, new LinkedList<>(List.of(i))); + ArrayDeque integers = head.get(key); + if (integers == null) { + head.put(key, new ArrayDeque<>(List.of(i))); break; } else if (!distinct) { - head.get(key).add(i); + integers.add(i); break; } } diff --git a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/util/OrderComparator.java b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/util/OrderComparator.java index 5086cdb3e6b..7e9a9ad229e 100644 --- a/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/util/OrderComparator.java +++ b/core/queryalgebra/evaluation/src/main/java/org/eclipse/rdf4j/query/algebra/evaluation/util/OrderComparator.java @@ -142,8 +142,8 @@ public int compare(BindingSet o1, BindingSet o2) { } // binding set sizes are equal. compare on binding names. - if (o2bindingNamesOrdered != null && !sortedEquals(o1bindingNamesOrdered, o2bindingNamesOrdered) - || !o1.getBindingNames().equals(o2.getBindingNames())) { + if (!o1.getBindingNames().equals(o2.getBindingNames()) || ((o2bindingNamesOrdered != null) + && !sortedEquals(o1bindingNamesOrdered, o2bindingNamesOrdered))) { if (o2bindingNamesOrdered == null) { o2bindingNamesOrdered = getSortedBindingNames(o2.getBindingNames()); @@ -178,6 +178,10 @@ public int compare(BindingSet o1, BindingSet o2) { } private boolean sortedEquals(List o1bindingNamesOrdered, List o2bindingNamesOrdered) { + if (o1bindingNamesOrdered == o2bindingNamesOrdered) { + return true; + } + if (o1bindingNamesOrdered.size() != o2bindingNamesOrdered.size()) { return false; } diff --git a/core/queryalgebra/evaluation/src/test/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSetTest.java b/core/queryalgebra/evaluation/src/test/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSetTest.java index 0e7cf0ffe36..beb8f8423af 100644 --- a/core/queryalgebra/evaluation/src/test/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSetTest.java +++ b/core/queryalgebra/evaluation/src/test/java/org/eclipse/rdf4j/query/algebra/evaluation/ArrayBindingSetTest.java @@ -8,15 +8,18 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.query.algebra.evaluation; import static org.assertj.core.api.Assertions.fail; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertFalse; import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertSame; import static org.junit.jupiter.api.Assertions.assertTrue; import java.util.Iterator; +import java.util.List; import java.util.NoSuchElementException; import org.eclipse.rdf4j.model.ValueFactory; @@ -24,6 +27,8 @@ import org.eclipse.rdf4j.model.vocabulary.RDF; import org.eclipse.rdf4j.query.Binding; import org.eclipse.rdf4j.query.BindingSet; +import org.eclipse.rdf4j.query.algebra.evaluation.impl.ArrayBindingBasedQueryEvaluationContext; +import org.eclipse.rdf4j.query.algebra.evaluation.impl.QueryEvaluationContext; import org.eclipse.rdf4j.query.impl.MapBindingSet; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; @@ -161,4 +166,47 @@ public void testThreeWithTwoElementsSetIterator() { assertNotNull(e); } } + + @Test + public void testSortedBindingNamesShouldBeSharedWithinEvaluationContext() { + ArrayBindingBasedQueryEvaluationContext context = new ArrayBindingBasedQueryEvaluationContext( + new QueryEvaluationContext.Minimal(vf.createLiteral("now"), null, null), + new String[] { "z", "a", "m" }, null); + + ArrayBindingSet left = context.createBindingSet(); + ArrayBindingSet right = context.createBindingSet(); + + left.setBinding("z", RDF.ALT); + left.setBinding("a", RDF.BAG); + right.setBinding("z", RDF.FIRST); + right.setBinding("a", RDF.NIL); + + List leftSortedBindingNames = left.getSortedBindingNames(); + List rightSortedBindingNames = right.getSortedBindingNames(); + + assertEquals(List.of("a", "z"), leftSortedBindingNames); + assertSame(leftSortedBindingNames, rightSortedBindingNames); + } + + @Test + public void testSortedBindingNamesShouldBeSharedForSmallBindingSetInLargeVariableArray() { + ArrayBindingBasedQueryEvaluationContext context = new ArrayBindingBasedQueryEvaluationContext( + new QueryEvaluationContext.Minimal(vf.createLiteral("now"), null, null), + new String[] { "z13", "z12", "z11", "z10", "z9", "z8", "z7", "z6", "z5", "z4", "z3", "z2", "a1" }, + null); + + ArrayBindingSet left = context.createBindingSet(); + ArrayBindingSet right = context.createBindingSet(); + + left.setBinding("z13", RDF.ALT); + left.setBinding("a1", RDF.BAG); + right.setBinding("z13", RDF.FIRST); + right.setBinding("a1", RDF.NIL); + + List leftSortedBindingNames = left.getSortedBindingNames(); + List rightSortedBindingNames = right.getSortedBindingNames(); + + assertEquals(List.of("a1", "z13"), leftSortedBindingNames); + assertSame(leftSortedBindingNames, rightSortedBindingNames); + } } diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java index 403ef1c9dda..0224d45fcf1 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStore.java @@ -59,7 +59,9 @@ import java.util.Arrays; import java.util.Collection; import java.util.Collections; +import java.util.HashMap; import java.util.HashSet; +import java.util.Map; import java.util.Optional; import java.util.Set; import java.util.WeakHashMap; @@ -200,6 +202,8 @@ class ValueStore extends AbstractValueFactory { */ private long nextId = 1; private boolean freeIdsAvailable; + private ValueStoreHashFile hashFile; + private final Map pendingHashUpdates = new HashMap<>(); private volatile long nextValueEvictionTime = 0; @@ -220,6 +224,7 @@ class ValueStore extends AbstractValueFactory { private Object[] previousNamespaceEntry; private final long valueEvictionInterval; + private final boolean valueHashCacheEnabled; ValueStore(File dir, LmdbStoreConfig config) throws IOException { this.dir = dir; @@ -228,6 +233,7 @@ class ValueStore extends AbstractValueFactory { this.autoGrow = config.getAutoGrow(); this.mapSize = config.getValueDBSize(); this.valueEvictionInterval = config.getValueEvictionInterval(); + this.valueHashCacheEnabled = config.getValueHashCacheEnabled(); open(); int cacheSize = nextPowerOfTwo(config.getValueCacheSize()); @@ -278,6 +284,18 @@ class ValueStore extends AbstractValueFactory { commit(); } + private void openHashFileQuietly() { + hashFile = null; + if (!valueHashCacheEnabled) { + return; + } + try { + hashFile = new ValueStoreHashFile(dir); + } catch (IOException e) { + logger.warn("Could not initialize LMDB hash cache", e); + } + } + private void logValues() throws IOException { readTransaction(env, (stack, txn) -> { long cursor = 0; @@ -313,6 +331,7 @@ private void logValues() throws IOException { private void open() throws IOException { // create directory if it not exists dir.mkdirs(); + openHashFileQuietly(); try (MemoryStack stack = stackPush()) { PointerBuffer pp = stack.mallocPointer(1); @@ -404,8 +423,10 @@ private long nextId(byte type) throws IOException { MDBVal keyData = MDBVal.calloc(stack); MDBVal valueData = MDBVal.calloc(stack); if (mdb_cursor_get(cursor, keyData, valueData, MDB_FIRST) == MDB_SUCCESS) { + long freedId = data2id(keyData.mv_data()); + clearStoredHash(freedId); // remove lower 2 type bits - long value = data2id(keyData.mv_data()) >> 2; + long value = freedId >> 2; // delete entry E(mdb_cursor_del(cursor, 0)); return value; @@ -461,6 +482,51 @@ ValueStoreRevision getRevision() { return revision; } + int getStoredHash(long id) { + Integer pendingHash; + synchronized (pendingHashUpdates) { + pendingHash = pendingHashUpdates.get(id); + } + if (pendingHash != null) { + return pendingHash; + } + if (hashFile == null) { + return 0; + } + try { + return hashFile.get(id); + } catch (IOException e) { + resetHashFileQuietly("read", e); + return 0; + } + } + + void storeHash(long id, int hash) { + if (id == LmdbValue.UNKNOWN_ID) { + return; + } + if (writeTxn != 0) { + synchronized (pendingHashUpdates) { + pendingHashUpdates.put(id, hash); + } + return; + } + writeHashNow(id, hash); + } + + void clearStoredHash(long id) { + if (id == LmdbValue.UNKNOWN_ID) { + return; + } + if (writeTxn != 0) { + synchronized (pendingHashUpdates) { + pendingHashUpdates.put(id, 0); + } + return; + } + writeHashNow(id, 0); + } + protected byte[] getData(long id) throws IOException { return readTransaction(env, (stack, txn) -> { MDBVal keyData = MDBVal.calloc(stack); @@ -590,6 +656,7 @@ public boolean resolveValue(long id, LmdbValue value) { try { byte[] data = getData(id); if (data != null) { +// System.out.println(id); data2value(id, data, value); cacheValue(id, value); return true; @@ -963,6 +1030,7 @@ public long getId(Value value, boolean create) throws IOException { valueIDCache.put(nv, id); } + storeHashIfAbsent(id, value); } return id; @@ -1174,6 +1242,7 @@ protected void freeUnusedIdsAndValues(MemoryStack stack, long txn, Set rev E(mdb_put(txn, freeDbi, idVal, emptyVal, 0)); // delete id -> value association E(mdb_del(txn, dbi, idVal, null)); + clearStoredHash(data2id(idVal.mv_data().duplicate())); // delete id and value from unused list E(mdb_cursor_del(unusedIdsCursor, 0)); @@ -1228,6 +1297,7 @@ void endTransaction(boolean commit) throws IOException { long stamp = revisionLock.writeLock(); try { E(mdb_txn_commit(writeTxn)); + flushPendingHashUpdates(); long revisionId = lazyRevision.getRevisionId(); cleaner.register(lazyRevision, () -> { synchronized (unusedRevisionIds) { @@ -1244,9 +1314,11 @@ void endTransaction(boolean commit) throws IOException { } } else { E(mdb_txn_commit(writeTxn)); + flushPendingHashUpdates(); } } else { mdb_txn_abort(writeTxn); + clearPendingHashUpdates(); } writeTxn = 0; invalidateRevisionOnCommit = false; @@ -1298,6 +1370,7 @@ public void clear() throws IOException { new File(dir, "data.mdb").delete(); new File(dir, "lock.mdb").delete(); + ValueStoreHashFile.deleteIfPresent(dir); clearCaches(); open(); @@ -1335,11 +1408,81 @@ private void closeReadTransactions() { public void close() throws IOException { if (env != 0) { + if (writeTxn == 0) { + flushPendingHashUpdates(); + } closeReadTransactions(); endTransaction(false); mdb_env_close(env); env = 0; } + if (hashFile != null) { + try { + hashFile.close(); + } catch (IOException e) { + logger.warn("Could not close LMDB hash cache", e); + } + hashFile = null; + } + } + + private void storeHashIfAbsent(long id, Value value) { + if (getStoredHash(id) == 0) { + storeHash(id, value.hashCode()); + } + } + + private void writeHashNow(long id, int hash) { + if (hashFile == null) { + return; + } + try { + hashFile.put(id, hash); + } catch (IOException e) { + resetHashFileQuietly("write", e); + } + } + + private void flushPendingHashUpdates() { + Map updates; + synchronized (pendingHashUpdates) { + if (pendingHashUpdates.isEmpty()) { + return; + } + updates = new HashMap<>(pendingHashUpdates); + pendingHashUpdates.clear(); + } + + for (Map.Entry entry : updates.entrySet()) { + writeHashNow(entry.getKey(), entry.getValue()); + } + } + + private void clearPendingHashUpdates() { + synchronized (pendingHashUpdates) { + pendingHashUpdates.clear(); + } + } + + private void resetHashFileQuietly(String operation, IOException cause) { + logger.warn("Resetting LMDB hash cache after {} failure", operation, cause); + clearPendingHashUpdates(); + if (!valueHashCacheEnabled) { + hashFile = null; + return; + } + ValueStoreHashFile currentHashFile = hashFile; + hashFile = null; + try { + if (currentHashFile != null) { + currentHashFile.delete(); + } else { + ValueStoreHashFile.deleteIfPresent(dir); + } + } catch (IOException deleteFailure) { + logger.warn("Could not delete LMDB hash cache", deleteFailure); + } + openHashFileQuietly(); } private static final class ReadTxn { diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreHashFile.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreHashFile.java new file mode 100644 index 00000000000..75bdd28e5cb --- /dev/null +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreHashFile.java @@ -0,0 +1,323 @@ +/******************************************************************************* + * Copyright (c) 2026 Eclipse RDF4J contributors. + * + * All rights reserved. This program and the accompanying materials + * are made available under the terms of the Eclipse Distribution License v1.0 + * which accompanies this distribution, and is available at + * http://www.eclipse.org/org/documents/edl-v10.php. + * + * SPDX-License-Identifier: BSD-3-Clause + *******************************************************************************/ +// Some portions generated by Codex +package org.eclipse.rdf4j.sail.lmdb; + +import java.io.File; +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.nio.MappedByteBuffer; +import java.nio.channels.FileChannel; +import java.nio.channels.FileChannel.MapMode; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.StandardOpenOption; +import java.util.ArrayList; +import java.util.List; +import java.util.Properties; +import java.util.zip.CRC32C; + +final class ValueStoreHashFile implements AutoCloseable { + + static final String FILE_NAME = "hashes.dat"; + static final String INTEGRITY_FILE_NAME = FILE_NAME + ".integrity"; + private static final long SEGMENT_SIZE = 256L * 1024 * 1024; + private static final String INTEGRITY_VERSION = "1"; + private static final String VERSION_KEY = "version"; + private static final String SIZE_KEY = "size"; + private static final String CRC32C_KEY = "crc32c"; + private static final byte[] ZERO_CHUNK = new byte[8192]; + + private final File file; + private final File integrityFile; + private final List segments = new ArrayList<>(); + private FileChannel channel; + private long mappedSize; + private boolean discardExistingContents; + + ValueStoreHashFile(File dir) throws IOException { + file = new File(dir, FILE_NAME); + integrityFile = new File(dir, INTEGRITY_FILE_NAME); + prepareForOpen(); + open(); + } + + synchronized int get(long id) throws IOException { + if (id < 0) { + return 0; + } + + long byteOffset = byteOffset(id); + if (byteOffset + Integer.BYTES > mappedSize) { + return 0; + } + + return buffer(byteOffset).getInt(offset(byteOffset)); + } + + synchronized void put(long id, int hash) throws IOException { + if (id < 0) { + return; + } + + long byteOffset = byteOffset(id); + ensureCapacity(byteOffset + Integer.BYTES); + buffer(byteOffset).putInt(offset(byteOffset), hash); + } + + synchronized void clear(long id) throws IOException { + if (id < 0) { + return; + } + + long byteOffset = byteOffset(id); + if (byteOffset + Integer.BYTES > mappedSize) { + return; + } + + buffer(byteOffset).putInt(offset(byteOffset), 0); + } + + synchronized void force() { + for (MappedByteBuffer segment : segments) { + segment.force(); + } + } + + synchronized void delete() throws IOException { + close(false); + deleteCacheFilesQuietly(file.toPath(), integrityFile.toPath()); + } + + @Override + public synchronized void close() throws IOException { + close(true); + } + + private void open() throws IOException { + channel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.READ, + StandardOpenOption.WRITE); + if (discardExistingContents) { + try { + channel.truncate(0); + discardExistingContents = false; + } catch (IOException ignored) { + // The channel may still have platform-specific file locks via prior mappings. + // Fall back to ignoring old bytes and zero-filling new mappings on demand. + } + return; + } + mapSegments(channel.size()); + } + + private void ensureCapacity(long requiredSize) throws IOException { + if (requiredSize <= mappedSize) { + return; + } + + long newSize = roundUp(requiredSize); + channel.write(java.nio.ByteBuffer.wrap(new byte[] { 0 }), newSize - 1); + mapSegments(newSize); + } + + private void mapSegments(long targetSize) throws IOException { + while (mappedSize < targetSize) { + long remaining = targetSize - mappedSize; + long size = Math.min(SEGMENT_SIZE, remaining); + MappedByteBuffer segment = channel.map(MapMode.READ_WRITE, mappedSize, size); + if (discardExistingContents) { + zero(segment); + } + segments.add(segment); + mappedSize += size; + } + } + + private MappedByteBuffer buffer(long byteOffset) { + return segments.get((int) (byteOffset / SEGMENT_SIZE)); + } + + private int offset(long byteOffset) { + return (int) (byteOffset % SEGMENT_SIZE); + } + + private long byteOffset(long id) { + return Math.multiplyExact(id, (long) Integer.BYTES); + } + + private long roundUp(long requiredSize) { + long remainder = requiredSize % SEGMENT_SIZE; + if (remainder == 0) { + return requiredSize; + } + return requiredSize + (SEGMENT_SIZE - remainder); + } + + static void deleteIfPresent(File dir) throws IOException { + deleteCacheFiles(new File(dir, FILE_NAME).toPath(), new File(dir, INTEGRITY_FILE_NAME).toPath()); + } + + private void prepareForOpen() throws IOException { + Path filePath = file.toPath(); + Path integrityPath = integrityFile.toPath(); + if (!Files.exists(filePath)) { + Files.deleteIfExists(integrityPath); + return; + } + if (!Files.exists(integrityPath)) { + discardExistingContents = true; + return; + } + if (!hasValidIntegrity(filePath, integrityPath)) { + discardExistingContents = true; + } + Files.deleteIfExists(integrityPath); + } + + private boolean hasValidIntegrity(Path filePath, Path integrityPath) throws IOException { + Properties properties = new Properties(); + try (InputStream inputStream = Files.newInputStream(integrityPath)) { + properties.load(inputStream); + } + + if (!INTEGRITY_VERSION.equals(properties.getProperty(VERSION_KEY))) { + return false; + } + + try { + long expectedSize = Long.parseLong(properties.getProperty(SIZE_KEY)); + long expectedCrc32c = Long.parseLong(properties.getProperty(CRC32C_KEY)); + return expectedSize == Files.size(filePath) && expectedCrc32c == computeCrc32c(filePath); + } catch (NumberFormatException e) { + return false; + } + } + + private void close(boolean writeIntegrity) throws IOException { + IOException failure = null; + if (!segments.isEmpty()) { + try { + force(); + } catch (RuntimeException e) { + failure = new IOException("Could not force hash cache file " + file, e); + } + } + if (channel != null) { + try { + channel.close(); + } catch (IOException e) { + failure = append(failure, e); + } finally { + channel = null; + } + } + segments.clear(); + mappedSize = 0L; + + Path filePath = file.toPath(); + Path integrityPath = integrityFile.toPath(); + if (writeIntegrity && failure == null) { + if (discardExistingContents) { + deleteCacheFilesQuietly(filePath, integrityPath); + } else { + try { + writeIntegrityFile(filePath, integrityPath); + } catch (IOException e) { + deleteCacheFilesQuietly(filePath, integrityPath); + failure = append(failure, e); + } + } + } else if (!writeIntegrity) { + Files.deleteIfExists(integrityPath); + } else if (failure != null) { + deleteCacheFilesQuietly(filePath, integrityPath); + } + + if (failure != null) { + throw failure; + } + } + + private void writeIntegrityFile(Path filePath, Path integrityPath) throws IOException { + if (!Files.exists(filePath)) { + Files.deleteIfExists(integrityPath); + return; + } + + String content = VERSION_KEY + "=" + INTEGRITY_VERSION + "\n" + + SIZE_KEY + "=" + Files.size(filePath) + "\n" + + CRC32C_KEY + "=" + computeCrc32c(filePath) + "\n"; + ByteBuffer encoded = StandardCharsets.UTF_8.encode(content); + try (FileChannel integrityChannel = FileChannel.open(integrityPath, StandardOpenOption.CREATE, + StandardOpenOption.TRUNCATE_EXISTING, StandardOpenOption.WRITE)) { + while (encoded.hasRemaining()) { + integrityChannel.write(encoded); + } + integrityChannel.force(true); + } + } + + private long computeCrc32c(Path path) throws IOException { + CRC32C crc32c = new CRC32C(); + byte[] buffer = new byte[8192]; + try (InputStream inputStream = Files.newInputStream(path)) { + int read; + while ((read = inputStream.read(buffer)) != -1) { + crc32c.update(buffer, 0, read); + } + } + return crc32c.getValue(); + } + + private static void deleteCacheFiles(Path filePath, Path integrityPath) throws IOException { + IOException failure = null; + try { + Files.deleteIfExists(integrityPath); + } catch (IOException e) { + failure = append(failure, e); + } + try { + Files.deleteIfExists(filePath); + } catch (IOException e) { + failure = append(failure, e); + } + if (failure != null) { + throw failure; + } + } + + private static void deleteCacheFilesQuietly(Path filePath, Path integrityPath) { + try { + deleteCacheFiles(filePath, integrityPath); + } catch (IOException ignored) { + // best effort only + } + } + + private static IOException append(IOException failure, IOException next) { + if (failure == null) { + return next; + } + failure.addSuppressed(next); + return failure; + } + + private static void zero(MappedByteBuffer segment) { + segment.position(0); + while (segment.hasRemaining()) { + int length = Math.min(segment.remaining(), ZERO_CHUNK.length); + segment.put(ZERO_CHUNK, 0, length); + } + segment.position(0); + } +} diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreRevision.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreRevision.java index 05928026933..85342c6154f 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreRevision.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreRevision.java @@ -8,6 +8,7 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.sail.lmdb; import java.io.Serializable; @@ -62,7 +63,7 @@ public ValueStore getValueStore() { } public boolean resolveValue(long id, LmdbValue value) { - return valueStore.resolveValue(id, value); + return valueStore != null && valueStore.resolveValue(id, value); } } @@ -91,7 +92,7 @@ public ValueStore getValueStore() { @Override public boolean resolveValue(long id, LmdbValue value) { - if (valueStore.resolveValue(id, value)) { + if (valueStore != null && valueStore.resolveValue(id, value)) { // set unwrapped version of revision value.setInternalID(id, revision); return true; @@ -105,4 +106,19 @@ public boolean resolveValue(long id, LmdbValue value) { ValueStore getValueStore(); boolean resolveValue(long id, LmdbValue value); + + default int getStoredHash(long id) { + ValueStore valueStore = getValueStore(); + if (valueStore == null || valueStore.getRevision().getRevisionId() != getRevisionId()) { + return 0; + } + return valueStore.getStoredHash(id); + } + + default void storeHash(long id, int hash) { + ValueStore valueStore = getValueStore(); + if (valueStore != null && valueStore.getRevision().getRevisionId() == getRevisionId()) { + valueStore.storeHash(id, hash); + } + } } diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfig.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfig.java index d7af70f2f46..5d3b857f51d 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfig.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfig.java @@ -80,6 +80,8 @@ public class LmdbStoreConfig extends BaseSailConfig { private long valueEvictionInterval = Duration.ofSeconds(60).toMillis(); + private boolean valueHashCacheEnabled = false; + /*--------------* * Constructors * *--------------*/ @@ -213,6 +215,15 @@ public LmdbStoreConfig setPageCardinalityEstimator(boolean pageCardinalityEstima return this; } + public boolean getValueHashCacheEnabled() { + return valueHashCacheEnabled; + } + + public LmdbStoreConfig setValueHashCacheEnabled(boolean valueHashCacheEnabled) { + this.valueHashCacheEnabled = valueHashCacheEnabled; + return this; + } + @Override public Resource export(Model m) { Resource implNode = super.export(m); @@ -255,6 +266,9 @@ public Resource export(Model m) { if (valueEvictionInterval != Duration.ofSeconds(60).toMillis()) { m.add(implNode, LmdbStoreSchema.VALUE_EVICTION_INTERVAL, vf.createLiteral(valueEvictionInterval)); } + if (valueHashCacheEnabled) { + m.add(implNode, LmdbStoreSchema.VALUE_HASH_CACHE_ENABLED, vf.createLiteral(true)); + } return implNode; } @@ -379,6 +393,17 @@ public void parse(Model m, Resource implNode) throws SailConfigException { + " property, found " + lit); } }); + + Models.objectLiteral(m.getStatements(implNode, LmdbStoreSchema.VALUE_HASH_CACHE_ENABLED, null)) + .ifPresent(lit -> { + try { + setValueHashCacheEnabled(lit.booleanValue()); + } catch (IllegalArgumentException e) { + throw new SailConfigException( + "Boolean value required for " + LmdbStoreSchema.VALUE_HASH_CACHE_ENABLED + + " property, found " + lit); + } + }); } catch (ModelException e) { throw new SailConfigException(e.getMessage(), e); } diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreSchema.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreSchema.java index 75a64db9ead..5bb185c4766 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreSchema.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreSchema.java @@ -87,6 +87,11 @@ public class LmdbStoreSchema { */ public final static IRI VALUE_EVICTION_INTERVAL; + /** + * http://rdf4j.org/config/sail/lmdb#valueHashCacheEnabled + */ + public final static IRI VALUE_HASH_CACHE_ENABLED; + static { ValueFactory factory = SimpleValueFactory.getInstance(); TRIPLE_INDEXES = factory.createIRI(NAMESPACE, "tripleIndexes"); @@ -101,5 +106,6 @@ public class LmdbStoreSchema { AUTO_GROW = factory.createIRI(NAMESPACE, "autoGrow"); PAGE_CARDINALITY_ESTIMATOR = factory.createIRI(NAMESPACE, "pageCardinalityEstimator"); VALUE_EVICTION_INTERVAL = factory.createIRI(NAMESPACE, "valueEvictionInterval"); + VALUE_HASH_CACHE_ENABLED = factory.createIRI(NAMESPACE, "valueHashCacheEnabled"); } } diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbBNode.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbBNode.java index d30195c089b..30416423f68 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbBNode.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbBNode.java @@ -8,6 +8,7 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.sail.lmdb.model; import java.io.ObjectStreamException; @@ -123,6 +124,22 @@ public boolean equals(Object o) { return super.equals(o); } + @Override + public int hashCode() { + if (internalID != UNKNOWN_ID) { + int cachedHash = revision.getStoredHash(internalID); + if (cachedHash != 0) { + return cachedHash; + } + } + + int hash = super.hashCode(); + if (internalID != UNKNOWN_ID) { + revision.storeHash(internalID, hash); + } + return hash; + } + protected Object writeReplace() throws ObjectStreamException { init(); return this; diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbIRI.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbIRI.java index e9eabab7274..3efbb471ce7 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbIRI.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbIRI.java @@ -8,6 +8,7 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.sail.lmdb.model; import java.io.ObjectStreamException; @@ -214,12 +215,25 @@ public boolean equals(Object o) { @Override public int hashCode() { + if (internalID != UNKNOWN_ID) { + int cachedHash = revision.getStoredHash(internalID); + if (cachedHash != 0) { + return cachedHash; + } + } + + int hash; if (this.iriString != null) { - return this.iriString.hashCode(); + hash = this.iriString.hashCode(); + } else { + init(); + hash = iriString.hashCode(); } - init(); - return iriString.hashCode(); + if (internalID != UNKNOWN_ID) { + revision.storeHash(internalID, hash); + } + return hash; } protected Object writeReplace() throws ObjectStreamException { diff --git a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbLiteral.java b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbLiteral.java index 2ecd1c41651..800c3bf88f5 100644 --- a/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbLiteral.java +++ b/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/model/LmdbLiteral.java @@ -8,6 +8,7 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.sail.lmdb.model; import java.io.ObjectStreamException; @@ -245,8 +246,19 @@ public boolean equals(Object o) { @Override public int hashCode() { + if (internalID != UNKNOWN_ID) { + int cachedHash = revision.getStoredHash(internalID); + if (cachedHash != 0) { + return cachedHash; + } + } + init(); - return super.hashCode(); + int hash = super.hashCode(); + if (internalID != UNKNOWN_ID) { + revision.storeHash(internalID, hash); + } + return hash; } @Override diff --git a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreHashCacheTest.java b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreHashCacheTest.java new file mode 100644 index 00000000000..62029e075cb --- /dev/null +++ b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreHashCacheTest.java @@ -0,0 +1,208 @@ +/******************************************************************************* + * Copyright (c) 2026 Eclipse RDF4J contributors. + * + * All rights reserved. This program and the accompanying materials + * are made available under the terms of the Eclipse Distribution License v1.0 + * which accompanies this distribution, and is available at + * http://www.eclipse.org/org/documents/edl-v10.php. + * + * SPDX-License-Identifier: BSD-3-Clause + *******************************************************************************/ +// Some portions generated by Codex +package org.eclipse.rdf4j.sail.lmdb; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; + +import java.io.File; +import java.io.IOException; +import java.io.RandomAccessFile; +import java.lang.reflect.Field; +import java.nio.MappedByteBuffer; +import java.nio.file.Files; +import java.util.List; +import java.util.function.Consumer; + +import org.eclipse.rdf4j.model.IRI; +import org.eclipse.rdf4j.model.util.Values; +import org.eclipse.rdf4j.sail.lmdb.config.LmdbStoreConfig; +import org.eclipse.rdf4j.sail.lmdb.model.LmdbIRI; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +class ValueStoreHashCacheTest { + + private ValueStore valueStore; + + @Test + void defaultConfigShouldNotPersistHashCache(@TempDir File dataDir) throws Exception { + valueStore = createValueStore(dataDir, new LmdbStoreConfig()); + + IRI iri = Values.iri("urn:hash:default-disabled"); + int expectedHash = iri.hashCode(); + long id = storeValue(iri); + + assertFalse(hashFile(dataDir).exists()); + + valueStore.close(); + valueStore = createValueStore(dataDir, new LmdbStoreConfig()); + + LmdbIRI lazyValue = (LmdbIRI) valueStore.getLazyValue(id); + assertFalse(isInitialized(lazyValue)); + assertEquals(expectedHash, lazyValue.hashCode()); + assertTrue("hashCode should initialize lazy IRIs when the hash cache is disabled", isInitialized(lazyValue)); + assertFalse(hashFile(dataDir).exists()); + assertFalse(integrityFile(dataDir).exists()); + } + + @Test + void enabledHashCacheShouldWriteIntegrityFileAndClearItOnOpen(@TempDir File dataDir) throws Exception { + LmdbStoreConfig config = hashCacheEnabledConfig(); + valueStore = createValueStore(dataDir, config); + + IRI iri = Values.iri("urn:hash:integrity-clean"); + int expectedHash = iri.hashCode(); + long id = storeValue(iri); + + valueStore.close(); + valueStore = null; + + assertTrue(hashFile(dataDir).exists()); + assertTrue("clean shutdown should create hash cache integrity metadata", integrityFile(dataDir).exists()); + + valueStore = createValueStore(dataDir, config); + + assertFalse("startup should remove integrity metadata until the next clean shutdown", + integrityFile(dataDir).exists()); + + LmdbIRI lazyValue = (LmdbIRI) valueStore.getLazyValue(id); + assertFalse(isInitialized(lazyValue)); + assertEquals(expectedHash, lazyValue.hashCode()); + assertFalse("validated cache should keep lazy IRIs unresolved", isInitialized(lazyValue)); + } + + @Test + void enabledHashCacheShouldDropCacheWhenIntegrityFileIsMissing(@TempDir File dataDir) throws Exception { + assertCorruptedCacheFallsBackToRecomputedHash(dataDir, file -> assertTrue(file.delete()), false); + } + + @Test + void enabledHashCacheShouldDropCacheWhenHashFileIsCorrupted(@TempDir File dataDir) throws Exception { + assertCorruptedCacheFallsBackToRecomputedHash(dataDir, file -> { + try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "rw")) { + randomAccessFile.seek(0); + randomAccessFile.writeInt(0x10203040); + } catch (IOException e) { + throw new RuntimeException(e); + } + }, true); + } + + @Test + void enabledHashCacheShouldDropCacheWhenIntegrityMetadataIsCorrupted(@TempDir File dataDir) throws Exception { + assertCorruptedCacheFallsBackToRecomputedHash(dataDir, file -> { + try { + Files.writeString(file.toPath(), "broken"); + } catch (IOException e) { + throw new RuntimeException(e); + } + }, false); + } + + @Test + void deleteShouldCleanMappedSegments(@TempDir File dataDir) throws Exception { + File valuesDir = new File(dataDir, "values"); + assertTrue(valuesDir.mkdirs()); + + ValueStoreHashFile hashFile = new ValueStoreHashFile(valuesDir); + hashFile.put(1L, 1234); + + MappedByteBuffer segment = firstSegment(hashFile); + assertTrue(hashFile(dataDir).exists()); + + hashFile.delete(); + + assertEquals(1234, segment.getInt(Integer.BYTES)); + assertTrue("delete should clear tracked mapped segments", segments(hashFile).isEmpty()); + assertFalse("delete should invalidate integrity metadata immediately", integrityFile(dataDir).exists()); + + ValueStoreHashFile reopened = new ValueStoreHashFile(valuesDir); + try { + assertEquals("reopened cache should ignore stale mapped contents after delete", 0, reopened.get(1L)); + } finally { + reopened.close(); + } + } + + private void assertCorruptedCacheFallsBackToRecomputedHash(File dataDir, Consumer corruptor, + boolean corruptHashFile) throws Exception { + LmdbStoreConfig config = hashCacheEnabledConfig(); + valueStore = createValueStore(dataDir, config); + + IRI iri = Values.iri("urn:hash:corrupt"); + int expectedHash = iri.hashCode(); + long id = storeValue(iri); + + valueStore.close(); + valueStore = null; + + corruptor.accept(corruptHashFile ? hashFile(dataDir) : integrityFile(dataDir)); + + valueStore = createValueStore(dataDir, config); + + LmdbIRI lazyValue = (LmdbIRI) valueStore.getLazyValue(id); + assertFalse(isInitialized(lazyValue)); + assertEquals(expectedHash, lazyValue.hashCode()); + assertTrue("invalid cache metadata should force lazy IRIs to recompute their hash", isInitialized(lazyValue)); + } + + private ValueStore createValueStore(File dataDir, LmdbStoreConfig config) throws IOException { + return new ValueStore(new File(dataDir, "values"), config); + } + + private long storeValue(IRI iri) throws Exception { + valueStore.startTransaction(true); + long id = valueStore.storeValue(iri); + valueStore.commit(); + return id; + } + + private LmdbStoreConfig hashCacheEnabledConfig() { + return new LmdbStoreConfig().setValueHashCacheEnabled(true); + } + + private File hashFile(File dataDir) { + return new File(new File(dataDir, "values"), ValueStoreHashFile.FILE_NAME); + } + + private File integrityFile(File dataDir) { + return new File(new File(dataDir, "values"), ValueStoreHashFile.INTEGRITY_FILE_NAME); + } + + private boolean isInitialized(Object value) throws Exception { + Field initializedField = value.getClass().getDeclaredField("initialized"); + initializedField.setAccessible(true); + return initializedField.getBoolean(value); + } + + @SuppressWarnings("unchecked") + private MappedByteBuffer firstSegment(ValueStoreHashFile hashFile) throws Exception { + return segments(hashFile).get(0); + } + + @SuppressWarnings("unchecked") + private List segments(ValueStoreHashFile hashFile) throws Exception { + Field segmentsField = ValueStoreHashFile.class.getDeclaredField("segments"); + segmentsField.setAccessible(true); + return (List) segmentsField.get(hashFile); + } + + @AfterEach + void after() throws Exception { + if (valueStore != null) { + valueStore.close(); + } + } +} diff --git a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreTest.java b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreTest.java index cdfef66f530..513dfe7f098 100644 --- a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreTest.java +++ b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/ValueStoreTest.java @@ -8,9 +8,11 @@ * * SPDX-License-Identifier: BSD-3-Clause *******************************************************************************/ +// Some portions generated by Codex package org.eclipse.rdf4j.sail.lmdb; import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertNotEquals; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertNull; @@ -18,6 +20,7 @@ import java.io.File; import java.io.IOException; +import java.lang.reflect.Field; import java.util.Arrays; import java.util.Collections; import java.util.HashSet; @@ -35,6 +38,8 @@ import org.eclipse.rdf4j.model.vocabulary.RDFS; import org.eclipse.rdf4j.model.vocabulary.XSD; import org.eclipse.rdf4j.sail.lmdb.config.LmdbStoreConfig; +import org.eclipse.rdf4j.sail.lmdb.model.LmdbBNode; +import org.eclipse.rdf4j.sail.lmdb.model.LmdbIRI; import org.eclipse.rdf4j.sail.lmdb.model.LmdbLiteral; import org.eclipse.rdf4j.sail.lmdb.model.LmdbValue; import org.junit.Assert; @@ -58,7 +63,15 @@ public void before(@TempDir File dataDir) throws Exception { } private ValueStore createValueStore() throws IOException { - return new ValueStore(new File(dataDir, "values"), new LmdbStoreConfig()); + return createValueStore(new LmdbStoreConfig()); + } + + private ValueStore createValueStore(LmdbStoreConfig config) throws IOException { + return new ValueStore(new File(dataDir, "values"), config); + } + + private LmdbStoreConfig hashCacheEnabledConfig() { + return new LmdbStoreConfig().setValueHashCacheEnabled(true); } @Test @@ -278,6 +291,149 @@ public void testDisableGc() throws Exception { assertNotEquals("IDs should NOT have been reused since GC is disabled", Collections.emptySet(), ids); } + @Test + public void testLazyIriHashCodeDoesNotInitializeAfterRestart() throws Exception { + IRI iri = Values.iri("urn:hash:iri"); + int expectedHash = iri.hashCode(); + long id = storeValueAndReopen(iri, hashCacheEnabledConfig()); + + LmdbIRI lazyValue = (LmdbIRI) valueStore.getLazyValue(id); + assertFalse(isInitialized(lazyValue)); + assertEquals(expectedHash, lazyValue.hashCode()); + assertFalse("hashCode should not initialize lazy IRIs after restart", isInitialized(lazyValue)); + } + + @Test + public void testLazyLiteralHashCodeDoesNotInitializeAfterRestart() throws Exception { + Literal literal = Values.literal("literal-hash"); + int expectedHash = literal.hashCode(); + long id = storeValueAndReopen(literal, hashCacheEnabledConfig()); + + LmdbLiteral lazyValue = (LmdbLiteral) valueStore.getLazyValue(id); + assertFalse(isInitialized(lazyValue)); + assertEquals(expectedHash, lazyValue.hashCode()); + assertFalse("hashCode should not initialize lazy literals after restart", isInitialized(lazyValue)); + } + + @Test + public void testLazyBNodeHashCodeDoesNotInitializeAfterRestart() throws Exception { + valueStore.close(); + valueStore = createValueStore(hashCacheEnabledConfig()); + + LmdbBNode bNode = valueStore.createBNode("hash-bnode"); + int expectedHash = bNode.hashCode(); + long id = storeValueAndReopen(bNode, hashCacheEnabledConfig()); + + LmdbBNode lazyValue = (LmdbBNode) valueStore.getLazyValue(id); + assertFalse(isInitialized(lazyValue)); + assertEquals(expectedHash, lazyValue.hashCode()); + assertFalse("hashCode should not initialize lazy bnodes after restart", isInitialized(lazyValue)); + } + + @Test + public void testRecycledIdsClearCachedHash() throws Exception { + valueStore.close(); + valueStore = createValueStore(hashCacheEnabledConfig()); + + LmdbBNode first = valueStore.createBNode("hash-first"); + int firstHash = first.hashCode(); + long firstId; + valueStore.startTransaction(true); + firstId = valueStore.storeValue(first); + valueStore.commit(); + valueStore.getLazyValue(firstId).hashCode(); + + valueStore.startTransaction(true); + valueStore.gcIds(Collections.singleton(firstId), new HashSet<>()); + valueStore.commit(); + + reopenValueStore(hashCacheEnabledConfig()); + + LmdbBNode second = valueStore.createBNode("hash-second"); + int secondHash = second.hashCode(); + long secondId; + valueStore.startTransaction(true); + secondId = valueStore.storeValue(second); + valueStore.commit(); + + assertEquals("ID should have been recycled", firstId, secondId); + assertNotEquals("test values must not share the same hash", firstHash, secondHash); + + reopenValueStore(hashCacheEnabledConfig()); + + LmdbBNode lazyValue = (LmdbBNode) valueStore.getLazyValue(secondId); + assertFalse(isInitialized(lazyValue)); + assertEquals(secondHash, lazyValue.hashCode()); + assertNotEquals("recycled IDs must not keep the previous hash", firstHash, lazyValue.hashCode()); + assertFalse("hashCode should not initialize recycled lazy bnodes after restart", isInitialized(lazyValue)); + } + + @Test + public void testStaleBNodeHashCodeIgnoresReusedIdAfterClear() throws Exception { + valueStore.close(); + valueStore = createValueStore(hashCacheEnabledConfig()); + + LmdbBNode first = valueStore.createBNode("hash-first"); + int firstHash = first.hashCode(); + long firstId; + valueStore.startTransaction(true); + firstId = valueStore.storeValue(first); + valueStore.commit(); + + ValueStoreRevision firstRevision = valueStore.getRevision(); + + valueStore.startTransaction(true); + valueStore.gcIds(Collections.singleton(firstId), new HashSet<>()); + valueStore.commit(); + + valueStore.unusedRevisionIds.add(firstRevision.getRevisionId()); + valueStore.forceEvictionOfValues(); + valueStore.startTransaction(true); + valueStore.commit(); + + LmdbBNode second = valueStore.createBNode("hash-second"); + int secondHash = second.hashCode(); + long secondId; + valueStore.startTransaction(true); + secondId = valueStore.storeValue(second); + valueStore.commit(); + + assertEquals("GC should make the old ID reusable", firstId, secondId); + assertNotEquals("test values must not share the same hash", firstHash, secondHash); + assertEquals("stale values must keep their original hash after a revision change", firstHash, first.hashCode()); + } + + private long storeValueAndReopen(Value value) throws Exception { + return storeValueAndReopen(value, new LmdbStoreConfig()); + } + + private long storeValueAndReopen(Value value, LmdbStoreConfig config) throws Exception { + valueStore.close(); + valueStore = createValueStore(config); + + long id; + valueStore.startTransaction(true); + id = valueStore.storeValue(value); + valueStore.commit(); + reopenValueStore(config); + return id; + } + + private void reopenValueStore() throws Exception { + reopenValueStore(new LmdbStoreConfig()); + } + + private void reopenValueStore(LmdbStoreConfig config) throws Exception { + valueStore.close(); + valueStore = createValueStore(config); + } + + private boolean isInitialized(Object value) throws Exception { + Field initializedField = value.getClass().getDeclaredField("initialized"); + initializedField.setAccessible(true); + return initializedField.getBoolean(value); + } + @AfterEach public void after() throws Exception { valueStore.close(); diff --git a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/benchmark/ThemeQueryBenchmark.java b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/benchmark/ThemeQueryBenchmark.java index e51876ff34e..11b97cd1de3 100644 --- a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/benchmark/ThemeQueryBenchmark.java +++ b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/benchmark/ThemeQueryBenchmark.java @@ -29,6 +29,7 @@ import org.eclipse.rdf4j.benchmark.rio.util.ThemeDataSetGenerator; import org.eclipse.rdf4j.benchmark.rio.util.ThemeDataSetGenerator.Theme; import org.eclipse.rdf4j.common.transaction.IsolationLevels; +import org.eclipse.rdf4j.query.TupleQuery; import org.eclipse.rdf4j.query.explanation.Explanation; import org.eclipse.rdf4j.queryrender.sparql.TupleExprIRRenderer; import org.eclipse.rdf4j.repository.sail.SailRepository; @@ -53,12 +54,13 @@ import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.RunnerException; import org.openjdk.jmh.runner.options.OptionsBuilder; +import org.openjdk.jmh.runner.options.TimeValue; @State(Scope.Benchmark) -@Warmup(iterations = 1, batchSize = 1, timeUnit = TimeUnit.SECONDS, time = 30) +@Warmup(iterations = 10, batchSize = 1, timeUnit = TimeUnit.SECONDS, time = 1) @BenchmarkMode({ Mode.AverageTime }) @Fork(value = 1, jvmArgs = { "-Xms32G", "-Xmx32G" }) -@Measurement(iterations = 1, batchSize = 1, timeUnit = TimeUnit.SECONDS, time = 10) +@Measurement(iterations = 10, batchSize = 1, timeUnit = TimeUnit.SECONDS, time = 1) @OutputTimeUnit(TimeUnit.MILLISECONDS) public class ThemeQueryBenchmark { @@ -72,18 +74,20 @@ public class ThemeQueryBenchmark { private static final long EXPECTED_TRIPLES_DATA_SIZE_BYTES = 1500921856L; private static final long EXPECTED_VALUES_DATA_SIZE_BYTES = 713687040L; - @Param({ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10" }) + @Param({ + // "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", + "11", "12" }) public int z_queryIndex; @Param({ "MEDICAL_RECORDS", - "SOCIAL_MEDIA", - "LIBRARY", - "ENGINEERING", - "HIGHLY_CONNECTED", - "TRAIN", - "ELECTRICAL_GRID", - "PHARMA" +// "SOCIAL_MEDIA", +// "LIBRARY", +// "ENGINEERING", +// "HIGHLY_CONNECTED", +// "TRAIN", +// "ELECTRICAL_GRID", +// "PHARMA" }) public String themeName; @@ -97,7 +101,11 @@ public class ThemeQueryBenchmark { public static void main(String[] args) throws RunnerException { var opt = new OptionsBuilder() .include("ThemeQueryBenchmark") - .forks(1) + .forks(0) + .measurementTime(TimeValue.milliseconds(1000)) + .measurementIterations(10) + .measurementBatchSize(1) + .warmupIterations(0) .build(); new Runner(opt).run(); } @@ -119,6 +127,26 @@ public void setup() throws IOException { } } + @Benchmark + public long executeQuery() { + try (var connection = repository.getConnection()) { + long count; + TupleQuery tupleQuery = connection.prepareTupleQuery(query); + tupleQuery.setMaxExecutionTime(10); + try (var evaluate = tupleQuery.evaluate()) { + count = evaluate + .stream() + .count(); + } + + if (count != expected) { + throw new IllegalStateException("Unexpected count: expected " + expected + " but got " + count); + } + + return count; + } + } + private void ensureDataLoadedAndValidated() throws IOException { var expectedDbFileSizes = readExpectedDbFileSizes(); if (!hasExpectedDbFileSizes(expectedDbFileSizes)) { @@ -285,24 +313,6 @@ public void tearDown() { storeConfig = null; } - @Benchmark - public long executeQuery() { - try (var connection = repository.getConnection()) { - long count; - try (var evaluate = connection.prepareTupleQuery(query).evaluate()) { - count = evaluate - .stream() - .count(); - } - - if (count != expected) { - throw new IllegalStateException("Unexpected count: expected " + expected + " but got " + count); - } - - return count; - } - } - @Test @Disabled public void testQueryCounts() throws IOException { @@ -346,6 +356,7 @@ public void setupVerifiesExpectedDbFileSizesInFixedStore() throws IOException { } @Test + @Disabled public void executeQueryReturnsExpectedCountForPharmaQueryTenAfterFreshGeneration() throws IOException { FileUtils.deleteDirectory(STORE_DIRECTORY); themeName = "PHARMA"; diff --git a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfigTest.java b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfigTest.java index 18133dc9482..a8786ad4597 100644 --- a/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfigTest.java +++ b/core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/config/LmdbStoreConfigTest.java @@ -46,6 +46,11 @@ void noReadaheadDefaultsToDisabled() { assertThat(invokeBooleanGetter(new LmdbStoreConfig(), "getNoReadahead")).isFalse(); } + @Test + void valueHashCacheDefaultsToDisabled() { + assertThat(new LmdbStoreConfig().getValueHashCacheEnabled()).isFalse(); + } + @ParameterizedTest @ValueSource(booleans = { true, false }) void testThatLmdbStoreConfigParseAndExportNoReadahead(final boolean noReadahead) { @@ -82,6 +87,18 @@ void testThatLmdbStoreConfigParseAndExportValueEvictionInterval(final long value ); } + @ParameterizedTest + @ValueSource(booleans = { true, false }) + void testThatLmdbStoreConfigParseAndExportValueHashCacheEnabled(final boolean valueHashCacheEnabled) { + testParseAndExport( + LmdbStoreSchema.VALUE_HASH_CACHE_ENABLED, + Values.literal(valueHashCacheEnabled), + LmdbStoreConfig::getValueHashCacheEnabled, + valueHashCacheEnabled, + valueHashCacheEnabled + ); + } + @ParameterizedTest @ValueSource(booleans = { true, false }) void testThatLmdbStoreConfigParseAndExportAutoGrow(final boolean autoGrow) { diff --git a/core/sail/memory/src/test/java/org/eclipse/rdf4j/sail/memory/benchmark/ThemeQueryBenchmark.java b/core/sail/memory/src/test/java/org/eclipse/rdf4j/sail/memory/benchmark/ThemeQueryBenchmark.java index 734d032fdb2..85d48c6be82 100644 --- a/core/sail/memory/src/test/java/org/eclipse/rdf4j/sail/memory/benchmark/ThemeQueryBenchmark.java +++ b/core/sail/memory/src/test/java/org/eclipse/rdf4j/sail/memory/benchmark/ThemeQueryBenchmark.java @@ -62,7 +62,7 @@ public class ThemeQueryBenchmark { private static final String STORE_NAME = "memory"; - @Param({ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10" }) + @Param({ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12" }) public int z_queryIndex; @Param({ diff --git a/core/sail/nativerdf/src/test/java/org/eclipse/rdf4j/sail/nativerdf/benchmark/ThemeQueryBenchmark.java b/core/sail/nativerdf/src/test/java/org/eclipse/rdf4j/sail/nativerdf/benchmark/ThemeQueryBenchmark.java index da9625981ea..e0fe77f4560 100644 --- a/core/sail/nativerdf/src/test/java/org/eclipse/rdf4j/sail/nativerdf/benchmark/ThemeQueryBenchmark.java +++ b/core/sail/nativerdf/src/test/java/org/eclipse/rdf4j/sail/nativerdf/benchmark/ThemeQueryBenchmark.java @@ -56,7 +56,7 @@ @OutputTimeUnit(TimeUnit.MILLISECONDS) public class ThemeQueryBenchmark { - @Param({ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10" }) + @Param({ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12" }) public int z_queryIndex; @Param({ diff --git a/pom.xml b/pom.xml index 482d839eceb..68b46991162 100644 --- a/pom.xml +++ b/pom.xml @@ -57,7 +57,7 @@ true 1.7.36 1.2.13 - 2.25.3 + 2.25.4 4.5.14 2.21.0 2.21 diff --git a/site/content/documentation/programming/lmdb-store.md b/site/content/documentation/programming/lmdb-store.md index ced2e80d236..3445fe2e342 100644 --- a/site/content/documentation/programming/lmdb-store.md +++ b/site/content/documentation/programming/lmdb-store.md @@ -107,6 +107,8 @@ config.setTripleIndexes("spoc,ospc,psoc"); config.setForceSync(true); // disable autogrow, enabled by default config.setAutoGrow(false); +// persist value hash codes across restarts, disabled by default +config.setValueHashCacheEnabled(true); // set maximum size of value db to 1 GiB config.setValueDBSize(1_073_741_824L); @@ -116,6 +118,11 @@ config.setTripleDBSize(1_073_741_824L); Repository repo = new SailRepository(new LmdbStore(dataDir), config); ``` +The optional value hash cache stores precomputed `Value.hashCode()` results in `hashes.dat`. It is disabled by default. +When enabled, LMDB writes a `hashes.dat.integrity` sidecar on clean shutdown and only trusts the cache again on the +next startup if that integrity metadata validates. Invalid or stale hash cache files are discarded automatically and +the store falls back to recomputing hashes lazily. + ## Required storage space, RAM size and disk performance You can expect a footprint of around 120 - 130 bytes per quad when using the LMDB store with 3 indexes (like spoc, ospc and psoc). diff --git a/testsuites/benchmark-common/src/main/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalog.java b/testsuites/benchmark-common/src/main/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalog.java index 99637fc0808..970896f0f7e 100644 --- a/testsuites/benchmark-common/src/main/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalog.java +++ b/testsuites/benchmark-common/src/main/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalog.java @@ -21,7 +21,7 @@ public final class ThemeQueryCatalog { - public static final int QUERY_COUNT = 11; + public static final int QUERY_COUNT = 13; private static final Map> QUERIES = new EnumMap<>(Theme.class); @@ -160,7 +160,72 @@ public final class ThemeQueryCatalog { " FILTER NOT EXISTS { ?patient med:hasMedication ?m2 . ?m2 med:code ?c .", " FILTER(?c = \"MED-1005\") }", "}"), - 1L))); + 1L), + query("Medical: denormalized patient encounter medication view", + medicalPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a med:Patient ;", + " med:hasEncounter ?encounter .", + " ?encounter a med:Encounter ;", + " med:hasObservation ?observation .", + " OPTIONAL {", + " ?root med:name ?patientName .", + " OPTIONAL {", + " ?root med:hasMedication ?medication .", + " OPTIONAL {", + " ?medication med:code ?medicationCode .", + " OPTIONAL { ?medication med:dosage ?medicationDosage . }", + " }", + " }", + " }", + " OPTIONAL {", + " ?encounter med:recordedOn ?recordedOn .", + " OPTIONAL {", + " ?encounter med:handledBy ?practitioner .", + " OPTIONAL { ?practitioner med:name ?practitionerName . }", + " }", + " }", + " OPTIONAL {", + " ?encounter med:hasCondition ?condition .", + " OPTIONAL {", + " ?condition med:code ?conditionCode .", + " OPTIONAL { ?condition med:description ?conditionDescription . }", + " }", + " }", + " OPTIONAL {", + " ?observation med:value ?observationValue .", + " OPTIONAL { ?observation med:unit ?observationUnit . }", + " }", + "}", + "ORDER BY ?root ?encounter ?observation"), + 199461L), + query("Medical: union-heavy denormalized patient encounter view", + medicalPrefix + String.join("\n", + "SELECT DISTINCT ?root ?optName ?optConditionCode ?optEncounter2 ?optLabel ?optEncounter ?optPractitioner ?optMedicationCode ?optDosage WHERE {", + " { ?root a med:Patient . }", + " UNION", + " { ?root a med:Encounter . }", + " OPTIONAL {", + " ?root med:name ?optName .", + " BIND(?optName AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root med:hasEncounter ?optEncounter .", + " OPTIONAL { ?optEncounter med:recordedOn ?optDate . }", + " OPTIONAL { ?optEncounter med:handledBy ?optPractitioner . }", + " }", + " OPTIONAL {", + " ?root med:hasMedication ?optMedication .", + " OPTIONAL { ?optMedication med:code ?optMedicationCode . }", + " OPTIONAL { ?optMedication med:dosage ?optDosage . }", + " }", + " OPTIONAL {", + " ?root med:hasEncounter ?optEncounter2 .", + " ?optEncounter2 med:hasCondition ?optCondition .", + " OPTIONAL { ?optCondition med:code ?optConditionCode . }", + " }", + "}"), + 347473L))); String socialPrefix = String.join("\n", "PREFIX social: ", @@ -398,6 +463,63 @@ public final class ThemeQueryCatalog { " OPTIONAL { ?e social:name ?optName . }", " FILTER(?optName IN (\"user7\", \"user8\", \"user9\", \"user10\", \"user11\"))", "}"), + 1L), + query("Social: denormalized users and posts view", + socialPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a social:User ;", + " social:follows ?followed .", + " ?post social:authored ?root .", + " OPTIONAL {", + " ?root social:name ?rootName .", + " OPTIONAL {", + " ?followed social:name ?followedName .", + " OPTIONAL { ?followed social:follows ?followedFollows . }", + " }", + " }", + " OPTIONAL {", + " ?post social:authored ?postAuthor .", + " OPTIONAL {", + " ?postAuthor social:name ?postAuthorName .", + " OPTIONAL { ?postAuthor social:follows ?postAuthorFollows . }", + " }", + " }", +// " OPTIONAL {", +// " ?follower social:follows ?root .", +// " OPTIONAL {", +// " ?follower social:name ?followerName .", +// " OPTIONAL { ?follower social:follows ?followerFollows . }", +// " }", +// " }", + "}", + "ORDER BY ?root ?followed ?post"), + 1L), + query("Social: union-heavy denormalized users and posts view", + socialPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " { ?root a social:User . }", + " UNION", + " { ?root a social:Post . }", + " OPTIONAL {", + " ?root social:name ?optName .", + " BIND(?optName AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root social:follows ?optFollowed .", + " OPTIONAL { ?optFollowed social:name ?optFollowedName . }", + " }", + " OPTIONAL {", + " ?post social:authored ?root .", + " OPTIONAL { ?root social:follows ?optAuthorFollows . }", + " OPTIONAL { ?post social:authored ?optPostAuthor . }", + " }", + " OPTIONAL {", + " { ?optFollower social:follows ?root . }", + " UNION", + " { ?root social:follows ?optFollower . }", + " OPTIONAL { ?optFollower social:name ?optFollowerName . }", + " }", + "}"), 1L))); String libraryPrefix = String.join("\n", @@ -532,6 +654,67 @@ public final class ThemeQueryCatalog { " MINUS { ?branch lib:name ?name2 .", " FILTER(CONTAINS(LCASE(STR(?name2)), \"branch 0\")) }", "}"), + 1L), + query("Library: denormalized catalog and loans view", + libraryPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a lib:Loan ;", + " lib:borrowedBy ?member ;", + " lib:loanedCopy ?copy .", + " ?copy lib:locatedAt ?branch .", + " ?book a lib:Book ;", + " lib:hasCopy ?copy .", + " OPTIONAL {", + " ?member lib:name ?memberName .", + " OPTIONAL {", + " ?branch lib:name ?branchName .", + " OPTIONAL { ?book lib:title ?bookTitle . }", + " }", + " }", + " OPTIONAL {", + " ?book lib:writtenBy ?author .", + " OPTIONAL {", + " ?author lib:name ?authorName .", + " OPTIONAL { ?author lib:country ?authorCountry . }", + " }", + " }", + " OPTIONAL {", + " ?root lib:loanDate ?loanDate .", + " OPTIONAL {", + " ?root lib:dueDate ?dueDate .", + " OPTIONAL { ?root lib:returnedDate ?returnedDate . }", + " }", + " }", + "}", + "ORDER BY ?root ?member ?copy"), + 1L), + query("Library: union-heavy denormalized catalog and loans view", + libraryPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " { ?root a lib:Book . }", + " UNION", + " { ?root a lib:Loan . }", + " OPTIONAL {", + " ?root lib:title ?optTitle .", + " BIND(?optTitle AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root lib:hasCopy ?optCopy .", + " OPTIONAL { ?optCopy lib:locatedAt ?optBranch . }", + " OPTIONAL { ?optBranch lib:name ?optBranchName . }", + " }", + " OPTIONAL {", + " ?root lib:loanedCopy ?optLoanCopy .", + " OPTIONAL { ?root lib:borrowedBy ?optMember . }", + " OPTIONAL { ?optMember lib:name ?optMemberName . }", + " }", + " OPTIONAL {", + " { ?root lib:writtenBy ?optAuthor . }", + " UNION", + " { ?bookForAuthor lib:writtenBy ?optAuthor . ?bookForAuthor lib:hasCopy ?optCopy2 . }", + " OPTIONAL { ?optAuthor lib:name ?optAuthorName . }", + " }", + "}"), 1L))); String engineeringPrefix = String.join("\n", @@ -649,7 +832,69 @@ public final class ThemeQueryCatalog { " FILTER(?optComponent != ?assembly)", " MINUS { ?requirement eng:satisfies ?component . }", "}"), - 1L))); + 1L), + query("Engineering: denormalized components and requirements view", + engineeringPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a eng:Requirement ;", + " eng:satisfies ?component ;", + " eng:verifiedBy ?test .", + " ?component a eng:Component ;", + " eng:partOf ?assembly .", + " OPTIONAL {", + " ?root eng:name ?requirementName .", + " OPTIONAL {", + " ?component eng:name ?componentName .", + " OPTIONAL { ?assembly eng:name ?assemblyName . }", + " }", + " }", + " OPTIONAL {", + " ?component eng:dependsOn ?dependency .", + " OPTIONAL {", + " ?dependency eng:name ?dependencyName .", + " OPTIONAL { ?dependency eng:partOf ?dependencyAssembly . }", + " }", + " }", + " OPTIONAL {", + " ?test eng:verifiedBy ?measurement .", + " OPTIONAL {", + " ?measurement eng:measuredValue ?measuredValue .", + " OPTIONAL { ?measurement eng:measuredAt ?measuredAt . }", + " }", + " }", + "}", + "ORDER BY ?root ?component ?test"), + 1L), + query("Engineering: union-heavy denormalized components and requirements view", + engineeringPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " { ?root a eng:Component . }", + " UNION", + " { ?root a eng:Requirement . }", + " OPTIONAL {", + " ?root eng:name ?optName .", + " BIND(?optName AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root eng:partOf ?optAssembly .", + " OPTIONAL { ?optAssembly eng:name ?optAssemblyName . }", + " }", + " OPTIONAL {", + " ?root eng:dependsOn ?optDependency .", + " OPTIONAL { ?optDependency eng:name ?optDependencyName . }", + " }", + " OPTIONAL {", + " ?root eng:satisfies ?optSatisfiedComponent .", + " OPTIONAL { ?optSatisfiedComponent eng:partOf ?optSatisfiedAssembly . }", + " }", + " OPTIONAL {", + " { ?root eng:verifiedBy ?optVerification . }", + " UNION", + " { ?verificationOwner eng:verifiedBy ?optVerification . FILTER(?verificationOwner = ?root) }", + " OPTIONAL { ?optVerification eng:measuredValue ?optMeasuredValue . }", + " }", + "}"), + 134229L))); String connectedPrefix = String.join("\n", "PREFIX conn: ", @@ -766,6 +1011,58 @@ public final class ThemeQueryCatalog { " ?n2 conn:weight ?w2 . FILTER(?w2 < ?threshold) }", " MINUS { ?node conn:connectsTo ?node . }", "}"), + 1L), + query("Connected: denormalized node neighborhood view", + connectedPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a conn:Node ;", + " conn:connectsTo ?neighbor .", + " ?neighbor conn:connectsTo ?neighbor2 .", + " OPTIONAL {", + " ?root conn:weight ?rootWeight .", + " OPTIONAL {", + " ?neighbor conn:weight ?neighborWeight .", + " OPTIONAL { ?neighbor2 conn:weight ?neighbor2Weight . }", + " }", + " }", + " OPTIONAL {", + " ?incoming conn:connectsTo ?root .", + " OPTIONAL {", + " ?incoming conn:weight ?incomingWeight .", + " OPTIONAL { ?incoming conn:connectsTo ?incomingNeighbor . }", + " }", + " }", + " OPTIONAL {", + " ?root conn:connectsTo ?optionalNeighbor .", + " OPTIONAL {", + " ?optionalNeighbor conn:connectsTo ?optionalNeighbor2 .", + " OPTIONAL { ?optionalNeighbor2 conn:weight ?optionalNeighbor2Weight . }", + " }", + " }", + "}"), + 1L), + query("Connected: union-heavy denormalized node neighborhood view", + connectedPrefix + String.join("\n", + "SELECT * WHERE {", + " { ?root a conn:Node . }", + " UNION", + " { ?root conn:connectsTo ?optUnionNeighbor . }", + " OPTIONAL {", + " ?root conn:connectsTo ?optNeighbor .", + " OPTIONAL { ?optNeighbor conn:weight ?optNeighborWeight . }", + " }", + " OPTIONAL {", + " ?root conn:weight ?optWeight .", + " BIND(?optWeight AS ?optNodeWeight)", + " }", + " OPTIONAL {", + " { ?optIncoming conn:connectsTo ?root . }", + " UNION", + " { ?root conn:connectsTo ?optIncoming . }", + " OPTIONAL { ?optIncoming conn:weight ?optIncomingWeight . }", + " }", + " FILTER(?optNodeWeight != 0 || !BOUND(?optNodeWeight))", + "}"), 1L))); String trainPrefix = String.join("\n", @@ -888,7 +1185,70 @@ public final class ThemeQueryCatalog { " FILTER(?optSection != ?op)", " MINUS { ?op train:name ?name2 . FILTER(CONTAINS(LCASE(STR(?name2)), \"op 1\")) }", "}"), - 1L))); + 1L), + query("Train: denormalized network and service view", + trainPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a train:TrainService ;", + " train:runsOnSection ?section ;", + " train:passesThrough ?operationalPoint .", + " ?section train:partOfLine ?line ;", + " train:hasTrackSection ?track .", + " OPTIONAL {", + " ?root train:name ?serviceName .", + " OPTIONAL {", + " ?line train:name ?lineName .", + " OPTIONAL { ?operationalPoint train:name ?operationalPointName . }", + " }", + " }", + " OPTIONAL {", + " ?root train:scheduledTime ?scheduledTime .", + " OPTIONAL {", + " ?section train:connectsOperationalPoint ?sectionOperationalPoint .", + " OPTIONAL { ?sectionOperationalPoint train:name ?sectionOperationalPointName . }", + " }", + " }", + " OPTIONAL {", + " ?line train:connectsOperationalPoint ?lineOperationalPoint .", + " OPTIONAL {", + " ?lineOperationalPoint train:name ?lineOperationalPointName .", + " OPTIONAL { ?lineOperationalPoint train:code ?lineOperationalPointCode . }", + " }", + " }", + " OPTIONAL {", + " ?track train:length ?trackLength .", + " OPTIONAL { ?track train:trackType ?trackType . }", + " }", + "}"), + 1L), + query("Train: union-heavy denormalized network and service view", + trainPrefix + String.join("\n", + "SELECT * WHERE {", + " { ?root a train:Line . }", + " UNION", + " { ?root a train:SectionOfLine . }", + " OPTIONAL {", + " ?root train:name ?optName .", + " BIND(?optName AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root train:partOfLine ?optLine .", + " OPTIONAL { ?optLine train:name ?optLineName . }", + " }", + " OPTIONAL {", + " ?root train:connectsOperationalPoint ?optOperationalPoint .", + " OPTIONAL { ?optOperationalPoint train:name ?optOperationalPointName . }", + " }", + " OPTIONAL {", + " { ?service train:runsOnSection ?root . }", + " UNION", + " { ?service train:passesThrough ?optOperationalPoint2 .", + " ?root train:connectsOperationalPoint ?optOperationalPoint2 . }", + " OPTIONAL { ?service train:scheduledTime ?optScheduledTime . }", + " }", + " OPTIONAL { ?root train:hasTrackSection ?optTrackSection . }", + "}"), + 943354L))); String gridPrefix = String.join("\n", "PREFIX grid: ", @@ -1005,7 +1365,76 @@ public final class ThemeQueryCatalog { " FILTER(?optValue > 200)", " FILTER NOT EXISTS { ?load grid:loadValue ?low . FILTER(?low < 50) }", "}"), - 1L))); + 1L), + query("Grid: denormalized substation asset view", + gridPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a grid:Transformer ;", + " grid:feeds ?substation ;", + " grid:hasMeter ?meter .", + " ?substation a grid:Substation .", + " ?meter grid:measures ?load .", + " OPTIONAL {", + " ?substation grid:name ?substationName .", + " OPTIONAL {", + " ?root grid:capacity ?transformerCapacity .", + " OPTIONAL { ?meter grid:serial ?meterSerial . }", + " }", + " }", + " OPTIONAL {", + " ?load grid:loadValue ?loadValue .", + " OPTIONAL {", + " ?load grid:loadType ?loadType .", + " OPTIONAL { ?load grid:unit ?loadUnit . }", + " }", + " }", + " OPTIONAL {", + " ?line grid:connectsTo ?substation .", + " OPTIONAL {", + " ?line grid:capacity ?lineCapacity .", + " OPTIONAL { ?line grid:name ?lineName . }", + " }", + " }", + " OPTIONAL {", + " ?generator grid:feeds ?substation .", + " OPTIONAL {", + " ?generator grid:capacity ?generatorCapacity .", + " OPTIONAL { ?generator grid:name ?generatorName . }", + " }", + " }", + "}"), + 1L), + query("Grid: union-heavy denormalized substation asset view", + gridPrefix + String.join("\n", + "SELECT * WHERE {", + " { ?root a grid:Substation . }", + " UNION", + " { ?root a grid:Transformer . }", + " OPTIONAL {", + " ?root grid:name ?optName .", + " BIND(?optName AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root grid:feeds ?optFed .", + " OPTIONAL { ?optFed grid:name ?optFedName . }", + " }", + " OPTIONAL {", + " { ?generator grid:feeds ?root . }", + " UNION", + " { ?root grid:feeds ?substation . ?generator grid:feeds ?substation . }", + " OPTIONAL { ?generator grid:capacity ?optGeneratorCapacity . }", + " }", + " OPTIONAL {", + " ?root grid:hasMeter ?optMeter .", + " OPTIONAL { ?optMeter grid:measures ?optLoad . }", + " OPTIONAL { ?optLoad grid:loadValue ?optLoadValue . }", + " }", + " OPTIONAL {", + " ?line grid:connectsTo ?root .", + " OPTIONAL { ?line grid:capacity ?optLineCapacity . }", + " }", + "}"), + 621654L))); String pharmaPrefix = String.join("\n", "PREFIX pharma: ", @@ -1168,7 +1597,84 @@ public final class ThemeQueryCatalog { "}", "GROUP BY ?pathway", "HAVING(COUNT(DISTINCT ?drug) > 1)"), - 51L))); + 51L), + query("Pharma: denormalized trial and drug view", + pharmaPrefix + String.join("\n", + "SELECT DISTINCT * WHERE {", + " ?root a pharma:ClinicalTrial ;", + " pharma:hasArm ?arm ;", + " pharma:studiesDisease ?disease .", + " ?arm pharma:armDrug ?drug ;", + " pharma:hasResult ?result .", + " OPTIONAL {", + " ?drug pharma:name ?drugName .", + " OPTIONAL {", + " ?disease pharma:name ?diseaseName .", + " OPTIONAL { ?root pharma:name ?trialName . }", + " }", + " }", + " OPTIONAL {", + " ?drug pharma:targets ?target .", + " OPTIONAL {", + " ?target pharma:inPathway ?pathway .", + " OPTIONAL { ?pathway pharma:name ?pathwayName . }", + " }", + " }", + " OPTIONAL {", + " ?result pharma:pValue ?pValue .", + " OPTIONAL {", + " ?result pharma:effectSize ?effectSize .", + " OPTIONAL { ?result pharma:responseRate ?responseRate . }", + " }", + " }", + " OPTIONAL {", + " ?drug pharma:hasMolecule ?molecule .", + " OPTIONAL {", + " ?molecule pharma:inClass ?drugClass .", + " OPTIONAL { ?drugClass pharma:name ?drugClassName . }", + " }", + " }", + " OPTIONAL {", + " ?drug pharma:indicatedFor ?indication .", + " OPTIONAL {", + " ?drug pharma:contraindicatedFor ?contraindication .", + " OPTIONAL { ?drug pharma:hasSideEffect ?sideEffect . }", + " }", + " }", + "}"), + 1L), + query("Pharma: union-heavy denormalized trial and drug view", + pharmaPrefix + String.join("\n", + "SELECT * WHERE {", + " { ?root a pharma:Drug . }", + " UNION", + " { ?root a pharma:ClinicalTrial . }", + " OPTIONAL {", + " ?root pharma:name ?optName .", + " BIND(?optName AS ?optLabel)", + " }", + " OPTIONAL {", + " ?root pharma:targets ?optTarget .", + " OPTIONAL { ?optTarget pharma:inPathway ?optPathway . }", + " }", + " OPTIONAL {", + " ?root pharma:indicatedFor ?optIndication .", + " OPTIONAL { ?root pharma:contraindicatedFor ?optContraindication . }", + " }", + " OPTIONAL {", + " { ?root pharma:hasArm ?optArm . }", + " UNION", + " { ?trial pharma:hasArm ?optArm . ?optArm pharma:armDrug ?root . }", + " OPTIONAL { ?optArm pharma:hasResult ?optResult . }", + " OPTIONAL { ?optResult pharma:pValue ?optPValue . }", + " OPTIONAL { ?optResult pharma:effectSize ?optEffectSize . }", + " }", + " OPTIONAL {", + " ?combo pharma:combinationOf ?root .", + " OPTIONAL { ?combo pharma:synergyScore ?optSynergy . }", + " }", + "}"), + 25710L))); validateQueries(); } diff --git a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogComplexityTest.java b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogComplexityTest.java index 68880833f81..1ebfaa67453 100644 --- a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogComplexityTest.java +++ b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogComplexityTest.java @@ -45,9 +45,26 @@ void eachThemeHasEightComplexQueries() { long markers = COMPLEX_MARKERS.stream() .filter(normalized::contains) .count(); - assertTrue(markers >= 2, + boolean hasLargeOptionalDenormalizedShape = normalized.contains("SELECT DISTINCT *") + && countOccurrences(normalized, "OPTIONAL") >= 6 + && countOccurrences(normalized, " .") >= 8; + assertTrue(markers >= 2 || hasLargeOptionalDenormalizedShape, "Theme " + theme + " query " + index + " lacks complexity markers: " + query); } } } + + private static int countOccurrences(String input, String token) { + int occurrences = 0; + int fromIndex = 0; + while (fromIndex >= 0) { + int found = input.indexOf(token, fromIndex); + if (found < 0) { + break; + } + occurrences++; + fromIndex = found + token.length(); + } + return occurrences; + } } diff --git a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpansionTest.java b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpansionTest.java index 1d205245496..b50c97aaa90 100644 --- a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpansionTest.java +++ b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpansionTest.java @@ -21,7 +21,7 @@ class ThemeQueryCatalogExpansionTest { - private static final int EXPANDED_QUERY_COUNT = 11; + private static final int EXPANDED_QUERY_COUNT = 13; @Test void eachThemeHasExpandedQueryCount() { diff --git a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpectedCountTest.java b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpectedCountTest.java index 1ab0f883e8b..89a035fae6c 100644 --- a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpectedCountTest.java +++ b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogExpectedCountTest.java @@ -23,14 +23,14 @@ class ThemeQueryCatalogExpectedCountTest { @Test void expectedCountsMatchCatalogValues() { Map expectedCounts = Map.of( - Theme.MEDICAL_RECORDS, new long[] { 1, 1, 135, 1, 1, 1, 8335, 1, 8335, 1, 1 }, - Theme.SOCIAL_MEDIA, new long[] { 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1 }, - Theme.LIBRARY, new long[] { 1, 1, 3, 1, 1, 1, 5081, 1, 10, 1, 1 }, - Theme.ENGINEERING, new long[] { 1, 1, 3, 1, 1, 1, 520, 1, 520, 1, 1 }, - Theme.HIGHLY_CONNECTED, new long[] { 1, 1, 36767, 1, 1, 1, 40251, 1, 1, 40251, 1 }, - Theme.TRAIN, new long[] { 1, 1, 3, 1, 1, 1, 7836, 1, 1, 67388, 1 }, - Theme.ELECTRICAL_GRID, new long[] { 1, 1, 10, 1, 1, 1, 9364, 1, 0, 1, 1 }, - Theme.PHARMA, new long[] { 1, 80, 0, 2216, 1, 1, 1, 1, 1635, 1, 51 } + Theme.MEDICAL_RECORDS, new long[] { 1, 1, 135, 1, 1, 1, 8335, 1, 8335, 1, 1, 199461, 347473 }, + Theme.SOCIAL_MEDIA, new long[] { 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1 }, + Theme.LIBRARY, new long[] { 1, 1, 3, 1, 1, 1, 5081, 1, 10, 1, 1, 1, 1 }, + Theme.ENGINEERING, new long[] { 1, 1, 3, 1, 1, 1, 520, 1, 520, 1, 1, 1, 134229 }, + Theme.HIGHLY_CONNECTED, new long[] { 1, 1, 36767, 1, 1, 1, 40251, 1, 1, 40251, 1, 1, 1 }, + Theme.TRAIN, new long[] { 1, 1, 3, 1, 1, 1, 7836, 1, 1, 67388, 1, 1, 943354 }, + Theme.ELECTRICAL_GRID, new long[] { 1, 1, 10, 1, 1, 1, 9364, 1, 0, 1, 1, 1, 621654 }, + Theme.PHARMA, new long[] { 1, 80, 0, 2216, 1, 1, 1, 1, 1635, 1, 51, 1, 25710 } ); for (Map.Entry entry : expectedCounts.entrySet()) { diff --git a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogOptimizerGapTest.java b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogOptimizerGapTest.java index 11870f18bdc..9e4f2674e50 100644 --- a/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogOptimizerGapTest.java +++ b/testsuites/benchmark-common/src/test/java/org/eclipse/rdf4j/benchmark/common/ThemeQueryCatalogOptimizerGapTest.java @@ -25,6 +25,9 @@ class ThemeQueryCatalogOptimizerGapTest { private static final String OPTIONAL_FILTER_MARKER = "FILTER(?OPT"; private static final String DISJUNCTIVE_MARKER = "||"; private static final String IN_LIST_MARKER = " IN "; + private static final String UNION_MARKER = "UNION"; + private static final String OPTIONAL_MARKER = "OPTIONAL"; + private static final String OPTIONAL_VARIABLE_MARKER = "?OPT"; @Test void eachQueryTargetsKnownOptimizerGaps() { @@ -38,9 +41,29 @@ void eachQueryTargetsKnownOptimizerGaps() { boolean hasGapMarker = normalized.contains(OPTIONAL_FILTER_MARKER) || normalized.contains(DISJUNCTIVE_MARKER) || normalized.contains(IN_LIST_MARKER); - assertTrue(hasGapMarker, "Theme " + theme + " query " + index - + " lacks optimizer-gap markers: " + query); + boolean hasLargeOptionalDenormalizedShape = normalized.contains("SELECT DISTINCT *") + && !normalized.contains(UNION_MARKER) + && countOccurrences(normalized, OPTIONAL_MARKER) >= 6; + boolean hasUnionHeavyDenormalizedShape = countOccurrences(normalized, UNION_MARKER) >= 1 + && countOccurrences(normalized, OPTIONAL_MARKER) >= 6 + && countOccurrences(normalized, OPTIONAL_VARIABLE_MARKER) >= 6; + assertTrue(hasGapMarker || hasLargeOptionalDenormalizedShape || hasUnionHeavyDenormalizedShape, + "Theme " + theme + " query " + index + " lacks optimizer-gap markers: " + query); } } } + + private static int countOccurrences(String input, String token) { + int occurrences = 0; + int fromIndex = 0; + while (fromIndex >= 0) { + int found = input.indexOf(token, fromIndex); + if (found < 0) { + break; + } + occurrences++; + fromIndex = found + token.length(); + } + return occurrences; + } } diff --git a/testsuites/benchmark/src/test/java/org/eclipse/rdf4j/benchmark/plan/QueryPlanSnapshotCliTest.java b/testsuites/benchmark/src/test/java/org/eclipse/rdf4j/benchmark/plan/QueryPlanSnapshotCliTest.java index fdb2c861ba4..33f860c660f 100644 --- a/testsuites/benchmark/src/test/java/org/eclipse/rdf4j/benchmark/plan/QueryPlanSnapshotCliTest.java +++ b/testsuites/benchmark/src/test/java/org/eclipse/rdf4j/benchmark/plan/QueryPlanSnapshotCliTest.java @@ -32,6 +32,7 @@ import java.util.Map; import java.util.concurrent.TimeUnit; +import org.eclipse.rdf4j.benchmark.common.ThemeQueryCatalog; import org.eclipse.rdf4j.benchmark.common.plan.QueryPlanCapture; import org.eclipse.rdf4j.benchmark.common.plan.QueryPlanCaptureContext; import org.eclipse.rdf4j.benchmark.common.plan.QueryPlanExplanation; @@ -328,8 +329,10 @@ void runAllThemeQueriesForSingleThemePrintsBatchEtaStartAndSummary() throws Exce cli.run(options); String printed = outputBuffer.toString(StandardCharsets.UTF_8); + String expectedSummary = "Completed run-all mode: " + ThemeQueryCatalog.QUERY_COUNT + + " queries across 1 theme."; assertTrue(printed.contains("ETA start:"), printed); - assertTrue(printed.contains("Completed run-all mode: 11 queries across 1 theme."), printed); + assertTrue(printed.contains(expectedSummary), printed); assertFalse(printed.contains("Theme=SOCIAL_MEDIA"), printed); } diff --git a/tools/workbench/src/test/java/org/eclipse/rdf4j/workbench/commands/SummaryServletCoverageTest.java b/tools/workbench/src/test/java/org/eclipse/rdf4j/workbench/commands/SummaryServletCoverageTest.java index 8746343ebf6..582a70e1b22 100644 --- a/tools/workbench/src/test/java/org/eclipse/rdf4j/workbench/commands/SummaryServletCoverageTest.java +++ b/tools/workbench/src/test/java/org/eclipse/rdf4j/workbench/commands/SummaryServletCoverageTest.java @@ -12,6 +12,9 @@ package org.eclipse.rdf4j.workbench.commands; import static org.assertj.core.api.Assertions.assertThat; +import static org.mockito.ArgumentMatchers.eq; +import static org.mockito.ArgumentMatchers.isNull; +import static org.mockito.ArgumentMatchers.nullable; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.when; @@ -61,7 +64,8 @@ void serviceKeepsRenderingWhenStatisticsCollectionIsInterrupted() throws Excepti Thread.interrupted(); } - verify(builder).result("memory", "Memory repo", null, null, null, null); + verify(builder).result(eq("memory"), eq("Memory repo"), isNull(), isNull(), nullable(String.class), + nullable(String.class)); verify(builder).end(); }