eclipse-rdf4j · hmottestad · Apr 5, 2026 · Apr 5, 2026 · Apr 5, 2026 · Apr 5, 2026
diff --git a/.codex/skills/high-performance-java/SKILL.md b/.codex/skills/high-performance-java/SKILL.md
@@ -0,0 +1,164 @@
+---
+name: high-performance-java
+description: Use when writing, reviewing, or reshaping HotSpot Java where algorithmic complexity, data-structure choice, throughput, latency, allocation rate, zero-copy, lazy evaluation, non-materialization, primitive collections, performance libraries, intrinsics, SuperWord auto-vectorization, or C2 assembly matter. Also use for advanced algorithmic problem solving in Java, including dynamic programming, graph/range techniques, and cache-aware code shape. Bias toward asymptotic wins first, then specialized hot-path code, then benchmark and JIT evidence.
+---
+
+# High-Performance Java
+
+Use this skill for Java hot paths and algorithm-heavy Java. Default bias: asymptotic win first, then fewer allocations, fewer copies, less polymorphism, narrower code shape, stronger evidence.
+
+HotSpot-only v1. Baseline assumptions:
+- repo baseline: JDK 21
+- current local runtime may be newer
+- low-level claims stay provisional until benchmark + JIT evidence agree
+- algorithm/data-structure claims stay provisional until they match the actual workload constraints
+
+## Core loop
+
+1. Identify the workload shape and constraints.
+2. Pick the algorithm and data structure that change the slope.
+3. Find the hot loop or hot call chain.
+4. Write the narrow fast path first.
+5. Push generic abstraction, materialization, and dispatch out of the loop.
+6. Benchmark before claiming improvement.
+7. Inspect HotSpot decisions before claiming JVM-level reasons.
+
+## Default coding bias
+
+- Prefer an algorithmic win over a micro win.
+- Prefer data structures that fit the operation mix, memory budget, and key domain.
+- Prefer primitive-friendly layouts before boxed object graphs.
+- Prefer zero-copy over copy-transform-copy.
+- Prefer reuse over per-item allocation.
+- Prefer lazy traversal over full materialization.
+- Prefer primitives, flat arrays, and tight counted loops in hot paths.
+- Prefer monomorphic calls that inline away.
+- Prefer specialized lambda/adaptor code for the active workload.
+- Prefer one fast path plus one cold fallback over a single generalized hot path.
+
+## Hard rules
+
+- Do not micro-optimize a fundamentally wrong algorithm.
+- Do not defend a perf change with style arguments alone.
+- Do not claim “faster” without a measurement path.
+- Do not claim “JIT will optimize this” without checking inlining / compilation evidence.
+- Do not add a specialized library until you know what property it buys: fewer allocations, fewer copies, lower contention, off-heap layout, better primitive support, or a stronger algorithm.
+- Do not keep elegant-but-generic stream pipelines in verified hot loops.
+- Do not pay interface / visitor / wrapper overhead inside the hottest loop unless evidence shows it disappears.
+- Do not default to boxed `Map<K, V>` / `Set<T>` / `List<T>` shapes when primitive collections or flat arrays better fit the dominant path.
+
+## Design checklist
+
+Ask these first:
+- What are `N`, `Q`, the update/query ratio, and the memory budget?
+- Is the main problem asymptotic complexity, cache locality, allocation pressure, branchiness, contention, or I/O?
+- What operation dominates: membership, counting, top-k, range query, join, shortest path, DP transition, parsing, encoding?
+- Can the key/value/state space stay primitive or bit-packed?
+- Can the workload become offline, batched, sorted, prefix-based, or compressed?
+- What allocates on the steady-state path?
+- What copies bytes, chars, arrays, or collections?
+- What materializes intermediate state that could stay streamed or cursor-based?
+- What dispatch stays virtual or megamorphic in the inner loop?
+- What loop shape blocks scalar replacement, inlining, or SuperWord vectorization?
+- What “generic” branch handles cases the active workload never uses?
+
+## Workflow
+
+### 0) Pick the algorithmic shape
+
+- Estimate the real workload: input size, query count, mutation pattern, latency target, and memory ceiling.
+- Choose the algorithm and data structure before tuning loop syntax.
+- Favor contiguous, cache-friendly, primitive-heavy representations when semantics allow.
+- For dynamic programming, define state, transition cost, base case, iteration order, and whether state compression is possible.
+- For graph/range/string problems, look for offline transforms, prefix structures, monotonic structures, or specialized search before hand-tuning.
+
+Read these only when relevant:
+- [references/algorithms-data-structures.md](references/algorithms-data-structures.md) for algorithm and data-structure selection.
+- [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md) for dynamic programming and advanced problem-solving patterns.
+
+### 1) Shape the code for HotSpot
+
+- Split hot and cold paths.
+- Hoist invariant checks and decoding outside the loop.
+- Replace generic callback stacks with narrow-path adapters.
+- Reuse mutable carriers only when ownership is clear.
+- Keep loop bodies predictable, contiguous, and exception-light.
+
+Detailed rules: see [references/coding-rules.md](references/coding-rules.md).
+
+### 2) Measure
+
+If you are in this RDF4J repo, use the local benchmark wrapper first:
+
+```bash
+scripts/run-single-benchmark.sh --module <module> --class <fqcn> --method <benchmarkMethod>
+```
+
+If you are outside RDF4J, use JMH or an existing reproducible micro/macro benchmark.
+
+Measurement workflow: see [references/evidence-workflow.md](references/evidence-workflow.md).
+
+### 3) Explain with JVM evidence
+
+When a benchmark moves, inspect what HotSpot actually did:
+- compilation tier
+- inlining success/failure
+- intrinsic usage when relevant
+- allocation pressure
+- assembly / C2 logs when needed
+
+Use sibling skill [hotspot-jit-forensics](../hotspot-jit-forensics/SKILL.md) for method-scoped C2 evidence. Use `async-profiler-java-macos` when wall/cpu/alloc evidence is needed on macOS.
+
+### 4) Use libraries intentionally
+
+- Prefer the JDK first when it is close enough and operationally simpler.
+- Reach for specialized libraries when they remove boxing, copies, parser overhead, contention, or off-heap indirection the JDK cannot.
+- Check dependency health before adding a new library.
+- Benchmark the library choice against the simplest credible in-repo baseline.
+
+Library reference: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md).
+
+### 5) Report honestly
+
+Frame conclusions as:
+- hypothesis
+- algorithm/data-structure choice
+- benchmark result
+- JIT/profile evidence
+- confidence
+
+If assembly is unavailable, say so and fall back to compilation logs, inlining diagnostics, and profile data.
+
+## Trigger examples
+
+Use this skill when the user asks to:
+- remove allocation pressure from a parser, iterator, encoder, decoder, or query loop
+- make a Java path zero-copy or lazy
+- choose the right data structure for a Java workload
+- solve a dynamic programming, graph, interval, ranking, or range-query problem in Java under performance constraints
+- replace boxed collections with primitive or cache-friendly structures
+- choose between the JDK and specialized Java performance libraries
+- specialize code for one workload instead of many
+- explain whether a HotSpot optimization actually happened
+- ground a Java perf change in benchmark + C2 evidence
+
+## Reference map
+
+- Algorithms and data structures: [references/algorithms-data-structures.md](references/algorithms-data-structures.md)
+- Advanced coding techniques: [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md)
+- High-performance Java libraries: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md)
+- Coding rules: [references/coding-rules.md](references/coding-rules.md)
+- Evidence workflow: [references/evidence-workflow.md](references/evidence-workflow.md)
+- JDK version guardrails: [references/jdk-21-26-notes.md](references/jdk-21-26-notes.md)
+
+## Output contract
+
+When you use this skill, the answer should usually include:
+- workload model and asymptotic bottleneck
+- algorithm and data-structure recommendation
+- hot-path hypothesis
+- concrete code-shape recommendation
+- library recommendation when a library meaningfully changes the design
+- benchmark command or benchmark evidence
+- JIT/profile evidence or the missing prerequisite
+- a confidence statement tied to the active JDK
diff --git a/.codex/skills/high-performance-java/agents/openai.yaml b/.codex/skills/high-performance-java/agents/openai.yaml
@@ -0,0 +1,4 @@
+interface:
+  display_name: "High-Performance Java"
+  short_description: "Hot-path Java plus algorithm/perf-library guidance"
+  default_prompt: "Use $high-performance-java to choose the right algorithm, data structure, library, and HotSpot-friendly code shape for a high-performance Java task."
diff --git a/.codex/skills/high-performance-java/references/advanced-coding-techniques.md b/.codex/skills/high-performance-java/references/advanced-coding-techniques.md
@@ -0,0 +1,220 @@
+# Advanced Coding Techniques
+
+Use this reference when the problem needs more than basic loops and collections: dynamic programming, advanced search, state compression, offline transforms, or optimization patterns that materially change runtime.
+
+## Dynamic programming checklist
+
+Before writing code, define:
+- state: the minimum information needed to continue
+- transition: how one state moves to the next
+- base case: the smallest solved states
+- order: top-down memoization or bottom-up tabulation
+- objective: min, max, count, feasibility, reconstruction
+- memory plan: full table, rolling rows, bitset, or sparse map
+
+If any of those are fuzzy, the DP is not ready.
+
+## DP implementation bias in Java
+
+- Prefer flat primitive arrays over nested object graphs.
+- Flatten `dp[row][col]` into one array when locality matters.
+- Use sentinel values (`INF`, `-1`, impossible masks) instead of wrapper objects.
+- Compress dimensions aggressively when a transition only needs prior rows or prior prefixes.
+- Use iterative tabulation when recursion depth or call overhead is risky.
+- Use memoization when the reachable state space is sparse or pruning is strong.
+
+## Common DP families
+
+### 1D DP
+
+Use for:
+- linear decisions
+- prefix optimization
+- classic knapsack-style transitions
+
+Java notes:
+- Often compresses to one array.
+- Direction matters: reverse iterate for 0/1 knapsack; forward iterate for unbounded knapsack.
+
+### 2D grid / sequence DP
+
+Use for:
+- edit distance
+- LCS variants
+- path counting
+- interval composition
+
+Java notes:
+- Two rolling rows often replace the full matrix.
+- Keep row-major iteration consistent with memory layout.
+
+### Interval DP
+
+Use for:
+- merge cost
+- matrix chain multiplication
+- optimal parenthesization
+- palindrome partitioning
+
+Heuristic:
+- Try increasing interval length order.
+- Precompute reusable range costs.
+
+### Tree DP
+
+Use for:
+- subtree aggregation
+- rerooting
+- independent set / matching variants on trees
+
+Java notes:
+- Iterative traversal can avoid stack overflow.
+- Store parent/index arrays once; reuse buffers for passes.
+
+### DAG DP
+
+Use for:
+- longest path in DAG
+- path counts
+- dependency-ordered optimization
+
+Heuristic:
+- Topological order first, transitions second.
+
+### Bitmask DP
+
+Use for:
+- small `n` subset problems
+- travelling-salesman-style state
+- assignment and partition variants
+
+Java notes:
+- Use `int` masks up to 31 bits, `long` masks up to 63.
+- Precompute subset transitions when reused heavily.
+- Beware exponential memory growth; consider meet-in-the-middle.
+
+### Digit DP
+
+Use for:
+- counting numbers with digit constraints
+- lexicographic numeric constraints
+
+State usually includes:
+- position
+- tight/limited flag
+- started/leading-zero flag
+- problem-specific accumulator
+
+## DP optimization patterns
+
+### Prefix/suffix acceleration
+
+If a transition scans prior states, ask whether prefix minima/maxima/sums can reduce it from `O(n^2)` to `O(n)`.
+
+### Monotonic queue optimization
+
+Use when transitions need min/max over a sliding window.
+
+### Divide-and-conquer DP optimization
+
+Use when the optimal split point is monotonic across rows or columns.
+
+### Convex hull trick / Li Chao tree
+
+Use when transitions are of the form:
+- `dp[i] = min_j(m[j] * x[i] + b[j])`
+- `max` variant of the same
+
+Only use when the algebra really matches.
+
+### Bitset DP
+
+Use when boolean subset transitions can become word-parallel bit operations.
+
+Examples:
+- subset sum
+- knapsack feasibility
+- reachability layers
+
+### State compression
+
+Reduce dimensions by:
+- keeping only prior row/column
+- encoding booleans into bits
+- coordinate-compressing sparse values
+- using ids instead of objects
+
+## Search and optimization patterns
+
+### Binary search on answer
+
+Use when:
+- feasibility is monotonic
+- exact objective is hard but checking a threshold is easier
+
+### Meet-in-the-middle
+
+Use when:
+- brute force is `2^n`
+- `n` is small enough to split into two `2^(n/2)` halves
+
+### Branch and bound
+
+Use when:
+- you can compute tight upper/lower bounds
+- a good heuristic ordering prunes much of the tree
+
+### Iterative deepening
+
+Use when:
+- memory is tight
+- solution depth is unknown but usually shallow
+
+### Offline query processing
+
+Use when:
+- query order is irrelevant
+- sorting queries/events lets you reuse structure updates
+
+## Greedy and exchange-thinking
+
+Before building DP or search, test whether a greedy proof exists:
+- local choice stays globally optimal
+- exchange argument repairs any non-greedy optimal solution
+- matroid-like or interval-scheduling structure is present
+
+If greedy works, it often beats DP both asymptotically and operationally.
+
+## Range and sequence patterns
+
+- Sliding window: monotonic boundary expansion or contraction.
+- Two pointers: sorted arrays, pair/triple sums, dedup, partitioning.
+- Monotonic stack: next greater/smaller, histogram, span problems.
+- Difference arrays: batch range updates.
+- Prefix sums / xor / hashes: cheap repeated range queries.
+
+## Java-specific implementation notes
+
+- Avoid recursion for deep graphs, trees, or DP unless the depth bound is small.
+- Replace tuple objects with parallel arrays or packed longs in hot paths.
+- Pre-size arrays and reusable buffers for repeated test cases.
+- Be explicit about overflow; use `long` for counts/costs unless `int` is proven safe.
+- Separate correctness code from hot code paths once the algorithm is clear.
+
+## Problem-solving ladder
+
+When stuck, try this order:
+1. Can I sort or batch the work?
+2. Can I precompute prefix, suffix, or compressed state?
+3. Can a different data structure remove a nested loop?
+4. Is the problem actually graph, interval, or DP in disguise?
+5. Can the state shrink to primitives or bits?
+6. Can I prove greedy, monotonicity, or convexity?
+
+## Red flags
+
+- DP state includes fields that do not affect future transitions.
+- Memoization key is a heavyweight object when a few ints suffice.
+- Full `O(n^2)` table retained even though only one frontier is used.
+- Search explores symmetric states repeatedly.
+- A library data structure is used where a flat array plus sort is enough.