Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
3381e6b
implement benchmark
hmottestad Apr 5, 2026
1a5213f
implement lftj
hmottestad Apr 5, 2026
67f5e26
start benchmarking
hmottestad Apr 5, 2026
9191827
start benchmarking
hmottestad Apr 5, 2026
82b55cb
add a skill for writing high performance java code
hmottestad Apr 5, 2026
03b9e6d
continue optimizing lftj
hmottestad Apr 5, 2026
7aa92c3
continue optimizing lftj
hmottestad Apr 5, 2026
9fde7ed
continue optimizing lftj
hmottestad Apr 5, 2026
a930d37
continue optimizing lftj
hmottestad Apr 5, 2026
06ca808
continue optimizing lftj
hmottestad Apr 5, 2026
100bf1d
lftj is faster
hmottestad Apr 5, 2026
d4b9849
lftj is faster
hmottestad Apr 5, 2026
d57102f
lftj is faster
hmottestad Apr 5, 2026
e831cc8
fixes
hmottestad Apr 5, 2026
71ab6b7
new best
hmottestad Apr 5, 2026
08892f5
new best
hmottestad Apr 5, 2026
e0e8b68
wip
hmottestad Apr 5, 2026
8cc530f
fastest yet, with codegen
hmottestad Apr 5, 2026
9f0d330
fastest yet, with codegen
hmottestad Apr 5, 2026
6363274
fastest yet, with codegen
hmottestad Apr 5, 2026
38a8216
more tests
hmottestad Apr 6, 2026
e110c96
more tests and fixes
hmottestad Apr 6, 2026
ebbe896
updated results
hmottestad Apr 6, 2026
3d4fb79
more tests
hmottestad Apr 6, 2026
02fce3c
improve skill
hmottestad Apr 6, 2026
9d28525
even faster
hmottestad Apr 6, 2026
584a3d1
even faster
hmottestad Apr 6, 2026
c63b138
even faster
hmottestad Apr 6, 2026
47cb10b
fix bugs
hmottestad Apr 6, 2026
13db1a1
fix bugs
hmottestad Apr 7, 2026
79e6bab
fix bugs
hmottestad Apr 7, 2026
ba6ddb2
fix bugs
hmottestad Apr 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions .codex/skills/high-performance-java/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
name: high-performance-java
description: Use when writing, reviewing, or reshaping HotSpot Java where algorithmic complexity, data-structure choice, throughput, latency, allocation rate, zero-copy, lazy evaluation, non-materialization, primitive collections, performance libraries, intrinsics, SuperWord auto-vectorization, or C2 assembly matter. Also use for advanced algorithmic problem solving in Java, including dynamic programming, graph/range techniques, and cache-aware code shape. Bias toward asymptotic wins first, then specialized hot-path code, then benchmark and JIT evidence.
---

# High-Performance Java

Use this skill for Java hot paths and algorithm-heavy Java. Default bias: asymptotic win first, then fewer allocations, fewer copies, less polymorphism, narrower code shape, stronger evidence.

HotSpot-only v1. Baseline assumptions:
- repo baseline: JDK 21
- current local runtime may be newer
- low-level claims stay provisional until benchmark + JIT evidence agree
- algorithm/data-structure claims stay provisional until they match the actual workload constraints

## Core loop

1. Identify the workload shape and constraints.
2. Pick the algorithm and data structure that change the slope.
3. Find the hot loop or hot call chain.
4. Write the narrow fast path first.
5. Push generic abstraction, materialization, and dispatch out of the loop.
6. Benchmark before claiming improvement.
7. Inspect HotSpot decisions before claiming JVM-level reasons.

## Default coding bias

- Prefer an algorithmic win over a micro win.
- Prefer data structures that fit the operation mix, memory budget, and key domain.
- Prefer primitive-friendly layouts before boxed object graphs.
- Prefer zero-copy over copy-transform-copy.
- Prefer reuse over per-item allocation.
- Prefer lazy traversal over full materialization.
- Prefer primitives, flat arrays, and tight counted loops in hot paths.
- Prefer monomorphic calls that inline away.
- Prefer specialized lambda/adaptor code for the active workload.
- Prefer one fast path plus one cold fallback over a single generalized hot path.

## Hard rules

- Do not micro-optimize a fundamentally wrong algorithm.
- Do not defend a perf change with style arguments alone.
- Do not claim “faster” without a measurement path.
- Do not claim “JIT will optimize this” without checking inlining / compilation evidence.
- Do not add a specialized library until you know what property it buys: fewer allocations, fewer copies, lower contention, off-heap layout, better primitive support, or a stronger algorithm.
- Do not keep elegant-but-generic stream pipelines in verified hot loops.
- Do not pay interface / visitor / wrapper overhead inside the hottest loop unless evidence shows it disappears.
- Do not default to boxed `Map<K, V>` / `Set<T>` / `List<T>` shapes when primitive collections or flat arrays better fit the dominant path.

## Design checklist

Ask these first:
- What are `N`, `Q`, the update/query ratio, and the memory budget?
- Is the main problem asymptotic complexity, cache locality, allocation pressure, branchiness, contention, or I/O?
- What operation dominates: membership, counting, top-k, range query, join, shortest path, DP transition, parsing, encoding?
- Can the key/value/state space stay primitive or bit-packed?
- Can the workload become offline, batched, sorted, prefix-based, or compressed?
- What allocates on the steady-state path?
- What copies bytes, chars, arrays, or collections?
- What materializes intermediate state that could stay streamed or cursor-based?
- What dispatch stays virtual or megamorphic in the inner loop?
- What loop shape blocks scalar replacement, inlining, or SuperWord vectorization?
- What “generic” branch handles cases the active workload never uses?

## Workflow

### 0) Pick the algorithmic shape

- Estimate the real workload: input size, query count, mutation pattern, latency target, and memory ceiling.
- Choose the algorithm and data structure before tuning loop syntax.
- Favor contiguous, cache-friendly, primitive-heavy representations when semantics allow.
- For dynamic programming, define state, transition cost, base case, iteration order, and whether state compression is possible.
- For graph/range/string problems, look for offline transforms, prefix structures, monotonic structures, or specialized search before hand-tuning.

Read these only when relevant:
- [references/algorithms-data-structures.md](references/algorithms-data-structures.md) for algorithm and data-structure selection.
- [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md) for dynamic programming and advanced problem-solving patterns.

### 1) Shape the code for HotSpot

- Split hot and cold paths.
- Hoist invariant checks and decoding outside the loop.
- Replace generic callback stacks with narrow-path adapters.
- Reuse mutable carriers only when ownership is clear.
- Keep loop bodies predictable, contiguous, and exception-light.

Detailed rules: see [references/coding-rules.md](references/coding-rules.md).

### 2) Measure

If you are in this RDF4J repo, use the local benchmark wrapper first:

```bash
scripts/run-single-benchmark.sh --module <module> --class <fqcn> --method <benchmarkMethod>
```

If you are outside RDF4J, use JMH or an existing reproducible micro/macro benchmark.

Measurement workflow: see [references/evidence-workflow.md](references/evidence-workflow.md).

### 3) Explain with JVM evidence

When a benchmark moves, inspect what HotSpot actually did:
- compilation tier
- inlining success/failure
- intrinsic usage when relevant
- allocation pressure
- assembly / C2 logs when needed

Use sibling skill [hotspot-jit-forensics](../hotspot-jit-forensics/SKILL.md) for method-scoped C2 evidence. Use `async-profiler-java-macos` when wall/cpu/alloc evidence is needed on macOS.

### 4) Use libraries intentionally

- Prefer the JDK first when it is close enough and operationally simpler.
- Reach for specialized libraries when they remove boxing, copies, parser overhead, contention, or off-heap indirection the JDK cannot.
- Check dependency health before adding a new library.
- Benchmark the library choice against the simplest credible in-repo baseline.

Library reference: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md).

### 5) Report honestly

Frame conclusions as:
- hypothesis
- algorithm/data-structure choice
- benchmark result
- JIT/profile evidence
- confidence

If assembly is unavailable, say so and fall back to compilation logs, inlining diagnostics, and profile data.

## Trigger examples

Use this skill when the user asks to:
- remove allocation pressure from a parser, iterator, encoder, decoder, or query loop
- make a Java path zero-copy or lazy
- choose the right data structure for a Java workload
- solve a dynamic programming, graph, interval, ranking, or range-query problem in Java under performance constraints
- replace boxed collections with primitive or cache-friendly structures
- choose between the JDK and specialized Java performance libraries
- specialize code for one workload instead of many
- explain whether a HotSpot optimization actually happened
- ground a Java perf change in benchmark + C2 evidence

## Reference map

- Algorithms and data structures: [references/algorithms-data-structures.md](references/algorithms-data-structures.md)
- Advanced coding techniques: [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md)
- High-performance Java libraries: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md)
- Coding rules: [references/coding-rules.md](references/coding-rules.md)
- Evidence workflow: [references/evidence-workflow.md](references/evidence-workflow.md)
- JDK version guardrails: [references/jdk-21-26-notes.md](references/jdk-21-26-notes.md)

## Output contract

When you use this skill, the answer should usually include:
- workload model and asymptotic bottleneck
- algorithm and data-structure recommendation
- hot-path hypothesis
- concrete code-shape recommendation
- library recommendation when a library meaningfully changes the design
- benchmark command or benchmark evidence
- JIT/profile evidence or the missing prerequisite
- a confidence statement tied to the active JDK
4 changes: 4 additions & 0 deletions .codex/skills/high-performance-java/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "High-Performance Java"
short_description: "Hot-path Java plus algorithm/perf-library guidance"
default_prompt: "Use $high-performance-java to choose the right algorithm, data structure, library, and HotSpot-friendly code shape for a high-performance Java task."
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Advanced Coding Techniques

Use this reference when the problem needs more than basic loops and collections: dynamic programming, advanced search, state compression, offline transforms, or optimization patterns that materially change runtime.

## Dynamic programming checklist

Before writing code, define:
- state: the minimum information needed to continue
- transition: how one state moves to the next
- base case: the smallest solved states
- order: top-down memoization or bottom-up tabulation
- objective: min, max, count, feasibility, reconstruction
- memory plan: full table, rolling rows, bitset, or sparse map

If any of those are fuzzy, the DP is not ready.

## DP implementation bias in Java

- Prefer flat primitive arrays over nested object graphs.
- Flatten `dp[row][col]` into one array when locality matters.
- Use sentinel values (`INF`, `-1`, impossible masks) instead of wrapper objects.
- Compress dimensions aggressively when a transition only needs prior rows or prior prefixes.
- Use iterative tabulation when recursion depth or call overhead is risky.
- Use memoization when the reachable state space is sparse or pruning is strong.

## Common DP families

### 1D DP

Use for:
- linear decisions
- prefix optimization
- classic knapsack-style transitions

Java notes:
- Often compresses to one array.
- Direction matters: reverse iterate for 0/1 knapsack; forward iterate for unbounded knapsack.

### 2D grid / sequence DP

Use for:
- edit distance
- LCS variants
- path counting
- interval composition

Java notes:
- Two rolling rows often replace the full matrix.
- Keep row-major iteration consistent with memory layout.

### Interval DP

Use for:
- merge cost
- matrix chain multiplication
- optimal parenthesization
- palindrome partitioning

Heuristic:
- Try increasing interval length order.
- Precompute reusable range costs.

### Tree DP

Use for:
- subtree aggregation
- rerooting
- independent set / matching variants on trees

Java notes:
- Iterative traversal can avoid stack overflow.
- Store parent/index arrays once; reuse buffers for passes.

### DAG DP

Use for:
- longest path in DAG
- path counts
- dependency-ordered optimization

Heuristic:
- Topological order first, transitions second.

### Bitmask DP

Use for:
- small `n` subset problems
- travelling-salesman-style state
- assignment and partition variants

Java notes:
- Use `int` masks up to 31 bits, `long` masks up to 63.
- Precompute subset transitions when reused heavily.
- Beware exponential memory growth; consider meet-in-the-middle.

### Digit DP

Use for:
- counting numbers with digit constraints
- lexicographic numeric constraints

State usually includes:
- position
- tight/limited flag
- started/leading-zero flag
- problem-specific accumulator

## DP optimization patterns

### Prefix/suffix acceleration

If a transition scans prior states, ask whether prefix minima/maxima/sums can reduce it from `O(n^2)` to `O(n)`.

### Monotonic queue optimization

Use when transitions need min/max over a sliding window.

### Divide-and-conquer DP optimization

Use when the optimal split point is monotonic across rows or columns.

### Convex hull trick / Li Chao tree

Use when transitions are of the form:
- `dp[i] = min_j(m[j] * x[i] + b[j])`
- `max` variant of the same

Only use when the algebra really matches.

### Bitset DP

Use when boolean subset transitions can become word-parallel bit operations.

Examples:
- subset sum
- knapsack feasibility
- reachability layers

### State compression

Reduce dimensions by:
- keeping only prior row/column
- encoding booleans into bits
- coordinate-compressing sparse values
- using ids instead of objects

## Search and optimization patterns

### Binary search on answer

Use when:
- feasibility is monotonic
- exact objective is hard but checking a threshold is easier

### Meet-in-the-middle

Use when:
- brute force is `2^n`
- `n` is small enough to split into two `2^(n/2)` halves

### Branch and bound

Use when:
- you can compute tight upper/lower bounds
- a good heuristic ordering prunes much of the tree

### Iterative deepening

Use when:
- memory is tight
- solution depth is unknown but usually shallow

### Offline query processing

Use when:
- query order is irrelevant
- sorting queries/events lets you reuse structure updates

## Greedy and exchange-thinking

Before building DP or search, test whether a greedy proof exists:
- local choice stays globally optimal
- exchange argument repairs any non-greedy optimal solution
- matroid-like or interval-scheduling structure is present

If greedy works, it often beats DP both asymptotically and operationally.

## Range and sequence patterns

- Sliding window: monotonic boundary expansion or contraction.
- Two pointers: sorted arrays, pair/triple sums, dedup, partitioning.
- Monotonic stack: next greater/smaller, histogram, span problems.
- Difference arrays: batch range updates.
- Prefix sums / xor / hashes: cheap repeated range queries.

## Java-specific implementation notes

- Avoid recursion for deep graphs, trees, or DP unless the depth bound is small.
- Replace tuple objects with parallel arrays or packed longs in hot paths.
- Pre-size arrays and reusable buffers for repeated test cases.
- Be explicit about overflow; use `long` for counts/costs unless `int` is proven safe.
- Separate correctness code from hot code paths once the algorithm is clear.

## Problem-solving ladder

When stuck, try this order:
1. Can I sort or batch the work?
2. Can I precompute prefix, suffix, or compressed state?
3. Can a different data structure remove a nested loop?
4. Is the problem actually graph, interval, or DP in disguise?
5. Can the state shrink to primitives or bits?
6. Can I prove greedy, monotonicity, or convexity?

## Red flags

- DP state includes fields that do not affect future transitions.
- Memoization key is a heavyweight object when a few ints suffice.
- Full `O(n^2)` table retained even though only one frontier is used.
- Search explores symmetric states repeatedly.
- A library data structure is used where a flat array plus sort is enough.
Loading
Loading