|
| 1 | +--- |
| 2 | +name: high-performance-java |
| 3 | +description: Use when writing, reviewing, or reshaping HotSpot Java where algorithmic complexity, data-structure choice, throughput, latency, allocation rate, zero-copy, lazy evaluation, non-materialization, runtime specialization, query-engine code generation, Janino, primitive collections, performance libraries, intrinsics, SuperWord auto-vectorization, or C2 assembly matter. Also use for advanced algorithmic problem solving in Java, including dynamic programming, graph/range techniques, cache-aware code shape, and choosing between interpreted, vectorized, and compiled execution paths. Bias toward asymptotic wins first, then the right execution model, then specialized hot-path code, then benchmark and JIT evidence. |
| 4 | +--- |
| 5 | + |
| 6 | +# High-Performance Java |
| 7 | + |
| 8 | +Use this skill for Java hot paths, algorithm-heavy Java, and JVM-side runtime specialization. Default bias: asymptotic win first, then the right execution model, then fewer allocations, fewer copies, less polymorphism, narrower code shape, stronger evidence. |
| 9 | + |
| 10 | +HotSpot-only v1. Baseline assumptions: |
| 11 | +- repo baseline: JDK 21 |
| 12 | +- current local runtime may be newer |
| 13 | +- low-level claims stay provisional until benchmark + JIT evidence agree |
| 14 | +- algorithm/data-structure claims stay provisional until they match the actual workload constraints |
| 15 | +- runtime codegen claims stay provisional until cold-start cost, warm steady-state behavior, and fallback behavior are all understood |
| 16 | + |
| 17 | +## Core loop |
| 18 | + |
| 19 | +1. Identify the workload shape and constraints. |
| 20 | +2. Pick the algorithm and data structure that change the slope. |
| 21 | +3. Decide whether the workload should stay interpreted, become vectorized/batched, or justify runtime specialization/code generation. |
| 22 | +4. Find the hot loop, hot call chain, or hot operator pipeline. |
| 23 | +5. Write the narrow fast path first. |
| 24 | +6. Push generic abstraction, materialization, and dispatch out of the loop. |
| 25 | +7. Benchmark before claiming improvement. |
| 26 | +8. Inspect HotSpot decisions before claiming JVM-level reasons. |
| 27 | + |
| 28 | +## Default coding bias |
| 29 | + |
| 30 | +- Prefer an algorithmic win over a micro win. |
| 31 | +- Prefer data structures that fit the operation mix, memory budget, and key domain. |
| 32 | +- Prefer the right execution model over reflexively adding code generation. |
| 33 | +- Prefer primitive-friendly layouts before boxed object graphs. |
| 34 | +- Prefer zero-copy over copy-transform-copy. |
| 35 | +- Prefer reuse over per-item allocation. |
| 36 | +- Prefer lazy traversal over full materialization. |
| 37 | +- Prefer primitives, flat arrays, and tight counted loops in hot paths. |
| 38 | +- Prefer monomorphic calls that inline away. |
| 39 | +- Prefer specialized lambda/adaptor code for the active workload. |
| 40 | +- Prefer one fast path plus one cold fallback over a single generalized hot path. |
| 41 | +- Prefer Janino only when generated Java can stay simple, code size can stay bounded, and compile cost can be amortized. |
| 42 | + |
| 43 | +## Hard rules |
| 44 | + |
| 45 | +- Do not micro-optimize a fundamentally wrong algorithm. |
| 46 | +- Do not defend a perf change with style arguments alone. |
| 47 | +- Do not claim “faster” without a measurement path. |
| 48 | +- Do not claim “JIT will optimize this” without checking inlining / compilation evidence. |
| 49 | +- Do not add a specialized library until you know what property it buys: fewer allocations, fewer copies, lower contention, off-heap layout, better primitive support, stronger compilation/runtime specialization, or a stronger algorithm. |
| 50 | +- Do not introduce Janino or other runtime codegen unless compile latency, cache keys, code-size limits, classloader lifetime, and fallback behavior are explicit. |
| 51 | +- Do not compile entire query plans blindly when only a subset of operators is hot or fusible. |
| 52 | +- Do not generate fancy modern Java syntax for Janino unless support is verified on the active Janino/runtime combination; conservative generated Java is the default. |
| 53 | +- Do not keep elegant-but-generic stream pipelines in verified hot loops. |
| 54 | +- Do not pay interface / visitor / wrapper overhead inside the hottest loop unless evidence shows it disappears. |
| 55 | +- Do not default to boxed `Map<K, V>` / `Set<T>` / `List<T>` shapes when primitive collections or flat arrays better fit the dominant path. |
| 56 | + |
| 57 | +## Design checklist |
| 58 | + |
| 59 | +Ask these first: |
| 60 | +- What are `N`, `Q`, the update/query ratio, and the memory budget? |
| 61 | +- Is the main problem asymptotic complexity, cache locality, allocation pressure, branchiness, contention, I/O, or execution-model overhead? |
| 62 | +- What operation dominates: membership, counting, top-k, range query, join, shortest path, DP transition, parsing, encoding, filter/projection evaluation, aggregation, or tuple materialization? |
| 63 | +- Can the key/value/state space stay primitive or bit-packed? |
| 64 | +- Can the workload become offline, batched, sorted, prefix-based, vectorized, or compressed? |
| 65 | +- What allocates on the steady-state path? |
| 66 | +- What copies bytes, chars, arrays, or collections? |
| 67 | +- What materializes intermediate state that could stay streamed or cursor-based? |
| 68 | +- What dispatch stays virtual or megamorphic in the inner loop? |
| 69 | +- What loop shape blocks scalar replacement, inlining, or SuperWord vectorization? |
| 70 | +- What “generic” branch handles cases the active workload never uses? |
| 71 | +- How often will a generated shape execute, and can compile cost be amortized? |
| 72 | +- Can compiled artifacts be cached by normalized shape, types, nullability, and algorithm choice? |
| 73 | +- What is the fallback path for cold queries, oversized generated code, compile failure, or classloader churn? |
| 74 | +- What method-size, class-size, or constant-pool limits could the generated code hit? |
| 75 | +- Who owns generated classes, caches, and classloaders over time? |
| 76 | + |
| 77 | +## Workflow |
| 78 | + |
| 79 | +### 0) Pick the algorithmic shape |
| 80 | + |
| 81 | +- Estimate the real workload: input size, query count, mutation pattern, latency target, and memory ceiling. |
| 82 | +- Choose the algorithm and data structure before tuning loop syntax. |
| 83 | +- Favor contiguous, cache-friendly, primitive-heavy representations when semantics allow. |
| 84 | +- For dynamic programming, define state, transition cost, base case, iteration order, and whether state compression is possible. |
| 85 | +- For graph/range/string problems, look for offline transforms, prefix structures, monotonic structures, or specialized search before hand-tuning. |
| 86 | + |
| 87 | +Read these only when relevant: |
| 88 | +- [references/algorithms-data-structures.md](references/algorithms-data-structures.md) for algorithm and data-structure selection. |
| 89 | +- [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md) for dynamic programming and advanced problem-solving patterns. |
| 90 | + |
| 91 | +### 1) Choose the execution model before shaping the code |
| 92 | + |
| 93 | +- Ask whether the path should stay interpreted, become vectorized/batched, or justify runtime code generation. |
| 94 | +- Prefer interpretation for cold, one-shot, or highly irregular workloads when compile latency will dominate. |
| 95 | +- Prefer vectorization/batching when cache-miss hiding, SIMD-friendly processing, or blocking operator boundaries dominate. |
| 96 | +- Prefer runtime code generation when the same shape executes repeatedly, per-tuple overhead dominates, and generated code can stay narrow and bounded. |
| 97 | +- In query engines, fuse straight pipelines first; split at blocking operators, large mutable state, code-size pressure, or unstable branches. |
| 98 | +- If Janino is chosen, keep generated Java conservative, keep helper methods small, and plan for cache + fallback from the start. |
| 99 | + |
| 100 | +Detailed guidance: see [references/codegen-and-janino.md](references/codegen-and-janino.md). |
| 101 | + |
| 102 | +### 2) Shape the code for HotSpot |
| 103 | + |
| 104 | +- Split hot and cold paths. |
| 105 | +- Hoist invariant checks and decoding outside the loop. |
| 106 | +- Replace generic callback stacks with narrow-path adapters. |
| 107 | +- Reuse mutable carriers only when ownership is clear. |
| 108 | +- Keep loop bodies predictable, contiguous, and exception-light. |
| 109 | +- For generated code, favor explicit loops, primitive locals/fields, simple helper methods, and stable call targets. |
| 110 | + |
| 111 | +Detailed rules: see [references/coding-rules.md](references/coding-rules.md). |
| 112 | + |
| 113 | +### 3) Measure |
| 114 | + |
| 115 | +If you are in this RDF4J repo, use the local benchmark wrapper first: |
| 116 | + |
| 117 | +```bash |
| 118 | +scripts/run-single-benchmark.sh --module <module> --class <fqcn> --method <benchmarkMethod> |
| 119 | +``` |
| 120 | + |
| 121 | +If you are outside RDF4J, use JMH or an existing reproducible micro/macro benchmark. |
| 122 | + |
| 123 | +Measurement workflow: see [references/evidence-workflow.md](references/evidence-workflow.md). |
| 124 | + |
| 125 | +### 4) Explain with JVM evidence |
| 126 | + |
| 127 | +When a benchmark moves, inspect what HotSpot actually did: |
| 128 | +- compilation tier |
| 129 | +- inlining success/failure |
| 130 | +- intrinsic usage when relevant |
| 131 | +- allocation pressure |
| 132 | +- assembly / C2 logs when needed |
| 133 | + |
| 134 | +Use sibling skill [hotspot-jit-forensics](../hotspot-jit-forensics/SKILL.md) for method-scoped C2 evidence. Use `async-profiler-java-macos` when wall/cpu/alloc evidence is needed on macOS. |
| 135 | + |
| 136 | +### 5) Use libraries intentionally |
| 137 | + |
| 138 | +- Prefer the JDK first when it is close enough and operationally simpler. |
| 139 | +- Reach for specialized libraries when they remove boxing, copies, parser overhead, contention, off-heap indirection, or runtime compilation friction the JDK cannot. |
| 140 | +- Check dependency health before adding a new library. |
| 141 | +- Benchmark the library choice against the simplest credible in-repo baseline. |
| 142 | + |
| 143 | +Library reference: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md). |
| 144 | + |
| 145 | +### 6) Report honestly |
| 146 | + |
| 147 | +Frame conclusions as: |
| 148 | +- hypothesis |
| 149 | +- algorithm/data-structure choice |
| 150 | +- execution-model choice |
| 151 | +- benchmark result |
| 152 | +- JIT/profile evidence |
| 153 | +- confidence |
| 154 | + |
| 155 | +If assembly is unavailable, say so and fall back to compilation logs, inlining diagnostics, and profile data. |
| 156 | + |
| 157 | +## Trigger examples |
| 158 | + |
| 159 | +Use this skill when the user asks to: |
| 160 | +- remove allocation pressure from a parser, iterator, encoder, decoder, or query loop |
| 161 | +- make a Java path zero-copy or lazy |
| 162 | +- choose the right data structure for a Java workload |
| 163 | +- solve a dynamic programming, graph, interval, ranking, or range-query problem in Java under performance constraints |
| 164 | +- replace boxed collections with primitive or cache-friendly structures |
| 165 | +- choose between the JDK and specialized Java performance libraries |
| 166 | +- decide whether a query engine should stay interpreted, become vectorized, or use Janino/runtime code generation |
| 167 | +- design generated Java for projections, filters, joins, aggregations, or expression evaluators |
| 168 | +- specialize code for one workload instead of many |
| 169 | +- explain whether a HotSpot optimization actually happened |
| 170 | +- ground a Java perf change in benchmark + C2 evidence |
| 171 | + |
| 172 | +## Reference map |
| 173 | + |
| 174 | +- Algorithms and data structures: [references/algorithms-data-structures.md](references/algorithms-data-structures.md) |
| 175 | +- Advanced coding techniques: [references/advanced-coding-techniques.md](references/advanced-coding-techniques.md) |
| 176 | +- Codegen and Janino for query engines: [references/codegen-and-janino.md](references/codegen-and-janino.md) |
| 177 | +- High-performance Java libraries: [references/high-performance-java-libraries.md](references/high-performance-java-libraries.md) |
| 178 | +- Coding rules: [references/coding-rules.md](references/coding-rules.md) |
| 179 | +- Evidence workflow: [references/evidence-workflow.md](references/evidence-workflow.md) |
| 180 | +- JDK version guardrails: [references/jdk-21-26-notes.md](references/jdk-21-26-notes.md) |
| 181 | + |
| 182 | +## Output contract |
| 183 | + |
| 184 | +When you use this skill, the answer should usually include: |
| 185 | +- workload model and asymptotic bottleneck |
| 186 | +- execution-model recommendation: interpreted, vectorized/batched, Janino/runtime codegen, or another compilation path |
| 187 | +- algorithm and data-structure recommendation |
| 188 | +- hot-path hypothesis |
| 189 | +- concrete code-shape recommendation |
| 190 | +- cache/fallback plan when runtime codegen is part of the design |
| 191 | +- library recommendation when a library meaningfully changes the design |
| 192 | +- benchmark command or benchmark evidence |
| 193 | +- JIT/profile evidence or the missing prerequisite |
| 194 | +- a confidence statement tied to the active JDK |
0 commit comments