Codegen and Janino for Query Engines

Use this reference when the question is not just “how do I make this loop faster?” but “should this JVM path be specialized or compiled at runtime at all?”

This file is especially relevant for:

query engines
expression evaluators
generated filters, projections, joins, and aggregations
repeated execution of the same logical shape with different bindings
deciding between interpretation, vectorization, and runtime code generation

First decision: should you use runtime codegen?

Ask these before introducing Janino or any other runtime compiler:

Is the same plan or operator shape executed enough times to amortize compile cost?
Is the bottleneck per-tuple interpreter/dispatch overhead rather than a worse algorithm or poor data layout?
Can the generated code stay small, simple, and stable enough to compile quickly?
Can you cache by normalized plan shape, types, nullability, and algorithm choice?
Do you already have a correct interpreted or vectorized fallback?

If the answers are mostly “no”, do not start with Janino.

Decision rule

Prefer interpretation when:

the workload is cold or one-shot
plans are highly irregular or rarely repeated
compile latency would dominate wall time
you still do not understand the real hot operators

Prefer vectorization or batching when:

you need better cache-miss hiding
SIMD-friendly bulk processing is available
pipelines naturally break at blocking operators
compile latency is hard to amortize

Prefer Janino/runtime codegen when:

the same shape runs repeatedly
per-row function-call / virtual-dispatch / boxing overhead dominates
the generated code can be primitive-heavy and monomorphic
the engine already has a clean IR/template layer and fallback path

What Janino is good at

Janino is a small embedded Java compiler for runtime compilation. Treat it as a pragmatic JVM tool for turning generated Java source into bytecode in memory.

Good fits:

scalar expression evaluators
generated projections and filters
compact join/aggregation helpers
medium-size fused operator pipelines
metadata or dispatch classes that are expensive to interpret repeatedly

Bad fits:

giant whole-query classes without splitting
source that relies on modern Java syntax unless verified
designs that need very tight control over bytecode layout or native code generation
systems with no plan for classloader ownership, eviction, and cache pressure

What other Java-based engines and frameworks do

Spark SQL

Spark uses whole-stage Java code generation and compiles generated Java with Janino. The engine explicitly tracks compilation time and inspects generated bytecode statistics. It also splits generated code to stay under JVM method-size and constant-pool limits.

Design lesson:

generate fused Java for hot pipelines
but split aggressively before code size becomes a correctness or compile-time problem
treat compile time as a first-class metric, not as background noise

Flink Table/SQL runtime

Flink uses generated runtime classes broadly and also exposes Janino ExpressionEvaluator compilation for expressions. Flink caches compiled code because repeated Janino compilation creates new class loaders/classes and can become a metaspace/class-unloading bottleneck.

Design lesson:

cache compiled artifacts by code shape and classloader context
own classloader lifetime deliberately
watch for cache-related class leaks, not just compile speed

Apache Calcite

Calcite uses Janino for scalar RexNode compilation and for generated metadata dispatch handlers.

Design lesson:

Janino is useful even when you are not compiling whole pipelines
expression compilation and dispatch generation are often easier wins than compiling everything

Apache Drill

Drill supports both Janino and the JDK compiler, and by default uses Janino only below a configurable source-size threshold. Larger generated sources are handed to the JDK compiler.

Design lesson:

do not bind the engine to one compiler policy
use size-based or complexity-based compiler selection
have a fallback when Janino stops being the right tool

What the research says

Foundational result

Compiled query execution can substantially outperform classic iterator-style interpretation because it reduces generic per-tuple overhead and gives the compiler a tighter, more optimizable control flow.

Important correction

Compiled execution is not a universal winner. Research comparing compiled and vectorized query engines found no single paradigm that always dominates.

More recent direction

State-of-the-art work focuses on:

reducing compilation latency
compiling only the fragments that pay for themselves
combining vectorized and compiled execution instead of forcing a binary choice
adaptive compilation decisions during execution

Practical conclusion:

“Use Janino everywhere” is not state of the art
“Never use Janino” is also wrong
the modern answer is selective, cached, fallback-friendly specialization

Query-engine design rules

1) Separate logical planning from codegen

Do not generate Java directly from arbitrary logical trees everywhere.

Prefer:

logical plan
physical plan
small codegen IR / templates / operator fragments
generated Java only at the final step

This keeps code splitting, reuse, and fallback manageable.

2) Compile fragments, not dogma

Good fusion targets:

scan -> filter -> project
probe-side inner loops
simple aggregate update loops
expression trees that are repeatedly evaluated

Good split points:

hash build / sort / materialization boundaries
highly branchy optional logic
very large generated state
code-size pressure
operators that already benefit from vectorized kernels

3) Keep generated Java conservative

For Janino, default to a conservative Java subset:

primitive locals and fields
explicit loops
explicit null checks
simple helper methods
predictable class shapes
minimal reflection
minimal generics in generated source

Do not assume support for newer Java syntax just because the runtime JDK is modern.

4) Design for code-size limits up front

Watch these failure modes:

huge methods
giant static initializers
constant-pool pressure
large switch ladders
giant string-built source blobs

Countermeasures:

split helper methods early
split classes if state grows too large
move large literals/references into arrays or external holders
stop fusing once code-size pressure starts dominating

5) Cache compiled artifacts deliberately

Cache keys often need more than the SQL string. Include the pieces that change the generated machine shape, such as:

normalized operator tree or pipeline shape
physical operator choices
input schema / internal types
nullability
sort/hash/key layout
relevant runtime feature flags

The cache policy must also answer:

who owns the classloader?
when are generated classes collectible?
what is the eviction policy?
what is the fallback on cache miss or compile failure?

6) Measure cold and warm separately

For codegen, “query time” is ambiguous.

Always separate:

first-run compile + execute
warm cached execute
compile-failure fallback path
classloading/metaspace side effects

If you only report the warm number, you can easily hide a bad design.

7) Keep a non-codegen path

A query engine should usually keep at least one of:

interpreted fallback
vectorized fallback
alternate compiler path

Reasons:

cold queries
oversized generated sources
Janino language or method-size limits
production debugging
incremental rollout and A/B comparison

Janino-specific checklist for generated query code

Before you approve a Janino-based design, check these:

Is the generated source simple enough for Janino rather than javac/bytecode generation?
Are helper methods split before they approach size limits?
Are repeated plan shapes cached?
Is there a strategy for dumping generated source on failure?
Is compile latency measured independently from execution latency?
Is the parent classloader stable and intentional?
Is there a fallback compiler or interpreted/vectorized path?
Does the codegen layer avoid user text injection and only emit from validated IR/templates?

When to prefer something other than Janino

Prefer the JDK compiler when:

source is large
compile latency is less critical than language compatibility
Janino limitations become binding

Prefer ASM / bytecode libraries when:

you need exact bytecode control
source generation itself becomes expensive or brittle
you need to avoid Java-parser/compiler overhead entirely

Prefer native/IR-based compilation when:

research-grade peak throughput matters most
you need tighter control than the JVM compilation pipeline gives you
you are prepared to own a much more complex toolchain

Practical recommendation for RDF/query-engine style work

A strong JVM default is:

start with a correct interpreted path
add vectorized/batched kernels for the obvious bulk operators
add Janino for repeated scalar/pipeline specializations that remain small and stable
cache compiled shapes aggressively
split code before limits force you to
keep a fallback for cold or oversized paths
benchmark cold, warm, and failure/fallback paths separately

That is much closer to current best practice than either “compile nothing” or “compile everything.”

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codegen and Janino for Query Engines

First decision: should you use runtime codegen?

Decision rule

Prefer interpretation when:

Prefer vectorization or batching when:

Prefer Janino/runtime codegen when:

What Janino is good at

What other Java-based engines and frameworks do

Spark SQL

Flink Table/SQL runtime

Apache Calcite

Apache Drill

What the research says

Foundational result

Important correction

More recent direction

Query-engine design rules

1) Separate logical planning from codegen

2) Compile fragments, not dogma

3) Keep generated Java conservative

4) Design for code-size limits up front

5) Cache compiled artifacts deliberately

6) Measure cold and warm separately

7) Keep a non-codegen path

Janino-specific checklist for generated query code

When to prefer something other than Janino

Prefer the JDK compiler when:

Prefer ASM / bytecode libraries when:

Prefer native/IR-based compilation when:

Practical recommendation for RDF/query-engine style work

FilesExpand file tree

codegen-and-janino.md

Latest commit

History

codegen-and-janino.md

File metadata and controls

Codegen and Janino for Query Engines

First decision: should you use runtime codegen?

Decision rule

Prefer interpretation when:

Prefer vectorization or batching when:

Prefer Janino/runtime codegen when:

What Janino is good at

What other Java-based engines and frameworks do

Spark SQL

Flink Table/SQL runtime

Apache Calcite

Apache Drill

What the research says

Foundational result

Important correction

More recent direction

Query-engine design rules

1) Separate logical planning from codegen

2) Compile fragments, not dogma

3) Keep generated Java conservative

4) Design for code-size limits up front

5) Cache compiled artifacts deliberately

6) Measure cold and warm separately

7) Keep a non-codegen path

Janino-specific checklist for generated query code

When to prefer something other than Janino

Prefer the JDK compiler when:

Prefer ASM / bytecode libraries when:

Prefer native/IR-based compilation when:

Practical recommendation for RDF/query-engine style work