[GH-2830] Improve Geography query support - core by zhangfengcdt · Pull Request #2831 · apache/sedona

zhangfengcdt · 2026-04-08T15:28:58Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.

Core architecture:

WKBGeography — stores WKB bytes as primary representation with lazy-parsed JTS, S2, and ShapeIndex caches (double-checked locking for thread safety)
GeographyWKBSerializer — WKB serializer with 0xFF format byte, backward-compatible with legacy S2-native format
GeographyUDT, implicits.scala, GeometrySerde — switched to WKBSerializer for all serialization paths

Geography functions (3 new):

Level 1 (JTS): ST_NPoints
Level 2 (JTS + Spheroid): ST_Distance
Level 3 (S2): ST_Contains

Docs: API docs for all 3 new functions in docs/api/sql/geography/

Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR

How was this patch tested?

all unit tests pass in common module (WKBGeographyTest, FunctionTest)
GeographyFunctionTest.scala — Spark SQL integration tests covering constructors, structural functions, metrics, predicates, DataFrame API, and serialization round-trips

Did this PR include necessary documentation updates?

Yes, I have updated the documentation.

…ation with lazy-parsed JTS and S2 caches. This enables GeoArrow-compatible serialization while maintaining full S2 functionality on demand.

- Functions.distance(Geography, Geography) → Spheroid.distance() — returns meters - Functions.area(Geography) → Spheroid.area() — returns m² - Functions.length(Geography) → Spheroid.length() — returns meters - All route through WKBGeography.getJTSGeometry() (no S2 parse needed)

also, added getShapeIndexGeography() as a third lazy cache

- null type ambiguity - level 1 functions missing Geography dispatch

…hole winding order

jiayuasu · 2026-04-09T01:30:01Z

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

zhangfengcdt · 2026-04-09T22:14:13Z

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

I am still working on it. I will clean up the benchmark codes and also for this PR, it will focus on building the core architecture of the WKB-based Geography serialization with cached S2. I will keep a few core ST functions and tests and move other to individual PRs following the merging.

…unctions

zhangfengcdt · 2026-04-11T16:45:55Z

ST Function Performance: Geography vs Geometry (cached objects, ns/op)

ST_NPoints (Level 1 — JTS accessor)

  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │         2 │        2 │    1x │
  └─────────────────────┴───────────┴──────────┴───────┘

ST_Distance (Level 2 — S2 geodesic distance)

  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       269 │       12 │   22x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │     1,576 │      373 │  4.2x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │     1,419 │      613 │  2.3x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │    69,279 │    3,874 │   18x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │   224,696 │  129,518 │  1.7x │
  └─────────────────────┴───────────┴──────────┴───────┘

ST_Contains (Level 3 — S2 predicate)

  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       284 │        8 │   36x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │       664 │        8 │   83x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │       684 │        8 │   86x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │       677 │        8 │   87x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │       703 │        8 │   88x │
  └─────────────────────┴───────────┴──────────┴───────┘

paleolimbot

This seems like it is headed in the right direction!

The WKBGeography is the right direction I think (In C++ I call this a GeoArrowGeography and it's slightly more general but the same idea). As written, I am not sure it is able to make anything faster (I am guessing that Spark already handles filter()s and won't materialize a Java object unless it will actually be used for something).

This may not be true in Java, but in C++ the S2Polygon and maybe S2Polyline has an internal index for itself and also one for each loop (lazily constructed, but almost always needed for initializing from WKB to figure out the nesting of shells/holes). The optimization of implementing the S2Shape interface directly on WKB is pretty much 100% to avoid those extra index builds (so that for any given function there's at most one shape index per object that is built). I don't know how flexible the S2Shape is in Java (in C++ implementing it was not too bad). I think you would need to do something similar to see improved performance (but maybe the first thing you want to do is get the functions and tests in place).

paleolimbot · 2026-04-14T13:57:11Z

+        if (result == null) {
+          try {
+            WKBReader reader = new WKBReader();
+            result = reader.read(wkbBytes);


I am guessing that this line is very expensive (it is in C++). The recent redo of s2geography in C++ that I just did is mostly about avoiding this line (e.g., by manually iterating over edges and using lower-level S2 primitives.

We now avoid this for Point, LineString, and Polygon by implementing WkbS2Shape (an S2Shape on pre-converted S2Point arrays) and building the ShapeIndex directly from it.

paleolimbot · 2026-04-14T14:00:36Z

+  @Override
+  public S2Region region() {
+    return getS2Geography().region();
+  }


I optimized this one for points using the S2PointRegion for points to avoid constructing Geography. That might not matter here (in general when implementing the functions I also made every attempt to avoid accessing the Region to avoid the index build)

paleolimbot · 2026-04-14T14:10:14Z

+  /** Spherical containment test using S2 boolean operations. */
+  public static boolean contains(Geography g1, Geography g2) {
+    if (g1 == null || g2 == null) return false;
+    Predicates pred = new Predicates();
+    return pred.S2_contains(toShapeIndex(g1), toShapeIndex(g2), s2Options());
+  }


This one has a similar array of optimizations for small geometries, including for small polygons (I probably should have put the small polygon optimization in the distance path as well but I ran out of will)

zhangfengcdt · 2026-04-15T18:04:06Z

This seems like it is headed in the right direction!

The WKBGeography is the right direction I think (In C++ I call this a GeoArrowGeography and it's slightly more general but the same idea). As written, I am not sure it is able to make anything faster (I am guessing that Spark already handles filter()s and won't materialize a Java object unless it will actually be used for something).

This may not be true in Java, but in C++ the S2Polygon and maybe S2Polyline has an internal index for itself and also one for each loop (lazily constructed, but almost always needed for initializing from WKB to figure out the nesting of shells/holes). The optimization of implementing the S2Shape interface directly on WKB is pretty much 100% to avoid those extra index builds (so that for any given function there's at most one shape index per object that is built). I don't know how flexible the S2Shape is in Java (in C++ implementing it was not too bad). I think you would need to do something similar to see improved performance (but maybe the first thing you want to do is get the functions and tests in place).

@paleolimbot Thanks for the review and they are really helpful. I have implemented some of the optimizations. Here's what changed and the benchmark results.

use little endian wkb for fast w/r
get dim from wkb bytes directly
add new WkbS2Shape and pre-convert wkb coordinate to S2Point (cached)
use regions for points
get cell union for points
point to point distance fast path
ShapeIndex is built from WkbS2Shape

  ST_NPoints (Level 1 — JTS accessor)

  ┌───────────────────┬────────┬───────┬────────┐
  │       Shape       │ Before │ After │ Change │
  ├───────────────────┼────────┼───────┼────────┤
  │ Point             │      2 │     2 │  ~same │
  ├───────────────────┼────────┼───────┼────────┤
  │ Polygon (16 vtx)  │      2 │     2 │  ~same │
  ├───────────────────┼────────┼───────┼────────┤
  │ Polygon (500 vtx) │      2 │     2 │  ~same │
  └───────────────────┴────────┴───────┴────────┘

  ST_Distance (Level 2 — S2 geodesic)

  ┌─────────────────────┬─────────┬─────────┬───────────────────────────────┐
  │        Shape        │ Before  │  After  │            Change             │
  ├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
  │ Point               │     269 │     245 │ 1.1x faster (point fast path) │
  ├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
  │ LineString (16 vtx) │   1,576 │   1,575 │                         ~same │
  ├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
  │ Polygon (16 vtx)    │   1,419 │   2,248 │                   1.6x slower │
  ├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
  │ Polygon (64 vtx)    │  69,279 │  64,515 │                  1.07x faster │
  ├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
  │ Polygon (500 vtx)   │ 224,696 │ 218,689 │                  1.03x faster │
  └─────────────────────┴─────────┴─────────┴───────────────────────────────┘

The major improvement: WkbS2Shape avoids building 3 redundant S2ShapeIndex instances (one per S2Loop, one for S2Polygon, and only the ShapeIndexGeography one is actually used).

  ST_Contains true (Level 3 — S2 predicate, point inside polygon)

  ┌─────────────────────┬────────┬───────┬─────────────┐
  │        Shape        │ Before │ After │   Change    │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Point               │    284 │   248 │ 1.1x faster │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ LineString (16 vtx) │    664 │   238 │ 2.8x faster │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Polygon (16 vtx)    │    684 │   239 │ 2.9x faster │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Polygon (64 vtx)    │    677 │   233 │ 2.9x faster │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Polygon (500 vtx)   │    703 │   254 │ 2.8x faster │
  └─────────────────────┴────────┴───────┴─────────────┘

For st_contains false case, it becomes slower because S2BooleanOperation.Contains() has different internal execution paths for true vs false results. Not sure if sedona-db also does this (S2Polygon for polygon predicates), which may see similar tradeoffs. But here is the results:

  ST_Contains false (Level 3 — S2 predicate, point outside polygon)

  ┌─────────────────────┬────────┬───────┬─────────────┐
  │        Shape        │ Before │ After │   Change    │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Point               │    226 │   223 │       ~same │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ LineString (16 vtx) │    220 │   653 │   3x slower │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Polygon (16 vtx)    │    222 │   657 │   3x slower │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Polygon (64 vtx)    │    222 │   657 │   3x slower │
  ├─────────────────────┼────────┼───────┼─────────────┤
  │ Polygon (500 vtx)   │    246 │   692 │ 2.8x slower │
  └─────────────────────┴────────┴───────┴─────────────┘

paleolimbot · 2026-04-15T18:22:09Z

Cool!

In s2geography there's quite a bit of code to avoid the shape index build at all for ST_Distance() and each individual predicate, which may bring your results closer to on par. Notably, there's the possible intersection check which should help the "point is not in polygon" case.

- point container return false immediately - point target use S2ClosestEdgeQuery instead of building ShapeIndexTarget

zhangfengcdt · 2026-04-16T15:48:25Z

Cool!

In s2geography there's quite a bit of code to avoid the shape index build at all for ST_Distance() and each individual predicate, which may bring your results closer to on par. Notably, there's the possible intersection check which should help the "point is not in polygon" case.

Thanks for pointing to the s2geography optimizations. I implemented the following two optimization for now:

point container to false directly early stop
pointTarget for distance: avoids building ShapeIndex for the point side

I also tried to implement the MayIntersect that might help with the contains false case, but looks like the MayIntersects path (~700ns) is more expensive than the cached ShapeIndex path (~220ns) in a single call.

In this PR, I was thinking to optimize the hot path, which may benefit more with cached ShapeIndex (good for joins). For the code path optimizations, adding mayIntersects pre-filters may be better; potentially adding checking covering intersection pre-filtering as well. I'd prefer to add them in following PRs with a configurable flags to avoid the hot path gets slower.

There seems to be tradeoffs we need to make default and also custom sedona configs for users to tune these optimizations since the geography has too much overhead compared to the geometry functions.

What do you think?

paleolimbot · 2026-04-16T16:39:43Z

That sounds good to me! I the work I did in C++ was mostly to expose all of the possible optimizations but I haven't tuned them yet (and the tuning is probably much different in Java than it is in C++). The tuning/extreme optimization is also much better done in the presence of integration testing against BigQuery/PostGIS (which is in the works on the SedonaDB side). The point optimizations are nice because they truly are easy outs (and point/polygon is I think common).

One of the things that may have affected my initial implementation is that SedonaDB benchmarks mostly have one Scalar argument (where the cost of computing the covering is amortized over many more rows). I didn't implement a S2LatLngRect bound comparison but I think computing that might be cheaper.

zhangfengcdt · 2026-04-17T14:16:41Z

That sounds good to me! I the work I did in C++ was mostly to expose all of the possible optimizations but I haven't tuned them yet (and the tuning is probably much different in Java than it is in C++). The tuning/extreme optimization is also much better done in the presence of integration testing against BigQuery/PostGIS (which is in the works on the SedonaDB side). The point optimizations are nice because they truly are easy outs (and point/polygon is I think common).

One of the things that may have affected my initial implementation is that SedonaDB benchmarks mostly have one Scalar argument (where the cost of computing the covering is amortized over many more rows). I didn't implement a S2LatLngRect bound comparison but I think computing that might be cheaper.

Yes, one scalar case is a very popular case and the ShapeIndex cache and/or covering will help a lot in performance. The join also benefit from the cache and covering.

zhangfengcdt · 2026-04-17T14:19:35Z

@jiayuasu Could you take a look again? I think the core is ready and I'd create a couple of following up PRs for different geography ST functions once this core PR is merged.

I'd also like to add a few optimization configs once all of the st functions are merged so we can measure the performance under different optimizations in all.

Thanks!

Copilot

Pull request overview

Implements WKB-based serialization for Sedona Geography (with lazy JTS/S2/ShapeIndex caching) and adds initial Geography function support (ST_NPoints, ST_Distance, ST_Contains), along with Spark SQL integration tests and documentation updates.

Changes:

Introduces WKBGeography as the primary Geography representation plus GeographyWKBSerializer for backward-compatible serialization.
Extends Spark SQL expressions to dual-dispatch between Geometry (JTS) and Geography (S2) for ST_NPoints, ST_Distance, and ST_Contains.
Adds/updates unit + integration tests and publishes new SQL API docs for the new Geography functions.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
spark/common/src/test/scala/org/apache/sedona/sql/geography/GeographyFunctionTest.scala	New Spark SQL integration tests for Geography constructors + representative L1/L2/L3 functions + serialization round-trips.
spark/common/src/test/scala/org/apache/sedona/sql/geography/FunctionsTest.scala	Removes older geography function tests (envelope/EWKT).
spark/common/src/test/scala/org/apache/sedona/sql/geography/ConstructorsTest.scala	Updates expected WKTs to reflect S2 normalization behavior (loop orientation/winding).
spark/common/src/test/scala/org/apache/sedona/sql/geography/ConstructorsDataFrameAPITest.scala	Updates expected WKT for multipolygon due to S2 normalization behavior.
spark/common/src/test/scala/org/apache/sedona/sql/FunctionResolverSuite.scala	Updates resolver test expectations for ambiguity handling.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/JoinQueryDetector.scala	Adjusts join predicate detection to accommodate `ST_Contains` no longer being an `ST_Predicate`.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/implicits.scala	Switches Geography serialization/deserialization to `GeographyWKBSerializer`.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Predicates.scala	Makes `ST_Contains` dual-dispatch (Geometry + Geography) via `InferredExpression`.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala	Adds Geography overloads for `ST_Distance` and `ST_NPoints`.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/FunctionResolver.scala	Changes overload resolution behavior for ambiguous matches (prefers first).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/GeographyUDT.scala	Switches UDT serialization/deserialization to `GeographyWKBSerializer`.
spark/common/src/main/java/org/apache/sedona/core/utils/SedonaConf.java	Adds config to enable eager ShapeIndex building at Geography deserialization.
docs/api/sql/geography/Geography-Functions/ST_NPoints.md	New docs page for `ST_NPoints`.
docs/api/sql/geography/Geography-Functions/ST_Distance.md	New docs page for `ST_Distance`.
docs/api/sql/geography/Geography-Functions/ST_Contains.md	New docs page for `ST_Contains`.
docs/api/sql/geography/Geography-Functions.md	Updates Geography function index to include the new functions and refresh descriptions.
common/src/test/java/org/apache/sedona/common/S2Geography/WKBGeographyTest.java	New unit tests for `WKBGeography` + serializer behavior and legacy compatibility.
common/src/test/java/org/apache/sedona/common/Geography/FunctionTest.java	Extends common geography function tests to cover nPoints/distance/contains and refactors envelope tests.
common/src/main/java/org/apache/sedona/common/geometrySerde/GeometrySerde.java	Routes Geography serialization through `GeographyWKBSerializer`.
common/src/main/java/org/apache/sedona/common/geography/Functions.java	Adds Geography implementations for nPoints/distance/contains plus helper conversions.
common/src/main/java/org/apache/sedona/common/geography/Constructors.java	Updates geography constructors/converters to produce `WKBGeography` and adds EWKB SRID extraction.
common/src/main/java/org/apache/sedona/common/S2Geography/WkbS2Shape.java	New lightweight `S2Shape` backed by WKB bytes for faster index building on simple types.
common/src/main/java/org/apache/sedona/common/S2Geography/WKBGeography.java	New Geography implementation storing WKB bytes with lazy JTS/S2/ShapeIndex caches.
common/src/main/java/org/apache/sedona/common/S2Geography/GeographyWKBSerializer.java	New WKB-based serializer with 0xFF marker and legacy S2-native fallback.
common/src/main/java/org/apache/sedona/common/S2Geography/Distance.java	Adds a point-to-index distance helper using `S2ClosestEdgeQuery.PointTarget`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-18T03:26:41Z

+          this.dim = 0;
+          double lon = buf.getDouble(5);
+          double lat = buf.getDouble(13);
+          S2Point p = S2LatLng.fromDegrees(lat, lon).toPoint();
+          this.vertices = new S2Point[] {p};
+          this.totalEdges = 1;
+          this.chainStarts = new int[] {0};
+          this.chainLengths = new int[] {1};
+          this.vertexOffsets = new int[] {0};
+          this.containsOriginValue = false;
+          break;
+        }
+
+      case 2: // LineString
+        {
+          this.dim = 1;
+          int numCoords = buf.getInt(5);
+          this.vertices = readVertices(buf, 9, numCoords);
+          this.totalEdges = Math.max(0, numCoords - 1);


The byte offsets here assume 2D WKB without an embedded SRID (e.g., Point X/Y at 5/13, LineString count at 5, Polygon ring count at 5). If the input is EWKB with the SRID flag set, coordinates start after the 4-byte SRID; if it’s Z/M, the coordinate stride is larger than 16 bytes. This will build an incorrect S2Shape (and therefore incorrect predicates/distances) for those valid inputs. Please compute a coordOffset based on the decoded EWKB flags and a stride based on input dimension, similar to common/.../WKBReader.java.

Copilot · 2026-04-18T03:26:42Z

+  /** Returns the base WKB geometry type (1-7), masking off Z/M/SRID flags. */
+  private int wkbBaseType() {
+    boolean le = (wkbBytes[0] == 0x01);
+    int raw;
+    if (le) {
+      raw =
+          (wkbBytes[1] & 0xFF)
+              | ((wkbBytes[2] & 0xFF) << 8)
+              | ((wkbBytes[3] & 0xFF) << 16)
+              | ((wkbBytes[4] & 0xFF) << 24);
+    } else {
+      raw =
+          ((wkbBytes[1] & 0xFF) << 24)
+              | ((wkbBytes[2] & 0xFF) << 16)
+              | ((wkbBytes[3] & 0xFF) << 8)
+              | (wkbBytes[4] & 0xFF);
+    }
+    return raw & 0xFF;
+  }


wkbBaseType() claims to mask off Z/M/SRID flags, but raw & 0xFF only keeps the lowest byte. This breaks ISO WKB Z/M encodings (e.g., 1001/3001) and doesn’t actually remove the EWKB SRID flag safely. Consider decoding typeInt the same way as WKBReader (e.g., geometryType = (typeInt & 0xffff) % 1000) and separately tracking whether SRID/Z/M flags are present for offset/stride calculations.

Copilot · 2026-04-18T03:26:42Z

+  public S2Point extractPoint() {
+    boolean le = (wkbBytes[0] == 0x01);
+    ByteBuffer bb =
+        ByteBuffer.wrap(wkbBytes).order(le ? ByteOrder.LITTLE_ENDIAN : ByteOrder.BIG_ENDIAN);
+    double lon = bb.getDouble(5);
+    double lat = bb.getDouble(13);
+    return S2LatLng.fromDegrees(lat, lon).toPoint();


extractPoint() reads coordinates at fixed offsets (5/13), which is only correct for 2D WKB without an embedded SRID. For EWKB where the SRID flag is set, bytes 5–8 are the SRID and coordinates start at offset 9; this will produce wrong points and therefore wrong region(), getCellUnionBound(), and the point fast paths in geography distance()/predicates. Please derive the coordinate offset from the decoded typeInt (SRID/Z/M flags) before reading X/Y.

Copilot · 2026-04-18T03:26:42Z

        case joinConditionMatcher(predicate, extraCondition) =>
          predicate match {
+            case ST_Contains(Seq(leftShape, rightShape)) =>
+              Some(
+                JoinQueryDetection(
+                  left,
+                  right,
+                  leftShape,
+                  rightShape,
+                  SpatialPredicate.CONTAINS,
+                  false,
+                  extraCondition))


ST_Contains now supports Geography via InferredExpression, but this join planning path always treats it as a geometry join (isGeography = false) and the join execution path deserializes join keys with GeometrySerializer (see TraitJoinQueryBase.toSpatialRDD). If a user writes a join condition ST_Contains(geogA, geogB), this optimization will likely throw at runtime by attempting to deserialize Geography bytes as Geometry. Since geography join support is explicitly deferred, this detector should skip planning spatial joins when either side is GeographyUDT (or otherwise detect geography inputs and fall back to Spark’s default join).

Copilot · 2026-04-18T03:26:42Z

        // Make sure that there's no ambiguity in the best match.
        val ambiguousMatches = functionWithMatchResults.filter { case (_, matchResult) =>
          matchResult == bestMatch._2
        }
        if (ambiguousMatches.length == 1) {
          function
        } else {
-          // Detected ambiguous matches, throw exception
-          val candidateTypesMsg = ambiguousMatches
-            .map { case (function, _) =>
-              "  (" + function.sparkInputTypes.mkString(", ") + ")"
-            }
-            .mkString("\n")
-          throw new IllegalArgumentException(
-            "Ambiguous function call. Candidates are: \n" + candidateTypesMsg)
+          // Multiple candidates match equally (e.g., null inputs matching both Geometry and
+          // Geography overloads). Prefer the first candidate — for null inputs all overloads
+          // return null, so the choice is irrelevant.
+          ambiguousMatches.head._1
        }


The ambiguity handling change (ambiguousMatches.head) applies to all equally-good matches, not just the null-literal case mentioned in the comment. This can silently pick an arbitrary overload in situations where two overloads require the same number of implicit casts, changing behavior from a clear error to potentially wrong results. Safer options are: (1) keep throwing on ambiguity unless all ambiguous calls contain NullType arguments, or (2) add a deterministic tie-breaker based on a specificity/precedence rule that can’t change semantics unexpectedly.

Copilot · 2026-04-18T03:26:42Z

+  public WkbS2Shape(byte[] wkb) {
+    boolean le = (wkb[0] == 0x01);
+    ByteBuffer buf =
+        ByteBuffer.wrap(wkb).order(le ? ByteOrder.LITTLE_ENDIAN : ByteOrder.BIG_ENDIAN);
+    int wkbType = buf.getInt(1) & 0xFF;
+


WKB type decoding is incorrect here: buf.getInt(1) & 0xFF drops EWKB/ISO type information (e.g., Z/M types like 1001/3001) and can misclassify geometries. It should read the full typeInt, derive geometryType = (typeInt & 0xffff) % 1000 (as WKBReader does), and also account for EWKB flags such as SRID (0x20000000) when computing the payload offset.

jiayuasu · 2026-04-18T03:28:45Z

+
+Return type: `Boolean`
+
+Since: `v1.9.0`


Let's make it 1.9.1. I don't think the geography function will come in on time with 1.9.0

jiayuasu · 2026-04-18T03:31:50Z

+/**
+ * Serializer for Geography objects using WKB as the primary format. Supports backward-compatible
+ * reading of legacy S2-native serialized data. Format discrimination by first byte: 0xFF for WKB
+ * format [0xFF][4-byte SRID big-endian][WKB payload], or 1-10 for legacy S2-native format.


I suggest we remove the backward compatibility of the S2 native format unless it helps with the performance. This will reduce the complexity of the codebase. The S2 SerDe format is never announced and we don't expect active users of this feature.

jiayuasu · 2026-04-18T03:34:28Z

+  }
+
+  /** Mean Earth radius in meters (WGS84 mean radius). */
+  private static final double EARTH_RADIUS_METERS = 6371008.8;


There is another place in Sedona codebase (Distance Spheroid / Sphere) also hard code the value of earth radius. can you unify them together?

jiayuasu · 2026-04-18T03:35:53Z

+SQL Example
+
+```sql
+SELECT ST_Contains(


Can you add SVG visuals for this too? See Geometry ST_Contains: https://sedona.apache.org/latest/api/sql/Predicates/ST_Contains/

jiayuasu · 2026-04-18T03:37:58Z

There is a bug in the geography test helper which never actually validates the correctness of the code: #2810

Is it still true in this PR?

zhangfengcdt added 20 commits April 7, 2026 09:12

Implement Geography type and migration

6a5c98c

Add WKBGeography class that stores WKB bytes as the primary represent…

c511380

…ation with lazy-parsed JTS and S2 caches. This enables GeoArrow-compatible serialization while maintaining full S2 functionality on demand.

implements more st functions for geography

33134a6

implement ST_Contains, ST_Intersects, ST_Equals

307f37c

update plan document

86b65da

add jmh benckmarks

a3294e5

add SerializerComparisonBench

56a66ed

wip

4c79336

add all st functions to the benchmarks

bf686e7

uses S2ClosestEdgeQuery for true geometry-to-geometry minimum distance

a0b4618

also, added getShapeIndexGeography() as a third lazy cache

update benchmark resutls

5252459

remove plan document

1567b68

add random geometry/geography generator with configurable vertex count

aa9c093

create GeographyFunctionIntegrationTest

73e56b6

function tests

f5e7459

add spark.sedona.geography.eagerShapeIndex

8b93733

add GeoParquet scenario to the benchmark

c133a13

clean up comments formating

d9063a1

updated Geography-Functions.md

bbbbb2f

zhangfengcdt requested a review from jiayuasu as a code owner April 8, 2026 15:28

zhangfengcdt added 7 commits April 8, 2026 08:36

add st_within

5f346e5

fix lint errors

349ad64

fix java style issue

ca4783e

fix pre-commit lint

606ac4e

fix failed tests

75a6894

- null type ambiguity - level 1 functions missing Geography dispatch

fix tests - updated expected multipolygon WKT to match S2-normalized …

b35bcac

…hole winding order

fix more tests

89239fd

zhangfengcdt marked this pull request as draft April 9, 2026 22:12

zhangfengcdt added 2 commits April 10, 2026 09:15

clean up benchmark code and keep only the core arch and 3 sample st f…

0839a97

…unctions

rephrase st classification in doc

6dbb188

zhangfengcdt changed the title ~~[GH-2830] Improve Geography query support - functions~~ [GH-2830] Improve Geography query support - core Apr 11, 2026

zhangfengcdt marked this pull request as ready for review April 13, 2026 15:58

zhangfengcdt requested a review from paleolimbot April 13, 2026 19:52

paleolimbot reviewed Apr 14, 2026

View reviewed changes

zhangfengcdt added 2 commits April 15, 2026 09:32

WIP - add WkbS2Shape

383537d

Add optimizations and add WkbS2Shape implementing S2Shape

0270ab4

add optimization to

5d9f2e7

- point container return false immediately - point target use S2ClosestEdgeQuery instead of building ShapeIndexTarget

jiayuasu requested a review from Copilot April 18, 2026 03:19

Copilot started reviewing on behalf of jiayuasu April 18, 2026 03:20 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

jiayuasu reviewed Apr 18, 2026

View reviewed changes

Conversation

zhangfengcdt commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

jiayuasu commented Apr 9, 2026

Uh oh!

zhangfengcdt commented Apr 9, 2026

Uh oh!

zhangfengcdt commented Apr 11, 2026

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangfengcdt commented Apr 15, 2026

Uh oh!

paleolimbot commented Apr 15, 2026

Uh oh!

zhangfengcdt commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented Apr 16, 2026

Uh oh!

zhangfengcdt commented Apr 17, 2026

Uh oh!

zhangfengcdt commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayuasu commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

zhangfengcdt commented Apr 8, 2026 •

edited

Loading

zhangfengcdt commented Apr 16, 2026 •

edited

Loading