Skip to content

GH-50027: [Format][C++][Python] Add arrow.range canonical extension type#50028

Open
Hoeze wants to merge 11 commits into
apache:mainfrom
Hoeze:feat/arrow-range-extension
Open

GH-50027: [Format][C++][Python] Add arrow.range canonical extension type#50028
Hoeze wants to merge 11 commits into
apache:mainfrom
Hoeze:feat/arrow-range-extension

Conversation

@Hoeze
Copy link
Copy Markdown

@Hoeze Hoeze commented May 24, 2026

Rationale for this change

This is a draft implementation for #50027, a new canonical range extension type.

What changes are included in this PR?

This PR provides the spec text, a C++ reference implementation, PyArrow bindings, and the supporting documentation.

Are these changes tested?

I let the tests run locally but did not try them in any other project yet.

Note that I made heavy use of AI to create this PR and copied many structures from the fixed shape tensor extension type. I reviewed each change and hope the changes I made are meaningful.
Nevertheless, I am not sure whether the C++ parts are comprehensive or if I missed anything; this is my first contribution to Arrow.

Are there any user-facing changes?

No, this is an addition of a new canonical extension type.

Hoeze added 8 commits May 24, 2026 13:03
Add a canonical extension type for bounded ranges (mathematical intervals),
distinct from Arrow's calendar Interval (duration) type.

- Spec: docs/source/format/CanonicalExtensions.rst adds the Range section.
  Storage is Struct<lower, upper> with both bounds nullable (null = +/-infinity,
  treated as exclusive). A closed parameter (left/right/both/neither, pandas
  vocabulary) is carried as JSON extension metadata; the subtype is read from
  storage. Disambiguates from the calendar Interval type per DB convention
  (INTERVAL = duration, RANGE/PERIOD = bounded set).
- C++ reference impl: cpp/src/arrow/extension/range.{h,cc} (RangeType/RangeArray)
  with serialize/deserialize, storage validation, registration in the global
  registry, tests, and CMake/meson wiring.
The closedness is no longer defaulted on the wire: empty metadata or a JSON
object without a "closed" key is now rejected by Deserialize, so a serialized
arrow.range is always unambiguous. The C++ convenience default argument for
constructing a RangeType in code is left-closed ([lower, upper)), matching the
PostgreSQL/Rust/Python range convention. Spec and tests updated.
Verified by building the arrow-canonical-extensions-test target (50/50 pass,
10/10 RangeType). Two fixes to the previously-uncompiled test:
- include arrow/array/array_nested.h for the full StructArray definition
  (it is only forward-declared in type_fwd.h).
- wrap the CheckDeserialize helper in an anonymous namespace to avoid a
  link-time collision with the identically named helper in opaque_test.cc.
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #50027 has been automatically assigned in GitHub to PR creator.

@rok
Copy link
Copy Markdown
Member

rok commented May 24, 2026

See comment.

Hoeze added 2 commits May 24, 2026 23:03
Add a sibling canonical extension type to arrow.range that stores bound
inclusivity per value via non-nullable boolean lower_inc/upper_inc fields,
storage Struct<lower:T, upper:T, lower_inc:bool, upper_inc:bool>.

arrow.range carries a single type-level closed parameter, sufficient for
discrete ranges that canonicalize to one closedness (int4range, int8range,
daterange). Continuous ranges (numrange, tsrange, tstzrange) cannot be
canonicalized, so closedness must travel with each value. arrow.range_inc
mirrors PostgreSQL's internal range representation for that case; both types
coexist.

The type has no metadata parameters: inclusivity lives in storage, so
Serialize emits {} and Deserialize accepts empty/{}/extra keys. A null
(infinite) bound is always exclusive regardless of its flag.

Covers C++ (type, array, registration, tests), pyarrow bindings and tests,
and the format spec, status table, and C++/Python API docs.
Copilot AI review requested due to automatic review settings June 4, 2026 19:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

}

// ---------------------------------------------------------------------------
// RangeIncType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps wwe can find a better name than RangeInc?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely 😄
Some ideas:

  • VarRangeType
  • VariableClosedRangeType
  • PerValueRangeType

What do you think?

Copy link
Copy Markdown
Author

@Hoeze Hoeze Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more idea:
FixedClosednessRangeType and VariableClosednessRangeType (similar to the tensor type naming

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range and GranularRange? Naming is hard.

Under CMAKE_UNITY_BUILD (Windows CI), range_test.cc and opaque_test.cc are
merged into one translation unit. Both declared a CheckDeserialize helper
(range's in an anonymous namespace, opaque's in namespace arrow), making the
unqualified call ambiguous and failing the MSVC build with C2668. Rename the
range helper to CheckRangeDeserialize to remove the collision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants