Skip to content

[SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode#55941

Open
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56916-element-at
Open

[SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode#55941
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56916-element-at

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

Title: [SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode
Base: master (independent)
Head: gengliangwang:SPARK-56916-element-at

What changes were proposed in this pull request?

Introduce ArrayUtils.java with a single helper elementAtIndexExact(int length, int index, QueryContext context) and use it from ElementAt's ArrayType branch in both doGenCode and doElementAt (eval).

The helper normalizes a 1-based element_at index against the array length and returns the 0-based position, throwing invalidElementAtIndexError for out-of-bound and invalidIndexOfZeroError for zero index. The caller still emits the type-specific arr.get(pos, dataType) (not the helper, since the return type depends on the array element type).

The non-ANSI branch is left inline because it can choose between defaultValueOutOfBound (an Option[Expression] that requires codegen access) or null.

Why are the changes needed?

Part of SPARK-56908 (umbrella). The ANSI ElementAt codegen body was the largest single inline body in collectionOperations.scala -- the helper collapses ~12 lines to ~3 per call site.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

build/sbt "catalyst/testOnly *CollectionExpressionsSuite"

59/59 pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x

### What changes were proposed in this pull request?

Introduce `ArrayUtils.java` with a single helper
`elementAtIndexExact(int length, int index, QueryContext context)` and
use it from `ElementAt`'s `ArrayType` branch in both `doGenCode` and
`doElementAt` (eval).

The helper normalizes a 1-based `element_at` index against the array
length and returns the 0-based position, throwing
`invalidElementAtIndexError` for out-of-bound and
`invalidIndexOfZeroError` for zero index. The caller still emits the
type-specific `arr.get(pos, dataType)` (not the helper, since the
return type depends on the array element type).

The non-ANSI branch is left inline because it can choose between
`defaultValueOutOfBound` (an `Option[Expression]` that requires
codegen access) or `null`.

### Why are the changes needed?

Part of SPARK-56908 (umbrella). The ANSI `ElementAt` codegen body was
the largest single inline body in `collectionOperations.scala` -- the
helper collapses ~12 lines to ~3 per call site.

### Does this PR introduce _any_ user-facing change?

No. The compiled behavior is identical; only the emitted Java source
text changes.

### How was this patch tested?

```
build/sbt "catalyst/testOnly *CollectionExpressionsSuite"
```

59/59 pass.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x
@gengliangwang
Copy link
Copy Markdown
Member Author


Stack overview (SPARK-56908 umbrella)

This PR is part of a stack of 8 PRs against SPARK-56908. Order:

  1. [SPARK-56909][SQL] Simplify Cast to int/long codegen under ANSI mode #55934 — [SPARK-56909][SQL] Simplify Cast to int/long codegen under ANSI mode (this stack base)
  2. [SPARK-56910][SQL] Simplify Cast to byte/short codegen under ANSI mode #55935 — [SPARK-56910][SQL] Simplify Cast to byte/short codegen under ANSI mode
  3. [SPARK-56911][SQL] Simplify Cast to decimal codegen under ANSI mode #55936 — [SPARK-56911][SQL] Simplify Cast to decimal codegen under ANSI mode
  4. [SPARK-56912][SQL] Simplify Cast to boolean codegen under ANSI mode #55937 — [SPARK-56912][SQL] Simplify Cast to boolean codegen under ANSI mode
  5. [SPARK-56914][SQL] Simplify decimal arithmetic codegen under ANSI mode #55939 — [SPARK-56914][SQL] Simplify decimal arithmetic codegen under ANSI mode (depends on [SPARK-56911][SQL] Simplify Cast to decimal codegen under ANSI mode #55936)
  6. [SPARK-56913][SQL] Simplify BinaryArithmetic byte/short codegen under ANSI mode #55938 — [SPARK-56913][SQL] Simplify BinaryArithmetic byte/short codegen under ANSI mode (independent)
  7. [SPARK-56915][SQL] Simplify MakeDate/MakeInterval codegen under ANSI mode #55940 — [SPARK-56915][SQL] Simplify MakeDate/MakeInterval codegen under ANSI mode (independent)
  8. [SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode #55941 — [SPARK-56916][SQL] Simplify ElementAt array codegen under ANSI mode (independent)

PRs 1-4 are linearly stacked on each other (each branch is based on the previous one). PR 5 (decimal arithmetic) is stacked on top of PR 3 (cast decimal) since it uses CastUtils.changePrecisionExact. PRs 6, 7, 8 branch off master independently.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem. ElementAt.doGenCode for ANSI mode contained ~12 lines of inline codegen for index validation (length check, zero-index check, sign normalization), with the same logic duplicated in Scala in doElementAt (eval). Per the SPARK-56908 umbrella, this was the largest single inline body in collectionOperations.scala.

Design approach. Extract the ANSI-mode validation into a Java static helper ArrayUtils.elementAtIndexExact(int length, int index, QueryContext) and call it from both eval and codegen. Each method now splits case _: ArrayType into a failOnError branch (uses the helper) and a non-failOnError branch (kept inline — only the ANSI branch is unified in this PR).

Key design decisions. The helper returns the validated 0-based int; the type-specific arr.get(pos, dataType) remains at the call site so the helper stays independent of element type.

Implementation sketch.

  • New file ArrayUtils.java in org.apache.spark.sql.catalyst.expressions with a single static method.
  • ElementAt.doElementAt and ElementAt.doGenCode each gain a new case _: ArrayType if failOnError => branch.

Behavior verified case-by-case against pre-PR (OOB, zero index, negative index, empty array). Codegen scaffolding (nullCheck) is identical to pre-PR.

LGTM with two minor nits inline.

* of inline length / zero / sign-normalization codegen with a return of
* the normalized array position (0-based).
*/
public final class ArrayUtils {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack uses per-operation naming (CastUtils, ArithmeticUtils, DateTimeConstructorUtils). ArrayUtils is broader than its single element_at-specific helper, and there's already an ArrayExpressionUtils.java in the same package that serves array-expression helpers. Risk: future readers won't know which utility class to look in, and ArrayUtils becomes a magnet for unrelated array helpers.

Consider renaming to ElementAtUtils (matches DateTimeConstructorUtils-style per-operation naming), or folding elementAtIndexExact into the existing ArrayExpressionUtils. WDYT?

Comment on lines +28 to +30
* {@link ElementAt} on {@code ArrayType}: a single call replaces ~12 lines
* of inline length / zero / sign-normalization codegen with a return of
* the normalized array position (0-based).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The a single call replaces ~12 lines ... clause describes the PR's effect rather than the helper's contract — once merged, the original 12-line inline form isn't visible to future readers. Peer CastUtils.java doesn't include similar line-count claims.

Suggested change
* {@link ElementAt} on {@code ArrayType}: a single call replaces ~12 lines
* of inline length / zero / sign-normalization codegen with a return of
* the normalized array position (0-based).
* {@link ElementAt} on {@code ArrayType}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants