Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window fuzzer test is flaky in nightly runs #11213

Open
kagamiori opened this issue Oct 10, 2024 · 5 comments · May be fixed by #11293
Open

Window fuzzer test is flaky in nightly runs #11213

kagamiori opened this issue Oct 10, 2024 · 5 comments · May be fixed by #11293
Labels
bug Something isn't working fuzzer-found

Comments

@kagamiori
Copy link
Contributor

Description

Example failure: https://github.com/facebookincubator/velox/actions/runs/11247502907/job/31271630633

Error Reproduction

No response

Relevant logs

No response

@kagamiori kagamiori added bug Something isn't working fuzzer Issues related the to Velox fuzzer test components. fuzzer-found labels Oct 10, 2024
@kagamiori kagamiori self-assigned this Oct 10, 2024
@kevinwilfong
Copy link
Contributor

I noticed this on some of my PRs as well, I was wondering if it my be an issue exposed by #10006 with NaNs in the off1 column

@kagamiori
Copy link
Contributor Author

kagamiori commented Oct 11, 2024

Update:
I reproduced the error with this command: velox_window_fuzzer_test --enable_window_reference_verification --presto_url=http://127.0.0.1:8080 --logtostderr=1 --minloglevel=0 --duration_sec=360 --num_batches=1 --batch_size=30 --only=stddev_samp --seed=2356821713 --duration_sec=0 --steps=1 -v=1.

This iteration runs Window[1][partition by [p0] order by [s0 DESC NULLS FIRST] w0 := stddev_samp(ROW["c0"]) RANGE between off0 FOLLOWING and off1 FOLLOWING]. There is one partition containing three input rows with the exact same values of sorting key, frame start, and frame end, i.e., they are peer rows and should all return the same window result. I found that WindowPartition::computeKRangeFrameBounds() doesn't handle NaN correctly and produces incorrect frame bounds. This further makes AggregateWindowFunction::analyzeFrameValues() to incorrectly consider the frames in this partition to be incremental and calls incrementalAggregation().

I'm continuing investigating.

Example queries to reproduce the mismatch:

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 asc
        RANGE BETWEEN 0.0 PRECEDING AND nan() PRECEDING 
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- different results

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 asc
        RANGE BETWEEN nan() PRECEDING AND 0.0 PRECEDING 
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- Presto throws while Prestissimo doesn't

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 desc
        RANGE BETWEEN 0.0 FOLLOWING  AND nan() FOLLOWING  
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- Presto throws while Prestissimo doesn't

SELECT
    s0,
    SUM(c0) OVER (
        ORDER BY
            s0 desc
        RANGE BETWEEN nan() FOLLOWING  AND 0.0 FOLLOWING  
    )
FROM (
    VALUES
        (1, 1.0),
        (2, 1.0),
        (3, 1.0)
) t(c0, s0); -- different results

@kagamiori
Copy link
Contributor Author

Both Presto and Velox results of the queries above are incorrect. prestodb/presto#23822

@kagamiori kagamiori removed the fuzzer Issues related the to Velox fuzzer test components. label Oct 14, 2024
@kagamiori
Copy link
Contributor Author

kagamiori commented Oct 14, 2024

Hi @aditi-pandit and @pramodsatya, the recent window fuzzer enhancement of k-range frames found this bug in both Presto and Velox. We're considering making window operations return NULL when any of the frame bound is NaN (prestodb/presto#23822 (comment)). Please let us know if you have any thoughts. Thanks!

@pramodsatya
Copy link
Collaborator

Hi @aditi-pandit and @pramodsatya, the recent window fuzzer enhancement of k-range frames found this bug in both Presto and Velox. We're considering making window operations return NULL when any of the frame bound is NaN (prestodb/presto#23822 (comment)). Please let us know if you have any thoughts. Thanks!

Thanks @kagamiori, returning null for Nan frame bounds would be better.

@kagamiori kagamiori removed their assignment Oct 16, 2024
kagamiori added a commit to kagamiori/velox that referenced this issue Oct 17, 2024
Summary:
NaN could appear as the frame bound of k-range frames in window 
operations. When NaN appear in either of the frame bounds, the window 
operation should return NULL at this row.

This diff fixes facebookincubator#11213.

Differential Revision: D64510519
@kagamiori kagamiori linked a pull request Oct 17, 2024 that will close this issue
kagamiori added a commit to kagamiori/velox that referenced this issue Oct 17, 2024
)

Summary:

NaN could appear as the frame bound of k-range frames in window 
operations. When NaN appear in either of the frame bounds, the window 
operation should return NULL at this row.

This diff fixes facebookincubator#11213.

Differential Revision: D64510519
kagamiori added a commit to kagamiori/velox that referenced this issue Oct 17, 2024
)

Summary:

NaN could appear as the frame bound of k-range frames in window 
operations. When NaN appear in either of the frame bounds, the window 
operation should return NULL at this row.

This diff fixes facebookincubator#11213.

Differential Revision: D64510519
kagamiori added a commit to kagamiori/velox that referenced this issue Oct 17, 2024
)

Summary:

NaN could appear as the frame bound of k-range frames in window 
operations. When NaN appear in either of the frame bounds, the window 
operation should return NULL at this row.

This diff fixes facebookincubator#11213.

Differential Revision: D64510519
kagamiori added a commit to kagamiori/velox that referenced this issue Oct 18, 2024
Summary:
Window fuzzer test is flaky due to incorrect handling of NaN in frame bounds in the Window operator (facebookincubator#11213). 
Presto needs to be fixed too to match Velox behavior (prestodb/presto#23822). 
Before then, make window fuzzer not generate NaN in frame bounds to avoid fuzzer failures.

Differential Revision: D64580491
facebook-github-bot pushed a commit that referenced this issue Oct 18, 2024
Summary:
Pull Request resolved: #11296

Window fuzzer test is flaky due to incorrect handling of NaN in frame bounds in the Window operator (#11213).
Presto needs to be fixed too to match Velox behavior (prestodb/presto#23822).
Before then, make window fuzzer not generate NaN in frame bounds to avoid fuzzer failures.

Reviewed By: amitkdutta

Differential Revision: D64580491

fbshipit-source-id: 28c0a26983c3fa8a78543d1b77691019dd70ffb0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fuzzer-found
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants