Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] - Data loss after 3 days following upgrade from Hudi 0.11.1 to 0.14.0 #11959

Open
RuyRoaV opened this issue Sep 18, 2024 · 3 comments
Labels
data-loss loss of data only, use data-consistency label for inconsistent view priority:critical production down; pipelines stalled; Need help asap.

Comments

@RuyRoaV
Copy link

RuyRoaV commented Sep 18, 2024

Tips before filing an issue

Describe the problem you faced

A clear and concise description of the problem.

We have a COW table which is updated via an UPSERT operation through a Glue Job; the operations were initially performed on Hudi 0.11.1. Moreover the table is partitioned by year, month and day.

Some days after upgrading to Hudi 0.14.0, we noticed that we were having less rows for partitions starting from the update date. Moreover, we noticed that records for a given partition day were dropped with a delay of 3 days. This behaviour was observed when counting the records by partition using Glue or Athena.

On another hand, we also have a Redshift Spectrum subscription built from this table, and when doing the row count check, we could see the "correct" number of rows. However, we could see duplicated data.

Furthermore, we upgraded 4 tables from Hudi 0.11.1 to Hudi 0.14.0 and only with this table we observed such behaviour.

To Reproduce

Steps to reproduce the behavior:

  1. Table in Hudi 0.11.1
  2. Upgrade to Hudi 0.14.0
  3. Wait 3 days to observe the data loss.

These are the write configurations set by us.

Screenshot 2024-06-21 at 13 12 16

Expected behavior

Could you please shed some light on why this could have happened?

We should see the correct number of rows in Athena / Glue.

Environment Description

  • Hudi version : 0.14.0

  • Spark version : 3.3.0 (Glue 4)

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : No

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

@danny0405 danny0405 added the data-loss loss of data only, use data-consistency label for inconsistent view label Sep 19, 2024
@migeruj
Copy link

migeruj commented Sep 19, 2024

Hi! I am another Hudi user like you, I'm not related directly with Hudi Project.

Could you please format your write configurations as a copyable JSON? This will help make it easier to replicate. From what I can see, nothing stands out as an issue so far.

Also, are you using the hudi-aws-bundle for your Glue Job? There was a breaking change introduced in version 0.13.0, which might affect your setup, though I’m not sure if it applies in your case.

Check the breaking changes and behaviour changes of 0.13.0 versions and 0.14.0 versions:

0.14.0 Changes
0.13.0 Changes

Also check known regressions, on 0.14.0 and 0.14.1 there is some regressions related to Duplicates for ComplexKeyGenerator. Based on that try to use 0.13.0 version instead until is solved.

If you’ve tried everything else, I recommend the following steps:

Compare the checkpoints before and after the Hudi upgrade to see if there is any behaviour that helps.
Could you use the hudi-cli to check the commit history? This can help track down any issues with the data or commits.

@danny0405
Copy link
Contributor

@ad1happy2go Do you have chance to help to reproduce here?

@ad1happy2go ad1happy2go added the priority:critical production down; pipelines stalled; Need help asap. label Sep 20, 2024
@rangareddy
Copy link

Hi @RuyRoaV

I have few questions before we identify and provide any kind solution:

  1. You mentioned the row count matches the number of records from Redshift Spectrum. Could you please elaborate on how you tested and verified this count?
  2. How did you verify the record count by partition using Glue or Athena? Providing more details would help us understand the issue and find a solution. You can do one more thing by launching spark and verify the count.
  3. You mentioned this issue is affecting a single table. Is there a difference in how this table was created compared to others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-loss loss of data only, use data-consistency label for inconsistent view priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: 👤 User Action
Development

No branches or pull requests

5 participants