Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

Open
soumilshah1995 opened this issue Sep 29, 2024 · 0 comments
Open

[SUPPORT] Add Support for Variant Type and Spark 4.0.0 Preview #12022

soumilshah1995 opened this issue Sep 29, 2024 · 0 comments
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas

Comments

@soumilshah1995
Copy link

Feature Request:

I would like to request support for the Variant Type in Apache Hudi and compatibility with Spark 4.0.0 preview. The new Variant Type introduced in Spark 4.0 significantly improves the performance of handling semi-structured data (such as JSON) and is up to 8X faster compared to working with raw JSON strings. This could greatly enhance Hudi’s efficiency when processing complex data formats.

Why is this needed:

Hudi’s support for large-scale data management would benefit greatly from the ability to handle semi-structured data types, such as those managed through the Variant Type.
As more organizations transition to Spark 4.0 (once officially released), maintaining compatibility will ensure that Hudi remains up-to-date with modern data processing pipelines.
Efficient handling of complex data formats like JSON and XML would make Hudi a more versatile solution for data lakes.
Use Case:

Processing and storing large datasets that contain a mix of structured and semi-structured data.
Leveraging Variant Type for faster querying and reduced overhead when dealing with nested and complex data structures.
Compatibility with Spark 4.0 will help early adopters of the latest Apache Spark features to continue using Hudi seamlessly in their pipelines.
References:

Spark 4.0.0 Preview with Variant Type Support
Delta Lake Variant Type Documentation
Additional Context: Currently, AWS services like EMR and Glue do not fully support Spark 4.0, but as these platforms are expected to adopt it in the near future, adding early support in Hudi would make the transition smoother for users.

@ad1happy2go ad1happy2go added the feature-enquiry issue contains feature enquiries/requests or great improvement ideas label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants