[data] preprocessor: use map_batches in MaxAbsScaler, MinMaxScaler, UniformKBinsDiscretizer #48097

hongchaodeng · 2024-10-18T01:22:23Z

Why are these changes needed?

dataset.aggregate requires a full materialization in memory.
Rewrite the MaxAbsScaler, MinMaxScaler, UniformKBinsDiscretizer to do a map_batches pass instead of using dataset.aggregate to calculate the necessary statistics.
We will get a streaming implementation for this.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…niformKBinsDiscretizer Signed-off-by: hongchaodeng <[email protected]>

[data] preprocessor: use map_batches in MaxAbsScaler, MinMaxScaler, U…

5ba5ddd

…niformKBinsDiscretizer Signed-off-by: hongchaodeng <[email protected]>

hongchaodeng requested review from scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners October 18, 2024 01:22

hongchaodeng added the go add ONLY when ready to merge, run all tests label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] preprocessor: use map_batches in MaxAbsScaler, MinMaxScaler, UniformKBinsDiscretizer #48097

[data] preprocessor: use map_batches in MaxAbsScaler, MinMaxScaler, UniformKBinsDiscretizer #48097

hongchaodeng commented Oct 18, 2024

[data] preprocessor: use map_batches in MaxAbsScaler, MinMaxScaler, UniformKBinsDiscretizer #48097

Are you sure you want to change the base?

[data] preprocessor: use map_batches in MaxAbsScaler, MinMaxScaler, UniformKBinsDiscretizer #48097

Conversation

hongchaodeng commented Oct 18, 2024

Why are these changes needed?

Related issue number

Checks