Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data/preprocessors] concatenator should preserve order of concatenated #47997

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

martinbomio
Copy link
Contributor

@martinbomio martinbomio commented Oct 11, 2024

Make concatenator preserve order.
This is a breaking change, since the Concatenator since now the expectation is that the list of columns to include is specify always, and that list will define the order of the columns in the output column.

Why are these changes needed?

See issue #47996

Related issue number

Closes #47996

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@martinbomio martinbomio force-pushed the martinbomio/fix-concatenator-order branch 4 times, most recently from 531a467 to b8cf5d6 Compare October 12, 2024 15:03
@martinbomio martinbomio marked this pull request as ready for review October 12, 2024 22:50
@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Oct 16, 2024
@richardliaw richardliaw changed the title fix: concatenator should preserve order of concatenated [data/preprocessors] concatenator should preserve order of concatenated Oct 16, 2024
@martinbomio martinbomio force-pushed the martinbomio/fix-concatenator-order branch 2 times, most recently from 2f701df to 071ad2b Compare October 17, 2024 01:31
Comment on lines 62 to 63
columns: A list of columns to concatenate. The order will define the order
of the concatenated columns.
Copy link
Contributor

@richardliaw richardliaw Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
columns: A list of columns to concatenate. The order will define the order
of the concatenated columns.
columns: A list of columns to concatenate. The provided order of the columns
will be retained during concatenation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - should we just keep it as include?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with keeping it, but just want to point out that it would be still a breaking change, and that other preporcessors use columns to define the columns that the preprocessor would be apply to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be breaking in particular?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be a required argument right now, if non is provided it will fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline - this makes sense and columns is more consistent with other preprocessors as well.

@martinbomio martinbomio force-pushed the martinbomio/fix-concatenator-order branch from 071ad2b to 191af73 Compare October 17, 2024 01:48
@martinbomio martinbomio force-pushed the martinbomio/fix-concatenator-order branch 2 times, most recently from 88acac8 to a5ec10e Compare October 17, 2024 13:07
@martinbomio martinbomio requested a review from a team as a code owner October 17, 2024 13:07
@martinbomio martinbomio force-pushed the martinbomio/fix-concatenator-order branch 2 times, most recently from 351b643 to a88e22a Compare October 17, 2024 14:58
Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, just a couple small nits/docs requests

@@ -20,6 +20,8 @@ class Concatenator(Preprocessor):
:class:`~ray.air.util.tensor_extensions.pandas.TensorArrayElement` objects of
shape :math:`(m,)`, where :math:`m` is the number of columns concatenated.
The :math:`m` concatenated columns are dropped after concatenation.
The preprocessor preserves the order of the columns provided in the ``colummns``
argument and will use that order when on transform, and transform_batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
argument and will use that order when on transform, and transform_batch.
argument and will use that order when calling ``transform()`` and ``transform_batch()``.

@@ -20,6 +20,8 @@ class Concatenator(Preprocessor):
:class:`~ray.air.util.tensor_extensions.pandas.TensorArrayElement` objects of
shape :math:`(m,)`, where :math:`m` is the number of columns concatenated.
The :math:`m` concatenated columns are dropped after concatenation.
The preprocessor preserves the order of the columns provided in the ``colummns``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i cannot add a comment to the docstring above this line because it wasn't modified in this PR. but can we also modify this line, to indicate that only columns specified in the columns arg will be included?

Combine numeric columns into a column of type
    :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@martinbomio martinbomio force-pushed the martinbomio/fix-concatenator-order branch from a88e22a to ddb92f4 Compare October 17, 2024 22:47
@martinbomio
Copy link
Contributor Author

@richardliaw @scottjlee adressed all your comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Concatenate preprocessor does not preserve order of concatenation
4 participants