Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Amazon AWS S3 support #268 #298

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Adds Amazon AWS S3 support #268 #298

wants to merge 1 commit into from

Conversation

Lucas-Mc
Copy link
Collaborator

This change adds support for data stored in Amazon AWS S3 (buckets). For the moment, anywhere there is a pn_dir parameter in a function, the user can provide the S3 URI form which to find the desired input file. This should always begin with 's3' in order to work correctly. An example input would be: 's3://my-aws-bucket/'.

For example, I can read record 100 (both .dat and .hea) stored on my-aws-bucket like this:

>>> import wfdb
>>> rec = wfdb.rdrecord('100', pn_dir='s3://my-aws-bucket')

As mentioned in #73, I think it would be nice to transition away from WFDB's dependence on files and take an approach similar to this statement:

1. You have the wfdb-python library that only takes contents as input
2. You have helper functions (downloads from physionet and read the contents for you)

Where each helper function would be for a separate application (PhysioNet, AWS S3, Hadoop HDFS, etc.). This would probably be a significant change (I think pn_dir should be changed to remote_dir at least) so something to look forward to in v4.0??

Fixes #268.

@bemoody
Copy link
Collaborator

bemoody commented Aug 20, 2021

Nice to know this is possible, though I haven't tested it.

I do see a few problems with the implementation here that would need to be addressed:

  • Numerous functions check whether remote_dir.startswith('s3'). This conflicts with the existing namespace (for example, rdrecord("foo", pn_dir="s345db/1.0") ought to read the record http://physionet.org/files/s345db/1.0/foo.) If reusing the pn_dir API is desirable, it needs to be unambiguous (for example, remote_dir.startswith('s3://') instead of remote_dir.startswith('s3').)

  • _stream_dat ignores start_byte and byte_count, which means it will give wrong results when reading a subset of a signal file, or when reading a signal file with a prolog.

  • Rather than creating abstractions to simplify the logic, this adds complexity to both the high-level record functions and the low-level download functions.

For a lot of reasons, including reducing the existing complexity, I'm trying to replace the download functions with an abstract file I/O layer (see pull #320.)

I know nothing about s3fs, but just looking at how you're using it here, I'm guessing that it already provides a BufferedIOBase interface or something similar. Which means it ought to be simpler, and much more maintainable, to do this by building on top of pull #320 instead.

I'd welcome thoughts from anyone about what the ideal API should look like. pn_dir is pretty clunky in my mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read records stored in AWS S3
2 participants