Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds feature displaying project view count #1101

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from
Open

Adds feature displaying project view count #1101

wants to merge 10 commits into from

Conversation

shulim
Copy link
Contributor

@shulim shulim commented Jul 15, 2020

This is a work in progress that starts to address #738 . The aim is to display a view count for each project, available to whoever accesses the project page to get a sense of how much a project has been referenced so far. The view count data is obtained from parsing NGINX logs. Here is what it looks like so far:

Screen Shot 2020-07-15 at 2 20 34 PM

Here are the changes/new files:

  • physionet-django/project/models.py: creates a Metrics table that stores daily and running total view counts for each project per date.

  • physionet-django/project/templates/project/published_project.html: adds a line to display the latest view count, as shown in the picture.

  • physionet-django/project/views.py: handles Metrics view

  • physionet-django/project/management/commands/get_metrics_data.py: this is a management task that can be run regularly to update view counts as each log is produced daily (can be called using python manage.py get_metrics_data [path_to_file]). The function log_parser parses the logs to count unique views (defined here as coming from different IP addresses). update_metrics takes data from log_parser to update the Metrics table in the database. validate_log_file makes sure the log file can/should be parsed.

  • physionet-django/project/management/commands/test_get_metrics_data.py: tests the get_metrics_data command using 3 sample log files. Can be run using python manage.py test project.management.commands.test_get_metrics_data (the test only works if resetdb and loaddemo are done beforehand—not sure if this is an issue. I attempted to create a setUp function that could reset the database before each test, but was so far unable to do it).

  • physionet-django/project/fixtures/physionet_access.log.1, physionet-django/project/fixtures/physionet_access.log.2, and physionet-django/project/fixtures/physionet_access.log.3 are sample log files I generated using log-generator (https://pypi.org/project/log-generator/). The IP addresses and user agents are random, and the user names were taken from the PhysioNet demo. I wasn’t sure where to put these logs, so I put them in fixtures—please let me know if there might be a better place or way to store them.

Please let me know your thoughts on this approach or any comments/questions you may have. Thank you!

Copy link
Member

@tompollard tompollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shulim, this is a great start. I've added a few comments, though haven't tested or looked at the parsing functionality yet. You should be able to fix the test by adding the migration.

@shulim shulim force-pushed the count_views branch 6 times, most recently from 535fd71 to a570373 Compare July 17, 2020 22:56
@Lucas-Mc
Copy link
Contributor

Lucas-Mc commented Jul 20, 2020

Hey @shulim! Thanks for all the work on this! It looks good, I just have one comment / suggestion. Should we mention to users that the view count is only cumulative starting at a certain date (the date of the oldest log file)? I think this would prevent confusion from authors who are wondering why their project only has 3 views if it's been on PhysioNet for 4 years.

Maybe this could be a circle question mark hover element such as the one after Roger's name in your photo that says "Note: the view count is cumulative as of July 20, 2020" or as simple as writing "16 views starting July 20, 2020" instead of just "16 views". Anyways, nice work!

EDIT: Maybe you could also explain in the hover element how the view count is determined? Or would this cause users to try and hack the system to up their project's view counts disingenuously?

@shulim
Copy link
Contributor Author

shulim commented Jul 22, 2020

Hi @Lucas-Mc, thanks for the helpful feedback! I’ll work on a circle question mark hover element to add a note on when the views started being counted.

EDIT: Maybe you could also explain in the hover element how the view count is determined?

I think adding a quick explanation would be nice! I can add a note about views being counted as unique IP address visits to any of the project’s pages on a given day (will think more on the wording as I go, but if you have any suggestions, I’ll work them in).

Or would this cause users to try and hack the system to up their project's view counts disingenuously?

I’m not sure if I can accurately predict users’ responses to this—though I currently think that this wouldn’t be a large concern, I’m also open to further discussion on this, maybe once I add the question mark element. Thanks again!

@bemoody
Copy link
Collaborator

bemoody commented Jul 28, 2020

Thanks, @shulim. This is a great start!

One general issue is that this requires that:

  • all requests in a given log file were made on the same date
  • there is only one log file for a particular date

It's better not to rely on such assumptions; instead, we should look at the date/time of each request, add up the number of requests for each date, and update the corresponding counters in the database.

At present, the system is configured to run logrotate at midnight local time, which means that if everything is working, then a new log file will be started a few seconds or minutes after midnight. But there are a lot of reasons that might not always be the case, and we don't want to rely on it.

Some details about the database model:

(models.py; line 2636)

    core_project = models.ForeignKey('project.CoreProject', 
                                     on_delete=models.DO_NOTHING, 
                                     default=None)

Don't use DO_NOTHING. Usually the correct value for on_delete is CASCADE, sometimes something else is more appropriate. DO_NOTHING is dangerous and for experts only.

Neither 'core_project' nor 'date' should have a default value (and None doesn't make sense anyway since these fields are not nullable.)

We probably want to say that these objects should be unique for a given project and date:

    class Meta:
        unique_together = (('core_project', 'date'),)

Some details about the parsing:

(get_metrics_data.py; line 98)

    my_file = open(filename).read()
    file_lines = my_file.split('\n')
    # ...
    for line in file_lines:
        # do stuff with line

Reading the entire log file into a gigantic string, and then splitting that into a list of strings, is not efficient. It's much better to read the file one line at a time:

    with open(filename) as file:
        for line in file:
            # do stuff with line

(line 106)

            if '?' not in parts[6] and 'GET' in parts[5] and (
                    '/files' in parts[6] or '/content' in parts[6]):

Why do you want to exclude requests with a query string ("?") ? If there are special cases then explain what those are.

The expression '/files' in parts[6] or '/content' in parts[6] will identify paths that contain the substring "/files" or "/content". But that's not what you want: you only want to find paths that begin with a slash, followed by "files" or "content", followed by another slash (which is then followed by a project slug.)

For example, you don't want to recognize:

  • /content
  • /filesystems/abcd
  • /about/database/files

(line 117)

        except IndexError:
            if line:
                print("Invalid line in log:", line)

I imagine the print was for debugging; we don't want to print anything here.

@shulim
Copy link
Contributor Author

shulim commented Jul 30, 2020

Thanks @bemoody for the detailed feedback! I'll look at this and get back to you next week on the suggested modifications.

@shulim
Copy link
Contributor Author

shulim commented Aug 5, 2020

@bemoody Thank you again for your suggestions! I've incorporated most of them, and I have a few questions on some:

One general issue is that this requires that:

  • all requests in a given log file were made on the same date
  • there is only one log file for a particular date

It's better not to rely on such assumptions; instead, we should look at the date/time of each request, add up the number of requests for each date, and update the corresponding counters in the database.

Thank you, that makes sense! I originally had the log parser keeping track of different dates within a single log file, and I can go back to that. I am, however, a little concerned about ensuring that logs don't get double-counted. If, for example, I have information in my table already from a log with entries on August 5, I don't want to accidentally call the command on the same log file and re-add the Aug 5 entries to the total view count. The current way I have in place to do this is to find the latest date for a slug and make sure that the additional views I add are for a date after the last update. If I do not assume that there is only one log file for a particular date, do you have any suggestions for how to go about preventing double-counting?

One thing that comes to mind is storing the IP addresses in the Metrics table so that I only count an IP address viewing a project page once per day, though I'm not sure if that could be a privacy concern. That would also significantly change how I store/access the data in multiple other places, but I'd be happy to work on that if it helps address my not-so-sound assumptions.

Some details about the database model:

    core_project = models.ForeignKey('project.CoreProject', 
                                     on_delete=models.DO_NOTHING, 
                                     default=None)

Don't use DO_NOTHING. Usually the correct value for on_delete is CASCADE, sometimes something else is more appropriate. DO_NOTHING is dangerous and for experts only.

Sounds good, I changed to CASCADE. I plan to commit this, along with the other changes from your feedback, all together after sorting out the other parts here.

Neither 'core_project' nor 'date' should have a default value (and None doesn't make sense anyway since these fields are not nullable.)

When I originally had 'core_project' and 'date' without default values, I was prompted to enter default values when attempting to make the migration. Do you have any thoughts on how to address this?

We probably want to say that these objects should be unique for a given project and date:

    class Meta:
        unique_together = (('core_project', 'date'),)

Thanks, that makes sense--I added this in the Metrics class.

Some details about the parsing:
Reading the entire log file into a gigantic string, and then splitting that into a list of strings, is not efficient. It's much better to read the file one line at a time:

    with open(filename) as file:
        for line in file:
            # do stuff with line

Thank you for the suggestion; I replaced my original code with these lines.

(line 106)

            if '?' not in parts[6] and 'GET' in parts[5] and (
                    '/files' in parts[6] or '/content' in parts[6]):

Why do you want to exclude requests with a query string ("?") ? If there are special cases then explain what those are.

I'm trying to avoid counting logs that contain requests similar to the following:
"GET /content/?topic=brain+tumor&orderby=relevance-desc&types=0&types=1&types=2&types=3 HTTP/1.1"
I intend to exclude queries like the one above from being counted as a view. However, I'd be open to other suggestions on getting around this, if applicable.

The expression '/files' in parts[6] or '/content' in parts[6] will identify paths that contain the substring "/files" or "/content". But that's not what you want: you only want to find paths that begin with a slash, followed by "files" or "content", followed by another slash (which is then followed by a project slug.)

Thanks, I added a slash so that the parser now searches for paths with /files/ and /content/.

(line 117)

        except IndexError:
            if line:
                print("Invalid line in log:", line)

I imagine the print was for debugging; we don't want to print anything here.

Noted. Instead of if line: and the print statement, I put continue because I don't want a single weirdly formatted line to raise something general like an InvalidLogError. I'm not sure if this is the right move, though, so please let me know your thoughts on this.

@bemoody
Copy link
Collaborator

bemoody commented Aug 6, 2020 via email

@shulim
Copy link
Contributor Author

shulim commented Aug 10, 2020

@bemoody thanks for your reponse! I believe I've addressed all the points we discussed and committed them, so if you get a chance, please let me know what you think.

Maybe we could add a model that simply records which log files have been processed, perhaps identified by the file's size and modification time? Or even just identified by the earliest/latest timestamps in the file?

Sounds good! I ended up storing the filename and creation timestamp in a new MetricsLogData table. I am hashing each file parsed and using the hashes stored in the table to determine whether or not a file will be parsed.

I think that should only happen if you're adding new columns (fields) to an existing table (model). Since this table doesn't exist yet, a default value isn't necessary. So if you delete the project/migrations/0045_metrics.py file and re-run the makemigrations command, it shouldn't complain.

This works great, thanks!

But you could write something like: path = parts[6].split('?')[0] slug = path.split('/')[2] to ignore any query string and just extract the project slug. In general, any request may contain query parameters and we want to count them either way.

Makes sense, incorporated this.

Yeah, I think if there are lines you can't parse then ignoring them is probably the best strategy in general. Maybe it'd be good to print a warning message just showing the number of unparseable lines?

Okay, did this!

@bemoody
Copy link
Collaborator

bemoody commented Aug 13, 2020

Thanks! Great improvements, although I still see some small issues here.

In the Metrics class, to mark fields as "unique together" you need to place the "unique_together" attribute inside a "Meta" class. See the Affiliation class for an example of how this is done.

In the update_metrics function, then, we want to add to the existing Metrics objects, if they exist, rather than creating new objects. To put it another way, we want the Metrics object to indicate the number of hits that actually occurred on August 13, even though some of those will have been recorded in the "August 12" and some in the "August 13" log file.

(And we don't want to have two Metrics objects for August 13, because then we don't know which object to look at.)

Finally, we don't want to create a MetricsLogData object until after we've parsed the file and saved the results - if the process is interrupted before the results are saved, it shouldn't record the file as having been processed. The best way would be to do all of the database updates - creating and modifying Metrics objects, and creating the MetricsLogData object - as a single atomic transaction.

In theory we could do this:

with transaction.atomic():
    if a MetricsLogData object exists:
        display a warning
    else:
        create a MetricsLogData object
        parse the file
        create/update Metrics objects

But it's better not to hold database locks for a long time, so it'd be better to:

parse the file
with transaction.atomic():
    if a MetricsLogData object exists:
        display a warning
    else:
        create a MetricsLogData object
        create/update Metrics objects

@shulim
Copy link
Contributor Author

shulim commented Aug 17, 2020

Thank you @bemoody for the continued feedback! I've tried to address each point below.

In the Metrics class, to mark fields as "unique together" you need to place the "unique_together" attribute inside a "Meta" class. See the Affiliation class for an example of how this is done.

Oops, not sure how I missed that. Fixed it!

In the update_metrics function, then, we want to add to the existing Metrics objects, if they exist, rather than creating new objects. To put it another way, we want the Metrics object to indicate the number of hits that actually occurred on August 13, even though some of those will have been recorded in the "August 12" and some in the "August 13" log file.

(And we don't want to have two Metrics objects for August 13, because then we don't know which object to look at.)

Oh yeah, that's important. I'm now using get_or_create for the project in update_metrics.

Finally, we don't want to create a MetricsLogData object until after we've parsed the file and saved the results - if the process is interrupted before the results are saved, it shouldn't record the file as having been processed. The best way would be to do all of the database updates - creating and modifying Metrics objects, and creating the MetricsLogData object - as a single atomic transaction.

In theory we could do this:

with transaction.atomic():
    if a MetricsLogData object exists:
        display a warning
    else:
        create a MetricsLogData object
        parse the file
        create/update Metrics objects

But it's better not to hold database locks for a long time, so it'd be better to:

parse the file
with transaction.atomic():
    if a MetricsLogData object exists:
        display a warning
    else:
        create a MetricsLogData object
        create/update Metrics objects

That's neat, thank you! I've followed that general outline in my latest commit.

@bemoody
Copy link
Collaborator

bemoody commented Aug 20, 2020

One small issue in handle():

creation_datetime = make_aware(datetime.datetime.strptime(
    time.ctime(os.path.getctime(filename)),
    "%a %b %d %H:%M:%S %Y"))

This is rather convoluted, and also would be wrong for one hour every year. :)
The simple way to convert a Unix time to a datetime object is:

creation_datetime = datetime.datetime.fromtimestamp(
    os.path.getctime(filename), datetime.timezone.utc)

I was confused about what you're trying to measure here. When you say "view count", I would assume that means the number of times a page was viewed (or, really, the number of times the page was loaded by a web browser.)

But that's not what you're counting here; you're trying to count the number of distinct visitors in each one-day period (assuming that roughly, one person corresponds to one IP address.) You did explain this, but I missed it in my first few times reading.

So, some of the issues I've mentioned don't make a lot of sense in that context. Counting distinct values is different from counting events. But please try to more clearly express what you mean, both in terms of variable names used in the code, and how the information is presented on the project page.

I may have more to add after I've thought about this a little more, but thanks for your work so far!

@shulim
Copy link
Contributor Author

shulim commented Sep 2, 2020

The simple way to convert a Unix time to a datetime object is:

creation_datetime = datetime.datetime.fromtimestamp(
    os.path.getctime(filename), datetime.timezone.utc)

@bemoody Thank you for the suggestion; I've updated the code accordingly.

I was confused about what you're trying to measure here. When you say "view count", I would assume that means the number of times a page was viewed (or, really, the number of times the page was loaded by a web browser.)

But that's not what you're counting here; you're trying to count the number of distinct visitors in each one-day period (assuming that roughly, one person corresponds to one IP address.) You did explain this, but I missed it in my first few times reading.

So, some of the issues I've mentioned don't make a lot of sense in that context. Counting distinct values is different from counting events. But please try to more clearly express what you mean, both in terms of variable names used in the code, and how the information is presented on the project page.

I may have more to add after I've thought about this a little more, but thanks for your work so far!

Thank you for your thoughts on this! I currently consider “views” and “[distinct] visitors (per day)” to be synonymous, at least from my less-technical perspective. I think that counting the total number of times a page was loaded by a web browser, if one person did it multiple times in a day, might make it too easy to “hack” the system (Lucas’s concern mentioned earlier), whereas refreshing the count every day for distinct visitors would hopefully preserve most of the page’s activity, excluding repeat visitors within a day. From reading up on how different sites count views, it seems like each has its own definition of the term, so I settled on this early on after discussing briefly with you, Tom, and Lucas. I also believe that "views" is the most commonly used term for metrics related to what I am counting, so I thought that this might make more sense than saying "50 visitors" or another less commonly-used term. However, I realize the importance of clarity and am open to refining/revising this definition.

Edit: I accidentally closed this, but I'm still open for feedback!

@shulim shulim closed this Sep 2, 2020
@shulim shulim reopened this Sep 2, 2020
@bemoody
Copy link
Collaborator

bemoody commented Sep 3, 2020

I agree that trying to estimate the number of distinct visitors is a more useful measurement of project "popularity" than simply looking at the number of page requests or bytes downloaded.

(Page requests and bytes downloaded are also useful metrics that I think we should be collecting for other reasons - and no one number can tell you the whole story - but we can add these and other things later, once we have the groundwork in place.)

I don't agree that we should worry about people "gaming" the system, because anything we do here can be gamed. Anybody can have 1000 different IP addresses in one day, if they want to. Nobody should put much stock in these numbers; what's really important are questions like how many courses are using the data, and how many peer-reviewed papers are being published.

One remaining issue here is that the logic of update_metrics doesn't deal well with the few seconds after midnight. The way update_metrics is currently written, all of the visits from the current log file are added together, and associated with the latest date seen for that project. If someone is actively downloading the project at midnight (which is usually the case for mimic-cxr), the latest date for the project will be today; if not, the latest date will be yesterday.

Furthermore, the update function adds to the existing "running_viewcount" but replaces the existing "viewcount". This results in very weird and inconsistent information being stored in the database.

For example, if we assume log rotation happens 15 seconds after midnight, and project X has:

100 visitors between 2020-08-19 00:00:15 and 2020-08-19 23:59:59
 no visitors between 2020-08-20 00:00:00 and 2020-08-20 00:00:14
200 visitors between 2020-08-20 00:00:15 and 2020-08-20 23:59:59
  5 visitors between 2020-08-21 00:00:00 and 2020-08-21 00:00:14
150 visitors between 2020-08-21 00:00:15 and 2020-08-21 23:59:59
 no visitors between 2020-08-22 00:00:00 and 2020-08-22 00:00:14

the database would end up looking like this:

date         viewcount  running_viewcount
2020-08-19   100        100
2020-08-21   150        455

...205 visits have "vanished", and there is no entry for 2020-08-20 at all.

Moreover, since those 5 visitors just after midnight were probably also represented among the 200 and the 150, we don't really want to count those as 5 additional visits anyway.

One possibility would be to revert to logic like you had before, where we assume one log file = one day, and associate all visits in the log file with a single date. In effect, treating the log rotation (rather than local midnight) as the start/end of the bucket. This isn't a great long-term solution but should make things work "well enough for now."

The better solution, in my view, would be to define visits in a way that is independent of the time of day - one request at 23:55 followed by one at 00:05 should be considered one visit, not two. Doing that would require either saving more information after each run, or parsing two log files at a time to obtain overlapping windows. Either way would require a decent amount of work, so I don't want to suggest that you try to implement it at this stage.

@shulim
Copy link
Contributor Author

shulim commented Jan 7, 2021

One possibility would be to revert to logic like you had before, where we assume one log file = one day, and associate all visits in the log file with a single date. In effect, treating the log rotation (rather than local midnight) as the start/end of the bucket.

@bemoody Thanks for the detailed feedback, and I'm sorry for the delay in getting back to you. I went forward with the one log file = one day approach, since although the few visitors who visit shortly after midnight may be counted with the previous day's total, the count displayed won't be affected. In get_metrics_data.py, I changed the format of the dictionary outputted by log_parser to fit the approach, as well as the update_metrics function.

In addition to any feedback you have on this, is there anything else that might need attention before merging? I'm planning to work on a small notice to indicate the day that views started being counted.

Thank you again!

@tompollard
Copy link
Member

tompollard commented Sep 21, 2021

@shulim We'd like to make a push to merge this! Please could you rebase on dev to deal with the conflict? It is less of a scary merge than it initially seems.

We (well @cx1111) recently refactored the project models.py for readability. As you'll see, models.py now includes a bunch of imports from a modelcomponents package.

We should (I think) need to move the following chunk to one of the modelcomponent/ modules (e.g. metadata.py, access.py or publishedproject.py - my preference maybe metadata.py):

class Metrics(models.Model):
'''
Stores the daily and running total view counts for each project, as well
as date recorded.
'''
core_project = models.ForeignKey('project.CoreProject',
on_delete=models.CASCADE)
viewcount = models.PositiveIntegerField(default=0)
running_viewcount = models.PositiveIntegerField(default=0)
date = models.DateField()
class Meta:
unique_together = (('core_project', 'date'),)
class MetricsLogData(models.Model):
'''
Stores the filenames, creation timestamps, and hashes of each log file that is parsed.
'''
filename = models.CharField(max_length=128)
creation_datetime = models.DateTimeField()
log_hash = models.CharField(max_length=128)

@tompollard
Copy link
Member

Bump! Message recently received from one of our dataset contributors:

I note that you do not give the download statistics on your website so I was wondering if there is any way that I can find out how many times this dataset has been downloaded, and monitor the number of downloads over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants