Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider revisiting pkg download algorithms #358

Open
richierocks opened this issue Jun 28, 2017 · 5 comments
Open

Consider revisiting pkg download algorithms #358

richierocks opened this issue Jun 28, 2017 · 5 comments

Comments

@richierocks
Copy link

richierocks commented Jun 28, 2017

A series of emails from Matt Dowle. He thinks that the current algorithm overweights automatic downloads, and that counting multiple downloads from a single IP address in one day as a single download would give a more accurate representation of how many people are using which packages.

@ludov04
Copy link
Contributor

ludov04 commented Jun 28, 2017

Thanks, that's an interesting idea.

We are indeed using the 'same' method as Arun in his cran.stats package, that means we look in a time window of [-60sec, +60sec] if a reverse dependencies was downloaded from the same ip.
However we do not use cran.stats, this was reimplemented in javascript, using our database to figure out dependencies, as we do not have a R server somewhere with all the packages installed.

For reference, the code is here:
https://github.com/datacamp/RDocumentation-app/blob/master/api/services/ElasticSearchService.js
and
https://github.com/datacamp/RDocumentation-app/blob/master/api/services/DownloadStatsService.js

We'll look into Matt's idea

@mattdowle
Copy link

mattdowle commented Jun 28, 2017

Thanks for looking into it Ludovic. I did not know RDocumentation-app was here with an Issues tracker so this is great. I had emailed Datacamp last year and had been following up over email. Much better here on GitHub instead.

  • Could the method and time window be stated or linked from the main https://www.rdocumentation.org/trends page. Since the original idea and implementation for splitting direct from indirect downloads came from Arun and his cran.stats package, it would be nice if that got some credit.

  • Is R6 and glue really the top 1 and 2 directly downloaded packages currently? It happened before that in some cases one ip_id downloads the same package many thousands of times, in a single day. For example, I replied to Rob Hyndman in Jan 2016 here :
    http://robjhyndman.com/hyndsight/fpp-downloads/#comment-2479833620
    Please ensure to click that comment's 'see more' button to see the plot I posted. Could the same thing be happening again?
    Here's my slides and video from 2015 where I suggested cleaning this dirty data by not counting an ip_id downloading the same package more than once in a single day.

I believe the proposed adjustment is almost trivial. It's just a unique of (date, ip_id, package) first (i.e. one extra line) before counting by package. So that every ip_id still counts but not more than once per day.

Here's that plot where fpp package was being downloaded over 30,000 times in a single day by a single ip_id.

image

@filipsch
Copy link
Contributor

filipsch commented Jan 5, 2018

@ludov04 isn't this long addressed by now?

@mattdowle
Copy link

@filipsch If it's resolved, that's great. I just assumed viridisLite was not really the number 1 most directly downloaded package. Is that true then? It's been number 1 for many months.
image

@ludov04
Copy link
Contributor

ludov04 commented Jan 15, 2018

The algorithm has indeed been revisited somewhere in august to exclude downloads occuring from the same ip on the same day. However, like @mattdowle mentioned, it's unlikely that viridisLite, pillar and R6 are the top downloaded packages, I guess something else is biasing that data.
I'm leaving this issue open, we'll revisiting that feature more deeply on the next RDocumentation iteration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants