Skip to content

Latest commit

 

History

History
211 lines (183 loc) · 9.33 KB

index.org

File metadata and controls

211 lines (183 loc) · 9.33 KB

RHIPE - R and Hadoop Integrated Processing Environment

#

RHIPE(phonetic spelling: hree-pay’ [1]) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g [1] This is greek for a moment in time. See here for pronunciation: Greek Lexicon

m <- expression({
  y <- unlist(strsplit(unlist(map.values),"[[:space:]]+"))
  sapply(y,function(r) rhcollect(r,T))
  ## instead of the previous line, you could also do
  ## z <-  table(sapply(y,function(r) rhcollect(r,T)))
  ## sapply(names(z), function(r) rhcollect(r, z[r]))
})
r <- expression(
    pre={
      count=0
    },
    reduce={
      count <- sum(as.numeric(unlist(reduce.values)),count)
    },post={
      rhcollect(reduce.key,count)
    })
z=rhmr(map=m,reduce=r,comb=T,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)

Or just, load Rhipe and type

rhwordcount(infolder,outfolder)

where infolder is the input file(or folder of files) of words(text file) and outfolder is the destination directory.

1 Mailing List

Mailing list is hosted on Google Groups. The url is http://groups.google.com/group/rhipe . Your first post will be moderated.

2 Data types supported

Well, a useful subset, but not all, see Data Types

3 More Information

For more information about what RHIPE is and not, read the FAQ. Please note, this does not work on Mac OS X Snow Leopard.

4 Download

Source

The source code is present on Git, go here http://github.com/saptarshiguha/RHIPE/

To check out the current version, install git

git clone git://github.com/saptarshiguha/RHIPE.git

To download version X e.g 0.45

git clone git://github.com/saptarshiguha/RHIPE.git
git checkout 0.45

The current version is always the master.

Versions

Read the documentation for installation. Current is the latest version.

VersionDownload
0.61Rhipe_0.61.tar.gzWed Sep 08 03:12:20 EDT 2010
0.60Rhipe_0.60.tar.gzFri May 28 17:13:16 EDT 2010
0.59Rhipe_0.59.tar.gzTue May 04 18:15:26 EDT 2010
0.58Rhipe_0.58.tar.gzFri Apr 09 01:41:48 EDT 2010
0.57Rhipe_0.57.tar.gzSat Mar 20 11:20:21 EDT 2010
0.56Rhipe_0.56.tar.gzSat Feb 27 17:04:44 EST 2010
0.55Rhipe\_0.55.tar.gzSun Feb 21 00:49:04 EST 2010
0.54Rhipe\_0.54.tar.gz
0.53Rhipe\_0.53.tar.gz
0.52Rhipe\_0.52.tar.gz
0.51Rhipe\_0.51.tar.gz
0.5Rhipe\_0.5.tar.gz
0.44rhipe.0.44.tgz

5 EC2

See the documentation.

6 Documentation

The documentation can be found here. PDF version can be found here

7 Contact

sguha -AT- purdue -DOT- edu

8 News

8.0.0.0.0.1 Wed Sep 08 03:06:34 EDT 2010
  • Version 0.61, minor=3
  • Some modifications made to sorting of keys. Works now.
  • Manual completely re-written
  • The experimental java as a server episode has been re-written.
8.0.0.0.0.2 Wed Aug 04 13:19:09 EDT 2010
  • Version 0.61, minor=2
  • Ordering of numeric and alphabetical keys (not default)
8.0.0.0.0.3 Wed Jun 30 13:28:21 EDT 2010
  • Version 0.61
  • Added a partitioner that partitions on the i’th element of a scalar vector (strings, numerics and integers)
8.0.0.0.0.4 Fri May 28 17:11:16 EDT 2010
  • Now version 0.60
  • Added asynch options to rhex, so jobs can run in the background freeing the R console. The return value can be used to monitor job progress. See Miscellaneous Commands for more information.
8.0.0.0.0.5 Thu May 06 21:29:36 EDT 2010
  • Added rhcp and rhmv to copy and moves files when both source and destination are on the HDFS (thanks to Jeff Li)
8.0.0.0.0.6 Tue May 04 18:15:35 EDT 2010
  • Some bugs in the comparator - fixed.
8.0.0.0.0.7 Thu Apr 23 12:48:45 EDT 2010 -
  • fixed comparators, rhgetkey working again. (0.59-2)
8.0.0.0.0.8 Thu Apr 22 12:23:37 EDT 2010
  • Fixed a bug in rhlapply, would not read in data. Thanks to eddyu
  • rhoptions()$version now has displays major, minor , date and notes. I added this since i make changes to RHIPE but keep the version the same.
8.0.0.0.0.9 Mon Apr 19 02:00:22 EDT 2010
  • Less memory allocation in the key/value(s).
  • rhread now does not do a mapreduce job to convert sequence files to binary. Also has a head like function.
    • if multicore is installed, then running rhread(..,mc=TRUE) will deserialize in parallel, which might or might be slower …
  • rhez takes an option mapred which is of the same form as mapred in rhmr. This will override the mapred value in rhmr.
  • rhgetkey takes a parameter skip to read in large databases, also no need for trailing “*”.
8.0.0.0.0.10 Thur Apr 15
  • moved to protobuf-2.3
8.0.0.0.0.11 Fri Apr 09 01:42:05 EDT 2010
  • rhls can now recurse
  • rhread now need only take a folder (no need for rhmap.sqs to read map files). Use the type argument to specify sequence(or text) files or map files.
  • Similarly rhmr does not need rhmap.sqs
8.0.0.0.0.12 Wed Apr 07 16:56:07 EDT 2010
  • rhread takes a max argument that reads in only max number of key-value pairs
  • rhex passes all extra arguments to the system command.
8.0.0.0.0.13 Sat Mar 20 20:51:58 EDT 2010
  • Combiner bug fixed, it’s still needs to be tested. Mail if numbers do not match.
8.0.0.0.0.14 Sat Mar 20 11:20:44 EDT 2010
  • Fixed combiner, still alpha, but it halves the wordcount speed. The combiner logic is run in the R interpreter C code. However it is still alpha, so if you get erroneous results kindly report them back.
  • Also fixed a buffer overflow in main.c. Thanks to Will Nolan.
  • Values and Keys can be now be upto 256MB.
8.0.0.0.0.15 Fri Feb 19 20:43:25 EST 2010
  • EC2 now works!
8.0.0.0.0.16 Thu Jan 14 20:19:24 EST 2010
  • Counters are returned to the R session (for rhmr only). That is the return value of rhmr is a list, the first element indicates success/failure and the second are all the counters visible in the job UI.
8.0.0.0.0.17 Wed Jan 13 02:52:27 EST 2010
  • Fixed a bug where errors in R code were not appearing. Somewhat fixed. Version stays the same.
8.0.0.0.0.18 Thu Dec 24 11:58:04 EST 2009

Released version 0.54

  • Introduce a Hadoop Map File Outputformat and functions for reading a key from map files(see help on rhmr and misc functions)
  • Fixed a bug for the case when no reducer is specified but RHIPE java code threw an exception.
8.0.0.0.0.19 Sun Dec 13 22:11:53 EST 2009
  • Release **Version 0.53**
  • Bug fixes:
    • Inserted R\_CStackLimits, since I’m using Protobuf a threaded library, it was upsetting R.
    • Removed Rf\_duplicate
  • Data types have been enhanced, now allows scalar vectors with attributes. Experimental.
  • A result of which can now write data.frames and read them back in.
  • Impose 64MB key,value serialization limit(workaround to come in future). Objects bigger than this will be written successfully,but will fail to read and will cause the job to fail.
8.0.0.0.0.20 Thu Dec 10 13:28:19 EST 2009
  • rhcounter ,available in mapreduce code, is more versatile. Previously, ‘,’ in the counter names would upset Hadoop. Not anymore, see documentation for rhmr
8.0.0.0.0.21 Wed Dec 2 12:44:23 EST 2009
  • Failed when running RHIPE from different UID’s. Now writes to /tmp/logger-UID. Version number is still the same
8.0.0.0.0.22 Mon Oct 12 11:18:31 EDT 2009
  • Removed the dependency on rJava. Getting it to work with Hadoop classpaths caused to much grief. The actualy RHIPE program remains unchanged but the client handler (R package) is a bit slower(?)
8.0.0.0.0.23 Sun Sep 27 22:01:33 EDT 2009
  • Names are only read for VECSXP (list objects), because of a strange bug.
8.0.0.0.0.24 Tue Sep 8 15:35:24 EDT 2009
  • Moved to Hadoop 0.20
  • Uses protobuf for serialization, fewer R types allowed
  • Does not depend on Rserve, single R package to install
8.0.0.0.0.25 Fri Aug 7 2009, Version 0.45
  • Web site revamped. Beginning with the current version, the entire manual is in PDF or can be accessed at the documentation link.
  • Source code is available on Git, go to the download page for instructions.
  • Stopped seeding via secure random generator, so the user will have to seed it to avoid correlated streams. On RHEL linux when running rhlapply on 145K+ tasks, /dev/random would block.