RHIPE - R and Hadoop Integrated Processing Environment

#

RHIPE(phonetic spelling: hree-pay’ [1]) is a java package that integrates the R environment with Hadoop, the open source implementation of Google’s mapreduce. Using RHIPE it is possible to code map-reduce algorithms in R e.g [1] This is greek for a moment in time. See here for pronunciation: Greek Lexicon

m <- expression({
  y <- unlist(strsplit(unlist(map.values),"[[:space:]]+"))
  sapply(y,function(r) rhcollect(r,T))
  ## instead of the previous line, you could also do
  ## z <-  table(sapply(y,function(r) rhcollect(r,T)))
  ## sapply(names(z), function(r) rhcollect(r, z[r]))
})
r <- expression(
    pre={
      count=0
    },
    reduce={
      count <- sum(as.numeric(unlist(reduce.values)),count)
    },post={
      rhcollect(reduce.key,count)
    })
z=rhmr(map=m,reduce=r,comb=T,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)

Or just, load Rhipe and type

rhwordcount(infolder,outfolder)

where infolder is the input file(or folder of files) of words(text file) and outfolder is the destination directory.

1 Mailing List

Mailing list is hosted on Google Groups. The url is http://groups.google.com/group/rhipe . Your first post will be moderated.

2 Data types supported

Well, a useful subset, but not all, see Data Types

3 More Information

For more information about what RHIPE is and not, read the FAQ. Please note, this does not work on Mac OS X Snow Leopard.

4 Download

Source

The source code is present on Git, go here http://github.com/saptarshiguha/RHIPE/

To check out the current version, install git

git clone git://github.com/saptarshiguha/RHIPE.git

To download version X e.g 0.45

git clone git://github.com/saptarshiguha/RHIPE.git
git checkout 0.45

The current version is always the master.

Versions

Read the documentation for installation. Current is the latest version.

Version	Download
0.61	Rhipe_0.61.tar.gz	Wed Sep 08 03:12:20 EDT 2010
0.60	Rhipe_0.60.tar.gz	Fri May 28 17:13:16 EDT 2010
0.59	Rhipe_0.59.tar.gz	Tue May 04 18:15:26 EDT 2010
0.58	Rhipe_0.58.tar.gz	Fri Apr 09 01:41:48 EDT 2010
0.57	Rhipe_0.57.tar.gz	Sat Mar 20 11:20:21 EDT 2010
0.56	Rhipe_0.56.tar.gz	Sat Feb 27 17:04:44 EST 2010
0.55	Rhipe\_0.55.tar.gz	Sun Feb 21 00:49:04 EST 2010
0.54	Rhipe\_0.54.tar.gz
0.53	Rhipe\_0.53.tar.gz
0.52	Rhipe\_0.52.tar.gz
0.51	Rhipe\_0.51.tar.gz
0.5	Rhipe\_0.5.tar.gz
0.44	rhipe.0.44.tgz

5 EC2

See the documentation.

6 Documentation

The documentation can be found here. PDF version can be found here

7 Contact

sguha -AT- purdue -DOT- edu

8 News

8.0.0.0.0.1 Wed Sep 08 03:06:34 EDT 2010

Version 0.61, minor=3
Some modifications made to sorting of keys. Works now.
Manual completely re-written
The experimental java as a server episode has been re-written.

8.0.0.0.0.2 Wed Aug 04 13:19:09 EDT 2010

Version 0.61, minor=2
Ordering of numeric and alphabetical keys (not default)

8.0.0.0.0.3 Wed Jun 30 13:28:21 EDT 2010

Version 0.61
Added a partitioner that partitions on the i’th element of a scalar vector (strings, numerics and integers)

8.0.0.0.0.4 Fri May 28 17:11:16 EDT 2010

Now version 0.60
Added asynch options to rhex, so jobs can run in the background freeing the R console. The return value can be used to monitor job progress. See Miscellaneous Commands for more information.

8.0.0.0.0.5 Thu May 06 21:29:36 EDT 2010

Added rhcp and rhmv to copy and moves files when both source and destination are on the HDFS (thanks to Jeff Li)

8.0.0.0.0.6 Tue May 04 18:15:35 EDT 2010

Some bugs in the comparator - fixed.

8.0.0.0.0.7 Thu Apr 23 12:48:45 EDT 2010 -

fixed comparators, rhgetkey working again. (0.59-2)

8.0.0.0.0.8 Thu Apr 22 12:23:37 EDT 2010

Fixed a bug in rhlapply, would not read in data. Thanks to eddyu
rhoptions()$version now has displays major, minor , date and notes. I added this since i make changes to RHIPE but keep the version the same.

8.0.0.0.0.9 Mon Apr 19 02:00:22 EDT 2010

Less memory allocation in the key/value(s).
rhread now does not do a mapreduce job to convert sequence files to binary. Also has a head like function.
- if multicore is installed, then running rhread(..,mc=TRUE) will deserialize in parallel, which might or might be slower …
rhez takes an option mapred which is of the same form as mapred in rhmr. This will override the mapred value in rhmr.
rhgetkey takes a parameter skip to read in large databases, also no need for trailing “*”.

8.0.0.0.0.10 Thur Apr 15

moved to protobuf-2.3

8.0.0.0.0.11 Fri Apr 09 01:42:05 EDT 2010

rhls can now recurse
rhread now need only take a folder (no need for rhmap.sqs to read map files). Use the type argument to specify sequence(or text) files or map files.
Similarly rhmr does not need rhmap.sqs

8.0.0.0.0.12 Wed Apr 07 16:56:07 EDT 2010

rhread takes a max argument that reads in only max number of key-value pairs
rhex passes all extra arguments to the system command.

8.0.0.0.0.13 Sat Mar 20 20:51:58 EDT 2010

Combiner bug fixed, it’s still needs to be tested. Mail if numbers do not match.

8.0.0.0.0.14 Sat Mar 20 11:20:44 EDT 2010

Fixed combiner, still alpha, but it halves the wordcount speed. The combiner logic is run in the R interpreter C code. However it is still alpha, so if you get erroneous results kindly report them back.
Also fixed a buffer overflow in main.c. Thanks to Will Nolan.
Values and Keys can be now be upto 256MB.

8.0.0.0.0.15 Fri Feb 19 20:43:25 EST 2010

EC2 now works!

8.0.0.0.0.16 Thu Jan 14 20:19:24 EST 2010

Counters are returned to the R session (for rhmr only). That is the return value of rhmr is a list, the first element indicates success/failure and the second are all the counters visible in the job UI.

8.0.0.0.0.17 Wed Jan 13 02:52:27 EST 2010

Fixed a bug where errors in R code were not appearing. Somewhat fixed. Version stays the same.

8.0.0.0.0.18 Thu Dec 24 11:58:04 EST 2009

Released version 0.54

Introduce a Hadoop Map File Outputformat and functions for reading a key from map files(see help on rhmr and misc functions)
Fixed a bug for the case when no reducer is specified but RHIPE java code threw an exception.

8.0.0.0.0.19 Sun Dec 13 22:11:53 EST 2009

Release **Version 0.53**
Bug fixes:
- Inserted R\_CStackLimits, since I’m using Protobuf a threaded library, it was upsetting R.
- Removed Rf\_duplicate
Data types have been enhanced, now allows scalar vectors with attributes. Experimental.
A result of which can now write data.frames and read them back in.
Impose 64MB key,value serialization limit(workaround to come in future). Objects bigger than this will be written successfully,but will fail to read and will cause the job to fail.

8.0.0.0.0.20 Thu Dec 10 13:28:19 EST 2009

rhcounter ,available in mapreduce code, is more versatile. Previously, ‘,’ in the counter names would upset Hadoop. Not anymore, see documentation for rhmr

8.0.0.0.0.21 Wed Dec 2 12:44:23 EST 2009

Failed when running RHIPE from different UID’s. Now writes to /tmp/logger-UID. Version number is still the same

8.0.0.0.0.22 Mon Oct 12 11:18:31 EDT 2009

Removed the dependency on rJava. Getting it to work with Hadoop classpaths caused to much grief. The actualy RHIPE program remains unchanged but the client handler (R package) is a bit slower(?)

8.0.0.0.0.23 Sun Sep 27 22:01:33 EDT 2009

Names are only read for VECSXP (list objects), because of a strange bug.

8.0.0.0.0.24 Tue Sep 8 15:35:24 EDT 2009

Moved to Hadoop 0.20
Uses protobuf for serialization, fewer R types allowed
Does not depend on Rserve, single R package to install

8.0.0.0.0.25 Fri Aug 7 2009, Version 0.45

Web site revamped. Beginning with the current version, the entire manual is in PDF or can be accessed at the documentation link.
Source code is available on Git, go to the download page for instructions.
Stopped seeding via secure random generator, so the user will have to seed it to avoid correlated streams. On RHEL linux when running rhlapply on 145K+ tasks, /dev/random would block.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.org

index.org

RHIPE - R and Hadoop Integrated Processing Environment

1 Mailing List

2 Data types supported

3 More Information

4 Download

5 EC2

6 Documentation

7 Contact

8 News

8.0.0.0.0.1 Wed Sep 08 03:06:34 EDT 2010

8.0.0.0.0.2 Wed Aug 04 13:19:09 EDT 2010

8.0.0.0.0.3 Wed Jun 30 13:28:21 EDT 2010

8.0.0.0.0.4 Fri May 28 17:11:16 EDT 2010

8.0.0.0.0.5 Thu May 06 21:29:36 EDT 2010

8.0.0.0.0.6 Tue May 04 18:15:35 EDT 2010

8.0.0.0.0.7 Thu Apr 23 12:48:45 EDT 2010 -

8.0.0.0.0.8 Thu Apr 22 12:23:37 EDT 2010

8.0.0.0.0.9 Mon Apr 19 02:00:22 EDT 2010

8.0.0.0.0.10 Thur Apr 15

8.0.0.0.0.11 Fri Apr 09 01:42:05 EDT 2010

8.0.0.0.0.12 Wed Apr 07 16:56:07 EDT 2010

8.0.0.0.0.13 Sat Mar 20 20:51:58 EDT 2010

8.0.0.0.0.14 Sat Mar 20 11:20:44 EDT 2010

8.0.0.0.0.15 Fri Feb 19 20:43:25 EST 2010

8.0.0.0.0.16 Thu Jan 14 20:19:24 EST 2010

8.0.0.0.0.17 Wed Jan 13 02:52:27 EST 2010

8.0.0.0.0.18 Thu Dec 24 11:58:04 EST 2009

8.0.0.0.0.19 Sun Dec 13 22:11:53 EST 2009

8.0.0.0.0.20 Thu Dec 10 13:28:19 EST 2009

8.0.0.0.0.21 Wed Dec 2 12:44:23 EST 2009

8.0.0.0.0.22 Mon Oct 12 11:18:31 EDT 2009

8.0.0.0.0.23 Sun Sep 27 22:01:33 EDT 2009

8.0.0.0.0.24 Tue Sep 8 15:35:24 EDT 2009

8.0.0.0.0.25 Fri Aug 7 2009, Version 0.45

Files

index.org

Latest commit

History

index.org

File metadata and controls

RHIPE - R and Hadoop Integrated Processing Environment

1 Mailing List

2 Data types supported

3 More Information

4 Download

5 EC2

6 Documentation

7 Contact

8 News

8.0.0.0.0.1 Wed Sep 08 03:06:34 EDT 2010

8.0.0.0.0.2 Wed Aug 04 13:19:09 EDT 2010

8.0.0.0.0.3 Wed Jun 30 13:28:21 EDT 2010

8.0.0.0.0.4 Fri May 28 17:11:16 EDT 2010

8.0.0.0.0.5 Thu May 06 21:29:36 EDT 2010

8.0.0.0.0.6 Tue May 04 18:15:35 EDT 2010

8.0.0.0.0.7 Thu Apr 23 12:48:45 EDT 2010 -

8.0.0.0.0.8 Thu Apr 22 12:23:37 EDT 2010

8.0.0.0.0.9 Mon Apr 19 02:00:22 EDT 2010

8.0.0.0.0.10 Thur Apr 15

8.0.0.0.0.11 Fri Apr 09 01:42:05 EDT 2010

8.0.0.0.0.12 Wed Apr 07 16:56:07 EDT 2010

8.0.0.0.0.13 Sat Mar 20 20:51:58 EDT 2010

8.0.0.0.0.14 Sat Mar 20 11:20:44 EDT 2010

8.0.0.0.0.15 Fri Feb 19 20:43:25 EST 2010

8.0.0.0.0.16 Thu Jan 14 20:19:24 EST 2010

8.0.0.0.0.17 Wed Jan 13 02:52:27 EST 2010

8.0.0.0.0.18 Thu Dec 24 11:58:04 EST 2009

8.0.0.0.0.19 Sun Dec 13 22:11:53 EST 2009

8.0.0.0.0.20 Thu Dec 10 13:28:19 EST 2009

8.0.0.0.0.21 Wed Dec 2 12:44:23 EST 2009

8.0.0.0.0.22 Mon Oct 12 11:18:31 EDT 2009

8.0.0.0.0.23 Sun Sep 27 22:01:33 EDT 2009

8.0.0.0.0.24 Tue Sep 8 15:35:24 EDT 2009

8.0.0.0.0.25 Fri Aug 7 2009, Version 0.45