Stern Center for Research Computing

New York University • Leonard Stern School of Business

Running hadoop, hive and mahout at the Stern Center for Research Computing

Hadoop is a linux based processing system, so to use it you need to be reasonably familiar with unix/linux commands.

The login system for the hadoop cluster is bigdata.stern.nyu.edu. It has a large storage area (/bigtemp) for temporary file storage, since most users home directories are limited in space. (< 1GB)  The /bigtemp area can be used as a staging area to load your data into the hadoop cluster.

You should store your data in /bigtemp/yournetid, and then you can copy (put) the data into the hadoop system where you can manipulate it using hadoop.

i.e.  At the bigdata command prompt type

mkdir /bigtemp/yournetid

You can then use sftp, scp or wget   commands to move your data into

/bigtemp/yournetid

Once you have the data in /bigtemp/yournetid you can use hadoop commands to move it into the cluster.

To access hadoop,

ssh yournetid@bigdata.stern.nyu.edu

Typing

hadoop  fs  -mkdir  test

Should create a directory “test”  in /user/yournetid  (which is your default folder in the hadoop file system).

type

hadoop fs -lsr

and you will get a list of  all of your files in hadoop

hive

will enter the hive command line environment

mahout  options

will run a mahout job.

Important things to remember.

hadoop keeps all of its files in its own file system called “hdfs”. You need to move your files from linux to the hadoop files system with the

hadoop fs -put /mylocalpath/mylocalfile myhadoopfilename

command. That will copy  the file at

/mylocalpath/mylocalfile

to

myhadoopfilename

in hdfs:/user/yournetid/myhadoopfilename

If your files are in /bigtemp/yournetid

the command would look like this

hadoop  fs   -put  /bigtemp/yournetid/yourfilename  yourfilename

One thing to watch out for. Many hadoop commands (and hive and mahout) work on a directory/folder of files as opposed to single file.

So you often have to create a folder in hadoop and put your file(s) in the folder.

In this case, you would first create the folder.

hadoop fs -mkdir  yourproject

and then

hadoop fs -put /bigtemp/yournetid/yourfilename   yourproject/

This will create the file hdfs://user/yournetid/yourproject/yourfilename

To experiment, you might download the single user version of hadoop and run it locally to get used to where it stores files.