Running hadoop, hive and mahout at the Stern Center for Research Computing
Hadoop is a linux based processing system, so to use it you need to be reasonably familiar with unix/linux commands.
The login system for the hadoop cluster is bigdata.stern.nyu.edu. It has a large storage area (/bigtemp) for temporary file storage, since most users home directories are limited in space. (< 1GB) The /bigtemp area can be used as a staging area to load your data into the hadoop cluster.
You should store your data in /bigtemp/yournetid, and then you can copy (put) the data into the hadoop system where you can manipulate it using hadoop.
i.e. At the bigdata command prompt type
You can then use sftp, scp or wget commands to move your data into
Once you have the data in /bigtemp/yournetid you can use hadoop commands to move it into the cluster.
To access hadoop,
hadoop fs -mkdir test
Should create a directory “test” in /user/yournetid (which is your default folder in the hadoop file system).
hadoop fs -lsr
and you will get a list of all of your files in hadoop
will enter the hive command line environment
will run a mahout job.
Important things to remember.
hadoop keeps all of its files in its own file system called “hdfs”. You need to move your files from linux to the hadoop files system with the
hadoop fs -put /mylocalpath/mylocalfile myhadoopfilename
command. That will copy the file at
If your files are in /bigtemp/yournetid
the command would look like this
hadoop fs -put /bigtemp/yournetid/yourfilename yourfilename
One thing to watch out for. Many hadoop commands (and hive and mahout) work on a directory/folder of files as opposed to single file.
So you often have to create a folder in hadoop and put your file(s) in the folder.
In this case, you would first create the folder.
hadoop fs -mkdir yourproject
hadoop fs -put /bigtemp/yournetid/yourfilename yourproject/
This will create the file hdfs://user/yournetid/yourproject/yourfilename
To experiment, you might download the single user version of hadoop and run it locally to get used to where it stores files.