Stern Center for Research Computing

New York University • Leonard Stern School of Business

Hadoop

Hadoop is an open-source project that works in a distributed computing environment on data intensive applications.

A simplistic description of the capability of hadoop would be to take the Linux grep, sort, and uniq filters and have them work in a distributed environment over 1000s of computers and petabytes of data. In general, Hadoop and related projects now extend grid and cluster computing concepts into the text data mining and analysis arena, beyond strictly scientific computing applications.

Stern Research Computing has  a small hadoop cluster.
Currently there are 13 processing nodes with about  50  cores and about 6TB of disk.
In addition to hadoop and map-reduce, hive,oig  and mahout (0.8) are also installed.

Why is This of Interest for Research?

Hadoop can dramatically increase the ability to do high end business research projects, which have been difficult up to now. For example, building machine learning applications, crawling the web for information about business topics, extracting data from public documents like SEC filings, building customized search engines for research uses, etc. A specific example  is the processing of large amounts of data that we receive every night from the Options Price Reporting Authority (OPRA). Our current approach parses the data using C and the Linux grep command, then sorts it. This all be could be done directly in the Hadoop environment, and would presumably scale linearly with the number of nodes.

Getting Started

To use hadoop, please look at these instructions.

Running hadoop, hive and mahout at the Stern Center for Research Computing

First, you must have your Stern userid enabled for hadoop To do that, please send an email to research@stern.nyu.edu, or call the help desk at 212-998-0180 and create a ticket for research computing.

To access hadoop,

ssh yournetid@bigdata.stern.nyu.edu

Typing

hadoop  fs  -mkdir  test

Should create a directory “test”  in /user/yournetid  (which is your default folder in the hadoop file system).

type

hadoop fs -lsr

and you will get a list of  all of your files in hadoop (initially just the test directory you just created).

hive

will enter the hive command line environment

mahout  options

will run a mahout job.

Important things to remember.

hadoop keeps all of its files in its own file system called “hdfs”. You need to move your files from linux to the hadoop files system with the

hadoop fs -put /mylocalpath/mylocalfile myhadoopfilename

command. That will copy  the file at

/mylocalpath/mylocalfile

to

myhadoopfilename

in hdfs:/user/yournetid/myhadoopfilename

Good luck….