Hadoop is an open-source project that works in a distributed computing environment on data intensive applications.
A simplistic description of the capability of hadoop would be to take the Linux grep, sort, and uniq filters and have them work in a distributed environment over 1000s of computers and petabytes of data. In general, Hadoop and related projects now extend grid and cluster computing concepts into the text data mining and analysis arena, beyond strictly scientific computing applications.
Stern Research Computing has a small hadoop cluster.
Currently there are 13 processing nodes with about 50 cores and about 6TB of disk.
In addition to hadoop and map-reduce, hive,oig and mahout (0.8) are also installed.
Why is This of Interest for Research?
Hadoop can dramatically increase the ability to do high end business research projects, which have been difficult up to now. For example, building machine learning applications, crawling the web for information about business topics, extracting data from public documents like SEC filings, building customized search engines for research uses, etc. A specific example is the processing of large amounts of data that we receive every night from the Options Price Reporting Authority (OPRA). Our current approach parses the data using C and the Linux grep command, then sorts it. This all be could be done directly in the Hadoop environment, and would presumably scale linearly with the number of nodes.
To use hadoop, please look at these instructions.
Running hadoop, hive and mahout at the Stern Center for Research Computing
First, you must have your Stern userid enabled for hadoop To do that, please send an email to email@example.com, or call the help desk at 212-998-0180 and create a ticket for research computing.
To access hadoop,
hadoop fs -mkdir test
Should create a directory “test” in /user/yournetid (which is your default folder in the hadoop file system).
hadoop fs -lsr
and you will get a list of all of your files in hadoop (initially just the test directory you just created).
will enter the hive command line environment
will run a mahout job.
Important things to remember.
hadoop keeps all of its files in its own file system called “hdfs”. You need to move your files from linux to the hadoop files system with the
hadoop fs -put /mylocalpath/mylocalfile myhadoopfilename
command. That will copy the file at