What is interesting for this blog entry is the use of HDFS and Hive layers for processing high volume data files.
I wish I had enough processing power on my laptop to clone the VM and emulate a 2-4 node cluster, I used Ubuntu Linux (64bit VM)
Installed Hadoop (http://www.apache.org/dyn/closer.cgi/hadoop/common/) and Hive (http://www.apache.org/dyn/closer.cgi/hive/)
Once installed, need to set few environmental variables to make things easier to work:
export JAVA_HOME=<PATH_TO_JDK>
export HIVE_HOME=<PATH_TO_HIVE>
export HADOOP_HOME=<PATH_TO_HADOOP>
Once you’ve extracted the Hadoop and Hive files into the above path; you need to edit the configuration files to add your node information. You’ll have to edit the following files in $HADOOP_HOME/conf/
hadoop-sites.xml
mapred-site.xml
hdfs-site.xml core-site.xml
to add the following to the <configuration> element; this process must be repeated on all nodes participating in the cluster. Make sure that host name is resolvable; if not make appropriate entries in /etc/hosts.
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>localhost:50090</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>localhost:50075</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Add these variables to PATH
export PATH=JAVA_HOME/bin:HIVE_HOME/bin:HADOOP_HOME/bin:$PATH
Edit the following files to customise the environment
edit $HADOOP_HOME/conf/hadoop-env.sh
uncomment line containing JAVA_HOME and set value according to your environment.
edit $HIVE_HOME/conf/hive-default.xml
look for key : hive.metastore.warehouse.dir
modify its value to a path where you want to store the hive files.
NOTE: Use all the commands as a non-root user.
The next thing you would like to do is format the HDFS to be used.
$ hadoop namenode -format
This formats the filesystem and prepares the HDFS to be used. Since HDFS is literally a file system, you need to upload the files that needs to be processed. Once the files are on a common path across all the nodes, the MR (Map-Reduce) algo can be written and deployed to get the job done.
$ $HADOOP_HOME/start-all.sh
Starts hadoop services
Now, HIVE just simplifies all of that process using SQL interface, and eliminates me from implementing the MR job.
To check this out, I used the following CSV file from http://explore.data.gov/Foreign-Commerce-and-Aid/U-S-Overseas-Loans-and-Grants-Greenbook-/5gah-bvex
The file can be downloaded and used for this sample.
Now lanuch HIVE, and execute all the commands in HIVE prompt
$ hive
hive> CREATE TABLE usa_gov_data (country STRING, comments STRING, year1 INT, year2 INT, year3 INT, year4 INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe'
WITH SERDEPROPERTIES (
'serialization.format'='org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol',
'quote.delim'='("|\[|\])',
'field.delim'=',',
'serialization.null.format'='-')
STORED AS TEXTFILE;
When command is executed, you will see the following messages on successful execution.
OK
Time taken: 7.879 seconds
The CSV file contains 2500 rows and 60 columns. For simplicity of demonstration; I've created a table which accommodates the first 6 columns)
The serialization.format and serialization/deserialization classes are explicitly mentioned. These helper classes will performing the parsing of the file during load.
quote.delim : parses out quotes in the csv file.
field.delim : change this to any delimiter that your file contains (don't forget to escape it)
To list the tables that you’ve created
hive> show tables;
OK
usa_gov_data
Time taken: 0.156 seconds
Now it’s time to upload the content:
hive> LOAD DATA LOCAL INPATH '<PATH_TO>/us_economic_assistance.csv' OVERWRITE INTO TABLE usa_gov_data;
Copying data from file:/home/innovator/Downloads/us_economic_assistance.csv
Copying file: file:/home/innovator/Downloads/us_economic_assistance.csv
Loading data to table default.usa_gov_data
Deleted hdfs://localhost:9000/user/hive/warehouse/usa_gov_data
OK
Time taken: 0.469 seconds
Once the file is uploaded, we can run the queries. (If you query the HDFS, you will be able to find the whole file as it is in the location). which implies, HDFS is just a placeholder for all objects; and it is upto the MR algo to process it while applying the query techniques.
hive> select * from usa_gov_data where country = 'Afghanistan';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201107012339_0001, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201107012339_0001
Kill Command = /home/innovator/apps/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://localhost:9001 -kill job_201107012339_0001
2011-07-01 23:41:31,068 Stage-1 map = 0%, reduce = 0%
2011-07-01 23:41:36,215 Stage-1 map = 100%, reduce = 0%
2011-07-01 23:41:39,244 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201107012339_0001
OK
Here is a snapshot of the results
Afghanistan Child Survival and Health 0 0 0 0
Afghanistan Department of Defense Security Assistance 0 0 0 0
Afghanistan Development Assistance 0 0 0 0
Afghanistan Economic Support Fund/Security Support Assistance 0 0 0 0
Afghanistan Food For Education 0 0 0 0
Afghanistan Global Health and Child Survival 0 0 0 0
Afghanistan Inactive Programs 0 0 0 0
Afghanistan Migration and Refugee Assistance 0 0 0 0
Afghanistan Narcotics Control 0 0 0 0
Afghanistan Nonproliferation, Anti-Terrorism, Demining and Related 0 0 0 0
Afghanistan Other Active Grant Programs 0 0 0 0
Afghanistan Other State Assistance 0 0 0 0
Afghanistan Other USAID Assistance 0 0 0 0
Afghanistan Other USDA Assistance 0 0 0 0
Afghanistan Peace Corps 0 0 0 0
Afghanistan Section 416(b)/ Commodity Credit Corporation Food for Progress 0 0 0 0
Afghanistan Title I 0 0 0 0
Afghanistan Title II 0 0 0 0
Time taken: 16.61 seconds
Just in case if you want to get rid of the table you created …
hive> drop table usa_gov_data;
There goes an end-to-end demo on how Hadoop and Hive can be leveraged. Now imagine a high-volume-data scenario where there are loads of files sitting in a stack of HDFS nodes. HIVE will federate the query across to all the nodes participating in the cluster and obtain back an aggregated view of data, which is really a powerful tool for distributed computing.
However, the detailed task is to find out implementation of the parsing technique, which is key for processing objects; understand that well.