Hadoop Ecosystem Tour
This lecture by Aaron gives a tour of projects that Hadoop has and list of other libraries that hadoop supports. There is nothing much technical, but is nice breadth video about Hadoop.
The following are some points I picked up from the video!
(1) Since Hadoop is heavily based on Google’s research papers, it s architecture is also similar to that of google except for some variation in the names.
Google Mapreduce –> MapReduce , Google File system –> Hadoop Distributed File System, Bigtable –> Hbase [ very large scale database system] , Chubby –> zookeeper [ Distributed locking system for synchronization]
(2) PIG – a data flow oriented language. Queries written in PIG get compiled down to mapreduce program which execute over the dataset.PiG is a client side library and a client side shell.
(3) Hive works in the same way- side library and a client side shell. Looks very much like SQL. Supports SELECT,JOIN GROUP and other SQL like commands.
(4) Hbase : Column store Database . HDFS makes it hard to lookup individual data records of 1KB or less. Hbase seems to facilitate that. And, Hbase has name rows and columns – an identifier associated with each of these.The data is sparse, Hbase has a huge number of columns but not all (row,column) has data in it.
BLOBs can be stored in these cells. Also, unlike SQL which is strongly typed – everything in hbase is just a string.
Bulk Scan : Also, different Mappers can work on different slices of the row space –which is based on the existing MapReduce Platform.
(5) ZooKeeper : Distributed Consensus Engine. A DB could be locked in ZooKeeper (on the DB’s identifier in ZooKeeper) and other resources do not attempt to access this till the lock is on.
(6) FUSE – DFS : Allows mounting of HDFS volumes via Linux File systems .However, this does not imply that hdfs could be used as a general purpose file system.
(7) Pipes and streaming library – pipes is a way to write code in C++ and connect to mapreduce and Streaming can be used with arbitrary scripting language.
(8) Hadoop could be used with amazon EC2 cloud.
(9) Scribe is FaceBook’s Log Aggregation tool & Mahout is a Machine Learning Library which could be used with Hadoop. Mahout derives its name from Hindi which means one who drives an elephant
Filed under: Uncategorized | Leave a Comment
No Responses Yet to “Hadoop Ecosystem Tour”