This lecture by Aaron gives a tour of projects that Hadoop has and list of other libraries that hadoop supports. There is nothing much technical,  but is nice  breadth video about Hadoop.

The  following are some  points  I picked up from the video!

(1)    Since Hadoop  is heavily based on Google’s research papers, it s architecture is also similar to that of google except for some  variation in the names.

Google  Mapreduce –> MapReduce , Google File system –> Hadoop Distributed File System, Bigtable –> Hbase [ very large scale database system] , Chubby –> zookeeper    [ Distributed locking system for synchronization]

(2)    PIG – a data flow oriented language. Queries written in PIG get compiled down to mapreduce program which execute over the dataset.PiG is a client side library and a client side shell.

(3)    Hive works in the same way- side library and a client side shell. Looks very much  like SQL. Supports  SELECT,JOIN GROUP and other SQL like  commands.

(4)    Hbase : Column  store Database . HDFS makes it hard to lookup individual data  records of 1KB or less. Hbase  seems to facilitate that. And,  Hbase has name rows and columns – an identifier associated with each of these.The data is sparse, Hbase has a huge number of columns but not all (row,column) has data in it.

BLOBs can be stored in these cells. Also, unlike SQL which is strongly typed – everything in hbase is just a string.

Bulk Scan : Also,  different Mappers  can work on  different slices of the row space –which is based on the existing MapReduce Platform.

(5)    ZooKeeper  : Distributed Consensus  Engine.  A  DB could be locked in ZooKeeper  (on the DB’s identifier in ZooKeeper) and other resources  do not attempt to access this till the lock is on.

(6)    FUSE – DFS :  Allows  mounting of HDFS volumes  via  Linux  File systems .However, this  does not imply that hdfs could be  used as a general  purpose file  system.

(7)    Pipes and streaming  library – pipes  is a way to write code in C++ and connect to  mapreduce and Streaming can be used with arbitrary scripting language.

(8)    Hadoop  could be used with amazon  EC2 cloud.

(9)    Scribe is  FaceBook’s Log Aggregation tool  & Mahout is a Machine Learning Library which could be used with Hadoop. Mahout derives its name from Hindi which  means one who drives an elephant :P



No Responses Yet to “Hadoop Ecosystem Tour”  

  1. No Comments Yet

Leave a Reply