The videos from Cloudera seem to be the best online source of learning Hadoop. I saw a couple of their videos, and the second one explains in detail abt map reduce and HDFS.

The video speaks about Map Reduce in the first half and then continues to HDFS in the second.The following are some of the points I  picked up from the video.. A few of them are actual words from the video. This slide is used by the presenter.

(1) Traditional “Map” , in other functional languages  like LISP  yield a single key value paid  output for a Single  Key value pair input. This is a simple upper case mapper.maps  inputs of lower case to upper.

(“foo”, “bar”)  (“FOO”, “BAR”)
(“Foo”, “other”)  (“FOO”, “OTHER”)
(“key2”, “data”)  (“KEY2”, “DATA”)
(“foo”, “bar”)  ---> (“FOO”, “BAR”)

(“Foo”, “other”)  ---> (“FOO”, “OTHER”)

(“key2”, “data”)  --->(“KEY2”, “DATA”)

But MapReduce  is influenced by this paradigm but not married to it ( :P ).
(2)The explode mapper emits more than  one output.It is also possible not to emit any output.

(“A”, “cats”) ---> (“A”, “c”), (“A”, “a”),(“A”, “t”), (“A”, “s”)

(“B”, “hi”) ---> (“B”, “h”), (“B”, “i”)

Filter Mapper – is used to boil  down/eliminate or simply filter out outputs.Ex: Output only if a value is prime .

let map(k, v) = if (isPrime(v)) then emit(k, v)

(“foo”, 7)  --->(“foo”, 7)

(“test”, 10)  ---> (nothing)

The Mappers  might sometime  not use a Key @ all. The Mappers do  not communicate with one another.

(3)Reducer : is the one that aggregates all the results generated by the Mappers.

-> All of the Mapperscan contribute to any of the Reducers.

The reducers return one final value for each of the Keys- The reducers are also isolated !

(4)Map Reduce does not have a notion of streaming data. All the data are aggregated and then sent to the Reducers.Looks like a performance bottleneck, but Mappers and Reducers are physically the samenodes. In practice, the spare cycles are picked  up by mappers  for other tasks upfront.

Ordering of Key,value pairs : The values are generally not sorted,but the  order in which the keys are fed to the Reducers are sorted in increasig order.

An example where  both Map and Reduce are used togather : Counting words across a set of documents.The Key,value pair would be (the word,1) which means that the word was seen once. Pairs with the common keys are feed to the same Reducer,which endup summing up the number of occurences. Assuming  there are some 50K unique words, 50K calls are made to the reduce  method focussing on one Key @ a time in isolation from other keys and values. However,the order through which the reducer is fed data with is sorted by the order of Keys.

Combine : A phase between Map and Reduce , performed locally on the Mapping node.A simple  reason why this could be done is

- the word ‘the’may appear 1000s of time  in a document , and it is a waste of network resources to send all of these instances,where infact only thenumber of ‘the’s is required. The summing reducer is associative and commutative ,so the order of the words dont matter.

- Free to design a logic for the combiner, in this case the same method used to reduce can be used to combine.

- repeated Combination could  be performed to keep the dataset bounded. Emit 100 keys -combine to oneand then emit another key and combine the 101 keys to 1

- The Mapper generates data , and fills up the buffer. Once the buffer is full,the Mapper node runs the combiner freeing up the buffer for mapper to generate more keys.This could  happen  multiple times .

HDFS :- (unfortunately called a File system,but is more a Dataset system )

based on Google File System (GFS) and it was built with the following assumptions:

- individual disks fail all the time.

Doesnt make a difference in using a HD with  greater Mean time to failure, as it becomes pointless when the system scales.  ” The law of large numbers ensure all probabilities become certainities.”

Once a file is closed , its immutable – append is the only way to update. This  is because , the data is replicated(to survive hardware failure) in different HDs -so the update   has to be synchronised. And also keeping multiple updates from different users  synchronised acrossed  multiple machines is hard.

Append is relatively easier than update , so it seems to have been implemented first.

Facebook uses HIVE to store their  log files.Each block is copied to @least three datanodes.

A single “namenode ” that has all Meta data. The namenode can be configured to back up in real time,if it fails- the system suspends service till namenode is restarted/  made available again.

Hadoop has no cache protocol.

- to access a certain chunk of data, the client sents a request to the Master node with the name of the file and the chunk number.The master node, returns a handle to the client.It also mentions the 3 datanodes where the data is available and the client fetches the data directly from the datanode.

The Masternode only serves the  MetaData, all of these from the RAM for fast lookup.

- the client chooses datanode in random. But there is a notion of Rack locality,tries to download from a Disk in the same Rack.

- Resumes download in  case of failure halfway through a read.

Scalability Bottleneck – the size of the cluster is limited by the RAM space in the metadata node. Namenode has 3 or 4 times RAM than that of datanodes.

-Data nodes do not know what files they hold,they save them as opaque block object – named as  block 3002840772 something!!

- The metadata on the RAM is also  written to the Hard Disk or some Remote HD,so if it fails there is always an image in HD.But it however never  interactively responds from the HD.



No Responses Yet to “MapReduce and HDFS”  

  1. No Comments Yet

Leave a Reply