Programming with Hadoop
26Oct09
This video speaks about Programming with Hadoop, I guess I couldnt follow a good 20 mins in the video.
I picked up the following points from the video, however there are a few more points to understand
(1)Some Hadoop Terminology
Job : A fullprogram that includes both mapper and reducer togather across a data set.
Task: A job is broken down to Tasks.The Mapper or the reducer individually are referred to as the Task.
Task Attempt : If a mapper crashes in one system,it might restart again – it is the same task,but a different task attempt though.
(2)The system attempts to execute a task over and over but it eventually stops – if a particular input slice is triggering the failure. @ this point ,
(a) either the entire job can fail
(b) or a quality factor could be set which specifies the number of map/reduce tasks that would suffice for the overall goal of the job.
(3)The same task could be parallely attempted by different mapper,one of them completes the task faster and the others are killed.
(4)There is a job tracker that runs on the MasterNode, which tells the slave nodes which particular task units it must run.The tast tracker runs on the slaves which is responsible for managing all tasks in the node. The tasks run on a different JVM from that of the task tracker.So even if a particular task crashes,the task tracker is isolated and still continues to run.
(5)There is exactly one tasktracker per node and all of the tasks report to it.
(6)The jobtracker is decided well in advance and its IP is published in the configuration file sent to the slaves.
Cannot run a job from a set of .class files,must assemble all of them into a .jar file. It then uploads the JAR into hdfs , and writes a configuration specs for the job which is typically an XML file. using RPC , client sends the pointer to the location of jar in hdfs and the XML config file to the job tracker. The jobtracker then notifies all the task trackers to download the jar from the shared hdfs.
(7)Sometimes, the records may bounce over to the next block – but hadoop takes care of this by reading past the end of the block.
(8)JobConf object describes the specs of a job which gets serialized and sent across the network and again gets deserialized at the client machine into the job conf object.
(9)FileInputFormat.addInputPath(conf) -used to specify the input to the mapper. Could simply be a file or a directory (uses all the files in the directory)
FileOutputFormat.setOutputPath(conf) – the reducers need to write back the solutions whihc is specified using this.
runJob() – will block / the program will wait till the MapReducer finishes it job and then go on, submitJob() – send the job to the jobtrackers Q and returns a handle to the job.
(10)The client sends a message to the Master every 10 seconds, based on which the client may recieve new data to map/reduce.
The JAR is cached, doesnt download it again.
(11) Once a task is completed , the node closes down the JVM and spawns a new one for the next task,which is wasteful. JVM reuse solves this and is being thought about.
Filed under: Uncategorized | Leave a Comment
Recent Entries
Categories
- Cricket (1)
- Delhi Trip (3)
- Madras (3)
- My Pallavaram (3)
- Robots (1)
- Tour (5)
- Uncategorized (15)