Hadoop Beginner's Guide

Garry Turkington (2013)
Review date: May, 2013
Summary

This is an introductory book on Hadoop. With this mission in mind, the author goes for breadth when covering the topic. The first chapter talks about big data, challenges, scaling, the MapReduce paradigm, and the origin of Hadoop. How cloud technologies fit into all of this is also covered at this point.  The second chapter contains detailed instructions on setting up Hadoop on a local server and making it solve a classic problem: counting how many times words occur in a document. Then the exercise is repeated using Amazon's Elastic MapReduce.

Chapter three, called "Understanding MapReduce", backs a few steps and explains the paradigm in greater detail with emphasis on the Java implementation. The subsequent chapter is mostly about Hadoop's streaming API, which allows implementing mappers and reducers in any language, as long as they can read from and write to standard input and output. A dataset containing UFO sightings is used in the examples, and the mappers and reducers are written in Ruby.

The next chapter, "Advanced MapReduce Techniques", explains the challenges in performing joins and solving graph problems. These areas don't immediately come to mind for a Hadoop beginner. The chapter also covers Avro, a serialization framework, and how it can be used in conjunction with Hadoop.

Chapter six is about Hadoop's fault tolerance. It describes various failure scenarios and explains how recovery is performed. The chapter contains scenarios that simulate failure (and recovery) of pretty much every component in a Hadoop cluster. The topic of chapter seven is pretty much configuration and scaling to some extent.

Chapters 8, 9, and 10 are about Hadoop in a larger context and interoperability with other systems and frameworks. These chapters cover Apache Hive (a data warehouse that provides a relational view on Hadoop data), moving data to and from relational databases, and getting data into Hadoop in general.

Opinion

I read this book "out of context", meaning that I didn't have an interesting problem solvable by MapReduce at hand and a dire need to learn Hadoop at the time of reading. Instead, I took time to read this book with the purpose of determining whether it's a good beginner book or not. All in all, I'd say that it is. The author really succeeds in creating a context for Hadoop and its ecosystem.

From the second chapter and onwards, Hadoop is gradually introduced using very detailed instructions. The general format for doing this is by listing every single command the user needs to type and its output, so the book is full of terminal session listings. All such listings are followed by sections called "What just happened?" that explain in detail the purpose of the commands and their output. This is actually quite helpful for readers who understand what's happening from just looking at the session listing; such readers can safely skip these sections. 

The above approach should enable any reader, regardless of level of experience, to follow along and do the exercises or labs, which is a good thing for a beginner book. I have a remark about this though: the session dumps could have been proofread better! I can't say that I read them through a magnifying glass, but still I found quite a few errors.

As for the contents, the book can be thought of as being divided into two parts: "Core and advanced Hadoop", and "Hadoop in a bigger context", where chapters one to five make up the first part. In fact, to get started, the reader only needs to read chapters two through four. I liked this structure. However, I reacted to one thing: the book never shows the monster! In my opinion, the introductory chapter fails to actually establish a case for Hadoop and MapReduce. Yes, it's about big data, scaling and problems and so on, but I couldn't find a logical transition to Hadoop as a solution to these problems. Instead, chapter two illustrates the framework with a distributed calculation of pi and the word counting program (Hadoop's version of the "Hello world" program).

In a later chapter, Hadoop is used to process a dataset with UFO sightings, and then, in a chapter on advanced techniques a graph problem is solved. Not until that chapter did I start getting a feeling for what kind of problems Hadoop and MapReduce should be used for. This is what I mean by "never showing the monster". Being an introductory text, I'd prefer the first or second chapter to describe some problems that are good candidates for the MapReduce paradigm, illustrate one of them, and then show how a distributed computation would help.

That said, I may be off track here. This is a book on Hadoop, and not MapReduce in general, and I did say that I read it without having intricate MapReduce problems at hand. This is pretty much my only criticism. If a reader doesn't perceive this as a problem, then there's nothing to complain about. After reading the book, I feel that I have a very good feeling for what Hadoop does and what building blocks in its ecosystem to use. I was actually even able to find some favorite chapters!

I liked chapter four because of its examples; they felt quite realistic and relevant. Chapter five was a favorite because it was good at explaining why joins in Hadoop are hard and because it was able to explain how MapReduce can be applied to graph traversing. For a beginner, graph traversal isn't the first problem that comes to mind. Besides, it's the first time I've read a non-academic text that mentions Bloom filters (this is probably only interesting to someone who has studied computer science).

Finally, I also liked chapter six (the one on fault tolerance and error recovery). By showing to the reader how to repair a cluster and explaining how redundancy is handled, the book inspires confidence. As I reader, I'm able to say: "Ok, I know how to set this thing up for basic use, and I have a chance of fixing it if it breaks".

One more thing... Here and there the book contains examples of how to use Amazon's EMR. This didn't feel awfully important to me, but it provides an even more solid explanation of how to apply the framework to bigger problems and how it can be used in a cloud environment.

To sum up: a good and comprehensive book on Hadoop that covers the framework and its ecosystem, verbose and easy to follow examples, and a structure that leaves the reader with a sense of getting the big picture. Minus: could devote some more pages to the MapReduce paradigm. 

Who should read this book

Those who want very practical advice on getting started with Hadoop will find this book helpful.




News

  • 2015-09-29

    It's been almost one and a half year since I reviwed a book! I've been too absorbed by Writing my own. Anyway, I'm back with Jeff Patton's relatively...
  • 2014-01-04

    New category! Performance! Reviewed The Every Computer Performance Book. Check it out!
  • 2013-09-10

    Reviewed a book that' slightly less technical, but much more fun to read. It's I.T. Confidential.
  • 2013-08-13

    Reviewed yet another book on Visual Studio 2012 and TFS. I also created a "Microsoft" category and moved the other TFS book there from the "Tools"...
  • 2013-08-05

    Updated the FAQ. Included information about getting a book reviewed.