Hortonworks Apache Hadoop Developer Hadoop 2.0 Certification exam for Pig and Hive Developer Online Training

exams

4 years ago

Question #1

Which one of the following statements describes a Pig bag. tuple, and map, respectively?

A . Unordered collection of maps, ordered collection of tuples, ordered set of key/value pairs
B . Unordered collection of tuples, ordered set of fields, set of key value pairs
C . Ordered set of fields, ordered collection of tuples, ordered collection of maps
D . Ordered collection of maps, ordered collection of bags, and unordered set of key/value pairs

Correct Answer: B

Question #2

You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster.

Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?

A . Run all the nodes in your production cluster as virtual machines on your development workstation.
B . Run the hadoop command with the Cjt local and the Cfs file:///options.
C . Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.
D . Run simldooop, the Apache open-source software for simulating Hadoop clusters.

Reveal Solution Hide Solution

Correct Answer: C

Question #3

Which HDFS command uploads a local file X into an existing HDFS directory Y?

A . hadoop scp X Y
B . hadoop fs -localPut X Y
C . hadoop fs-put X Y
D . hadoop fs -get X Y

Reveal Solution Hide Solution

Correct Answer: C

Question #4

In Hadoop 2.0, which TWO of the following processes work together to provide automatic failover of the NameNode? Choose 2 answers

A . ZKFailoverController
B . ZooKeeper
C . QuorumManager
D . JournalNode

Reveal Solution Hide Solution

Correct Answer: A,D

Question #5

To use a lava user-defined function (UDF) with Pig what must you do?

A . Define an alias to shorten the function name
B . Pass arguments to the constructor of UDFs implementation class
C . Register the JAR file containing the UDF
D . Put the JAR file into the user's home folder in HDFS

Reveal Solution Hide Solution

Correct Answer: C

Question #6

When is the earliest point at which the reduce method of a given Reducer can be called?

A . As soon as at least one mapper has finished processing its input split.
B . As soon as a mapper has emitted at least one record.
C . Not until all mappers have finished processing all records.
D . It depends on the InputFormat used for the job.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

Note: The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done.

Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data. Another job that starts later that will actually use the reduce slots now can’t use them.

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis.

Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn’t hog up reducers when they aren’t doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, When is the reducers are started in a MapReduce job?

Question #7

Which one of the following statements describes the relationship between the ResourceManager and the ApplicationMaster?

A . The ApplicationMaster requests resources from the ResourceManager
B . The ApplicationMaster starts a single instance of the ResourceManager
C . The ResourceManager monitors and restarts any failed Containers of the ApplicationMaster
D . The ApplicationMaster starts an instance of the ResourceManager within each Container

Reveal Solution Hide Solution

Correct Answer: A

Question #8

Which HDFS command copies an HDFS file named foo to the local filesystem as localFoo?

A . hadoop fs -get foo LocalFoo
B . hadoop -cp foo LocalFoo
C . hadoop fs -Is foo
D . hadoop fs -put foo LocalFoo

Reveal Solution Hide Solution

Correct Answer: A

Question #9

You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file.

Which is the best way to make this library available to your MapReducer job at runtime?

A . Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
B . Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C . When submitting the job on the command line, specify the Clibjars option followed by the JAR file path.
D . Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

The usage of the jar command is like this,

Usage: hadoop jar <jar> [mainClass] args…

If you want the commons-math3.jar to be available for all the tasks you can do any one of these

Question #9

A . Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
B . Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C . When submitting the job on the command line, specify the Clibjars option followed by the JAR file path.
D . Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Reveal Solution Hide Solution

Question #9

A . Have your system administrator copy the JAR to all nodes in the cluster and set its location in the HADOOP_CLASSPATH environment variable before you submit your job.
B . Have your system administrator place the JAR file on a Web server accessible to all cluster nodes and then set the HTTP_JAR_URL environment variable to its location.
C . When submitting the job on the command line, specify the Clibjars option followed by the JAR file path.
D . Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Reveal Solution Hide Solution

Question #12

In a MapReduce job with 500 map tasks, how many map task attempts will there be?

A . It depends on the number of reduces in the job.
B . Between 500 and 1000.
C . At most 500.
D . At least 500.
E . Exactly 500.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

From Cloudera Training Course:

Task attempt is a particular instance of an attempt to execute a task

C There will be at least as many task attempts as there are tasks

C If a task attempt fails, another will be started by the JobTracker

C Speculative execution can also result in more task attempts than completed tasks

Question #13

You want to count the number of occurrences for each unique word in the supplied input data. You’ve decided to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you that you could optimize this by specifying a combiner.

Will you be able to reuse your existing Reduces as your combiner in this case and why or why not?

A . Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.
B . No, because the sum operation in the reducer is incompatible with the operation of a Combiner.
C . No, because the Reducer and Combiner are separate interfaces.
D . No, because the Combiner is incompatible with a mapper which doesn’t use the same data type for both the key and value.
E . Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job?

Question #14

What data does a Reducer reduce method process?

A . All the data in a single input file.
B . All data produced by a single mapper.
C . All data for a given key, regardless of which mapper(s) produced it.
D . All data for a given value, regardless of which mapper(s) produced it.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

All values with the same key are presented to a single reduce task.

Reference: Yahoo! Hadoop Tutorial, Module 4: MapReduce

Question #15

Given a directory of files with the following structure: line number, tab character, string:

Example:

1abialkjfjkaoasdfjksdlkjhqweroij

2kadfjhuwqounahagtnbvaswslmnbfgy

3kjfteiomndscxeqalkzhtopedkfsikj

You want to send each line as one record to your Mapper.

Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?

A . SequenceFileAsTextInputFormat
B . SequenceFileInputFormat
C . KeyValueFileInputFormat
D . BDBInputFormat

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop

Question #16

Examine the following Hive statements:

Assuming the statements above execute successfully, which one of the following statements is true?

A . Each reducer generates a file sorted by age
B . The SORT BY command causes only one reducer to be used
C . The output of each reducer is only the age column
D . The output is guaranteed to be a single file with all the data sorted by age

Reveal Solution Hide Solution

Correct Answer: A

Question #17

When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

A . When the types of the reduce operation’s input key and input value match the types of the reducer’s output key and output value and when the reduce operation is both communicative and associative.
B . When the signature of the reduce method matches the signature of the combine method.
C . Always. Code can be reused in Java since it is a polymorphic object-oriented programming language.
D . Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance.
E . Never. Combiners and reducers must be implemented separately because they serve different purposes.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

You can use your reducer code as a combiner if the operation performed is commutative and associative.

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job?

Question #18

What does the following WebHDFS command do?

Curl -1 -L “http://host:port/webhdfs/v1/foo/bar?op=OPEN”

A . Make a directory /foo/bar
B . Read a file /foo/bar
C . List a directory /foo
D . Delete a directory /foo/bar

Reveal Solution Hide Solution

Correct Answer: B

Question #19

You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you’ve decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.

Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

A . hadoop “mapred.job.name=Example” MyDriver input output
B . hadoop MyDriver mapred.job.name=Example input output
C . hadoop MyDrive CD mapred.job.name=Example input output
D . hadoop setproperty mapred.job.name=Example MyDriver input output
E . hadoop setproperty (“mapred.job.name=Example”) MyDriver input output

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Configure the property using the -D key=value notation:

-D mapred.job.name=’My Job’

You can list a whole bunch of options by calling the streaming jar with just the -info argument

Reference: Python hadoop streaming: Setting a job name

Question #20

Determine which best describes when the reduce method is first called in a MapReduce job?

A . Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
B . Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
C . Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.
D . Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, When is the reducers are started in a MapReduce job?

Question #21

You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.txt.

How many files will be processed by the FileInputFormat.setInputPaths () command when it’s given a path object representing this directory?

A . Four, all files will be processed
B . Three, the pound sign is an invalid character for HDFS file names
C . Two, file names with a leading period or underscore are ignored
D . None, the directory cannot be named jobdata
E . One, no special characters can prefix the name of an input file

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Files starting with ‘_’ are considered ‘hidden’ like unix files starting with ‘.’.

# characters are allowed in HDFS file names.

Question #22

In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?

A . mXn (i.e., m multiplied by n)
B . n
C . m
D . m+n (i.e., m plus n)
E . mn (i.e., m to the power of n)

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A MapReduce job with m mappers and r reducers involves up to m * r distinct copy operations, since each mapper may have intermediate output going to every reducer.

Question #23

Which Hadoop component is responsible for managing the distributed file system metadata?

A . NameNode
B . Metanode
C . DataNode
D . NameSpaceManager

Reveal Solution Hide Solution

Correct Answer: A

Question #24

Review the following data and Pig code.

M,38,95111

F,29,95060

F,45,95192

M,62,95102

F,56,95102

A = LOAD 'data' USING PigStorage('.') as (gender:Chararray, age:int, zlp:chararray);

B = FOREACH A GENERATE age;

Which one of the following commands would save the results of B to a folder in hdfs named myoutput?

A . STORE A INTO 'myoutput' USING PigStorage(',');
B . DUMP B using PigStorage('myoutput');
C . STORE B INTO 'myoutput';
D . DUMP B INTO 'myoutput';

Reveal Solution Hide Solution

Correct Answer: C

Question #25

MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two.

A . Heath states checks (heartbeats)
B . Resource management
C . Job scheduling/monitoring
D . Job coordination between the ResourceManager and NodeManager
E . Launching tasks
F . Managing file system metadata
G . MapReduce metric reporting
H . Managing tasks

Reveal Solution Hide Solution

Correct Answer: B,C
B,C

Explanation:

The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

Note:

The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

/ Monitoring the status of the cluster with respect to which nodes have which resources

available. Under YARN, this will be global.

/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

Reference: Apache Hadoop YARN C Concepts & Applications

Question #26

Assuming the following Hive query executes successfully:

Which one of the following statements describes the result set?

A . A bigram of the top 80 sentences that contain the substring "you are" in the lines column of the input data A1 table.
B . An 80-value ngram of sentences that contain the words "you" or "are" in the lines column of the inputdata table.
C . A trigram of the top 80 sentences that contain "you are" followed by a null space in the lines column of the inputdata table.
D . A frequency distribution of the top 80 words that follow the subsequence "you are" in the lines column of the inputdata table.

Reveal Solution Hide Solution

Correct Answer: D

Question #27

Given the following Pig commands:

Which one of the following statements is true?

A . The $1 variable represents the first column of data in ‘my.log’
B . The $1 variable represents the second column of data in ‘my.log’
C . The severe relation is not valid
D . The grouped relation is not valid

Reveal Solution Hide Solution

Correct Answer: B

Question #28

What does Pig provide to the overall Hadoop solution?

A . Legacy language Integration with MapReduce framework
B . Simple scripting language for writing MapReduce programs
C . Database table and storage management services
D . C++ interface to MapReduce and data warehouse infrastructure

Reveal Solution Hide Solution

Correct Answer: B

Question #29

What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A . Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B . Relational operations on large amounts of structured and semi-structured data.
C . Algorithms that require global, sharing states.
D . Large-scale graph algorithms that require one-step link traversal.
E . Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

See 3) below.

Limitations of Mapreduce C where not to use Mapreduce

While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every problem. Here are some problems I found where MapReudce is not suited and some papers that address the limitations of MapReuce.

Question #29

What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A . Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B . Relational operations on large amounts of structured and semi-structured data.
C . Algorithms that require global, sharing states.
D . Large-scale graph algorithms that require one-step link traversal.
E . Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Reveal Solution Hide Solution

Question #29

What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A . Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B . Relational operations on large amounts of structured and semi-structured data.
C . Algorithms that require global, sharing states.
D . Large-scale graph algorithms that require one-step link traversal.
E . Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Reveal Solution Hide Solution

Question #29

What types of algorithms are difficult to express in MapReduce v1 (MRv1)?

A . Algorithms that require applying the same mathematical function to large numbers of individual binary records.
B . Relational operations on large amounts of structured and semi-structured data.
C . Algorithms that require global, sharing states.
D . Large-scale graph algorithms that require one-step link traversal.
E . Text analysis algorithms on large collections of unstructured text (e.g, Web crawls).

Reveal Solution Hide Solution

Question #33

You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value.

As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?

A . Processor and network I/O
B . Disk I/O and network I/O
C . Processor and RAM
D . Processor and disk I/O

Reveal Solution Hide Solution

Correct Answer: B

Question #34

Which one of the following statements regarding the components of YARN is FALSE?

A . A Container executes a specific task as assigned by the ApplicationMaster
B . The ResourceManager is responsible for scheduling and allocating resources
C . A client application submits a YARW job to the ResourceManager
D . The ResourceManager monitors and restarts any failed Containers

Reveal Solution Hide Solution

Correct Answer: D

Question #35

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values.

Which interface should your class implement?

A . Combiner <Text, IntWritable, Text, IntWritable>
B . Mapper <Text, IntWritable, Text, IntWritable>
C . Reducer <Text, Text, IntWritable, IntWritable>
D . Reducer <Text, IntWritable, Text, IntWritable>
E . Combiner <Text, Text, IntWritable, IntWritable>

Reveal Solution Hide Solution

Correct Answer: D

Question #36

Which one of the following Hive commands uses an HCatalog table named x?

A . SELECT * FROM x;
B . SELECT x.-FROM org.apache.hcatalog.hive.HCatLoader(‘x’);
C . SELECT * FROM org.apache.hcatalog.hive.HCatLoader(‘x’);
D . Hive commands cannot reference an HCatalog table

Reveal Solution Hide Solution

Correct Answer: C

Question #37

Given the following Pig command:

logevents = LOAD 'input/my.log' AS (date:chararray, levehstring, code:int, message:string);

Which one of the following statements is true?

A . The logevents relation represents the data from the my.log file, using a comma as the parsing delimiter
B . The logevents relation represents the data from the my.log file, using a tab as the parsing delimiter
C . The first field of logevents must be a properly-formatted date string or table return an error
D . The statement is not a valid Pig command

Reveal Solution Hide Solution

Correct Answer: B

Question #38

Consider the following two relations, A and B.

A . C = DOIN B BY a1, A by b2;
B . C = JOIN A by al, B by b2;
C . C = JOIN A a1, B b2;
D . C = JOIN A SO, B $1;

Reveal Solution Hide Solution

Correct Answer: B

Question #39

Given the following Hive commands:

Which one of the following statements Is true?

A . The file mydata.txt is copied to a subfolder of /apps/hive/warehouse
B . The file mydata.txt is moved to a subfolder of /apps/hive/warehouse
C . The file mydata.txt is copied into Hive’s underlying relational database 0.
D . The file mydata.txt does not move from Its current location in HDFS

Reveal Solution Hide Solution

Correct Answer: A

Question #40

In a MapReduce job, the reducer receives all values associated with same key.

Which statement best describes the ordering of these values?

A . The values are in sorted order.
B . The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
C . The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.
D . Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.

Reveal Solution Hide Solution

Correct Answer: B
B

Explanation:

Note:

* Input to the Reducer is the sorted output of the mappers.

* The framework calls the application’s Reduce function once for each unique key in the sorted order.

* Example:

For the given sample input the first map emits:

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

The second map emits:

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

Question #41

Which describes how a client reads a file from HDFS?

A . The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data directory off the DataNode(s).
B . The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly off the DataNode.
C . The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly off the DataNode.
D . The client contacts the NameNode for the block location(s). The NameNode contacts the DataNode that holds the requested data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the client.

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How the Client communicates with HDFS?

Question #42

For each input key-value pair, mappers can emit:

A . As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
B . As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
C . One intermediate key-value pair, of a different type.
D . One intermediate key-value pair, but of the same type.
E . As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

Reveal Solution Hide Solution

Correct Answer: E
E

Explanation:

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

Reference: Hadoop Map-Reduce Tutorial

Question #43

You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

A . There is no difference in output between the two settings.
B . With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.
C . With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.
D . With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.

Reveal Solution Hide Solution

Correct Answer: D
D

Explanation:

* It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

Note:

Reduce

In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.

The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable).

Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.

The output of the Reducer is not sorted.

Question #44

In Hadoop 2.0, which one of the following statements is true about a standby NameNode?

The Standby NameNode:

A . Communicates directly with the active NameNode to maintain the state of the active NameNode.
B . Receives the same block reports as the active NameNode.
C . Runs on the same machine and shares the memory of the active NameNode.
D . Processes all client requests and block reports from the appropriate DataNodes.

Reveal Solution Hide Solution

Correct Answer: B

Question #45

In the reducer, the MapReduce API provides you with an iterator over Writable values.

What does calling the next () method return?

A . It returns a reference to a different Writable object time.
B . It returns a reference to a Writable object from an object pool.
C . It returns a reference to the same Writable object each time, but populated with different data.
D . It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object.
E . It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise.

Reveal Solution Hide Solution

Correct Answer: C
C

Explanation:

Calling Iterator.next() will always return the SAME EXACT instance of IntWritable, with the contents of that instance replaced with the next value.

Reference: manupulating iterator in mapreduce