List Headline Image
Updated by edureka.co on May 05, 2021
 REPORT
edureka.co edureka.co
Owner
20 items   1 followers   0 votes   9 views

Top 20 Hadoop Interview Questions

Looking out for Hadoop Cluster Interview Questions that are frequently asked by employers? Here is the second list of Hadoop Cluster Interview Questions which covers setting up a Hadoop Cluster.

Always keep in mind that, only theoretical knowledge is not enough to crack an interview. Employers expect from the candidate to have practical knowledge and hands-on experience on Hadoop as well. So, this Hadoop Cluster Interview Questions will help you to gain practical knowledge of Hadoop framework.

Source: https://www.edureka.co/blog/interview-questions/hadoop-interview-questions-hadoop-cluster/

1

What are the features of Standalone (local) mode?

In stand-alone mode, there are no daemons, everything runs on a single JVM.
It has no DFS and it utilizes the local file system.
Stand-alone mode is suitable only for running MapReduce programs during development for testing.
It is one of the least used environments.

2

What are the features of Pseudo mode?

Pseudo mode is used in both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.

3

What are the features of Fully Distributed mode?

This is an important question as Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running and other hosts on which Datanodes are running. NodeManagers are installed on every DataNode and it is responsible for execution of the task on every single DataNode. All these NodeManagers are managed by ResourceManager, which receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly.

4

What is configured in /etc/hosts and what is its role in setting Hadoop cluster?

This is a technical question which challenges your basic concept. /etc/hosts file contains the hostname and their IP address of that host. It maps the IP address to the hostname. In Hadoop cluster, we store all the hostnames (master and slaves) with their IP addresses in /etc/hosts so, that we can use hostnames easily instead of IP addresses.

5

What are the default port numbers of NameNode, ResourceManager & MapReduce JobHistory Server?

You are expected to remember basic server port numbers if you are working with Hadoop. The port number for corresponding daemons are as follows:

Namenode – ’50070’

ResourceManager – ’8088’

MapReduce JobHistory Server – ’19888’.

6

What are the main Hadoop configuration files?

♣ Tip: Generally, approach this question by telling the 4 main configuration files in Hadoop and giving their brief descriptions to show your expertise.

core-site.xml: core-site.xml informs Hadoop daemon where NameNode runs on the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
hdfs-site.xml: hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
mapred-site.xml: mapred-site.xml contains configuration settings of the MapReduce framework like number of JVM that can run in parallel, the size of the mapper and the reducer, CPU cores available for a process, etc.
yarn-site.xml: yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc.
These files are in the conf/hadoop/ directory inside Hadoop directory.

7

How does Hadoop CLASSPATH plays vital role in starting or stopping in Hadoop daemons?

♣ Tip: To check your knowledge on Hadoop the interviewer may ask you this question.

CLASSPATH includes all the directories containing jar files required to start/stop Hadoop daemons. The CLASSPATH is set inside /etc/hadoop/hadoop-env.sh file.

8

What is a spill factor with respect to the RAM?

♣ Tip: This is a theoretical question, but if you add a practical taste to it, you might get a preference.

The map output is stored in an in-memory buffer; when this buffer is almost full, then spilling phase starts in order to move data to a temp folder.

Map output is first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb property .By default, it will be 100 MB.

When the buffer reaches certain threshold, it will start spilling buffer data to disk. This threshold is specified in mapreduce.map.sort.spill.percent.

9

What is command to extract the compressed file in tar.gz format?

This is an easy question, tar -xvf /file_location/filename.tar.gz command will extract the tar.gz compressed file.

10

How will you check Java and Hadoop is installed on your system?

By using the following commands we can check whether Java and Hadoop are installed and their paths are set inside .bashrc file:
For checking Java – java -version
For checking Hadoop – hadoop version

11

What is the full form of fsck?

The full form of fsck is File System Check. HDFS supports the fsck (filesystem check) command to check for various inconsistencies. It is designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.

12

Which are the main hdfs-site.xml properties?

The three main hdfs-site.xml properties are:
dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs) and where DFS is located – on the disk or onto the remote directory.
dfs.data.dir which gives you the location of DataNodes, where the data is going to be stored.
fs.checkpoint.dir is the directory on the filesystem where the Secondary NameNode stores the temporary images of edit logs, which is to be merged and the FsImage for backup.

13

What happens if you get a ‘connection refused java exception’ when you type hadoop fsck?

If you get a ‘connection refused java exception’ when you type hadoop fsck, it could mean that the NameNode is not working on your VM.

14

How can we view the compressed files via HDFS command?

We can view compressed files in HDFS using hadoop fs -text /filename command.

15

What is the command to move into safe mode and exit safe mode?

♣ Tip: Approach this question by first explaining safe mode and then moving on to the commands.

Safe Mode in Hadoop is a maintenance state of the NameNode during which NameNode doesn’t allow any changes to the file system. During Safe Mode, HDFS cluster is read-only and doesn’t replicate or delete blocks.

To know the status of safe mode, you can use the command: hdfs dfsadmin -safemode get
To enter safe mode: hdfs dfsadmin -safemode enter
To exit safe mode: hdfs dfsadmin -safemode leave

16

What does ‘jps’ command does?

jps command is used to check all the Hadoop daemons like NameNode, DataNode, ResourceManager, NodeManager etc. which are running on the machine.

17

How can I restart Namenode?

This question has two answers, answering both will give you a plus point. We can restart NameNode by following methods:

You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode
Use ./sbin/stop-all.shand and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.

18

How can we check whether NameNode is working or not?

To check whether NameNode is working or not, use the jps command, this will show all the running Hadoop daemons and there you can check whether NameNode daemon is running or not.

19

How can we look at the Namenode in the web browser UI?

If you want to look for NameNode in the browser, the port number for NameNode web browser UI is 50070. We can check in web browser using http://master:50070/dfshealth.jsp.

20

What are the network requirements for Hadoop?

You should answer this question as, the Hadoop core uses Shell (SSH) for communication with salve and to launch the server processes on the slave nodes. It requires a password-less SSH connection between the master and all the slaves and the secondary machines, so every time it does not have to ask for authentication as master and slave requires rigorous communication.