Listly by edureka.co
Looking out for Hadoop Cluster Interview Questions that are frequently asked by employers? Here is the second list of Hadoop Cluster Interview Questions which covers setting up a Hadoop Cluster.
Always keep in mind that, only theoretical knowledge is not enough to crack an interview. Employers expect from the candidate to have practical knowledge and hands-on experience on Hadoop as well. So, this Hadoop Cluster Interview Questions will help you to gain practical knowledge of Hadoop framework.
Source: https://www.edureka.co/blog/interview-questions/hadoop-interview-questions-hadoop-cluster/
In stand-alone mode, there are no daemons, everything runs on a single JVM.
It has no DFS and it utilizes the local file system.
Stand-alone mode is suitable only for running MapReduce programs during development for testing.
It is one of the least used environments.
Pseudo mode is used in both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.
This is an important question as Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running and other hosts on which Datanodes are running. NodeManagers are installed on every DataNode and it is responsible for execution of the task on every single DataNode. All these NodeManagers are managed by ResourceManager, which receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly.
This is a technical question which challenges your basic concept. /etc/hosts file contains the hostname and their IP address of that host. It maps the IP address to the hostname. In Hadoop cluster, we store all the hostnames (master and slaves) with their IP addresses in /etc/hosts so, that we can use hostnames easily instead of IP addresses.
You are expected to remember basic server port numbers if you are working with Hadoop. The port number for corresponding daemons are as follows:
Namenode – ’50070’
ResourceManager – ’8088’
MapReduce JobHistory Server – ’19888’.
♣ Tip: Generally, approach this question by telling the 4 main configuration files in Hadoop and giving their brief descriptions to show your expertise.
core-site.xml: core-site.xml informs Hadoop daemon where NameNode runs on the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
hdfs-site.xml: hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
mapred-site.xml: mapred-site.xml contains configuration settings of the MapReduce framework like number of JVM that can run in parallel, the size of the mapper and the reducer, CPU cores available for a process, etc.
yarn-site.xml: yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc.
These files are in the conf/hadoop/ directory inside Hadoop directory.
♣ Tip: To check your knowledge on Hadoop the interviewer may ask you this question.
CLASSPATH includes all the directories containing jar files required to start/stop Hadoop daemons. The CLASSPATH is set inside /etc/hadoop/hadoop-env.sh file.
♣ Tip: This is a theoretical question, but if you add a practical taste to it, you might get a preference.
The map output is stored in an in-memory buffer; when this buffer is almost full, then spilling phase starts in order to move data to a temp folder.
Map output is first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb property .By default, it will be 100 MB.
When the buffer reaches certain threshold, it will start spilling buffer data to disk. This threshold is specified in mapreduce.map.sort.spill.percent.
This is an easy question, tar -xvf /file_location/filename.tar.gz command will extract the tar.gz compressed file.
By using the following commands we can check whether Java and Hadoop are installed and their paths are set inside .bashrc file:
For checking Java – java -version
For checking Hadoop – hadoop version
The full form of fsck is File System Check. HDFS supports the fsck (filesystem check) command to check for various inconsistencies. It is designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.
The three main hdfs-site.xml properties are:
dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs) and where DFS is located – on the disk or onto the remote directory.
dfs.data.dir which gives you the location of DataNodes, where the data is going to be stored.
fs.checkpoint.dir is the directory on the filesystem where the Secondary NameNode stores the temporary images of edit logs, which is to be merged and the FsImage for backup.
If you get a ‘connection refused java exception’ when you type hadoop fsck, it could mean that the NameNode is not working on your VM.
We can view compressed files in HDFS using hadoop fs -text /filename command.
♣ Tip: Approach this question by first explaining safe mode and then moving on to the commands.
Safe Mode in Hadoop is a maintenance state of the NameNode during which NameNode doesn’t allow any changes to the file system. During Safe Mode, HDFS cluster is read-only and doesn’t replicate or delete blocks.
To know the status of safe mode, you can use the command: hdfs dfsadmin -safemode get
To enter safe mode: hdfs dfsadmin -safemode enter
To exit safe mode: hdfs dfsadmin -safemode leave
jps command is used to check all the Hadoop daemons like NameNode, DataNode, ResourceManager, NodeManager etc. which are running on the machine.
This question has two answers, answering both will give you a plus point. We can restart NameNode by following methods:
You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode
Use ./sbin/stop-all.shand and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
To check whether NameNode is working or not, use the jps command, this will show all the running Hadoop daemons and there you can check whether NameNode daemon is running or not.
If you want to look for NameNode in the browser, the port number for NameNode web browser UI is 50070. We can check in web browser using http://master:50070/dfshealth.jsp.
You should answer this question as, the Hadoop core uses Shell (SSH) for communication with salve and to launch the server processes on the slave nodes. It requires a password-less SSH connection between the master and all the slaves and the secondary machines, so every time it does not have to ask for authentication as master and slave requires rigorous communication.