14 Deep Learning Interview Questions and Answers

Subhaditya Mukherjee
Machine Learning engineer who loves to code and write

What is a ResNet and where would you use it? Is it efficient?

Among the various neural networks that are used for computer vision, ResNet (Residual Neural Networks), is one of the most popular ones. It allows us to train extremely deep neural networks which is the prime reason for its huge usage and popularity. Before the invention of this network, training extremely deep neural networks was almost impossible.

To understand why, we must look at the vanishing gradient problem which is basically an issue that arises when the gradient is back propagated to all the layers. As a large number of multiplications are performed, the size of the network keeps decreasing till it becomes extremely small and thus, the network starts performing badly. ResNet helps to counter the vanishing gradient problem.

The efficiency of this network is highly dependent on the concept of skip connections. Skip connections is a method of allowing a shortcut path through which the gradient can flow, which in effect helps counter the vanishing gradient problem.

An example of skip connection is shown below:


In general, a skip connection allows us to skip the training of a few layers. Skip connections are also called identity shortcut connections as they allow us to directly compute an identity function by just relying on these connections and not having to look at the whole network.

The skipping of these layers makes ResNet an extremely efficient network.

Dropout is an essential requirement in some neural networks. Why is it necessary?

Overfitting is probably one of the biggest problems when it comes to neural networks. This occurs when a complicated model is used for a very small dataset. It quite obviously results in very poor performance.

To counter overfitting, one of the most useful methods is dropout. Dropout uses different architectures in parallel to train neural networks. Some layers are randomly removed during training which, in effect, is called a dropout.

When a dropout takes place, some of the units are forced to fix errors that were already caused by other units. In general dropout is done on any of the layers apart from the output layer. The use case for dropout is probably all types of networks including convolutional neural networks, Long Short-Term Memory (LSTM) networks etc.

Note that both hidden as well as visible layers can be dropped. At the end of a dropout, a reduced network, with both incoming and outgoing edges removed for every dropped out node, is produced.

The probability in general of a node being dropped is 0.5. In effect, as training is not performed on all nodes, overfitting is reduced. This also leads to the model learning more generic features which can then be used to learn new data quicker and better.

Dropout generally gives better performance on large networks. Dropout generally performs better with a large learning rate but with a decay factor.

What is a sobel filter? How would you implement it in Python?

The sobel filter performs a two-dimensional spatial gradient measurement on a given image which then emphasizes regions which have high spatial frequency. In effect, this means finding edges. 

In most cases, sobel filters are used to find the approximate absolute gradient magnitude for every point in a grayscale image. The operator consists of a pair of 3×3 convolution kernels. One of these kernels is rotated by 90 degrees.

These kernels respond to edges that run horizontal or vertical with respect to the pixel grid, one kernel for each orientation. A point to note is that these kernels can be applied either separately or can be combined together to find the absolute magnitude of the gradient at every point.

The sobel operator has a large convolution kernel which ends up smoothing the image to a greater extent and thus the operator becomes less sensitive to noise. It also produces higher output values for similar edges compared to other methods.

To overcome the problem of output values from the operator overflowing the maximum allowed pixel value per image type, avoid using image types that support pixel values.

Implementation in Python

To implement it in Python, we can use the OpenCV module (can be installed from pip):

import cv2

import numpy as np

img = cv2.imread('your image.jpg',0)

laplacian = cv2.Laplacian(img,cv2.CV_64F)

sobelx = cv2.Sobel(img,cv2.CV_64F,1,0,ksize=5)

sobely = cv2.Sobel(img,cv2.CV_64F,0,1,ksize=5)

How do you add dropout to a Neural Network?

Dropout can be added very easily to a neural network. The code is as follows

if (dropout_flag==True):
  first_layer *= np.random.binomial([np.ones((len(X), hidden_dim))], 1 - dropout_percentage)[0] * (1.0 / (1 - dropout_percentage))

What is the purpose of a Boltzmann Machine?

Boltzmann machines are algorithms which are based on physics, specifically thermal equilibrium. A special and more well known case of Boltzmann machines is the Restricted boltzmann machine which is a type of boltzmann machine where there are no connections between hidden layers of the network.

The concept was coined by Geoff Hinton who most recently won the Turing award. In general, the algorithm uses the laws of thermodynamics and tries to optimise a global distribution of energy in the system.

In discrete mathematical terms, a restricted boltzmann machine can be called a symmetric bipartite graph i.e. two symmetric layers. These machines are a form of unsupervised learning which means that there are no labels provided with data. It uses stochastic binary units to reach this state.

Boltzmann machines are derived from markov state machines. A Markov State Machine is a model that can be used to represent almost any computable function. The restricted boltzmann machine can be regarded as an undirected graphical model. It is used in dimensionality reduction, collaborative filtering, learning features as well as modelling. It can also be used for classification and regression. In general, restricted boltzmann machines are composed of a two layer network which can then be extended further.

Note that these models are probabilistic in nature since each of the nodes present in the system learns low-level features from items in dataset. For example, if we take a grayscale image, each node that is responsible for the visible layer will take just one pixel value from the image.

A part of the process of creating such a machine is feature hierarchy where sequences of activations are grouped in terms of features. In thermodynamics principles, simulated annealing is a process that the machine follows to separate signal and noise.

What is the advantage of Boltzmann Machines?

The advantage of Boltzmann machines is that many of these machines can be piped together to make a system which is generally called a deep belief network.

Deep belief networks are interesting as they can be used to discover many complex features and patterns in the training data. The only disadvantage of these networks is that they are relatively slower than other models. The nodes which are present across layers are connected to each other but none of the nodes in the same layer are connected. Each of these layers compute their respective inputs.

Why do we have gates in neural networks?

To understand gates we must first understand recurrent neural networks.

Recurrent neural networks allow information to be stored as memory by means of loops. Thus, the output of a recurrent neural network is not only based on the current input but also the past inputs which are stored in memory of the network. Back propagation is done through time but in general, the truncated version of this is used for longer sequences.

Gates are generally used in networks that are dependent on time. In effect, any network which would require memory, so to speak, would benefit from the use of gates. These gates are generally used to keep track of any information that is required by the network without leading to a state of either vanishing or exploding gradients. Such a network can also preserve the error through time. Since a sense of constant error is maintained, the network can learn better.

These gated units can be considered as units with a recurrent connections. They also contain additional neurons which are gates. If you relate this process to a signal processing system, the gate is used to regulate which part of the signal passes through. A sigmoid activation function is used which means that the values taken are from 0 to 1.

An advantage of using gates is that it enables the network to either forget information that it has already learnt or to selectively ignore information either based on the state of the network or the input the gate receives.

Gates are extensively used in recurrent neural networks especially in Long Short-Term Memory (LSTM) networks. A general LSTM network will have 3 to 5 gates typically an input gate, output gate, hidden gate and activation gate.

Transfer learning is one of the most useful concepts today. Where can it be used?

Pre-trained models is probably one of the most common use cases for transfer learning.

For anyone who does not have access to huge computational power, training complex models is always a challenge. Transfer learning aims to help by both improving the performance and speeding up your network.

In layman terms, transfer learning is a technique in which a model that has already been trained to do one task is used for another without much change. This type of learning is also called multi-task learning.

Many models that are pre-trained are available online. Any of these models can be used as a starting point in the creation of the new model required. After just using the weights, the model must be refined and adapted on the required data by tuning the parameters of the model.

The general idea behind transfer learning is to transfer knowledge not data. For humans, this task is easy – we can generalise models which we have mentally created a long time ago for a different purpose. One or two samples is almost always enough. However, in the case of neural networks, huge amount of data and computational power are required.

Transfer learning should generally be used when we don’t have a lot of labelled training data or if there already exists a network for the task you are trying to achieve, probably trained on a much more massive dataset. Note, however, that the input of the model must have the same size during training. Also, this works only if the tasks are fairly similar to each other and the features learned can be generalised. For example, something like learning how to recognise vehicles can probably be extended to learn how to recognise aeroplanes and helicopters.

What are some real-life examples where Transfer Learning can be used?

An example where transfer learning can be used, is photograph classification. Since it is not possible to train such huge categories of photographs on a normal machine, pre-trained weights can be used directly. If you are using your own dataset, you might need to tune the parameters before the network works accurately.

Transfer learning is very widely used with image data and language data. Since words are mapped to very high dimensional vector spaces, it becomes easy to find words with similar meaning in different languages or even in the same language.

Why are deep learning models referred to as black boxes?

Lately, the concept of deep learning being a black box has been floating around. A black box is a system whose functioning cannot be properly grasped but the output produced can be understood and utilised.

Now, since most models are mathematically sound and are created based on legit equations, how is it possible that we do not know how the system works?

First, it is almost impossible to visualize the functions that are generated by a system. Most machine learning models end up with such complex output that it is not possible for a human to make sense of it.

Second, there are networks with millions of hyperparameters. As a human, we can grasp around 10 to 15 parameters. But analysing a million of them seems out of the question.

Third and most important, it becomes very hard, if not impossible, to trace back why the system made the decisions it did. This may not sound like a huge problem to worry about but consider the case of a self driving car. If the car hits someone on the road, we need to understand why that happened and prevent it. But this isn’t possible if we do not understand how the system works.

To make a deep learning model not be a black box, a new field called Explainable Artificial Intelligence or simply, Explainable AI is emerging. This field aims to be able to create intermediate results and trace back the decision making process of a system.

Why is the process of weight initialization an important step in deep learning?

Building even a small neural network is an extremely challenging task and we quite obviously do not want to get results that are less than satisfactory. The first step to making an efficient neural network is weight initialization. A negative effect of improper initialisation is that the neural network might be prohibited from learning at all.

The core objective is to prevent the explosion or vanishing of activation outputs of the layers over the course of iterations. This occurs due to multiplication of large matrices, which is one of the core mathematical operations behind neural networks. In effect, it leads to generation of matrix products which are quite large for the system to handle.

With weight initialization, a network comes to a quick convergence and also has less error. Optimisation is thus achieved in the least time possible.

What are the types of weight initialization?

There are two major types of weight initialisation:- zero initialisation and random initialisation.

Zero initialisation: In this process, biases and weights are initialised to 0. If the weights are set to 0, all derivatives with respect to the loss functions in the weight matrix become equal. Hence, none of the weights change during subsequent iterations. Setting the bias to 0 cancels out any effect it may have.

All hidden units become symmetric due to zero initialisation. In general, zero initialisation is not very useful or accurate for classification and thus must be avoided when any classification task is required.

Random initialisation:  As compared to 0 initialisation, this involves setting random values for the weights. The only disadvantage is that setting very high values will increase the learning time as the sigmoid activation function maps close to 1. Likewise, if low values are set, the learning time increases as the activation function is mapped close to 0.

Setting too high or too low values thus, generally leads to the exploding or vanishing gradient problem.

New types of weight initialisation like “He initialisation” and “Xavier initialisation” have also emerged. These are based on specific equations and are not mentioned here due to their sheer complexity.

What does tuning of hyperparameters signify? Explain with examples.

A hyperparameter is just a variable which defines the structure of the network. Let’s go through some hyperparameters and see the effect of tuning them.

  1. Number of hidden layers – Most times, the presence or absence of a large number of hidden layers may determine the output, accuracy and training time of the neural network. Having a large number of these layers may sometimes cause an increase in accuracy.
  2. Learning rate – This is simply a measure of how fast the neural network will change its parameters. A large learning rate may lead to the network not being able to converge, but might also speed up learning. On the other hand, a smaller value for learning rate will probably slow down the network but might lead to the network being able to converge.
  3. Number of epochs – This is the number of times the entire training data is run through the network. Increasing the number of epochs leads to better accuracy.
  4. Momentum – Momentum is a measure of how and where the network will go while taking into account all of its past actions. A proper measure of momentum can lead to a better network.
  5. Batch Size – Batch size determines the number of subsamples that are inputs to the network before every parameter update.

An example of hyperparameters for a SVM model in tensorflow is shown below:


What is a Tensor in deep learning?

To represent data before being processed by neural networks, the data should have a regular structure. This data structure is called a tensor.

In general tensors are just multidimensional arrays. They are useful because they allow us to present data in an extremely higher dimensions, in an easy way. This is an important aspect since we cannot visualise data beyond a specific number of dimensions.

In deep learning, most of the data can be represented in the form of n-dimensional vectors and hence, we use tensors. There also exists a type of processing unit called the tpu or tensor processing unit.

An Example of a Tensor using Tensorflow is as follows:

import tensorflow as tf
strings = tf.Variable(["Hello"], tf.string)
decimals = tf.Variable([3.14159, 2.71828], tf.float32)
integers = tf.Variable([2, 3, 5, 7, 11], tf.int32)
complexNumbers = tf.Variable([12.3 - 4.85j, 7.5 - 6.23j], tf.complex64)

What are capsules? How are they useful in deep learning?

In deep learning, and especially while working with image data, it is essential to preserve the location of specific structures in the image. Most neutral networks fail to do so, especially those in the category of generative networks.

For example, take the case of generating faces using a generator adversarial neural network. At the end of the generation we have a face but with all features jumbled. The nose is above the eye, the mouth below the chin and so on. Quite obviously, this is not a face. But to a machine, since the concept of facial structure does not exist, thus for the system, the output is correct. To prevent such a mistake, we use capsule networks.

Capsule networks help to retain structure and location of features especially while working with images for any task. Capsule networks consist of a vector specifying the features present in the object.

A capsule might also specify many more attributes and parameters. It is obvious how useful these networks can be in preserving the structural integrity of the generated object.