Deep learning job interviews. A necessary evil. Most beginners in the industry break out in a cold sweat at the mere thought of a machine learning or a deep learning job interview. How do I prepare for my upcoming deep learning job interview? What kind of deep learning interview questions they are going to ask me? What questions should I ask them? These are just a few thoughts that run through the mind of any interviewee. The problem with most machine learning or deep learning interviews is that you never know whether you’ve to bring your lucky whiteboard marker or your lucky keyboard. Not to mention the deep learning questions that you will be asked in your next job interview are hardly predictable.
The good news? We’ve collated 100 deep learning technical interview questions from the insights of our industry experts on what kind of questions they ask most often. So, keep calm and read on to see what kind of questions you can expect in the hot seat in your next deep learning job interview. Ready to dive in? Then let’s get started!
Table of Contents
The foremost step when deciding on choosing a neural network model is to have a good know-how of the data and then decide the best model for it. Also, factoring in whether it is a linearly separable problem or not is important when deciding on a neural network model. So, the task at hand and the data play a vital role in choosing the best neural network model for a given problem. However, it is always better to start with a simple model like multi-layer perceptron (MLP) that has just one hidden layer unlike CNN, LSTM, or RNN that require configuring the nodes and layers. MLP is considered the simplest neural network because the weight initialization is not sensitive and also there is no need to define a structure for the network beforehand.
The curse of dimensionality (the problems that arise when working with high-dimensional data) is a common problem when working on machine learning or deep learning projects. Curse of Dimensionality causes lots of difficulties while training a model because it requires training a lot of parameters on a scarce dataset leading to issues like overfitting, large training times, and poor generalization. PCA and autoencoders are used to tackle these issues. PCA is an unsupervised technique wherein the actual data is projected to the direction of high variance while autoencoders are neural networks used for compressing the data into a low dimensional latent space and then try to reconstruct the actual high dimensional data.
PCA or autoencoders are effective only when the features have some relationship with each other. A general thumb rule between choosing PCA and Autoencoders is the size of data. Autoencoders work great for larger datasets and PCA works well for smaller datasets. Autoencoders are usually preferred when there is a need for modeling non-linearities and relatively complex relationships. Autoencoders can encode a lot of information with fewer dimensions when there is a curvature in low dim structure or non-linearity, making them a better choice over PCA in such scenarios.
Autoencoders are usually preferred for identifying data anomalies rather than for reducing data. Anomalous data points can be identified using the reconstruction error, PCA is not good for reconstructing data particularly when there are non-linear relationships.
Given a business problem, there is no hard and fast rule to determine the exact number of neurons and hidden layers required to build a neural network architecture. The optimal size of the hidden layer in a neural network lies between the size of the output layers and the size of the input. However, here are some common approaches that have the advantage of making a great start to building a neural network architecture –
One common problem with using ANN’s for image classification is that ANN’s react differently to input images and their shifted versions. Let’s consider a simple example where you have the picture of a dog in the top left of an image and in another image, there is a picture of a dog at the bottom right. ANN will assume that a dog will always appear in this section of any image, however, that’s not the case. ANN’s require concrete data points meaning if you are building a deep learning model to distinguish between cats and dogs, the length of the ears, the width of the nose, and other features should be provided as data points while if using CNN for image classification spatial features are extracted from the input images. When there are thousands of features to be extracted, CNN is a better choice because it gathers features on its own, unlike ANN where each individual feature needs to be measured.
Training a neural network model becomes computationally heavy (requiring additional storage and processing capability) as the number of layers and parameters increases. Tuning the increased number of parameters can be a tedious task with ANN, unlike CNN where the time for tuning parameters is reduced making it an ideal choice for image classification problems.
A common problem with Tanh or Sigmoid functions is that they saturate. Once saturated, the learning algorithms cannot adapt to the weights and enhance the performance of the model. Thus, Sigmoid or Tanh activation functions prevent the neural network from learning effectively leading to a vanishing gradient problem. The vanishing gradient problem can be addressed with the use of Rectified Linear Activation Function (ReLu) instead of sigmoid and using a Xavier initialization.
When the model weights grow exponentially and become unexpectedly large in the end when training the model, exploding gradient problem happens. In a neural network with n hidden layers, n derivatives are multiplied together. If the weights that are multiplied are greater than 1 then the gradient increases exponentially greater than the usual one and eventually explodes as you propagate through the model. The situation wherein the value of weights is more than 1 makes the output exponentially larger hindering the model training and impacting the overall accuracy of the model is referred to as the exploding gradients problem. Exploding gradients is a serious problem because the model cannot learn from its training data resulting in a poor loss. One can deal with the exploding gradient problem either by gradient clipping, weight regularization, or with the use of LSTM’s.
Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects
Constant validation accuracy is a common problem when training any neural network because the network just remembers the sample and results in an overfitting problem. Overfitting of a model means that the neural network model works fantastic on the training sample but the performance of the model sinks in on the validation set. Here are some tips to try to fix the constant validation accuracy in CNN –
Learning rate is one of the most important configurable hyperparameters used in the training of a neural network. The value of the learning rate lies between 0 and 1. Choosing the learning rate is one of the most challenging aspects of training a neural network because it is the parameter that controls how quickly or slowly a neural network model adapts to a given problem and learns. A higher learning rate value means that the model requires few training epochs and results in rapid changes while a smaller learning rate implies that the model will take a long time to converge or might never converge and get stuck on a suboptimal solution. Thus, it is advisable not to use a learning rate that is too low or too high but instead a good learning rate value should be discovered through trial and error.
Every neural network has a hidden layer along with input and output layers. Neural networks that use a single hidden layer are known as shallow neural networks while those that use multiple hidden layers are referred to as deep neural networks. Both shallow and deep networks are capable of fitting into any function but shallow networks require a lot of parameters, unlike deep networks that can fit functions even with a limited number of parameters because of several layers. Deep networks are preferred today over shallow networks because at every layer the model learns a novel and abstract representation of the input. Also, they are much more efficient in terms of the number of parameters and computations compared to shallow networks.
Yes, there is a possibility that the neural network model will learn even if all the biases are initialized to 0.
No, it is not possible to train a model by initializing all the weights to 0 because the neural network will never learn to perform a given task. Initializing all weights to zeros will cause the derivatives to remain the same for every w in W [1] because of which neurons will learn the same features in every iteration. Not just 0, but any kind of constant initialization of weights is likely to produce a poor result.
Without non-linearities, a neural network will act like a perceptron regardless of how many layers are there making the output linearly dependent on the input. In other words, having a neural network with n layers and m hidden units with linear activation functions is just like having a linear neural network without hidden layers that can only find linear separation boundaries. A neural network without non-linearities cannot find appropriate solutions and classify the data correctly for complex problems.
The problem with deep neural networks is that they are most likely to overfit training data with few examples. Overfitting can be reduced by ensembles of networks with different model configurations but this requires the additional effort of maintaining multiple models and is also computationally expensive. Dropout is one of the easiest and exceptionally successful methods to reduce dependencies in deep neural networks and overcome overfitting problems. When using the dropout regularization method, a single neural network model is used to similar different network architecture by dropping out nodes while training. It is considered an effective method of regularization as it improves generalization errors and is also computationally cheap.
You will need to know about One-Shot Learning for Face Recognition which is a classification task where is one or more examples(faces in this case) are used for classifying new examples(faces) in the future. One needs to know about the method of indexing data to retrieve a new face faster. A new face can be recognized by finding the vectors that are close )most similar) to the input face but in this case, the system would have become super slow if we were to calculate the distance to 12 million vectors. A convenient way would be to index data on real vector space by dividing the data into easy structures for querying (almost like a tree data structure). It is easier to find the vector that is in close proximity with time very quickly whenever new data is available. Techniques like Annoy Indexing, Locality Sensitive Hashing, and Approximate Nearest Neighbours can be used for this purpose.
Flexibility makes deep learning powerful. Neural networks are universal function approximators so even if it is a complex enough problem at hand(where the formula between input and output is not known), a neural network can be approximated. Also, transfer learning (where the trained weights of an existing neural network can be used to initialize the weights of another network that performs similar tasks) makes the application of deep learning much easier under situations when training a neural network from scratch is costly or almost impossible when there is data scarcity.
Faster and powerful computational resources are also a prime reason for the adoption of neural network architectures. One cannot deny the fact that it is faster to train a neural network in just minutes with GPU acceleration which would otherwise take days for the network to learn.
Recommended Reading:
Yes, it is definitely possible to build deep networks using a linear function as the activation function for each layer if the problem is represented by a linear equation. However, a problem that is a composition of linear functions is a linear function and there is nothing extraordinary that can be achieved with the implementation of a deep network because adding more nodes to the network will not increase the predictive power of the machine learning model.
The decrease in the accuracy of a deep learning model after a few epochs implies that the model is learning from the characteristics of the dataset and not considering the features. This is referred to as the overfitting of the deep learning model. You can either use dropout regularization or early stopping to fix this issue. Early stopping as the phrase implies stops training the deep learning model any further the moment you notice a drop inaccuracy of the model. Dropout regularization is a technique wherein a few nodes or output layers are dropped so that the remaining nodes have varying weights.
With images as inputs, an improperly set learning rate can cause noisy features. Having an ill-chosen learning rate determines the prediction quality of a model and can result in an unconverged neural network.
19)What do you understand by the terms Batch, Iterations, and Epoch in training a neural network model?
20) Is it possible to calculate the learning rate for a model a priori?
For simple models, it could be possible to set the best learning rate value a priori. However, for complex models, it is not possible to calculate the best learning rate through theoretical deductions that can actually make accurate predictions. Observations and experiences do play a vital role in defining the optimal learning rate.
21) What is the theoretical foundation of neural networks?
To answer this question one needs to explain the universal approximation theorem that forms the base on why neural networks work.
Introducing non-linearity via an activation function allows us to approximate any function. It’s quite simple, really. — Elon Musk
According to the Universal Approximation Theorem, a neural network having a single hidden layer containing a finite number of neurons can approximate any continuous function to a reasonable accuracy for inputs in a specific range. However, if the function has large gaps it is not possible to approximate it. Meaning, if a neural network is trained with inputs between 20 and 30, we cannot be assured that it will work well for inputs between 60 and 70.
22) What are the commonly used approaches to set the learning rate?
23) Is there any difference between neural networks and deep learning?
Ideally, there is no significant difference between deep learning networks and neural networks. Deep learning networks are neural networks but with a slightly complex architecture than they were in 1990s. It is the availability of hardware and computational resources that has made it feasible to implement them now.
24) You want to train a deep learning model on a 10GB dataset but your machine has 4GB RAM. How will you go about implementing a solution to this deep learning problem?
One of the possible ways to answer this question would be to say that a neural network can be trained by loading the data into the NumPy array and defining a small batch size.NumPy doesn’t load the complete dataset into the memory but creates a complete mapping of the dataset. NumPy offers several tools for compressing large datasets that can be integrated with other NN packages like PyTorch, TensorFlow, or Keras.
25) How will the predictability of a neural network impact if you use a ReLu activation function and then use the Sigmoid function in the final layer of the network?
The neural network will predict only one class for all types of inputs because the output of a ReLu activation function is always a non-negative result.
26) What are the limitations of using a perceptron?
A major drawback to using a perceptron is that they can only linearly separable functions and cannot handle non-linear inputs.
27) How will you differentiate between a multi-class and multi-label classification problem?
In a multi-class classification problem, the classification task has more than two mutually exclusive classes whereas in a multi-label problem each label has a different classification task, however, the tasks are related somehow. For example, classifying a set of images of animals which may be cats, dogs, or bears is a multi-class classification problem that assumes that each sample has only one label meaning an image can be classified as either a cat or a dog but not both at the same time. Now imagine that you want to process the below image. The image shown below needs to be classified as both cat and dog because the image shows both the animals. In a multi-label classification problem, a set of labels are assigned to each sample and the classes are not mutually exclusive. So, a pattern can belong to one or more classes in a multi-label classification problem.
28) What do you understand by transfer learning?
You know how to ride a bicycle, so it will be easy for you to learn to drive a bike. This is transfer learning. You have some skill and you can learn a new skill that relates to it without having to learn it from scratch. Transfer learning is a process in which the learning can be transferred from one model to another without having to make the model learn everything from scratch. The features and weights can be used for training the new model providing reusability. Transfer learning works well in training a model easily when there is limited data.
29) What is fine-tuning and how is it different from transfer learning?
In transfer learning, the feature extraction part remains untouched and only the prediction layer is retrained by changing the weights based on the application. On the contrary in fine-tuning, the prediction layer along with the feature extraction stage can be retrained making the process flexible.
30) Why do we use convolutions for images instead of using fully connected layers?
Each convolution kernel in a CNN acts like its own feature detector and has a partially in-built translation in-variance. Using convolutions lets one preserve, encode and make use of the spatial information from the image, unlike fully connected layers that do not have any relative spatial information.
31) What do you understand by Gradient Clipping?
Gradient Clipping is used to deal with the exploding gradient problem that occurs during the backpropagation. The gradient values are forced element-wise to a particular minimum or maximum value if the gradient has crossed the expected range. Gradient clipping provides numerical stability while training a neural network but does not provide any performance improvements.
32) What do you understand by end-to-end learning?
It is a deep learning process where a model gets raw data as the input and all the various parts are trained simultaneously to produce the desired outcome with no intermediate tasks. The advantage of end-to-end learning is that there is no need for implicitly doing feature engineering which usually leads to a lower bias. A good example that you can quote in the content of end-to-end learning is driverless cars. They use human-provided input as guidance and are trained to automatically learn and process the information using a CNN to complete tasks.
33) Are convolutional neural networks translation-invariant?
Convolutional neural networks are translation invariant only to a certain extent but pooling can make them translation invariant. Making a CNN completely translation-invariant might not be possible. However, by feeding the right kind of data this can be achieved although this might not be a feasible solution.
34) What is the advantage of using small kernels like 3x3 than using a few large ones.
Smaller kernels let you use more filters so you can use a greater number of activations functions and let the CNN learn a more discriminative mapping function. Also, smaller kernels capture more spatial context and use fewer computations and parameters making them a better choice over large ones.
35) How can you generate a dataset on multiple cores in real-time that can be fed to the deep learning model?
One of the major challenges today in CV is the need to load large datasets of videos and images but there is not enough memory on the machine. In such situations, data generators act as a magic wand when it comes to loading a dataset that is memory-consuming. You can talk about the various data generators Keras model class provides. When working with big data, in most of the cases it might not be required to load all the data into RAM as it would be memory wastage, could lead to memory overflow, and also take a longer time to process. Making use of generative functions is highly beneficial then as they generate the data to be directly fed into the model in each batch for training.
36) How do you bring balance to the force when handling imbalanced datasets in deep learning?
It is next to impossible to have a perfectly balanced real-world dataset when working on deep learning problems so there will be some level of class imbalance within the data that can be tackled either by –
37) What are the benefits of using batch normalization when training a neural network?
Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization
38) Which is better LSTM or GRU?
LSTM works well for problems where accuracy is critical and sequence is large whereas if you want less memory consumption and faster operations, opt for GRU. Refer here for detailed Answer: /recipes/what-is-difference-between-gru-and-lstm-explain-with-example
39) RMSProp and Adam optimizer adjust gradients? Does this mean that they perform gradient clipping?
This does not inherently mean that they perform gradient clipping because gradient clipping involves setting up predetermined values beyond which the gradients cannot go, unlike Adam and RMSProp that make multiplicative adjustments to gradients.
40) Can you name a few hyperparameters used for training a neural network.
When training any neural networks there are two types of hyperparameters-one that define the structure of the neural network and the other determining how a neural network is trained. Listed are a few hyperparameters that are set before training any neural network –
41) When is multi-task learning usually preferred?
Multi-task learning with deep neural networks is a subfield wherein several tasks are learned by a shared model. This reduces overfitting, enhances data efficiency, and speeds up the learning process with the use of auxiliary information. Multi-task learning is useful when there is a small amount of data for any given task and we can benefit from training a deep learning model on a large dataset.
42) Explain the Adam Optimizer in one minute.
Adaptive momentum or Adam optimizer is an optimization algorithm designed to deal with sparse gradients on noisy problems. Adam optimizer improves convergence through momentum that ensures that a model does not get stuck in saddle point and also provides per-parameter updates for faster convergence.
43) Which loss function is preferred for multi-category classification?
Cross-Entropy loss function
44) To what kind of problems can the cross-entropy loss function be applied?
45) List the steps to implement a gradient descent algorithm.
46) How important is it to shuffle the training data when using batch gradient descent?
Shuffling the training dataset will not make much of a difference because the gradient is calculated at every epoch using the complete training dataset.
47) What is the benefit of using max-pooling in classification convolutional neural networks?
The feature maps become smaller after max-pooling in CNN and hence help reduce the computation and also give more translation in-variance. Also, we don’t lose much semantic information because we’re taking the maximum activation.
48) Can you name a few data structures that are commonly used in deep learning?
You can talk about computational graphs, tensors, matrices, data frames, and lists.
49) Can you add an L2 regularization to a recurrent neural network to overcome the vanishing gradient problem?
This can actually worsen the vanishing gradient problem because the L2 regularization will shrink weights towards zero.
50) How will you implement Batch Normalization in RNN?
It is not possible to use batch normalization in RNN because statistics are computed per batch and thus batch normalization will not consider the recurrent part of the neural network. An alternative to this could be layer normalization in RNN or reparameterizing the LSTM layer that allows the use of batch normalization.
1. Given that there are so many deep learning algorithms, how will you determine which deep learning algorithm has to be used for a dataset.
Artificial Neural Network Artificial Neural Network or sometimes called Classic Neural Network is a connection of multilayered perceptrons. This algorithm can be used when the data is properly structured in a tabular form. Both Classification and regression problems can be solved using ANNs Convolutional Neural Networks These networks are the best proven ones to build any prediction model involving image data as input. To put it in general terms, CNN works best on data with spatial relationships and hence these can also produce state-of-the-art results for NLP problems such as topic modelling, document classification and so on. Recurrent Neural Networks RNNs come into picture when we have sequential data where the order of the data entered is also important. RNNs can provide solutions for problems involving Time Series data. More often, rather than vanilla RNNs, gated networks like LSTMs (Long short term memory) and GRUs(Gated Recurrent units) are proven to give much better results. Autoencoders Autoencoders are widely used in the deep learning community these days because of its ability to operate automatically based on its inputs even before taking an activation function and final output decoding. These can be used when we have problems such as feature detection, recommendation systems and other compelling problems.
2. How do one-hot encoding and label encoding affect the dimensionality of a dataset?
Label encoding does not really affect the dataset in any way because in label encoding, we only provide labels to each category in the column.
For example,
Place of birth |
Place of birth |
Delhi |
0 |
Hyderabad |
1 |
Chennai |
2 |
Delhi |
0 |
In the above example, we are mapping Delhi -> 0, Hyderabad -> 1, and Chennai -> 2.
In one hot encoding, we create columns to each of the category in the dataset. Thus, the more the number of categories in the column, the more are the columns generated after one hot encoding. Let us consider the very same dataset that we saw above. After one hot encoding it will look like the table shown below
Place of birth (Delhi) |
Place of birth (Hyderabad) |
Place of birth (Chennai) |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
If the value is ‘Delhi’, then only the column meant for ‘Delhi’ takes the value 1 and the other columns takes the value 0.
Often, we don't consider the last/first category after one hot encoding the variable because it can be clearly understood that if all the existing entries for the category are 0, then it belongs to the category that we dropped. This is much clearly explained with the example below
Place of birth (Delhi) |
Place of birth (Hyderabad) |
1 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
Here , we already know that there are 3 unique categories in the variable (Delhi, Hyderabad, and Chennai). There are two zeros in the 3rd row which clearly implies that it does not belong to both the categories and the one which remains in Chennai. Therefore, the decoded value for that row is Chennai.
3. Why are GPUs important for implementing deep learning models?
Whenever we are trying to build any neural network model, the model training phase is the most resource-consuming job. Each iteration of model training comprises thousands (or even more) of matrix multiplication operations taking place. If there are less than around 1 lakh parameters in a neural network model, then it would not take more than a few minutes (or few hours at most) to train. But when we have millions of parameters, that is when our sizable computers would probably give up. This is where GPUs come into the picture. GPUs (Graphics Processing Units) are nothing but CPUs but with more ALUs (Arithmetic logic units) than our normal CPUs which are specifically meant for this kind of heavy mathematical computation.
4. Which is the best algorithm for face detection ?
There are several machine learning algorithms available for face detection but the best ones are the ones which involve CNNs and deep learning. Some notable algorithms for face detection are listed below FaceNet Probablisit Face Embedding ArcFace Cosface Spherface
5. What evaluation approaches do you use to gauge the effectiveness of deep learning models?
6. When training a neural network, you observe that the loss does not decrease in the first few epochs. What are the possible reasons for this?
7. What are the commonly used techniques to deal with the overfitting of a deep learning model?
8. What kind of gradient descent variant is the best for handling data that is too big to handle in RAM simultaneously?
9. How will you explain the success and recent rise in demand for deep learning in the industry?
10. How do you select the depth of a neural network?
Get More Practice, More Data Science and Machine Learning Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro
1)What is Deep Learning?
2)Which deep learning framework do you prefer to work with – PyTorch or TensorFlow and why? Refer PyTorch vs Tensorflow for answer
3) Talk about a deep learning project you’ve worked on and the tools you used?
4) Have you used the ReLu activation function in your neural network? Can you explain how does the ReLu activation function works?
Yes, I have used ReLu in my neural networks. ReLu stands for Rectified Linear Unit. Basically, the function returns the input value as it is if it is positive or returns zero if it is negative. If the function is plotted in a line graph, it would look like the graph shown below
The main purpose of formulating this function was to overcome the Vanishing gradient problem caused by preliminary activation functions like Sigmoid and TanH which prevented us from building deeper neural network models. Now a days, this function has become a default activation function for many types of neural network models because models that use this function are easily trainable and don't suffer from the vanishing gradient problem.
5) How often do you use pre-trained models for your neural network?
6) What does the future of video analysis look like with the use of deep learning solutions? How effective/good is video analysis currently?
7) Tell us about your passion for deep learning. Do you like to participate in deep learning/machine learning hackathons, write blogs around novel deep learning tools, or attend local meetups, etc ?
8) Describe the last time you felt frustrated solving a deep learning challenge, and how did you overcome it?
What is more important to you the performance of your deep learning model or its accuracy?
Given the dataset, how will you decide which deep learning model to use and how to implement it?
What is the last deep learning research paper you’ve read?
What are the most commonly used neural network paradigms ? (Hint: Talk about Encoder-Decoder Structures, LSTM, GAN, and CNN)
Is it possible to use a neural network as a tool of dimensionality reduction?
How deep learning models tackle the curse of dimensionality?
15) What are the pros and cons of using neural networks?
Pros :
Neural networks are highly flexible and can be used for both classification and regression problems and sometimes for problems much more complex than that Neural networks are highly scalable. We can add as many layers with as many neurons as we want Neural networks are proven to produce best results when we have a lot of data points. They work best for non linear data such as image data, text data and so on. They can be used on any data that can be converted to numbers.
Cons :
1. The well known disadvantage of neural networks is their "black box" nature. That is, we don't know how or why our neural network came up with a certain output. For example, when we feed an image of a dog into a neural network and it predicts it to be a duck, we may find it difficult to understand what caused it to arrive at this prediction.
2. Developing a neural network model takes much time.
3. Neural networks are more computationally expensive than traditional algorithms.
4. The amount of computational power needed for a neural network depends mostly on the size of data, depth and complexity of the network.
5. To train a neural network model, it requires much more data than training a traditional machine learning model.
16) How is a Capsule Neural Network different from a Convolutional Neural Network?
17) What is a GAN and what are the different types of GAN you’ve worked with?
18) For any given problem, how do you decide if you have to use transfer learning or fine-tuning?
Transfer learning is a method used when a model is developed for one task is reused to work on a second task. Fine tuning is one approach to achieve transfer learning. In Transfer Learning we train the model with a dataset and after we train the same model with another dataset that has a different distribution of classes. In Fine-tuning, an approach of Transfer Learning, we have a dataset, and we make an 80-20 split and use 80% of it in training. Then we train the same model with the remaining 20%. Usually, we change the learning rate to a smaller one, so it does not have a significant impact on the already adjusted weights. To decide which method to choose, one should experiment first by using transfer learning as it is easy and fast, and if it does not suffice the purpose, then use fine tuning.
19) Can you share some tricks or techniques that you use to fight to overfit a deep learning model and get better generalization?
Overfitting of a model is defined when, a model performs well on the training data (low bias) and performs badly / poorly on the test data (high variance). In short, the model has learned over a certain pattern of data and is not useful for any other data. Overfitting can be detected by checking the performance metrics like loss and accuracy of a given model. There are several tips and techniques one can use in order to reduce the over fitting of a deep learning model. • Increase the size of training data. • Reduce number of layers in the hidden layer, this will reduce the networks capacity.
20) Explain the difference between Gradient Descent and Stochastic Gradient Descent.
To begin with, Gradient descent and stochastic gradient descent both are popular machine learning and deep learning optimization algorithms which are used for updating a set of parameters in an iterative way in order to minimize an error function. In gradient descent in order to update parameters, the entire dataset set is to be considered for a particular iteration while in stochastic gradient descent, computation is carried over only one single training sample. For example, if a dataset has 10000 datapoints, then GD, will train on all the 10000 datapoints and this will take a longer time, while on the other hand, Stochastic GD, will be much faster as we will train on only a single sample and update the parameters. This is because Stochastic gradient descent usually converges faster than gradient descent on large datasets, because updates are more frequent.
21) Which one do you think is more powerful – a two-layer NN without any activation function or a two-layer decision tree?
22) Can you name the breakthrough project that garnered the popularity and adoption of deep learning?
23) Differentiate between bias and variance with respect to deep learning models and how can you achieve a balance between the two?
While understanding predictions, understanding the prediction errors is most important. There are mainly two broad types of errors, reducible and irreducible. In reducible errors we have two kinds, bias and variance. Gaining a proper understanding of these errors helps one built an accurate model by avoiding overfitting and underfitting of the model.
In order to obtain the optimal balance between the two errors, the model must always aim at maintain a low bias and a low variance. An optimal balance of bias and variance would never overfit or underfit the model.
Bias – In the above diagram, the training error (blue dotted line) is high in the initial stage (high bias) and then decreases sustainably (low bias) High bias means, the data is under fitting, and hence the data must have a low bias to achieve good results. In order to achieve low bias:
Variance – the variance in deep learning is nothing but the difference between the validation error and the training error. In the above figure, we can see that the gap between the training error and validation error is high, i.e., the variance is high. This is the case of overfitting. The model should have low variance and can be achieved by: i. Increasing the training data ii. Using regularization iii. Using different neural network architectures.
24) What are your thoughts about using GPT3 for our business?
GPT-3, or the third generation Generative Pre-trained Transformer, is a neural network machine. GPT-3 is a text predictor. Given a text or phrase, GPT-3 returns a human-like response to text completion in natural language. GPT-3 has a wide range of applications serving the industry today. It is a powerful tool that can create applications for responding to customer queries, language translator (say, asking a question in English and expecting an answer in Spanish) etc.
GPT3 can also do everything from creating spreadsheets to building complex CSS or even deploying Amazon Web Services (AWS) instances. So, can using GPT-3 help your business? Well, it can help in many ways. It all depends on what you need it to do, but it is a super versatile deep learning model applied to many applications.
Some more applications of GPT-3 that you can probably use in your business are:
25) Can you train a neural network without using back-propagation? If yes, what technique will you use to accomplish this?
26) Describe your research experience in the field of deep learning?
27) Explain the working of a perceptron.
• Perceptron's were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts.
• A perceptron is one of the simplest ANN (artificial neural network) unit that does certain computations in order to detect features or business intelligence in the input data.
• Perceptron is based on an artificial neuron called a threshold logic unit (TLU)
• The inputs and output are numbers rather then binary values and each input connection is associated with a weight.
• The TLU computes a weighted sum of its inputs: (z = w1 x1 + w2 x2 + ⋯ + wn xn = wT x), then applies a step function to that sum and outputs the result: hw(x) = step(z), where z = wT x.
• A single TLU can be used for simple linear binary classification.
28) Differentiate between a feed-forward neural network and a recurrent neural network.
29) Why don’t we see the exploding or vanishing gradient problem in feed-forward neural networks?
30) How do you decide the size of the filter when performing a convolution operation in a CNN?
31) When designing a CNN, can we find out how many convolutional layers should we use?
32) What do you understand by a computational graph?
33) Differentiate between PCA and Autoencoders.
Which one is better for reconstruction linear autoencoder or PCA?
How is deep learning related to representation learning?
Explain the Borel Measurable function.
How are Gradient Boosting and Gradient Descent different from each other?
In a logistic regression model, will all the gradient descent algorithms lead to the same model if run for a long time?
What is the benefit of shuffling a training dataset when using batch gradient descent?
Explain the cross-entropy loss function.
Why is cross-entropy preferred as the cost function for multi-class classification problems?
What happens if you do not use any activation functions in a neural network?
What is the importance of having residual neural networks?
There is a neuron in the hidden layer that always results in a large error in backpropagation. What could be the reason for this?
Explain the working of forwarding propagation and backpropagation in deep learning.
Is there any difference between feature learning and feature extraction?
Do you know the difference between the padding parameters valid and the same padding in a CNN?
How does deep learning outperform traditional machine learning models in time series analysis?
Can you explain the parameter sharing concept in deep learning?
How many trainable parameters are there in a Gated Recurrent Unit cell and in a Long Short Term Memory cell
51. What are the key components of LSTM ?
52. What are the components of a General Adversarial Network?
So that pretty much makes it for this post – the most common deep learning engineer interview questions and answers. Whether you’re a beginner or a seasoned professional, hopefully, these deep learning job interview questions and answers have been useful and been able to boost your confidence for your next deep learning engineer job interview.
Congrats! You now have the know-how on the kind of deep learning interview questions you can expect in your next job interview. However, there is still a lot to learn to solidify your deep learning knowledge and get hands-on experience working with diverse deep learning projects and all the deep learning frameworks like PyTorch, TensorFlow, and Keras. ProjectPro helps you move right into practice with over 60+ end-to-end solved data science and machine learning projects where you will learn how to develop machine learning/deep learning models from scratch and develop a high-level ability to think about productionized machine learning systems. Get started today to take your deep learning skills to the next level and build a fantastic job-winning portfolio of projects.
We would love to hear your own machine learning or deep learning interview experiences. If you have any other interesting deep learning interview questions to share that can be helpful, please send an email with the questions and answers to khushbu.shah@dezyre.com to make the learning experience for the community enriching and valuable. All the questions and answers shared would be posted on the blog with due credit to the author.