New Year Offer - Flat 50% Off On All Courses

# Top 80 Data Science Interview Questions in 2021

Last updated on Nov. 24, 2020, 3:57 p.m. 546 Views #### Kirandeep Kaur

This blog contains a review of common interview questions and answers that are asked during a Data Science interview. You’ll get to know how to answer basic data science interview questions related to predictions, underfitting, and overfitting. Also, with 80 Data Science interview questions, you will walk through the typical questions related to statistics and probability and some more related to data structures and algorithms. The technical skills required to appear for the interview are:

• Python
• SQL
• Statistics and Probability
• Algorithms
• Supervised and Unsupervised Machine Learning

## Q1) Explain Supervised and unsupervised learning?

Supervised learning – When the target variable is known for the problem statement, it becomes Supervised learning. This is applied to perform regression and classification.

Example: Linear Regression and Logistic Regression.

Unsupervised learning – When the target variable is unknown for the problem statement, it becomes Unsupervised learning. This is widely used to perform Clustering.

Example: K-Means and Hierarchical clustering.

## Q2) What are the commonly used algorithms?

• Linear regression
• Logistic regression
• Random Forest
• KNN

## Q3) What is precision?

The ratio of predicted positive against the actual positive is known as precision. It is the most commonly used error metric as a classification mechanism. The range is from 0 to 1, where 1 represents 100%.

## Q4) What is recall?

The ratio of the true positive rate against the actual positive rate is known as recall. The range is from 0 to 1

## Q5) Which metric is used for accuracy in problem statement classification?

F1 Score - 2 * (Precision*Recall)/Precision + Recall

## Q6) What is a normal distribution?

When the data distribution is equally distributed, understanding that the mean, median, and mode are equal.

## Q7) What is overfitting?

Any prediction rate which has high inconsistency between the training error and the test error leads to a high business problem, if the error rate in the training set is low and the error rate of the test set is high, then it is said to be an overfitting model.

## Q8) What is underfitting?

Any prediction rate which has provided low prediction in the training error and the test error leads to a high business problem, if the error rate in the training set is high and the error rate of the test set is also high, then is said to be an overfitting model.

## Q9) What is a univariate analysis?

An analysis that can be applied to one attribute at a time is called univariate analysis. Boxplot is one of the widely used univariate models. Scatter plot and cook’s distance are other methods used for bivariate and multivariate analysis.

## Q10) Name a few methods for Missing Value Treatments.

Central Imputation – This method acts more like central tendencies. All the missing values will be filed with mean and median mode respective to numerical and categorical data types.

KNN – K Nearest Neighbour imputation

Distance between two or multiple attributes are calculated using Euclidian’s distance and the same will be used to treat the missing values. Mean and mode will be used in CI.

## Q11) What is the Pearson correlation?

Correlation between predicted and actual data can be examined and understood using this method. The range is from -1 to +1. -1 refers to negative 100% whereas +1 refers to positive 100%.

The formula is Sd(x)*m/Sd.(y)

## Q12) How and by what methods data visualizations can be used effectively?

In order to give insights in a very effective and efficient manner, data can likewise be utilized so that it isn't just limited to bar, line, or some stereotypic diagrams. Data can be spoken to in a considerably more outwardly pleasing way. One thing that has to be taken care of is to convey the intended insight or finding correctly to the audience. Once the baseline is set, the innovative and creative part can help you come up with better looking and functional dashboards. There is a fine line between the simple insightful dashboard and awesome looking 0 fruitful insight dashboards.

Q13) How to understand the problems faced during data analysis?

The majority of the issue looked at during hands-on investigation or information science is a direct result of poor comprehension of the issue close by and focusing more on devices, final products, and different parts of the venture. Separating the issue to a granular level and understanding takes a ton of time and practice to ace. Returning to the starting point in information science activities can be found in part of organizations and even in your own task or Kaggle issues.

## Q14) What are the time series algorithms?

Time series algorithms like ARIMA, ARIMAX, SARIMA, Holts winters are very interesting to learn and use as well to solve a lot of complex problems for businesses. Data for time arrangement examination assumes a fundamental job. The stationarity, regularity, cycles, and clamors need time and consideration. Take as much time as you might want to make the information right. At that point, you can run any model on it.

## Q15) How can I achieve accuracy in the first model that I built?

Building machine learning models involves a lot of interesting steps. 90% accuracy models don’t come in the very first attempt. You have done a lot of better feature selection techniques to get that point, which means it involves a lot of trial and error. The process will help you learn new concepts in statistics, math, and probability.

## Q16) What is the basic responsibility of a Data Scientist?

As data scientists, we have the responsibility to make complex things simple enough that anyone without context should understand what we are trying to convey.

The moment, we start explaining even the simple things the mission of making the complex simple goes away. This happens a lot when we are doing data visualization.

Less is more. Rather than pushing too much information to the reader’s brain, we need to figure out how easily we can help them consume a dashboard or a chart.

The process is simple to say but difficult to implement. You must bring the complex business value out of a self-explanatory chart. It’s a skill every data scientist should strive towards and good to have in their arsenal.

## Q17) Explain RUN-Group processing.

To practice RUN-group processing, you start the system and then submit many RUN-groups. A RUN-group is a group of records that contain at least one product group including ends with a RUN statement. It can contain different SAS statements such as AXIS, BY, GOPTIONS LEGEND, Power, or WHERE.

## Q18) State the definition of BY-Group processing.

BY-Group Processing is a method of preparing observations from one or numerous SAS data sets that are arranged or ordered by the importance of individual or more shared variables. All data sets that are being connected must include one or more BY variables.

## Q19) Explain Precision and Recall.

### Recall:

It is known as a true real rate. The number of positives that your model has claimed related to the original defined number of positives available during this data.

### Precision

It is also known as a positive predicted value. This is more based on the prediction. That indicates a time like a number of accurate positives that the model needs when compared to the number of positives it actually claims.

## Q20) Explain the F1 score and How it is used?

The F1 score is defined as a measure of a model’s performance. The average of Precision and Recall of a model is nothing but F1 score measure. Based on the results, the F1 score is 1 then it is classified as best and 0 being the worst

## Q21) Explain the confounding variables.

These are obvious variables in a scientific model that correlates directly or inversely with both the subject and the objective variable. The study fails to account for the confounding factor.

## Q22) How can you randomize the items of a list in place in Python?

Below is the example from random import shuffle

``````x = [‘Data’, ‘Class’, ‘Blue’, ‘Flag’, ‘Red’, ‘Slow’]

shuffle(x)

print(x)``````

Output:

[‘Red’, ‘Data’, ‘Blue’, ‘Slow’, ‘Class’, ‘Flag’]

## Q23) How to get indices of N maximum values in a NumPy array?

We can get the indices of N maximum values in a NumPy array using the below code:

Output:

[ 4 3 1 ]

## Q24) How to create 3D plots or visualizations using NumPy/SciPy?

Like 2D plotting, 3D graphics is beyond the scope of NumPy and SciPy, but just as in this 2D example, packages exist that integrate with NumPy. Matplotlib provides primary 3D plotting in the mplot3d subpackage, whereas Mayavi produces a wide range of high-quality 3D visualization features, utilizing the powerful VTK engine.

## Q25) What are the types of biases that can occur during sampling?

Some simple models of selection bias are described below. Under coverage occurs when some members of the population live badly represented inside the sample. The survey relied on a service unit, drawn of telephone directories and car registration lists.

• Selection bias
• Under coverage bias
• Survivorship bias

## Q26) Which Python library is used for data visualization?

Plotly is a tool also called Plot.ly because of its main platform online. It is an interactive online visualization tool that is being used for data analytics, scientific graphs, and other visualization. This contains some great API including one for Python

Q27) How can you access a specific script inside a module?

If the whole module needs to be imported, we simply can use from pandas import *

## Q28) What is a nonparametric test used for?

Non-parametric tests do not assume that the data follow a specific distribution. They can be used whenever the data do not meet the assumptions of parametric tests.

## Q29) What are the pros and cons of the Decision Trees algorithm?

Pros

• Easy to interpret.
• Will ignore irrelevant independent variables since information gain will be minimal.
• Can handle missing data.
• Fast modeling.

Cons

• Many combinations are possible to create a tree.
• There are chances that it might not find the best tree possible.

## Q30) Name some Classification Algorithms.

Following are some Linear Classifiers

• Logistic Regression
• Naive Bayes Classifier
• Decision Trees
• Random Forest
• Neural Networks
• K Nearest Neighbor

## Q31) What are the pros and cons of the Naive Bayes algorithm?

Pros

• Big sized data is handled easily
• Multiclass performance is good and accurate
• It is not processed intensively

Cons

Assume independence of predictor variables

## Q32) What are the types of Skewness?

A dataset that is skewed right or left is the two types.

## Q33) What is skewed data?

A data distribution that is has skewed data towards the right or left.

## Q34) What is an outlier?

An outlier is a value that is very much away from the rest of the values in the data set.

## Q35) What are the applications of data science?

Following are the application of data science:

• Optical character recognition
• Recommendation engines
• Filtering algorithms
• Personal assistants
• Surveillance
• Autonomous driving
• Facial recognition and more.

## Q36) Define EDA and what are the steps to perform EDA?

EDA [exploratory data analysis] is an approach to analyzing data to summarize their main characteristics, often with visual methods.

Steps included to perform EDA:

• Make a summary of observations
• Describe central tendencies or core part of the dataset
• Describe the shape of the data
• Identify potential associations
• Develop insight into errors, missing values, and major deviations

## Q37) What are the various types of data available in Enterprises?

• Structured data
• Unstructured data
• Big data from social media, surveys, pictures, audio, video, drawings, maps.
• Machine-generated data from instruments
• Real-time data feeds

## Q38) What are the various types of analysis performed on data?

• Univariate analyses are detailed statistical analysis methods that can be changed based upon the number of variables involved in a distributed period of time.
• The bivariate analysis tries to explain the difference between two variables at an individual time as in a scatterplot.
• Multivariate analysis contracts including the single study from and then a couple of variables to understand the effect from variables to some responses.
• Univariate – 1 variable
• Bivariate – 2 variables
• Multivariate – more than 2 variables

## Q39) What is the difference between primary data and secondary data?

Data collected by the interested or self is primary data. This data is collected afresh and the first time. Someone else has collected the data and is used by you is secondary data.

## Q40) State the difference between qualitative & quantitative?

Quantitative methods analyze the data based on numbers. The qualitative method analyzes the data by attributes.

## Q41) What is a histogram?

The histogram is the accurate representation of numerical data based on their occurrences or frequencies.

• Mean
• Median
• Mode

## Q43) What are quartiles?

Quartiles are three points in the data that divide the data into four groups. Each group consists of a quarter of the data.

## Q44) What are the commonly used error metrics in regression tasks?

MSE – Mean squared error – Average of the square of errors

RMSE – Root mean square error – the root of MSE

MAPE – Mean absolute percentage error

• F1 score
• Accuracy
• Sensitivity
• Specificity
• Recall
• Precision

## Q46) What is it called when there is more than 1 explanatory variable in the regression task?

Multiple linear regressions

## Q47) What are residuals in a regression task?

The difference between the predicted value and the actual value is called the residual.

## Q48) What are the main classifications in Machine learning?

• Supervised learning
• Unsupervised learning
• Reinforcement learning

## Q49) What are the main types of supervised learning tasks?

• Classification task (categorical in nature)
• Regression task (continuous in nature)

## Q50) Give a simple representation for Linear Equation.

Y = mx + c ;

Where y is the dependent variable; c is the independent variable; m is the slope

## Q51) What is the R square value?

R squared values tell us how close the regression line is fit to the actual values.

## Q52) What are some common ways of imputation?

Mean imputation, median imputation, KNN imputation, stochastic regression, substitution

## Q53) What is the difference between series and list

The list is size and data mutable

Series is data mutable but not size mutable

describe()

## Q55) What is the difference between a dictionary and a set?

Dictionary has key-value pair

Set does not have key-value pairs and set has only unique elements

## Q56) Which function can be used to filter a DataFrame?

The query function can be used to filter a data frame.

## Q57) What is the function to create a test train split?

From sklearn.metrics import test_train_split . This function is used to create a test train split from the data.

## Q58) What is pickling and unpickling?

Pickling is the process of saving a data structure into the physical drive or hard disk.

Unpickling is used to read a pickled file from a hard disk or physical storage drive.

## Q59) How to convert n number of series to a data frame?

DataFrame(data = {‘col1’:series1,’col2’:series2})

## Q60) How to select a section of a data frame?

Using iloc and loc functions the rows and columns can be selected.

## Q61) How to differentiate from KNN and K-means clustering?

KNN is standing for the K- Nearest Neighbours, it remains classified because of a supervised algorithm. K-means is an unsupervised cluster algorithm

## Q62) Difference between “long” and “wide” format data?

In the wide form, each subject's responses will remain in a separate row, and each answer is into a separate column. In the long format, each data is a one-time time by subject. You can understand data in wide form by the fact that columns usually design groups.

## Q63) How Cleaning Data is an important part of the process?

Cleaning the data at the point of work is a great job. If we try to fix the sources of uncontrollable data like this plane, our time can take up to 80%.

## Q64) Which language you find suitable for text analysis: R or Python?

Python because it has a rich library and researchers allow high-quality data analysis tools and data structures, while R does not have this feature. So, Python is more suited to text analysis.

## Q65) What are the two main elements of the hottest architecture?

HDFS and YARN are the two main components of the Hadoop structure.

HDFS- Hadoop distributed file system. This is the Hadoop top job distributed database. It is possible to save and retrieve the number of data at any time.

YARN- It stands for Yet Another Resource Negotiator. It modifies resources and handles workloads.

## Q66) How do Data Scientists use statistics?

Statistics helps to see data scientists’ samples, data for late insights, and to convert large data to large intelligence. It helps customers get a good idea of what to expect. Data scientists can learn about consumer behavior, interest, involvement, retention, and last convertible statistics. It helps to create powerful data models to estimate some specifications and calculations. Everything can be changed into a powerful business idea by informing users exactly what they want.

## Q67) What is Logistic Recession?

It is a statistical technique or a model for analyzing the database and predicting binary effects. The effect must be zero or one or a binary effect of yes or no. Random forest is an important technique used for classification, resilience, and other tasks in the database

## Q68) Why is data important in data analysis?

Since the information originates from numerous sources, guarantee that information examination is sufficient. Information decontamination is significant. Data cleaning controls the way toward distinguishing and fixing information, guaranteeing that the information is finished and precise, and if the parts of the wrong then the data are precluded or changed as per the necessity. This procedure will be perfect with information battles or clump handling.

When the data has been purged, it affirms the standards of the information on the framework. Information cleaning is a significant piece of information science since debasement is disregarded because of human carelessness, trade, or the capacity of different things. Data recovery is taken by a huge part of the time and exertion of the researcher, because of the speed and speed it gets from Big Data.

## Q69) Explain the real-world scenarios to use Machine Learning?

Here are some situations where machine learning can be found in real-world applications:

• Online: Customer understanding, ad targeting, and review
• Search Engine: Ranking pages depending on the search’s personal choices
• Funding: Assessing Investment Opportunities and Risks, Finding Fraud Operations
• Medicare: Designing medicines depending on the patient’s history and needs
• Robotics: Machine learning to handle situations outside of normal
• Social Media: Linking Understanding Relationships and Recommendations
• Extracting information: Creating questions to get answers from databases on the web

## Q70) What is Linear Recreation?

This is the most commonly used method for predictive analysis. The linear replication method is used to describe the relationship between a dependent variable and one or the other variable. The main task of linear recursion is the method of applying a single line in a scattering plot.

Linear Recreation has the following three modes:

• Determining and analyzing data communication and direction
• Evaluating the model
• To ensure the use and validity of the model

It is widely used in scenes that have a catching effect. For example, you should know the effect of a specific action to determine the various consequences.

## Q71) What Is Interpolation and Extrapolation?

The interpolation and approval rules are important in any statistical analysis. Extrapolation is a valuation or evaluation of facts by determining it or taking an evaluation or to an unknown area or area. It is a technique that can penetrate something using the available data.

On the other hand, interpolation is the method of determining a certain value between the value of a certain value and a value of values. This is especially useful if you have data between the two sides of a particular region, but you do not have enough data points at the specified point. This is when you sort the interpolation to determine the required value.

## Q72) When P-values can be rejected and when cannot be rejected?

P-value > 0.05 indicates weak evidence against the null hypothesis that means it cannot be rejected.

P-value< =0.05 denotes that it is strong evidence against the null hypothesis and the null hypothesis could be rejected.

P-value=0.05 is the marginal value indicating the possibilities for going either way.

## Q73) What is Backpropagation and explain its working.

Back Propagation is the training algorithm that is used for multi-layer neural networks. By using this method, we can move the error from an end of the network to the complete weight of the inside networks and thus permitting efficient computation of a gradient.

The following are the steps that are used in Back Propagation,

• Forwarding the Propagation of the Training Data
• Derivatives are obtained using the output and the target
• Back Propagation for computing the derivative of error w.r.t and output activation
• By using the previously calculated derivatives for the output
• Updating the Weights

## Q74) What are the variants of Back Propagation?

Stochastic Gradient Descent: It is used for calculating the single training examples for the calculation of the gradient and update parameters.

Batch Gradient Descent: This is used for calculating the gradient for the complete data set and performing the update at every iteration.

Mini-batch Gradient Descent: This is the most popular optimization algorithm. It is a variant of the Stochastic Gradient Descent and here rather than a single training example, the Mini-Batch of Sample is used.

## Q75) Name a few frameworks of Deep Learning.

• Caffe
• Keras
• Pytorch
• Chainer
• TensorFlow
• MS Cognitive Toolkit

## Q76) There is a race track with five lanes. There are 25 horses of which you want to find out the three fastest horses. What is the minimal number of races needed to identify the 3 fastest horses of those 25?

Divide the 25 horses into 5 groups where each group contains 5 horses. The race between all the 5 groups (5 races) will determine the winners of each group. A race between all the winners will decide the victory of the victors and must be the fastest horse. The last race between the second and third spot from the winner group alongside the first and second spot of the runner up bunch alongside the third spot horse will decide the second and third fastest horse from the group of 25.

## Q77) What are Auto-encoders?

Auto-encoders are known to be the simplest learning network approach that is used for transforming inputs into outputs with minimal error. It means that the resultant outputs are very close to the inputs.

## Q78) What are Tensors?

Tensors are mathematical objects that represent the collection of higher dimensions of data inputs in the form of alphabets, numerals, and rank fed as inputs to the neural networks.

## Q79) What is the reason for resampling?

Resampling is performed in any one of the following cases:

• Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
• Substituting labels on data points when performing significance tests
• Validating models by using random subsets (bootstrapping, cross-validation)

## Q80) What are the types of SVM Kernels?

• Linear Kernel
• Polynomial Kernel
• Sigmoid Kernel

## Conclusion

The best way to appear in the data science interview is to pack your bag with skills that require the most in the field of data science. These skills can be a collection of programming languages, Statistics, and analysis, probability, logical reasoning, etc. Thus, practice the skills to showcase for your data science interview and ace your interview with confidence. Remember, practice is the key to success. All the best!

Data science is the skill and technology that every industry is craving. Having a data science skillset in the current era means having a great demanding career option in your pocket. If you are also dreaming of becoming a data scientist then check Data Science training at Codegnan. We have trained hundreds of data scientists until now.

The salary of a data scientist in India ranges from INR 365k per annum to 500k per annum.

Our data science training will help you master data science analytics skills through real-world projects in multiple domains like Big Data, Data Science, and Machine Learning.  The trending word of data science is waiting for you to be skilled.

## Trending Certifications at Codegnan

##### Python for Data Science Certification

5.0

(45)

###### MTA Certification - Python for Data Science

555 Learners

200 Hours

Next Batch:
4th Jan | Weekday (Online) | 07:00PM - 09:00PM

Key Skills:
Mathematical Thinking, Critical Thinking

##### Machine Learning with Python Certification

5.0

(350)

###### Codegnan Certification for Machine Learning

431 Learners

50 Hours

Next Batch:
Nov 23rd | Weekday (Online) | 10:00AM - 12:00PM

Key Skills:
Mathematical thinking, Python Programming Skills

##### Web Development with Python - Django Certification

5.0

(298)

###### Codegnan Certification for Django

380 Learners

40 Hours

Next Batch:
Dec 1st | Weekday (Online) | 6:00PM - 7:00PM

Key Skills:
Python Programming Skills

##### ReactJS Certification

5.0

(70)

###### Codegnan Certification for ReactJs

98 Learners

60 Hours

Next Batch:
Nov 30th | Weekday (Online) | 2:00PM - 4:00PM

Key Skills:
HTML, CSS, Java Script