Top ANswers to Data Science Interview Question

By Codegnan | 3987 Views | 10 Mins Read | Updated on March 10, 2020

This blog contains the review of common interview questions and answers that are asked during a Data Science interview. You’ll get to know how to answer basic data science interview questions related to predictions, underfitting, and overfitting. Also, with 80 Data Science interview questions, you will walk through the typical questions related to statistics and probability and some more related to data structures and algorithms. The technical skills required to appear for the interview are: o Python o SQL o Statistics and Probability o Algorithms o Supervised and Unsupervised Machine Learning

Q1) Explain Supervised and unsupervised learning?

Supervised learning – When the target variable is known for the problem statement, it becomes Supervised learning. This is applied to perform regression and classification.

Example: Linear Regression and Logistic Regression.

Unsupervised learning – When the target variable is unknown for the problem statement, it becomes Unsupervised learning. This is widely used to perform Clustering.

Example: K-Means and Hierarchical clustering.

Q2) What are the commonly used algorithms?

1.Linear regression
2.Logistic regression
3.Random Forest
4.KNN

Q3) What is precision?

The ratio of predicted positive against the actual positive is known as precision. It is the most commonly used error metric as a classification mechanism. The range is from 0 to 1, where 1 represents 100%.

Q4) What is recall?

The ratio of the true positive rate against the actual positive rate is known as recall. The range is from 0 to 1

Q5) Which metric is used for accuracy in problem statement classification?

F1 Score - 2 * (Precision*Recall)/Precision + Recall

Q6) What is a normal distribution?

When the data distribution is equally distributed, understanding that the mean, median and mode are equal.

Q7) What is overfitting?

Any prediction rate which has high inconsistency between the training error and the test error leads to the high business problem, if the error rate in the training set is low and the error rate of the test set is high, then it is said to be an overfitting model.

Q8) What is underfitting?

Any prediction rate which has provided low prediction in the training error and the test error leads to a high business problem, if the error rate in the training set is high and the error rate of the test set is also high, then is said to be an overfitting model.

Q9) What is a univariate analysis?

An Analysis that can be applied to one attribute at a time is called as a univariate analysis. Boxplot is one of the widely used univariate models. Scatter plot and cook’s distance are other methods used for bivariate and multivariate analysis.

Q10) Name a few methods for Missing Value Treatments.

Central Imputation – This method acts more like central tendencies. All the missing values will be filed with mean and median mode respective to numerical and categorical data types.

KNN – K Nearest Neighbour imputation

Distance between two or multiple attributes are calculated using Euclidian’s distance and the same will be used to treat the missing values. Mean and mode will be used in CI.

Q11) What is Pearson correlation?

Correlation between predicted and actual data can be examined and understood using this method. The range is from -1 to +1. -1 refers to negative 100% whereas +1 refers to positive 100%.

The formula is Sd(x)*m/Sd.(y)

Q12) How and by what methods data visualizations can be used effectively?

In order to give insights in a very effective and efficient manner, data can likewise be utilized so that it isn't just limited to bar, line or some stereotypic diagrams. Data can be spoken to in a considerably more outwardly pleasing way. One thing has to be taken care of is to convey the intended insight or finding correctly to the audience. Once the baseline is set, the innovative and creative part can help you come up with better looking and functional dashboards. There is a fine line between the simple insightful dashboard and awesome looking 0 fruitful insight dashboards.

Q13) How to understand the problems faced during data analysis?

The majority of the issue looked during hands on investigation or information science is a direct result of poor comprehension of the issue close by and focusing more on devices, final products and different parts of the venture. Separating the issue to a granular level and understanding takes a ton of time and practice to ace. Returning to the starting point in information science activities can be found in part of organizations and even in your own task or kaggle issues.

Q14) What are the time series algorithms?

Time series algorithms like ARIMA, ARIMAX, SARIMA, Holts winters are very interesting to learn and use as well to solve a lot of complex problems for businesses. Data for time arrangement examination assumes a fundamental job. The stationarity, regularity, cycles and clamors need time and consideration. Take as much time as you might want to make the information right. At that point you can run any model on it.

Q15) How can I achieve accuracy in the first model that I built?

Building machine learning models involves a lot of interesting steps. 90?curacy models don’t come in the very first attempt. You have done a lot of better feature selection techniques to get that point, which means it involves a lot of trial and error. The process will help you learn new concepts in statistics, math and probability.

Q16) What is the basic responsibility of a Data Scientist?

As a data scientist, we have the responsibility to make complex things simple enough that anyone without context should understand what we are trying to convey.

The moment, we start explaining even the simple things the mission of making the complex simple goes away. This happens a lot when we are doing data visualization.

Less is more. Rather than pushing too much information to reader’s brain, we need to figure out how easily we can help them consume a dashboard or a chart.

The process is simple to say but difficult to implement. You must bring the complex business value out of a self-explanatory chart. It’s a skill every data scientist should strive towards and good to have in their arsenal.

Q17) Explain RUN-Group processing.

To practice RUN-group processing, you start the system and then submit many RUN-groups. A RUN-group is a group of records that contain at least one product group including ends with a RUN statement. It can contain different SAS statements such as AXIS, BY, GOPTIONS LEGEND, Power, or WHERE.

Q18) State the definition of BY-Group processing.

BY-Group Processing is a method of preparing observations from one or numerous SAS data sets that are arranged or ordered by importance of individual or more shared variables. All data sets that are being connected must include one or more BY variables.

Q19) Explain Precision and Recall.

Recall:

It is known as a true real rate. The number of positives that your model has claimed related to the original defined number of positives available during this data.

Precision

It is also known as a positive predicted value. This is more based on the prediction. That indicates a time like a number of accurate positives that the model needs when compared to the number of positives it actually claims.

Q20) Explain F1 score and How it is used?

The F1 score is defined as a measure of a model’s performance. The average of Precision and Recall of a model is nothing but F1 score measure. Based on the results, the F1 score is 1 then it is classified as best and 0 being the worst

Q21) Explain the confounding variables.

These are obvious variables in a scientific model that correlates directly or inversely with both the subject and the objective variable. The study fails to account for the confounding factor.

Q22) How can you randomize the items of a list in place in Python?

Below is the example from random import shuffle

x = [‘Data’, ‘Class’, ‘Blue’, ‘Flag’, ‘Red’, ‘Slow’]

shuffle(x)

print(x)

Output:

[‘Red’, ‘Data’, ‘Blue’, ‘Slow’, ‘Class’, ‘Flag’]

Q23) How to get indices of N maximum values in a NumPy array?

We can get the indices of N maximum values in a NumPy array using the below code:

import numpy as np

arr = np.array([1, 3, 2, 4, 5])

print(arr.argsort()[-3:][::-1])

Output

[ 4 3 1 ]

Q24) How to create 3D plots or visualizations using NumPy/SciPy?

Like 2D plotting, 3D graphics is beyond the scope of NumPy and SciPy, but just as in this 2D example, packages exist that integrate with NumPy. Matplotlib provides primary 3D plotting in the mplot3d subpackage, whereas Mayavi produces a wide range of high-quality 3D visualization features, utilizing the powerful VTK engine. 

Q25) What are the types of biases that can occur during sampling?

Some simple models of selection bias are described below. Under coverage occurs when some members of the population live badly represented inside the sample. The survey relied on a service unit, drawn of telephone directories and car registration lists.

1.Selection bias

2.Under coverage bias

3.Survivorship bias


Q26) Which Python library is used for data visualization?

Plotly is the tool also called Plot.ly because of its main platform online. It is an interactive online visualization tool that is being used for data analytics, scientific graphs, and other visualization. This contains some great API including one for Python

Q27) How can you access a specific script inside a module?

If the whole module needs to be imported, we simply can use from pandas import *

Q28) What is a nonparametric test used for?

Non parametric tests do not assume that the data follows a specific distribution. They can be used whenever the data do not meet the assumptions of parametric tests.

Q29) What are the pros and cons of the Decision Trees algorithm?

Pros 
1.Easy to interpret. 
2.Will ignore irrelevant independent variables since information gain will
be minimal. 
3.Can handle missing data. 
4.Fast modelling.

Cons 
1.Many combinations are possible to create a tree. 
2.There are chances that it might not find the best tree possible.

Q30) Name some Classification Algorithms.

Following are some Linear Classifiers
1.Logistic Regression
2.Naive Bayes Classifier
3.Decision Trees
4.Random Forest
5.Neural Networks
6.K Nearest Neighbor

Q31) What are pros and cons of Naive Bayes algorithm?

Pros
1.Big sized data is handled easily
2.Multiclass performance is good and accurate
3.It is not process intensive

Cons
Assume an independence of predictor variables

Q32) What are the types of Skewness?

A dataset that is skewed right or left are the two types.

Q33) What is skewed data?

A data distribution that is has skewed data towards the right or left.

Q34) What is an outlier?

An outlier is a value that is very much away from the rest of the values in the data set.

Q35) What are the applications of data science?

Following are the application of data science:
1.Optical character recognition
2.Recommendation engines
3.Filtering algorithms
4.Personal assistants
5.Advertising
6.Surveillance
7.Autonomous driving
9.Facial recognition and more.

Q36) Define EDA and what are the steps to perform EDA?

EDA [exploratory data analysis] is an approach to analyzing data to summarize their main characteristics, often with visual methods.

Steps included to perform EDA:

1.Make summary of observations

2.Describe central tendencies or core part of dataset

3.Describe shape of data

4.Identify potential associations

5.Develop insight into errors, missing values and major deviations

Q37) What are the various types of data available in Enterprises?

1.Structured data
2.Unstructured data
3.Big data from social media, surveys, pictures, audio, video, drawings, maps.
4.Machine generated data from instruments
5.Real time data feeds

Q38) What are the various types of analysis performed on data?

1.Univariate analyses are detailed statistical analysis methods which can be changed based upon the number of variables involved in a distributed period of time.
2.The bivariate analysis tries to explain that difference between two variables at an individual time as in a scatterplot.
3.Multivariate analysis contracts including the single study from and then a couple of variables to understand the effect from variables to some responses.
4.Univariate – 1 variable
5.Bivariate – 2 variables
6.Multivariate – more than 2 variables

Q39) What is the difference between primary data and secondary data?

Data collected by the interested or self is primary data. This data is collected afresh and first time. Someone else has collected the data and being used by you is secondary data.

Q40) State the difference between qualitative & quantitative?

Quantitative methods analyze the data based on numbers. Qualitative method analyzes the data by attributes.

Q41) What is histogram?

Histogram is the accurate representation of numerical data based on their occurrences or frequencies.

Q42) What are the common measures of central tendencies?

1.Mean
2.Median
3.Mode

Q43) What are quartiles?

Quartiles are three points in the data that divide the data into four groups. Each group consists of a quarter of data.

Q44) What are the commonly used error metrics in regression tasks?

MSE – Mean squared error – Average of square of errors

RMSE – Root mean square error – root of MSE

MAPE – Mean absolute percentage error

Q45) What are the commonly used error metrics for classification tasks?

1.F1 score
2.Accuracy
3.Sensitivity
4.Specificity
5.Recall
6.Precision

Q46) What is it called when there are more than 1 explanatory variable in the regression task?

Multiple linear regressions

Q47) What are residuals in a regression task?

The difference between the predicted value and the actual value is called the residual.

Q48) What are the main classifications in Machine learning?

1.Supervised learning
2.Unsupervised learning
3.Reinforcement learning

Q49) What are the main types of supervised learning tasks?

1.Classification task (categorical in nature)
2.Regression task (continuous in nature)

Q50) Give a simple representation for Linear Equation.

Y = mx + c ;

Where, y is the dependent variable; c is the independent variable; m is slope

Q51) What is R square value?

R squared values tell us how close the regression line is fit to the actual values.

Q52) What are some common ways of imputation?

Mean imputation, median imputation, KNN imputation, stochastic regression, substitution

Q53) What is the difference between series and list

List is size and data mutable

Series is data mutable but not size mutable

Q54) Which function is used to get descriptive statistics of a dataframe?

describe()

Q55) What is the difference between a dictionary and a set?

Dictionary has key value pair

Set does not have key value pairs and set has only unique elements

Q56) Which function can be used to filter a DataFrame?

The query function can be used to filter a dataframe.

Q57) What is the function to create a test train split?

From sklearn.metrics import test_train_split . This function is used to create a test train split from the data.

Q58) What is pickling and unpickling?

Pickling is the process of saving a data structure into the physical drive or hard disk.

Unpickling is used to read a pickled file from hard disk or physical storage drive.

Q59) How to convert n number of series to a dataframe?

DataFrame(data = {‘col1’:series1,’col2’:series2})

Q60) How to select a section of a dataframe?

Using iloc and loc functions the rows and columns can be selected.

Q61) How to differentiate from KNN and K-means clustering?

KNN is standing for the K- Nearest Neighbours, it remains classified because of a supervised algorithm. K-means is an unsupervised cluster algorithm

Q62) Difference between “long” and “wide” format data?

In the wide form, each subject's responses will remain in a separate row, and each answer is into a separate column. In the long format, each data is a one-time time by subject. You can understand data in wide form by that fact that columns usually design groups.

Q63) How Cleaning Data is an important part of the process?

Cleaning the data at the point of work is a great job. If we try to fix the sources of uncontrollable data like this plane, our time can take up to 80%.

Q64) Which language you find suitable for text analysis: R or Python?

Python because it has a rich library and researchers allow high quality data analysis tools and data structures, while R does not have this feature. So, Python is more suited to text analysis.

Q65) What are the two main elements of the hottest architecture?

HDFS and YARN are two main components of the Hadoop structure.

HDFS- Hadoop distributed file system. This is the Hadoop top job distributed database. It is possible to save and retrieve the number of data at any time.

YARN- It stands for Yet Another Resource Negotiator. It modifies resources and handles workloads.

Q67) What is Logistic Recession?

It is a statistical technique or a model for analyzing the database and predicting binary effects. The effect must be zero or one or a binary effect of yes or no. Random forest is an important technique used for classification, resilience and other tasks in the database

Q68) Why is data important in data analysis?

Since the information originates from numerous sources, guarantee that information examination is sufficient. Information decontamination is significant. Data cleaning controls the way toward distinguishing and fixing information, guaranteeing that the information is finished and precise, and if the parts of the wrong then the data are precluded or changed as per the necessity. This procedure will be perfect with information battles or clump handling.

When the data has been purged, it affirms the standards of the information on the framework. Information cleaning is a significant piece of information science since debasement is disregarded because of human carelessness, trade or capacity of different things. Data recovery is taken by a huge part of time and exertion of the researcher, because of the speed and speed it gets from Big Data.

Q69) Explain the real-world scenarios to use Machine Learning?

Here are some situations where machine learning can be found in real world applications:
1.Online: Customer understanding, ad targeting and review
2.Search Engine: Ranking pages depending on the search’s personal choices
3.Funding: Assessing Investment Opportunities and Risks, Finding Fraud Operations
4.Medicare: Designing medicines depending on the patient’s history and needs
5.Robotics: Machine learning to handle situations outside of normal
6.Social Media: Linking Understanding Relationships and Recommendations
7.Extracting information: Creating questions to get answers from databases on the web

Q70) What is Linear Recreation?

This is the most commonly used method for predictive analysis. The linear replication method is used to describe the relationship between a dependent variable and one or the other variable. The main task of linear recursion is the method of applying a single line in a scattering plot.
Linear Recreation has the following three modes:
1.Determining and analyzing data communication and direction
2.Evaluating the model
3.To ensure the use and validity of the model
It is widely used in scenes that have a catch effect. For example, you should know the effect of a specific action to determine the various consequences.

Q71) What Is Interpolation and Extrapolation?

The interpolation and approval rules are important in any statistical analysis. Extrapolation is a valuation or evaluation of facts by determining it or taking an evaluation or to an unknown area or area. It is a technique that can penetrate something using the available data.

On the other hand, interpolation is the method of determining a certain value between a value of a certain value and a value of values. This is especially useful if you have data between the two sides of a particular region, but you do not have enough data points at the specified point. This is when you sort the interpolation to determine the required value.

Q72) When P-values can be rejected and when cannot be rejected?

P-value > 0.05 indicates weak evidence against the null hypothesis that means it cannot be rejected.

P-value< =0.05 denotes that it is strong evidence against the null hypothesis and the null hypothesis could be rejected.

P-value=0.05 is the marginal value indicating the possibilities for going either way.

Q73) What is Back propagation and explain its working.

Back Propagation is the training algorithm that is used for multi-layer neural networks. By using this method, we can move the error from an end of the network to the complete weight of the inside networks and thus permitting efficient computation of a gradient.
The following are the steps that are used in Back Propagation,
1.Forwarding the Propagation of the Training Data
2.Derivatives are obtained using the output and the target
3.Back Propagation for computing the derivative of error w.r.t and output activation
4.By using the previously calculated derivatives for the output
5.Updating the Weights

Q74) What are the variants of Back Propagation?

Stochastic Gradient Descent: It is used for calculating the single training examples for the calculation of the gradient and update parameters.

Batch Gradient Descent: This is used for calculating the gradient for the complete data set and performing the update at every iteration.

Mini-batch Gradient Descent: This is the most popular optimization algorithm. It is a variant of the Stochastic Gradient Descent and here rather than a single training example, the Mini-Batch of Sample is used.

Q75) Name a few frameworks of Deep Learning.

1.Caffe
2.Keras
3.Pytorch
4.Chainer
5.TensorFlow
6.MS Cognitive Toolkit

Q76) There is a race track with five lanes. There are 25 horses of which you want to find out the three fastest horses. What is the minimal number of races needed to identify the 3 fastest horses of those 25?

Divide the 25 horses into 5 groups where each group contains 5 horses. Race between all the 5 groups (5 races) will determine the winners of each group. A race between all the winners will decide the victory of the victors and must be the fastest horse. A last race between the second and third spot from the winner group alongside the first and second spot of the runner up bunch alongside the third spot horse will decide the second and third fastest horse from the group of 25.

 

Q77) What are Auto-encoders?

Auto-encoders are known to be the simplest learning networks approach that is used for transforming inputs into outputs with the minimal error. It means that the resultant outputs are very close to the inputs.

Q78) What are Tensors?

Tensors are mathematical objects that represent the collection of higher dimensions of data inputs in the form of alphabets, numerals, and rank fed as inputs to the neural networks.

Q79) What is the reason for resampling?

Resampling is performed in any one the following cases:
1.Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
2.Substituting labels on data points when performing significance tests
3.Validating models by using random subsets (bootstrapping, cross-validation)

Q80) What are the types of SVM Kernels?

1.Linear Kernel
2.Polynomial Kernel
3.Radial basis Kernel
4.Sigmoid Kernel

Conclusion

The best way to appear in the data science interview is to pack your bag with skills that require the most in the field of data science. These skills can be a collection of programming languages, Statistics and analysis, probability, logical reasoning, etc. Thus, practice the skills to showcase for your data science interview and ace your interview with confidence. Remember, practice is the key to success. All the best!

Talk to our Student Counsellor and choose RIght Career

Our Popular Courses

Online / Classroom

Python Certification

189 Learners

(78)
Get certified from

Online / Classroom

Data Analysis Certification

189 Learners

(78)
Get certified from

Online / Classroom

Data Science Certification

189 Learners

(78)
Get certified from

Online / Classroom

Computer Vision Certification

189 Learners

(78)
Get certified from
Blogs on Latest Technologiets
blog image

Category: Web Development

Web Development with Python

3000 views
blog image

Category: What is the capability of NLP

What is the capability of NLP in 2020?

3000 views

Our Student Reviews

639+ Positive Reviews on Google
500+ Positive Reviews on Facebook
350+ Positive Reviews on Justdial