29 Data Science Interview Questions

Data Science Interview Questions

Getting ready for a data science interview? 

You’ll face data science interview questions about coding, statistics, machine learning, and real-world problem-solving. 

In this guide, we’ll break down common data science interview questions and how to answer them.

Whether you’re a beginner or an expert, these tips will help you ace your next interview!

💡 Want to become a high-paying data science job ready?

Explore our courses:

Beginner data science interview questions

1. What is Data Science?

Data Science is the study of data to find useful insights, patterns, and trends. It combines statistics, programming, and domain knowledge to make better decisions. Businesses use data science for multiple tasks like predicting sales, detecting fraud, and improving customer experience.

Example: A streaming platform like Netflix uses data science to suggest movies based on your past watching habits.

2. Explain the difference between Supervised and Unsupervised Learning 

The difference between Supervised and unsupervised learning is:

  • Supervised Learning: The computer learns from labeled data (data with correct answers). It makes predictions based on past examples.
  • Example: A spam filter learns from past emails labeled as “spam” or “not spam.”
  • Unsupervised Learning: The computer finds patterns in data without labeled answers. It groups similar things together.
  • Example: Netflix groups similar users based on what they watch and suggests shows they might like.

3. What is Linear Regression?

Linear Regression is a method to predict a value using a straight line. It shows the relationship between two things. If one increases, the other might increase or decrease.

Example:

Imagine you sell ice cream. If the temperature rises, you sell more. A straight line can predict future sales based on temperature by plotting a graph with “Temperature” on the X-axis and “Ice Cream Sales” on the Y-axis.

Mathematically, the formula is:

Sales = (Slope × Temperature) + Intercept

If the slope is 10, it means with every 1°C increase in temperature, you will sell 10 more ice creams.

4. Describe confusion matrix with an example. 

A confusion matrix is a table used to measure how well a classification model performs. It compares predicted vs. actual results.

The table has four main parts:

  • True Positive (TP): Correctly predicted as positive
  • False Positive (FP): Incorrectly predicted as positive
  • True Negative (TN): Correctly predicted as negative
  • False Negative (FN): Incorrectly predicted as negative

Example:

If a doctor’s AI model predicts if a person has a disease:

  • TP: Sick people correctly identified as sick
  • FP: Healthy people wrongly identified as sick
  • TN: Healthy people correctly identified as healthy
  • FN: Sick people wrongly identified as healthy

5. What are sampling techniques? 

Sampling techniques are the process of selecting a data subset from a larger dataset. It means you can pick a small group of data from a big population to analyze and uncover patterns. They save time and effort while still providing accurate insights.

Types of Sampling:

  • Random Sampling – In Random sampling, every item has an equal chance of selection. (E.g., lottery draw)
  • Stratified Sampling – In Stratified sampling, the entire population is divided into groups, then samples are taken from each group. (E.g., selecting students from different grades)
  • Systematic Sampling – In Systematic sampling, every nth item is chosen. (E.g., checking every 10th product in a factory)
  • Cluster Sampling – In Cluster sampling, the entire population is divided into groups, and one or more groups are selected randomly. (E.g., picking random schools in a city for a survey)

6. Define Pruning in a Decision Tree Algorithm

Pruning is a technique used in decision trees to remove unnecessary branches that make the model too complex. It helps improve accuracy by preventing overfitting, which happens when a model learns too much from training data and performs poorly on new data.

Example:

Imagine a tree that predicts if a student will pass an exam. If the tree splits into too many branches (e.g., “Did the student eat breakfast?” or “What color is their notebook?”), it becomes too complex. Pruning removes these unnecessary branches and keeps only the important ones, like “Did the student study?”

7. What is the difference between long-format data and wide-format data?

The difference between long-format data and wide-format data lies in the way data is organized in a table format. In long format, each row represents one observation of a variable. Whereas, in wide format, multiple observations of the same variable appear in different columns. 

You can use long format representation for charts and analysis because it stores data in a structured way. The wide format makes reading data easy when comparing values side by side.

For example, 

Wide format data representation 

StudentJanuary scoreFebruary scoreMarch score
Alice858890
Bob788092

Long-format data representation 

StudentMonthScore
AliceJanuary 85
AliceFebruary88
AliceMarch90
BobJanuary78
BobFebruary80
BobMarch82

8. What is the purpose of Cross-Validation?

The purpose of Cross-validation is to check if a model performs well on new, unseen data. Instead of testing on the same data used for training, cross-validation splits data into parts: one part is used for training and another for testing.

For example, imagine you are training a model to predict if it will rain based on past weather data. 

If you test it using the same training data, it will look perfect but might fail on new data. 

Cross-validation ensures the model generalizes well by testing on different data splits, improving reliability.

9. When does Bias occur?

Bias occurs when a model makes incorrect assumptions about data, leading to poor predictions. It happens when the model is too simple and ignores important patterns. This is also called underfitting.

Example:

A model predicts house prices using only the size of a house but ignores location, number of rooms, and condition. If it assumes “bigger is always more expensive,” it will make incorrect predictions because other factors also affect housing prices. This is called bias.

To fix bias, we need a better model that considers more relevant features. To develop such models, you can consider strategies like feature selection and engineering, regularisation techniques, cross-validation, etc.

10. What is the difference between precision and recall?

The difference between Precision and recall lies in the way how well a model predicts positive cases.

Precision: How many predicted positives are actually correct?

Recall: How many actual positives were correctly predicted?

Example:

A model predicts if emails are spam.

If precision is high, most emails labeled as spam are actually spam.

If recall is high, the model catches most spam emails but may misclassify some normal emails as spam.

You can understand this better with a real-time example, 

If a doctor tests for cancer:

  • During high precision → Fewer false positives (wrongly saying someone has cancer).
  • During high recall → Fewer false negatives (missing actual cancer cases).

⭐ Related data science resources

Intermediate Data Science 

11. What is the curse of dimensionality?

The curse of dimensionality happens when a dataset has too many features (dimensions), making it harder for machine learning models to learn properly. As dimensions increase, data points spread out, making distance-based calculations (like in KNN or clustering) less meaningful. This leads to poor model performance.

Example:

Imagine you are searching for a friend in a small park (2D space). It’s easy to find them. But if you search in a giant forest (100D space), it becomes much harder. The same happens with high-dimensional data—models struggle to find meaningful patterns.

  • What are the feature selection methods used to select the right variables?

The feature selection method used to select the right variables includes:

  • Filter Methods – In Filter methods, you can use statistical tests like correlation or chi-square to remove irrelevant features.
  • Wrapper Methods – With Wrapper methods, you can train models with different feature sets and select the best (e.g., Recursive Feature Elimination – RFE).
  • Embedded Methods – In Embedded methods, you can use built-in model techniques like Lasso Regression to eliminate unimportant features.

Feature selection reduces the number of input variables, improving model accuracy and speed.

For example, Suppose we predict house prices with 50 features (like area, number of rooms, the color of the walls, etc.). Using the Embedded method in feature selection, we might find that the “color of the walls” doesn’t impact the price and remove it, making the model simpler and better.

12. What is Overfitting in Machine Learning?

Overfitting in machine learning occurs when a model learns too much from training data, capturing noise instead of patterns. It performs well on training data but fails on new data. This happens when the model is too complex, or the training data is too small.

You can fix it by using more training data, applying the regularization (L1/L2) method, or using simpler models like pruning decision trees.

Example:

Imagine a student memorizing answers instead of understanding concepts. They score 100% in practice tests but fail real exams with new questions. In machine learning, a deep decision tree with too many branches overfits, memorizing data instead of generalizing.

13. What is Principal Component Analysis (PCA)?

Principal Component Analysis PCA is a technique to reduce the number of features while keeping the most important information. It transforms data into a smaller set of new variables (principal components) that capture most of the variation in the original dataset. PCA is used in image compression, face recognition, and finance.

You can think of PCA as cleaning a messy room. If you have too many things lying around, it’s hard to find what you need. But if you organize them neatly, you can still keep the most important ones while saving space.

For example:

Suppose we analyze student performance based on 10 subjects. Instead of looking at all 10 scores, PCA can create two principal components:

  • PC1: Overall academic strength
  • PC2: Strength in science vs. arts

This reduces complexity while preserving insights.

  • What is the Bias-Variance Tradeoff?

The bias-variance tradeoff is the balance between two types of mistakes a machine learning model can make.

  • Bias happens when a model is too simple and makes too many assumptions. It doesn’t learn enough from the data and makes errors. This is called underfitting.
  • Variance happens when a model is too complex and tries to learn every small detail from the data, even the noise. It performs well on training data but poorly on new data. This is called overfitting.

If a model has high bias, it won’t perform well because it ignores important patterns. 

If it has high variance, it will struggle with new data. 

The goal of the Bias-Variance Tradeoff is to find a balance where the model learns well without being too simple or too complex.

14. What is the concept of Ensemble Learning?

Ensemble Learning is a technique where multiple machine learning models are combined to improve accuracy. Instead of relying on a single model, we use multiple models (weak learners) and combine their predictions to make a stronger final prediction. This helps reduce errors and improves stability.

Example:

Imagine a group of doctors diagnosing a patient. If only one doctor gives an opinion, there is a higher chance of error. But if 10 doctors analyze the case and vote on the best diagnosis, the final decision is more accurate.

Common ensemble methods include:

  • Bagging (Bootstrap Aggregating): Training multiple models on different samples of data (e.g., Random Forest).
  • Boosting: Training models sequentially, where each model corrects the mistakes of the previous one (e.g., AdaBoost, XGBoost).
  • Stacking: Combining multiple models using another model as a final decision-maker.
  • How do you handle missing data in a dataset?

Missing data can lead to biased results and inaccurate predictions. To handle missing values, we follow these steps:

  • Identify Missing Data: Check which columns have missing values using df.isnull().sum() in Python (Pandas).
  • Remove Rows/Columns: If a column has too many missing values (e.g., 80% missing), we may drop it.
  • Imputation (Filling Missing Values):
    1. Mean/Median: For numerical data (e.g., filling missing ages with the average age).
    2. Mode: For categorical data (e.g., filling missing city names with the most common city).
    3. Forward/Backward Fill: Filling missing values using previous or next values in time-series data.

15. What is the difference between batch gradient descent and stochastic gradient descent?

Gradient Descent is an optimization algorithm used to minimize errors in machine learning models. The difference between Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD) lies in how they update weights.

  • Batch Gradient Descent: Computes the gradient using the entire dataset in each step. It is more stable but slower for large datasets.
  • Stochastic Gradient Descent: Updates weights after processing each individual data point. It is faster but noisier.

16. What is the purpose of outlier detection in data science?

Outliers are extreme values that do not follow the normal pattern of the data. Detecting and handling outliers is crucial because they can distort statistical models and affect machine learning accuracy.

Why Detect Outliers?

  • Improve Model Accuracy: Outliers can skew averages and predictions.
  • Detect Data Errors: Outliers may indicate incorrect or corrupted data.
  • Identify Rare Events: Fraud detection systems use outlier detection to catch unusual transactions.

17. How to implement a real-time data processing pipeline using different tools?

Real-time data processing means analyzing data as it arrives rather than storing it first and then processing it later. Apache Kafka is a popular tool for handling real-time data streams.

Steps to build a real-time pipeline:

  • Data Source (Producers): Collects data from sensors, websites, or applications.
  • Kafka Broker: Kafka stores and transmits the data in real-time.
  • Consumers (Processing Layer): Reads data and processes it using Apache Spark or Flink.
  • Storage & Visualization: Processed data is stored in a database (e.g., Elasticsearch) and displayed on dashboards (e.g., Grafana).

Advanced data science interview questions

18. Mention the steps involved in an analytics project

An analytics project follows a structured approach:

  • Understand the Problem – You first need to define the business objective.
  • Collect Data – Then, gather relevant data from different sources.
  • Clean and Prepare Data – Data scientists then handle the missing values, remove duplicates, and standardize formats.
  • Explore Data (EDA) – You must use statistics and visualizations to find patterns.
  • Feature Engineering – To create meaningful features to improve model accuracy, you will use different feature engineering techniques. 
  • Select and Train Model – Then, you can choose algorithms to train ML models.
  • Evaluate Model – The next step is to measure the accuracy of the machine by using metrics like RMSE and F1-score.
  • Deploy Model – Finally, you can integrate the model into a real-world system.
  • Monitor & Improve – You can keep track of its performance and retrain machines if needed.

Example:

An e-commerce company wants to predict product demand. They collect sales data, clean it, analyze seasonal trends, build a forecasting model, test its accuracy, and deploy it for inventory management.

👉 Data science project ideas for beginners with source code

19. What are Eigenvectors and Eigenvalues?

Eigenvectors and eigenvalues help simplify complex datasets by reducing dimensions while preserving essential information.

  • Eigenvalues measure the magnitude of the transformation.
  • Eigenvectors show the direction of transformation.

In Principal Component Analysis (PCA), we use eigenvectors and eigenvalues to identify the most important patterns in data.

Example:

Imagine a dataset with 100 variables (features). PCA helps reduce them to a few principal components by identifying eigenvectors that capture maximum variance. This is useful in facial recognition, where eigenfaces represent the most significant facial features.

20. Explain the principles of predictive modeling. 

Predictive modeling involves building models that use historical data to predict future outcomes. The key principles are:

  • Understand Business Context – Define the goal (e.g., predicting customer churn).
  • Choose the Right Algorithm – Use regression, decision trees, or neural networks.
  • Train on Historical Data – Learn patterns from past data.
  • Validate with New Data – Check if the model generalizes well.
  • Measure Performance – Use RMSE, accuracy, or AUC-ROC for evaluation.

Example:

Netflix uses predictive modeling to recommend movies. It analyzes your past viewing history and suggests shows using collaborative filtering.

21. Describe different regularisation techniques. 

Regularization prevents overfitting by penalizing complex models. The main techniques are:

  • L1 Regularization (Lasso) – Shrinks some feature coefficients to zero, performing feature selection.
  • L2 Regularization (Ridge) – Distributes penalty across coefficients, reducing their impact without making them zero.
  • Elastic Net – Combines L1 and L2 for balanced regularization.

22. When is resampling done?

Resampling improves model accuracy by modifying the dataset. It is done when:

  • Handling Imbalanced Data – Use SMOTE (Synthetic Minority Over-sampling Technique) when one class is underrepresented.
  • Evaluating Models – Apply Cross-Validation to test model performance on different data subsets.
  • Bootstrapping – Generate multiple samples from limited data to improve estimates.

Example:

A bank has 100,000 loan applications but only 5,000 fraud cases. Instead of training on unbalanced data, we oversample fraud cases to prevent bias.

23. How can you calculate Euclidean Distance in Python?

Euclidean Distance is the straight-line distance between two points in an n-dimensional space. In Python, you can calculate it using NumPy or the math.dist() function.

  • Using NumPy:

import numpy as np

point1 = np.array([3, 4])

point2 = np.array([6, 8])

distance = np.linalg.norm(point1 – point2)

print(distance)  # Output: 5.0

  • Using math.dist():

import math

point1 = (3, 4)

point2 = (6, 8)

distance = math.dist(point1, point2)

print(distance)  # Output: 5.0

24. How would you detect bogus Instagram accounts used for scamming consumers?

Detecting scam Instagram accounts used for scamming consumers involves analyzing behavior patterns, follower ratios, and content irregularities. Here are a few things you can consider to identify them

  • Check Profile Activity – Scammers often have low posts and very high follow counts.
  • Engagement Metrics – Real users have natural likes/comments; scammers use automated bots.
  • Profile Picture & Bio Analysis – Scammers often use stock images or generic bios.
  • Text & Sentiment Analysis – NLP can detect fake DMs or phishing messages.
  • Graph Analysis – Analyzing friend connections helps uncover fake networks.

Example:

A scam account may follow 10,000 users but only receive 50 likes per post. Running an anomaly detection model (e.g., Isolation Forest) can flag such accounts.

  • Given a list A of objects and another list B which is identical to A except that one element is removed, find that removed element.

The most efficient way to solve this problem is using the set difference or the sum difference approach.

  • Using set difference:

A = [1, 2, 3, 4, 5]

B = [1, 2, 4, 5]

missing_element = set(A) – set(B)

print(missing_element)  # Output: {3}

  • Using sum difference:

missing_element = sum(A) – sum(B)

print(missing_element)  # Output: 3

For example:

If you have a list of items you packed for a trip (A = [shoes, jeans, shirt, hat]) and after unpacking, you check (B = [shoes, jeans, shirt]), the missing item (hat) is found using this approach.

25. What is the purpose of the MapReduce framework in big data processing? 

MapReduce is a framework for processing large datasets in parallel across multiple nodes. It has two key functions:

  • Map Step: Divides data into smaller chunks and processes them in parallel.
  • Reduce Step: Aggregates the processed data into meaningful results.

Example:

Suppose you have a huge log file containing web traffic data.

  • Map Phase: Each server processes a portion of the logs and counts visits per IP.
  • Reduce Phase: The results from all servers are combined to find the total visits per IP.

Code example:

from collections import Counter

logs = [“192.168.1.1”, “192.168.1.2”, “192.168.1.1”]

mapped = [(ip, 1) for ip in logs]

reduced = Counter(dict(mapped))

print(reduced)  # Output: {‘192.168.1.1’: 2, ‘192.168.1.2’: 1}

26. What are generative adversarial network (GANs) and their applications in data science?

GANs are a type of neural network where two models—the Generator and Discriminator—compete to improve data generation.

  • Generator: Creates fake data trying to mimic real data.
  • Discriminator: Tries to differentiate real from fake data.
  • As training progresses, the Generator gets better at creating realistic outputs.

Its applications include:

  • Image Generation – You can use GANs for creating AI-generated human faces (e.g., ThisPersonDoesNotExist.com).
  • Super-Resolution – It can enhance low-quality images.
  • Data Augmentation – It can create synthetic medical images for better AI training.
  • Fraud Detection – It helps in detecting deepfakes by training models to distinguish real from fake.

Reasoning questions for data science interview questions

Here’s a curated list of technical data science interview questions designed to test both technical knowledge and reasoning ability, spanning statistics, machine learning, programming, and problem-solving:

27. Write a function to calculate the Euclidean distance between two points in n-dimensional space. Then, optimize it for large-scale data.

The Euclidean distance between two points AAA and BBB in n-dimensional space is calculated using the formula:

d(A,B)=i=1n(Ai-Bi)2

For large-scale data, we optimize it using vectorized operations with NumPy instead of looping over dimensions.

Example:
Here’s a function to compute Euclidean distance efficiently:

import numpy as np

def euclidean_distance(a, b):

    return np.linalg.norm(np.array(a) – np.array(b))

# Example

point1 = [3, 4, 5]

point2 = [1, 1, 1]

print(euclidean_distance(point1, point2))  # Output: 5.385

Optimization for Large-Scale Data:

For datasets with millions of points, we can use NumPy broadcasting or Scipy’s cdist function:

from scipy.spatial.distance import cdist

def batch_euclidean(A, B):

    return cdist(A, B, metric=’euclidean’)

# Example: Distance between multiple points

A = np.array([[1, 2], [3, 4], [5, 6]])

B = np.array([[0, 0], [1, 1]])

print(batch_euclidean(A, B))  # Returns distance matrix

  • How would you implement a binary search algorithm? What are its time and space complexities, and in what scenarios is it preferable to linear search?

Binary search efficiently finds an element in a sorted list by repeatedly dividing the search space in half.

Implementation:

def binary_search(arr, target):

    left, right = 0, len(arr) – 1

    while left <= right:

        mid = left + (right – left) // 2  # Prevents overflow in large lists

        if arr[mid] == target:

            return mid  # Target found, return index

        elif arr[mid] < target:

            left = mid + 1  # Search in the right half

        else:

            right = mid – 1  # Search in the left half

    return -1  # Target not found

Time & Space Complexity

  • Time Complexity:
    • Best case: O(1) (When the target is found at the middle index)
    • Worst/Average case: O(log⁡ n) (Since the list is halved at each step)
  • Space Complexity:
    • Iterative version: O(1) (Uses only a few extra variables)
    • Recursive version: O(log ⁡n) (Due to recursive function call stack)

When is it preferable to linear search?

You can use Linear Search when:

  • The list is unsorted (sorting takes O(n log ⁡n), which is costly).
  • The list is small, where O(n) is not a big issue.
ScenarioBinary SearchLinear Search
Sorted DataYesNo
Small Data (few elements)NoYes
Large Data (millions of elements)YesNo
Dynamic listNoYes

Example Comparison:

Searching for a name in a sorted phonebook? Use Binary Search

Finding a rare letter in an unsorted paragraph? Use Linear Search

28. Given two tables, orders (order_id, customer_id, amount) and customers (customer_id, signup_date), write a query to find the average order amount for customers who signed up in 2023.

We need to join the orders table with the customers table based on customer_id and filter customers who signed up in 2023.

SQL Query

SELECT AVG(o.amount) AS avg_order_amount

FROM orders o

JOIN customers c ON o.customer_id = c.customer_id

WHERE c.signup_date BETWEEN ‘2023-01-01’ AND ‘2023-12-31’;

Explanation

  • JOIN customers ON o.customer_id = c.customer_id → Links both tables
  • WHERE c.signup_date BETWEEN ‘2023-01-01’ AND ‘2023-12-31’ → Filters customers from 2023
  • AVG(o.amount) → Computes the average order value

Example Dataset

Orders Table

order_idcustomer_idamount
1011200
1022150
1033300

Customers Table

customer_idsignup_date
12023-02-10
22022-11-15
32023-06-01

For this dataset, the average order amount for 2023 customers would be (200 + 300) / 2 = 250.

29. How would you handle missing data in a dataset? Discuss pros/cons of methods like mean imputation, k-NN imputation, or deletion.

To handle missing data, you can use mean imputation, k-NN imputation, or deletion.

Mean imputation replaces missing values with the column’s average, which is simple but may reduce accuracy. k-NN imputation finds similar data points to estimate missing values, making it more precise but slower.

Deletion removes incomplete rows or columns, which is easy but may lose important data. The best method depends on the dataset’s size and completeness.

MethodProsCons
Deletion (Drop Rows/Columns)Works well for small missing dataLosses valuable info if many rows are dropped
Mean/Median ImputationSimple and quick, keeps all dataIt can distort data distributionDoesn’t work well when data has outliers
K-NN ImputationMore accurate than Mean Imputation, it preserves relationshipsSlow for large datasets

Example:

Given this dataset:

AgeSalary
2550000
30NaN
3580000
  • Mean Imputation:

Replace missing salary with the mean of available salaries.

Given dataset:

AgeSalary
2550000
30NaN
3580000

Implementation:

import pandas as pd

df = pd.DataFrame({‘Age’: [25, 30, 35], ‘Salary’: [50000, None, 80000]})

df[‘Salary’].fillna(df[‘Salary’].mean(), inplace=True)

print(df)

Output:

IDAgeSalary
12550000
23065000
33580000
  • K-NN Imputation:

Fills missing values based on similar data points.

Given Dataset

AgeSalary
2550000
30NaN
3580000

Implementation

from sklearn.impute import KNNImputer

import numpy as np

data = np.array([[25, 50000], [30, np.nan], [35, 80000]])

imputer = KNNImputer(n_neighbors=2)

data_imputed = imputer.fit_transform(data)

print(data_imputed)

Output:

IDAgeSalary
12550000
23065000
33580000
  • Deletion (Dropping rows or columns):

Given dataset:

IDAgeSalaryDepartment
12550000HR
230NaNIT
33580000NaN

Implementation:

import pandas as pd

# Create DataFrame

df = pd.DataFrame({

    ‘ID’: [1, 2, 3],

    ‘Age’: [25, 30, 35],

    ‘Salary’: [50000, None, 80000],

    ‘Department’: [‘HR’, ‘IT’, None]

})

# Drop rows with missing values

df_cleaned = df.dropna()

print(df_cleaned)

Output:

IDAgeSalaryDepartment
12550,000HR

Since rows 2 and 3 have NaN values, they are removed. 

Why Enroll in our data science course?

Want a high-paying job in tech? Our Data Science Course in Hyderabad gives you the skills to work with data, build AI models, and solve real-world problems. 

  • 6 months of expert training 
  • Hands-on projects (like rain prediction & chatbots) 
  • 300 hours of live classes 
  • Job placement support 
  • No prior coding needed Join 2,700+ students who landed great jobs! 

Here you can check our recent student placements at Codegnan.

Whether you’re a beginner or an IT professional looking to upskill, our flexible learning options and 100% placement assistance make us the perfect choice. Secure a high-paying career in data science—contact today.

Open chat
Scan the code
Hello
Can we help you?