Class 12 AI Code 843 Unit 2 Data Science Methodology NCERT Book Solution

UNIT 2: Data Science Methodology: An Analytic Approach to Capstone Project

Activity – Solution

Activity 1

Mr. Pavan Sankar visited a food festival and wants to sample various cuisines. But due to health concerns, he has to avoid certain dishes. However, since the dishes were not categorized by cuisine, he found it challenging and wished for assistance in identifying the cuisines offered.

Q1. Are You ready to help Pavan Sankar?
Ans: Yes

Q2. What do you think is the actual problem of Pavan?
Ans: Pavan could not identify the cuisine of dishes kept in the food festival.

Q3. Can we predict the cuisine of a given dish using the name of the dish only?
Ans: No, it is difficult as the names of all dishes might be unknown.

Q4. Let’s check. The following dish names were taken from the menu of a restaurant in India
● Bubble and Squeak
● Phan Pyat
● Jadoh
Are you able to tell the cuisines of these dishes?
Ans: No

Q5. Is it possible to determine the dish name and cuisine of a dish with its images alone?
Ans: No, some images will look similar. So, it is difficult to identify the cuisine.

Q6. What about determining the cuisine of a dish based on its ingredients?
Ans: To Certain extent we could identify the cuisine with the ingredients of dishes.

Q7. What is the name of the dish prepared from the ingredients given in the Figure?

Ans: Vegetable Pulav

Activity 2

Mr. Pavan Sankar has set his goal to find the dish and its cuisine using its ingredients. He plans to proceed as shown in the flowchart in Fig 2.3.
Observe the flowchart and answer the questions.

Q1. Which type of analytics questioning is being utilized here?
A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics

Ans: B) Diagnostic Analytics

Q2. What type of approach is chosen here?
Ans: Classification Approach

Q3. Which algorithm is depicted in the figure given here?
Ans: Decision Tree

Activity 3

Mr. Pavan Sankar is now ready with a classification approach. Now he needs to identify the data
requirements.

Q1. Write down the name of two cuisines, five dishes from each cuisine and the ingredients needed for the five dishes separately.

Ans: Cuisine: Indian

Dish	Ingredients
1. Aloo gobi	Potato, Cauliflower, Masalas, Oil, Salt
2. Naan	Flour, Yeast, Salt, Milk
3. Butter Chicken	Chicken, Butter, Masala, Oil, Salt
4. Gulab Jamun	Dough, Oil, Sugar
5. Poha	Rice, Potato, Turmeric

Cuisine – Chinese

Dish	Ingredients
1. Manchow soup	Vegetables, Ginger, Garlic, Soy sauce, Chilly
2. Mapo Tofu	Tofu, Pork, Vegetable, Meat, Soy sauce, Chilly, Garlic
3. Chow Mein	Noodles, Sesame oil, Chicken, Garlic, Soy Sauce
4. Chicken Fried Rice	Rice, Vegetables, Chicken, Soy sauce, Oil, Salt
5. Char Siu	Soy Sauce, Garlic, Honey, Spices, Sesame oil

Q 2. To collect the data on ingredients, in what format should the data be collected?
Ans:

Data can be collected in a table format.
The text file can also be created.
For available dishes, images can also be collected.

Activity 4

Q1. If you need the names of American cuisine, how will you collect the data?
Ans: To get the dish names of American cuisine, we can use Web scraping. Personal Interviews with Americans is also possible in case if Americans are there nearby.

Q2. You want to try out some healthy recipes in the Indian culture. Mention the different ways you could collect the data.
Ans: Collect the data directly from the places where the culture is maintained, interview grandparents, refer text book which have the cultural context.

Q3. How can you collect a large amount of data and where can it be stored?
Ans: Large amount of data can be collected through Online sources. Many websites provide large data sets free of use. It can be stored in the form of CSV file or relational database in the cloud.

Activity 5

Q1. Semolina which is called rava or suji in Indian households is a by-product of durum wheat. Name a few dishes made from semolina. How will you differentiate the data of different dishes?

Ans: Upma, Rava Kichadi, Kesari, Suji Pancakes etc.
Main ingredients of all dishes are Suji. Based on salt or sugar, it becomes a sweet dish or not.
With different ingredients added to the base ingredient Suji, different type of dishes is made.

Q2. Given below is a sample of data collected during the data collection stage. Let us try to understand it.

a. Basic ingredients of sushi are rice, soy sauce, wasabi and vegetables. Is the dish listed in the data? Are all ingredients available?
Ans: Yes. Vegetables and Soy Sauce is not available in the data.

b. Find out the ingredients for the dish “Pulao”. Check for invalid data or missing data.
Ans: Common Ingredients for Fried Rice are Rice, Vegetables, Oil, Garlic, Soy sauce, Salt, Chilli,
Onion.
Here in the data Garlic is not found.

c. Inspect all columns for invalid, incorrect or missing data and list them below.
Ans: Invalid: Salt (Y), Oil (N), Sugar(N)
Incorrect: Rice (one), Chicken(2)
Missing data: Potato, Soy sauce

d. Which ingredients is common for all dishes? Which ingredient is not used for any dish?
Ans: Common– Rice, Not common- Potato

Activity 6

Q1. Are there any textual mistakes in the data given in the Table-1? Mention if any.
Ans: Yes, For Pulao dish, as value of country “Indiana” is given instead of “Indian”.

Q2. In Table 1, incorrect data was identified in the columns rice and chicken. Write the possible ways to rectify them.
Ans: In the column Rice it is written as “one” instead of 1 and in the column Chicken “2” is written.

Q3. Is the first column name appropriate? Can you suggest a better name?
Ans: No, as this is the table of dishes and cuisine, the column name country can be replaced with” Cuisine”.

Q4. First three values of the first column seem to be similar. Do we need to make any corrections to this data?
Ans: First three values are similar as the dishes belongs to the same cuisine. “Indiana’ may be changed to “Indian”.

Q5. Do the dishes with common ingredients come under the same cuisine? Why?
Ans: Yes, some ingredients may be common for the dishes under the same cuisine. It is determined by the culture and food habits of people under the cuisine.

Activity 7

Q1. Name two programming languages which can be used to implement the Decision Tree Algorithm.
Ans: Python, R

Q2. In the problem of identifying dish name and cuisine, if we chose the algorithm Decision Tree to solve the problem and choose Python as a tool, name some libraries which will help in the implementation.
Ans: numpy, pandas, re, sklearn, matplotlib, itertools, random

Activity 8

Q1. In the cuisine identification problem, on which set will the Decision tree be built: Training or Test?
Ans: Training

Q2. Name any diagnostic metric which can be used to determine an optimal classification model.
Ans: Confusion Matrix, Log loss etc.

Activity 9

Q1. Mention some ways to embed the solution into mobiles or websites.
Ans: Training model may be downloaded into apk files and this can be integrated into mobile apps (using thunkable) or into websites created by Weebly.

EXERCISES – Solution

A. Objective-type questions

1. Which is the hardest stage in the foundational methodology of Data Science?
a. Business Understanding
b. Data collection
c. Modelling
d. Evaluation

Ans: a. Business Understanding

2. Business Sponsors defines the problem and project objectives from a ______ perspective.
a. Economic
b. Feedback
c. Business
d. Data Collection

Ans: c. Business

3. Match the following and choose the correct options:
i. Descriptive approach A. Statistical Analysis
ii. Diagnostic approach B. Current Status
iii. Predictive approach C. How to solve it?
iv. Prescriptive approach D. Probabilities of action

a. (i)—A , (ii)—B, (iii) – C , (iv)—D
b. (i)—B , (ii)—A, (iii) – D , (iv)—C
c. (i)—D , (ii)—B, (iii) – A , (iv)—C
d. (i)—A , (ii)—C, (iii) – B , (iv)—D

Ans: b. (i)—B , (ii)—A, (iii) – D , (iv)—C

4. Arrange the following statements in order
i: Gaps in data will be identified and plans to fill/make substitutions will have to be made
ii: Decisions are made whether the collection requires more data or not
iii: Descriptive statistics and visualization is applied to dataset
iv: Identify the necessary data content, formats and sources

a. i, ii, iii, iv
b. iv, ii, iii, i
c. i, iii, ii, iv
d. ii, i, iii, iv

Ans: b. iv, ii, iii, i

5. Data Modelling focuses on developing models that are either ________ or _________.
a. Supervised, Unsupervised
b. Predictive, Descriptive
c. Classification, Regression
d. Train-test split, Cross Validation

Ans: b. Predictive, Descriptive

6. Answer the question based on the given statement.
Statement 1- There is no optimal split percentage
Statement 2- The most common split percentage between training and testing data is 20%-80%
a. Statement 1 is true Statement 2 is false
b. Statement 2 is true Statement 1 is false
c. Both Statement 1 and 2 are true
d. Both Statement 1 and 2 are false

Ans: a. Statement 1 is true Statement 2 is false

7. Train-test split function is imported from which Python module?
a. sklearn.model_selection
b. sklearn.ensemble
c. sklearn.metrics
d. sklearn. preprocessing

Ans: a. sklearn.model_selection

8. Identify the incorrect statement:
i. cross-validation gives a more reliable measure of your model’s quality
ii. cross-validation takes short time to run
iii. cross-validation gets multiple measures of model’s quality
iv. cross-validation is preferred with small data

a. ii and iii
b. iii only
c. ii only
d. ii, iii and iv

Ans: c. ii only

9. Identifying the necessary data content, formats and sources for initial data collection is done in which step of Data Science methodology?
a. Data requirements
b. Data Collection
c. Data Understanding
d. Data Preparation

Ans: a. Data requirements

10. Data sets are available online. From the given options, which one does not provide online data?
a. UNICEF
b. WHO
c. Google
d. Edge

Ans: d. Edge

11. A _____ set is a set of historical data in which outcomes are already known.
a. Training set
b. Test set
c. Validation set
d. Evaluation set

Ans: a. Training Set

12. ______ data set is used to evaluate the fit machine learning model.
a. Training set
b. Test set
c. Validation set
d. Evaluation set

Ans: b. Test Set

13. x_train,x_test,y_train,y_test = train_test_split (x, y, test_size=0.2)
From the above line of code, identify the training data set size.
a. 0.2
b. 0.8
c. 20
d. 80

Ans: b. 0.8

14. In k-fold cross validation, what does k represent?
a. number of subsets
b. number of experiments
c. number of folds
d. all of the above

Ans: d. all of the above

15. Identify the correct points regarding MSE given below:
i. MSE is expanded as Median Squared Error
ii. MSE is the standard deviation of the residuals
iii. MSE is preferred with regression
iv. MSE penalize large errors more than small errors

a. i and ii
b. ii and iii
c. iii and iv
d. ii, iii and iv

Ans: c. iii and iv

B. Short Answer Questions

1. How many steps are there in Data Science Methodology? Name them in order.
Ans: There are 10 steps in Data Science Methodology. They are Business Understanding, Analytic Approach, Data Requirements, Data Collection, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment and Feedback

2. What do you mean by Feature Engineering?
Ans: Feature Engineering is the process of using domain knowledge of data to create features(variables) that make the machine learning algorithms work.

3. Data is collected from different sources. Explain the different types of sources with example.
Ans: Data can be collected from two sources—Primary data source and Secondary data source.

(a) Primary Sources are sources which are created to collect the data for analysis. Examples include Interviews, Surveys, Marketing Campaigns, Feedback Forms, IOT sensor data etc.,

(b) Secondary data is the data which is already stored and ready for use. Data given in Books, Journals, Websites, Internal transactional databases, etc., are some examples

4. Which step of Data Science Methodology is related to constructing the data set? Explain.
Ans: Data Understanding stage is related to constructing the data set. Here we check whether the data collected represents the problem to be solved or not. Here we evaluate whether the data is relevant, comprehensive, and suitable for addressing the specific problem or question at hand. Techniques such as descriptive statistics and visualization can be applied to the dataset, to assess the content, quality, and initial insights about the data.

5. Write a short note on the steps done during Data Preparation.
Ans: The most time-consuming stage is Data Preparation. Here data is transformed into a state where it is easier to work with. Feature Engineering is also a part of Data Preparation.

Data preparation includes

cleaning of data (dealing with invalid or missing values, remove duplicates and give a suitable format)
combine data from multiple sources (archives, tables and platforms)
transform data into meaningful input variables

6. Differentiate between descriptive modelling and predictive modelling.
Ans: Descriptive modelling and Predictive modelling are based on the analytic approach that was taken, either statistically driven or machine learning driven.

Descriptive modeling is a mathematical process that describes real-world events and the relationships between factors responsible for them. An example of a descriptive model might examine things like: if a person did this, then they are likely to prefer that.
Predictive modeling is a process that uses data mining and probability to forecast outcomes. For example, A predictive model tries to yield yes/no, or stop/go type outcomes. The data scientist will use a training set for predictive modeling.

7. Explain the different metrics used for evaluating Classification models.

Ans:

Confusion Matrix
A Confusion Matrix is a table used to evaluate the performance of a classification model. It summarizes the predictions against the actual outcomes

Precision and Recall
Precision measures “What proportion of predicted Positives is truly Positive?”
Precision = (TP)/(TP+FP).
Precision should be as high as possible.

Recall measures “What proportion of actual Positives is correctly classified?”
Recall = (TP)/(TP+FN)

F1-score
A good F1 score means that you have low false positives and low false negatives, so you’re correctly identifying real threats, and you are not disturbed by false alarms. An F1 score is considered perfect when it is 1, while the model is a total failure when it is 0.
F1 = 2* (precision * recall)/(precision + recall)

Accuracy
Accuracy = Number of correct predictions / Total number of predictions
Accuracy = (TP+TN)/(TP+FP+FN+TN)

8. Is Feedback a necessary step in Data Science Methodology? Justify your answer.
Ans: Yes, Feedback is necessary. Feedback from the users will help to refine the model and assess it for performance and impact. Data Scientists can automate any or all of the feedback so that the model refresh process speeds up and gets quick improved results. The value of the model will be dependent on successfully incorporating feedback and adjusting for as long as the solution is required

9. Write a comparative study on train-test split and cross validation.

Ans: Train-Test Split vs Cross Validation

Train-Test Splication	Cross Validation
Normally applied on large data sets.	Normally applied on small data sets.
Divides the data into training data set and testing data set.	Divides a dataset into subsets (folds), trains the model on some folds, and evaluates its performance on the remaining data.
Gives the accuracy of the model of the validation data set.	Gives the average accuracy across each validation set.
Clear demarcation on training data and testing data.	Every data point at some stage could be in either testing or training data set.

10. Why is model validation important?
Ans: Model Validation offers a systematic approach to measure its accuracy and reliability, providing insights into how well it generalizes to new, unseen data.

The benefits of Model Validation include

Enhancing the model quality.
Reduced risk of errors.
Prevents the model from overfitting and underfitting.

C. Long Answer Questions

1. Explain the procedure of k-fold cross validation with suitable diagram.

Ans: In k-fold cross validation we will be working with k subsets of datasets. For example, if we could have 5 folds or experiments (here k=5), we divide the data into 5 pieces, each being 20% of the full dataset.

We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set. We then run a second experiment, where we hold out data from the second fold (using everything except the 2nd fold for training the model.) This gives us a second estimate of model quality. We repeat this process, using every fold once as the holdout. Putting this together, 100% of the data is used as a holdout at some point.

2. Data is the main part of any project. How will you find the requirements of data, collect it, understand the data and prepare it for modelling?
Ans: For any Model data is made ready with four steps.

a) Data Requirements– In the data requirements stage we should identify the necessary data content, formats, and sources for initial data collection. 5W1H questions may be employed. Here we identifying the types of data required, decides how to store the data considering the structure in which the data should be organized, whether it is in a table, text file, or database. We will be identifying the sources from which we can collect the data and also any necessary cleaning or organization steps required are done.

b) Data Collection-Data collection is a systematic process of gathering observations or measurements. In this phase the data requirements are revised and decisions are made as to whether the collection requires more or less data. Today’s high performance database analytics enable data scientists to utilize large datasets. Data can be collected from Primary data source (Survey, Interview etc.) or Secondary data source (Social media data tracking, web scraping etc)

c) Data Understanding– Data Understanding stage is related to constructing the data set. Here we check whether the data collected represents the problem to be solved or not. Here we evaluate whether the data is relevant, comprehensive, and suitable for addressing the specific problem or question at hand. Techniques such as descriptive statistics and visualization can be applied to the dataset, to assess the content, quality, and initial insights about the data.

d) Data Preparation– The most time-consuming stage is Data Preparation. Here data is transformed into a state where it is easier to work with. Data preparation includes cleaning of data, combining data from multiple sources and transform data into meaningful input variables. Feature Engineering is also a part of Data Preparation.

D. Case study

1. Calculate MSE and RMSE values for the data given below using MS Excel.

Ans: Steps in Excel

1. Enter the Data as given above.

2. Calculate Squared Errors.

Label column C as Squared Error i.e. type “Squared Error” in cell C1.
In cell C2, enter the formula = (A2 – B2)²
Drag or Copy this formula to till cell C11.

3. Calculate the Mean Squared Error (MSE)

In Cell A12, type Mean Squared Error (MSE).
In cell C12, enter the formula = AVERAGE(C2 : C11)
This computes the average of all square errors.

4. Calculate the Root Mean Squared Error (RMSE)

In cell A13, type Root Mean Squared Error (RMSE)
In C13, type formula: =SQRT(C12)
This takes the square root of the MSE to compute the RMSE.

For the given data: MSE = 62.5 and RMSE = 7.91

2. Given a Confusion matrix, calculate Precision, Recall, F1 score, and Accuracy.

Ans:

Summary of Metrics:

Precision: 78.9%
Recall: 83.3%
F1 Score: 81.0%
Accuracy: 82.5%

E. Competency-Based Questions

1. A transportation company aims to optimize its delivery routes and schedules to minimize costs and improve delivery efficiency. The company wants to use Data Science to identify the most optimal routes and delivery time windows based on historical delivery data and external factors such as traffic and weather conditions. Various questions are targeted by data scientist to achieve this business goal. Identify the analytical approach model that can be used for each.

a) determine the most suitable delivery routes for perishable goods, ensuring timely deliveries without explicitly using past data to make predictions.
b) gather insights on the average delivery times for different vehicle types, how they vary based on the complexity of delivery route.
c) group delivery routes into different categories based on the average delivery time and order volume.

Ans:
a. Predictive model
b. Descriptive model
c. Classification model

2. A leading investment firm aims to improve their client portfolio management system. They want to know whether Artificial Intelligence could be used to better understand clients’ investment preferences and risk tolerance levels. Which stage of Data Science methodology can you relate this to?

Ans: Business Understanding

3. An Online Learning Platform has implemented a recommendation system to suggest personalized courses to users. They need to assess the effectiveness and accuracy of this system. Which stage of Data Science methodology can you relate this to?

Ans: Evaluation

4. A data scientist working to improve public transportation services by analyzing commuter travel patterns. He has encountered a scenario where he needs to understand the impact of major events on commuter behavior. For instance, the city is hosting a large-scale sporting event, and the data scientist needs to assess how this event affects commuting patterns, such as changes in peak travel times, shifts in preferred modes of transportation, and alterations in popular routes.
Which stage of Data Science methodology is he in? List the steps he needs to follow.

Ans: The data scientist in this scenario is in the stage of Data Collection within the Data Science methodology.

To address the scenario effectively, the data scientist should:

Identify Relevant Data Sources
Gather Data
Clean and Prepare Data
Analyze Data
Interpret Results

5. A data scientist is tasked with developing a machine learning model to predict customer churn for a small e-commerce startup. A limited dataset is only available for this task. The dataset contains information about customer demographics, purchase history, website interactions, and whether they churned or not. Considering the challenge posed by the limited dataset size, which approach would you recommend the data scientist to use for training the churn prediction model a simple train-test split or cross-validation? Justify your recommendation regarding the dataset’s size and generalizability.

Ans: Considering the limited dataset size and the need for robustness and generalizability in the model, I would recommend using cross-validation approach. Cross-validation involves splitting the dataset into multiple subsets (folds), training the model on different combinations of these subsets, and evaluating its performance on the remaining data. This mitigates the risk of overfitting or underfitting the model, which is common with a small dataset. Additionally, cross validation maximizes the utilization of limited data by using each data point for both training and validation across multiple folds, eliminating the need for additional data.

6. Identify the type of big data analytics (descriptive, predictive) used in the following:
a. A clothing brand monitors social media mentions to understand customer perception. It uses data in the form of social media posts, comments, and reviews containing brand mentions to get a clear picture of overall customer sentiment and areas where they excel or fall short.
b. A factory aims to predict equipment failures before they occur to minimize downtime. It uses Sensor data from machines (temperature, vibration, power consumption) coupled with historical maintenance records to identify patterns in sensor data that indicate an impending equipment failure.

Ans:
a. Descriptive Analytics. This involves summarizing and analyzing historical social media data to understand customer sentiment and perception towards the clothing brand.
b. Predictive Analytics. Specifically, it involves using historical sensor data from machines to predict equipment failures before they occur, minimizing downtime.

7. Identify the type of big data analytics (diagnostic, prescriptive) used in the following:
a. A subscription service experiences a rise in customer cancellations. It uses Customer account information, usage data (frequency of logins, features used), and support ticket logs.to identify potential reasons for churn.
b. A food delivery service wants to improve delivery efficiency and reduce delivery times. It uses Customer location data, order details, historical delivery times, and traffic patterns to calculate the most efficient delivery routes.

Ans:
a. Diagnostic Analytics. This involves analyzing customer account information, usage data, and support ticket logs to diagnose potential reasons for customer cancellations or churn.
b. Prescriptive Analytics. This involves analyzing customer location data, order details, historical delivery times, and traffic patterns to prescribe the most efficient delivery routes and reduce delivery times.