Skip to the content.

Protein_Prediction

Names: Ethan Deng, Jason Gu

Overview

Here is our exploratory data analysis on this dataset: Here

Table of Contents

Framing the Problem

In our project, we will build a model that will predict the amount of protein in a recipe by looking at the features in the nutrition column. The nutrition column includes the calories, total_fat, sugar, sodium, saturated fats, and carbohydrates. These features seem to have a correlation to the amount of protein there is in a recipe. This is a regression problem, not a classification problem because we are trying to predict a quantitative value (amount of protein in grams).

Baseline Model

Screenshot 2023-12-13 at 4 19 21 PM

Final Model

For our final model, we moved away from the Linear Regression Prediction Model because it was lacking in performance as seen in the evaluation metrics of R^2, RMSE, and MAE. We decided to try a Random Forest Regressor for a prediction model because random forest is better at dealing with imbalanced data and less prone to overfitting.

Algorithm and Hyperaparmeters

We chose the Random Forest Regressor to predict our model because it’s better at finding non-linear relationships between the inputs and outputs, is less prone to overfitting compared to other complex models like decision trees, and can handle datasets with irrelevant features without significantly impacting performance.

We used GridSearchCV with varying numbers of folds ranging from 5 to 15 to find the most optimal hyperparameters. These are represented as comments in the code because the code takes a long time to run.

For hyperparemters, we decided to use a combination of the number of estimators, max depth, and max features.

The hyperparameters that ended up performing the best in the new model are as follows:

Number of Estimators (n_estimators): 15
Maximum Depth of Trees (max_depth): None
Maximum Features (max_features): 'sqrt'

Comparing the performance of the baseline model to the new model:

Fairness Analysis

In this analysis, we are comparing the RMSE of two groups: meat recipes (Group X) and non-meat recipes (Group Y). The meat recipes consist are conducted based on whether or not the words ‘beef’, ‘chicken’, ‘pork’, or ‘fish’ appear in the ingredients column. RMSE is used as the evaluation metric to measure the difference in prediction accuracy between the two groups. We are interested in determining whether there is a significant difference in the prediction accuracy of protein content (as measured by RMSE) between these two groups.

Results:

RMSE for Meat Recipes: 5.660917830525564

RMSE for Non-Meat Recipes: 4.362381737844239

p-value from permutation test: 0.00

Conclusion:

The analysis comparing the prediction accuracy with RMSE between meat recipes and non-meat recipes yielded a statistically significant difference. The RMSE for non-meat recipes (4.362) is significantly lower than that for meat recipes (5.667). The p-value of 0 indicates strong evidence against the null hypothesis, suggesting that there is indeed a significant difference in the prediction accuracy of protein content between meat and non-meat recipes based on the RMSE metric, meaning that our model is not fair.