
Machine Learning Projects
This project focuses on developing a robust Credit Risk Prediction Model, leveraging machine learning algorithms to assess the likelihood of loan default based on financial and demographic data.

The project focusses on developing Healthcare cost predictor model for the Healthcare Businesses to generate premiums according to their parameters.

1. Credit Risk Prediction Model for Loan Applications
Overview:​
​
In collaboration with AtliQ AI, this project aimed to develop a Predictive Health Insurance Premium Model for Shield Insurance. The model was designed to estimate health insurance premiums based on multiple customer factors, with the overall goal of automating and optimizing premium pricing decisions.
1. Problem Statement:
​
Financial institutions need an efficient method for evaluating the creditworthiness of loan applicants. A predictive model that accurately forecasts default risks based on an applicant’s profile can help minimize financial risk and enhance decision-making.​​​​
2. Data Exploration and Preprocessing:
The project utilized loan application data that included demographic details, financial metrics, and historical loan information. Through thorough exploratory data analysis (EDA), key patterns and insights were derived, guiding feature engineering for better prediction accuracy. Key Steps: Handling Missing Values: Missing values were imputed or removed based on domain knowledge. Feature Engineering: New features like the Loan-to-Income (LTI) ratio, Delinquency Ratio, and Average Days Past Due (DPD) were created to capture deeper insights into the applicant’s financial behavior and repayment capacity. Outliers Detection: Outliers in key numerical features were addressed to improve model stability.
3. Exploratory Data Analysis (EDA):​
EDA helped uncover patterns that influence loan default risks, highlighting critical features for inclusion in the model. Age Distribution: Defaulters had a slightly younger average age (37.12) compared to non-defaulters (39.7), indicating that younger borrowers may be more likely to default. KDE Plots for Continuous Features: Features such as Loan Tenure, Delinquent Months, and Credit Utilization showed stronger associations with default, while others like Loan Amount and Income required feature transformations (e.g., LTI ratio). Feature Relationships: The analysis suggested that the Loan-to-Income (LTI) ratio and Delinquency Ratio are strong predictors of loan default.

4. Feature Engineering:
After identifying key variables from EDA, the following features were engineered to improve model performance: Loan-to-Income (LTI) Ratio: The ratio of loan amount to income, which captures the borrower’s financial strain. Delinquency Ratio: Reflects the frequency of missed payments during the loan term, indicating repayment behavior. Average Days Past Due (DPD): Measures the severity of missed payments, which significantly correlates with loan defaults. These features were instrumental in distinguishing between defaulters and non-defaulters.



5. Feature Selection and Encoding:
To enhance model performance, feature selection and encoding techniques were employed: Numerical Features: Features with high multicollinearity were removed based on the Variance Inflation Factor (VIF). The final numerical features included age, number of dependents, loan tenure, and credit utilization. Categorical Features: Features like residence type, loan purpose, and loan type were selected based on their Information Value (IV) and Weight of Evidence (WOE). These features were then one-hot encoded to ensure compatibility with machine learning algorithms.
6. Model Training and Optimization:
Multiple machine learning models were evaluated to find the most accurate and interpretable solution for credit risk prediction. Several attempts were made to improve the performance of the model. Modeling Attempts: Initial Exploration (Baseline): Logistic Regression, Random Forest, and XGBoost models were tested. Almost all 3 have the same F1 Score. Therefore Logistic Regression was chosen for its better interpretability. Handling Class Imbalance: Techniques like Random Under Sampler and SMOTE-Tomek were applied to address the class imbalance, improving recall for the minority class (default). Hyperparameter Optimization: RandomizedSearchCV and Optuna were used to fine-tune model parameters for better performance.
7. Final Model Evaluation:
After Model Training we came to conclusion through our findings of Recall, Precision and Macro F1 score that Logistic Regression is the Best Model. The logistic regression model, after optimization, achieved: Precision (Class 1 - Default): 0.56 Recall (Class 1 - Default): 0.94 F1-Score (Class 1): 0.70 AUC: 0.98 (Excellent ability to distinguish defaults from non-defaults) Gini Coefficient: 0.96 (Indicates strong model performance in rank ordering)


8. Deployment - Streamlit Application:
The final step was developing a user-friendly application to enable loan officers to use the model for real-time credit risk predictions. Streamlit Application Features: Input Fields: Users can input demographic and financial data like age, income, loan amount, and loan tenure. Loan-to-Income (LTI) Ratio: The LTI ratio is dynamically calculated and displayed. Prediction Results: Once the "Calculate Risk" button is pressed, the app computes and displays the default probability, credit score (scaled between 300-900), and risk category (Poor, Average, Good, Excellent). The application allows for quick assessment of loan applicants, making it a valuable tool for financial institutions.
9. Business Impact:
This solution helps financial institutions: Reduce Risk: By accurately predicting which borrowers are at high risk of default. Improve Loan Approval Decisions: Enabling data-driven decision-making with a credit score and risk category. Increase Operational Efficiency: Streamlining the credit assessment process with a real-time, automated risk prediction tool.
10. Future Enhancements:
Model Updates: Incorporating more advanced machine learning models like Gradient Boosting Machines (GBM) or Neural Networks for potentially higher accuracy. Data Integration: Adding more data sources such as credit history, employment status, and spending behavior to improve prediction. Application Scalability: The Streamlit app can be integrated with bank systems for seamless loan application processing.
11. Conclusion:
This credit risk prediction project highlights my ability to work on the entire machine learning pipeline, from data preprocessing and feature engineering to model development, evaluation, and deployment. The combination of strong technical skills and business acumen allows me to create impactful solutions that enhance decision-making in critical financial processes.
Link to Live Application:
2. Predictive Health Insurance Model for Shield Insurance
Overview:​
​
In collaboration with AtliQ AI, this project aimed to develop a Predictive Health Insurance Premium Model for Shield Insurance. The model was designed to estimate health insurance premiums based on multiple customer factors, with the overall goal of automating and optimizing premium pricing decisions.
1. Objectives:
​
-
Achieve >97% model accuracy.
-
Ensure that 95% of predictions deviate by less than 10% from actual values.
-
Provide a cloud-deployed solution for seamless access.
2. Data Exploration and Preprocessing:
Missing and Duplicate Values: Removed entries with missing or duplicate data. Outlier Treatment: Age restricted to values ≤ 100 for realistic predictions. Income outliers addressed using a quantile-based approach (99.9% threshold). Negative Values Correction: Converted negative number_of_dependants to absolute values. Categorical Standardization: Unified inconsistent labels in the smoking_status column.


3. Exploratory Data Analysis (EDA):​
Univariate Analysis: Gender distribution: 54.96% Male, 45.04% Female. Analyzed BMI categories, smoking habits, and medical histories. Bivariate Analysis: Income vs. Insurance Plan Preference: Heatmaps highlighted customer segmentation. Smoking Status vs. Gender: Revealed behavior trends influencing risk facto

4. Feature Engineering:
Risk Score Computation: Weighted scores for conditions like heart disease. Normalized Risk Score for uniform scaling.
5. Feature Selection and Encoding:
Encoding: Ordinal Encoding for ordered categories (e.g., insurance_plan). One-Hot Encoding for nominal variables. Feature Selection: Removed redundant columns to reduce multicollinearity. Scaling: Used MinMaxScaler for continuous features. Multicollinearity Analysis: Removed income_lakhs due to high Variance Inflation Factor (VIF).
6. Model Training and Optimization:
Data Splitting Split data into 70% training and 30% testing using train_test_split. Models Used Linear Regression: Baseline model with an R² score of 0.93. Ridge Regression: Controlled coefficient magnitudes with L2 regularization. XGBoost Regressor: Initial R² score of 0.98. Fine-tuned hyperparameters using RandomizedSearchCV. Final MSE: 1,563,064, RMSE: 1,250. Residual Analysis Residuals distribution identified higher errors for younger customers (age < 25). Segmented Model Approach Model for Young Customers (age



7. Final Model Evaluation:
Accuracy: Overall model accuracy exceeded 98% on the test set. For young customers (age
Premium Prediction for Old Population


Premium Prediction for Young Population


8. Deployment - Streamlit Application:
Main Components User Interface (UI): Inputs for age, income, BMI, and medical history. Button for predictions. Prediction Logic: Uses predict() to call the appropriate model based on age. Results Display: Predicted premium shown with st.success(). Helper Functions calculate_normalized_risk(): Computes risk score. preprocess_input(): Handles scaling and encoding. handle_scaling(): Selects appropriate scaler. Deployment Cloud deployment for remote accessibility. Streamlit application for ease of use by insurance underwriters.

9. Key Outcomes
Improved Accuracy: Segmentation reduced prediction errors. Best Model Performance: XGBoost for old and Linear Regression for Young. Actionable Insights: Enhanced premium prediction, lowering business risk.
10. Conclusion:
This project demonstrated an end-to-end machine learning pipeline, from data cleaning to deployment, with a focus on user-friendly interfaces. The segmented modeling strategy significantly enhanced accuracy, making the model robust and reliable for real-world application in health insurance premium prediction.