Portfolio

Data Science Projects

Project #1

Sentiment Analysis
Open AI vs NLTK
- Project Narrative -

The sentiment analysis comparison between OpenAI’s GPT-3.5-turbo and NLTK on a subset of tweets from the first 65 days of the Russia-Ukraine war (1.6GB dataset of 500k tweets) reveals similar findings across both tools. A correlation analysis shows no significant relationship between tweet sentiment and the number of likes, retweets, replies, or quotes, indicating sentiment does not influence these engagement metrics. Additionally, a daily sentiment heatmap illustrates that while only 51% of the analysis overlaps between GPT-3.5-turbo and NLTK, both methods capture a similar trend, with a notable increase in negative tweets beginning on February 24, 2022, aligning with the start of the war.

Project #2

Spam Predictor

- Classification Models -

In the spam prediction task, models like KNN, Logistic Regression, SVM, Decision Trees, and Random Forest were tested. The data was split 75%/25% into training and test sets to ensure model accuracy on unseen data. To address overfitting, k-fold cross-validation was used, with stratified k-fold balancing observations across folds. Although KNN was initially chosen, Decision Trees proved the best model based on its high cross-validation score. Adding features like "word_freq_inherited" and "word_freq_won" could improve predictive power. For a continuous target variable, we would consider regression models like Linear, Lasso, Ridge, and Random Forest.

Project #3

Predictive Neural Network with Keras

- Flower Type Prediction -

In the flower type prediction task using a predictive neural network built with Keras, Model 1 outperformed Model 2 in terms of both train and test accuracy. Model 1, with 2 hidden layers and a higher number of neurons per layer, achieved a better balance between train and test accuracy, indicating strong predictive performance and generalization to new data. In contrast, Model 2 showed a larger gap between train and test accuracy, signaling overfitting. Model 1's higher and more consistent accuracy across both sets demonstrates it is more effective for this classification task.