Explore a variety of data science projects showcasing different machine learning techniques and approaches across diverse domains. These projects leverage popular tools and frameworks like Python, Keras, and NLTK to deliver insights and predictions. More innovative models and advanced methods are in development, adding even greater depth to the portfolio.
The sentiment analysis comparison between OpenAI’s GPT-3.5-turbo and NLTK on a subset of tweets from the first 65 days of the Russia-Ukraine war (1.6GB dataset of 500k tweets) reveals similar findings across both tools. A correlation analysis shows no significant relationship between tweet sentiment and the number of likes, retweets, replies, or quotes, indicating sentiment does not influence these engagement metrics. Additionally, a daily sentiment heatmap illustrates that while only 51% of the analysis overlaps between GPT-3.5-turbo and NLTK, both methods capture a similar trend, with a notable increase in negative tweets beginning on February 24, 2022, aligning with the start of the war.
In the spam prediction task, models like KNN, Logistic Regression, SVM, Decision Trees, and Random Forest were tested. The data was split 75%/25% into training and test sets to ensure model accuracy on unseen data. To address overfitting, k-fold cross-validation was used, with stratified k-fold balancing observations across folds. Although KNN was initially chosen, Decision Trees proved the best model based on its high cross-validation score. Adding features like "word_freq_inherited" and "word_freq_won" could improve predictive power. For a continuous target variable, we would consider regression models like Linear, Lasso, Ridge, and Random Forest.
In the flower type prediction task using a predictive neural network built with Keras, Model 1 outperformed Model 2 in terms of both train and test accuracy. Model 1, with 2 hidden layers and a higher number of neurons per layer, achieved a better balance between train and test accuracy, indicating strong predictive performance and generalization to new data. In contrast, Model 2 showed a larger gap between train and test accuracy, signaling overfitting. Model 1's higher and more consistent accuracy across both sets demonstrates it is more effective for this classification task.