Lia Cappellari
Data Scientist
Enthusiastic and results-driven Data Engineer with a strong foundation in industrial engineering. Proficient in coding languages including R, Python, HTML, SQL, and Javascript, with a solid grasp of machine learning fundamentals. Experienced in AWS systems, conducting impactful data analyses and implementing process improvements using Six-Sigma methodologies. Proven leadership as the President of Alpha Pi Mu, coupled with work experience optimizing operations at CAES, Restaurant World and currently, SteerBridge Strategies.
My Projects
Ribbit - An app for automated frog species identification and classification
This project aims to build an application that leverages an API and Machine Learning for real time identification and classification of amphibian species through use uploaded recordings of frog calls. Data gathered will be contributed to global biodiversity repositories for conservation efforts.
Approach:
This project is currently in progress. Information will be updated when complete.
Impact:
This project is currently in progress. Information will be updated when complete.

This fully interactive webpage, built to explore historical Olympic data, provides users with in-depth insights into Olympic medal distributions, athlete performance, and country dominance across various sports and time periods. The goal of the project was to create a user-friendly interface that allows dynamic exploration of the data through multiple visualizations and filters, offering a comprehensive view of the Olympics.
Approach:
• Webpage Development: The webpage was developed using HTML, CSS, and JavaScript for frontend design and responsiveness. The visualizations were embedded using Tableau to offer a seamless user experience.
• Interactive Features: The webpage includes multiple different dashboards that allow users to explore medal counts by type of medal, athlete, sport, country, gender, or game year. Various different charts are used for different purposes. Multiple filters, including drop-down menus and sliders, allow users to customize the analysis based on game year, discipline, and country. Filters were implemented to dynamically adjust visualizations in real-time.
• Tools Used: HTML, CSS, JavaScript for webpage design and interactivity. Tableau was used for creating dynamic and interactive visualizations embedded into the webpage. Python (Pandas) for data cleaning and preparation. Flask for deployment.
The goal of this project was to predict student dropout rates in higher education using a variety of machine learning models. We explored four different models: baseline logistic regression, random forest, and neural network.
Approach:
• Data Preprocessing: Cleaned and preprocessed data from the educational institution's student database, handling missing values and normalizing numerical features.
• Modeling: Developed a baseline logistic regression model and iterated over Random Forest and Neural Network models, tuning hyperparameters using grid search and cross-validation.
• Performance Metrics: Assessed model performance using accuracy, precision, recall, and F1 score. Random Forest yielded the best results with an F1 score of 0.78.
• Tools Used: Python (Pandas, Scikit-learn), Jupyter Notebooks for modeling and visualization.
Impact:
The Random Forest model provided the institution with insights into which factors were most indicative of student dropout, enabling targeted interventions for at-risk students.
This experiment explored how expressing gratitude affects tipping behavior at a New York City coffee shop. A sign thanking customers for "supporting a local business" was placed in front of the register for the treatment group, while the control group saw no sign.
Approach:
• Data Collection: Collected tipping percentages from both groups over a 2-week period.
• Statistical Analysis: Performed a two-sample t-test to determine whether there was a statistically significant difference in tipping behavior between the two groups. Also conducted tests for heterogenous treatment effects between the days of the week.
• Results: While the treatment group tipped 1.08% more on average, the difference was not statistically significant (p > 0.05) at the 5% level, but it was significant at the 10% level.
• Tools Used: R for statistical analysis and visualizations, Excel for data aggregation.
The objective of this project was to identify inefficiencies in the material handling process at Thermo Fisher Scientific and implement data-driven improvements to reduce scrap and operational costs.
Approach:
• Data Analysis: Collected and analyzed operational data using SQL and R. Used clustering techniques to group together product units based on similarities as well as scrap rates and potential savings. Identified key areas of waste and inefficiency, particularly in the handling of raw materials and defective parts.
• Process Optimization: Applied Six Sigma methodologies to propose changes in the material handling process, leading to more efficient workflows and reduced scrap.
• Results: Successfully reduced scrap by 10%, resulting in annual cost savings of $700,000. Created an interactive R Shiny Dashboard for future uses.
• Tools Used: R for statistical analysis, SQL for querying data from operational databases.
This project aimed to identify key factors contributing to drug overdoses, focusing on geographic and temporal trends as well as substance combinations that lead to higher risk of overdose.
Approach:
• Data Collection: Aggregated data from public health sources, including the CDC, on drug overdose rates, substance types, and demographic information.
• Exploratory Data Analysis (EDA): Used Python (Pandas, Matplotlib, Seaborn) for in-depth EDA, identifying trends by age, gender, and location. Conducted time series analysis to observe changes in overdose rates over time.
• Statistical Modeling: Built regression models to explore relationships between variables (e.g., opioid usage and overdose rates), identifying key predictors.
• Tools Used: Python for data wrangling and visualization, Tableau for presenting geographic and temporal trends.
Impact:
The analysis provided actionable insights for public health officials, identifying geographic areas and demographic groups most at risk for drug overdoses, potentially informing future policy interventions.
This project involved developing a user-friendly web portal using R Shiny to enable seamless data analysis and visualization for the Design of Experiments (DOE) methodology.
Approach:
• Web Portal Development: Built an interactive web interface using R Shiny that allows users to upload datasets and conduct a 7-step DOE analysis.
• Analysis Features: The portal performs tasks such as model significance testing, residual analysis, ANOVA assumptions verification, and data visualization.
• Customization: Included functionality for customizable plots, interactive tables, and downloadable reports, making the portal accessible for both novice and expert users.
• Tools Used: R Shiny for web development, ggplot2 for data visualization, R for statistical analysis.
Impact:
The portal enabled users to perform complex statistical analyses without needing advanced coding skills, making it a valuable tool for researchers and engineers conducting experimental designs.