Predicting Water Potability Using Machine Learning and Feature Engineering Techniques

Abstract

Ensuring access to clean and safe drinking water remains a critical global challenge. Traditional water safety assessments rely on time-consuming laboratory testing, which can be resource-intensive and slow to scale. In this study, I explore the application of machine learning to classify water as potable or non-potable based on its chemical and physical characteristics. Addressing key challenges such as data imbalance and low feature correlation, I implement several modeling strategies—including feature expansion and ensemble learning—to improve classification accuracy. The final stacked model achieved 66% accuracy, highlighting both the potential and limitations of data-driven approaches in water safety prediction.

1. Introduction

Access to potable water is a foundational public health necessity. However, the methods used to determine water safety typically require manual sampling and laboratory analysis, which are costly and slow. The goal of this project is to explore whether water potability can be reliably predicted using supervised machine learning models trained on various water quality indicators. This study focuses on identifying meaningful patterns in water chemistry data to classify water samples as either potable or non-potable, using techniques that address the challenges of feature selection, class imbalance, and model optimization.

2. Materials and Methods

2.1 Dataset

The dataset used for this study includes nine attributes representing chemical and physical water quality indicators: pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic Carbon, Trihalomethanes, and Turbidity. The target variable, Potability, is binary (0 = non-potable, 1 = potable). The dataset is notably imbalanced, with a greater number of non-potable water samples.

2.2 Preprocessing

Rows with missing values were removed. Features were scaled using StandardScaler to standardize their ranges. A correlation matrix revealed weak relationships between the input features and potability, indicating the need for further feature engineering to improve predictive performance.

2.3 Feature Engineering

To address the low correlation between original features and the target variable, second-degree polynomial features were generated, increasing the feature set from 9 to 45. Recursive Feature Elimination (RFE) was used to reduce dimensionality and retain only the most impactful features. Empirical testing showed that selecting the top 15–20 features provided the optimal tradeoff between model complexity and performance.

2.4 Addressing Class Imbalance

The class imbalance resulted in a model bias toward predicting non-potable water, leading to a high false negative rate for potable water. To counteract this, I implemented the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples of potable water, improving the model’s ability to correctly classify minority class examples.

3. Modeling and Results

3.1 Baseline Models

Initial models included Logistic Regression and Random Forest. Logistic Regression achieved 50.37% accuracy—barely better than chance—due to its limitations with nonlinear data. Random Forest improved performance to 58.81%, benefiting from its ability to model nonlinear relationships, though tuning via GridSearchCV provided minimal gains.

3.2 Polynomial Feature Expansion

After introducing polynomial features, model accuracy improved significantly. A tuned Random Forest classifier with polynomial features achieved 68.49% accuracy. After further optimization (e.g., increasing n_estimators to 230), accuracy increased to 69.73%.

3.3 Threshold Adjustment

The model was more effective at identifying non-potable water than potable water, with recall scores of 0.83 and 0.48 respectively. To address this, the classification threshold was adjusted from 0.5 to 0.4. This adjustment slightly reduced accuracy to 66% but produced a better balance of F1-scores (0.64 for potable, 0.68 for non-potable), improving practical viability.

3.4 Stacked Ensemble Model

To further improve classification performance, a stacked ensemble model was developed using Random Forest, Logistic Regression, and LightGBM as base learners, with Gradient Boosting as the meta-learner. The final stacked model maintained 66% accuracy and achieved improved balance between precision and recall across both classes.

4. Discussion

This study demonstrates the complexity of predicting water potability from chemical attributes alone. Although polynomial feature engineering significantly improved performance, limitations remain. The final model correctly classifies approximately two out of every three samples. While this is a promising result for exploratory purposes, it is not yet sufficient for deployment in high-stakes settings.

Future improvements could include incorporating external features such as geographic, seasonal, or industrial context, using more advanced classifiers like XGBoost or CatBoost, and optimizing the meta-model in the ensemble. Additionally, expanding the dataset and reducing noise could further enhance predictive stability.

5. Conclusion

This project explored a machine learning approach to water potability classification using publicly available chemical water quality data. Although the final stacked model achieved 66% accuracy, further refinement is necessary to meet real-world safety standards. Nonetheless, this work underscores the importance of feature engineering, class balancing, and ensemble modeling in building effective predictive systems. Future iterations could bring such models closer to deployment as supplementary tools for public health monitoring.