Predicting oil contamination in water using machine learning on microbial compositions
Document Type
Article
Publication Date
1-1-2026
Abstract
We present a compact and generative machine-learning framework that predicts oil contamination based on microbial community compositions from experimental samples. Our method combines dimensionality reduction with data augmentation and generative modeling to address high-dimensional, non-linear, and sparse microbial data. To reduce the 503-dimensional bacterial composition dataset, we compared three dimensionality reduction techniques: feature importance from random forest, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). Feature importance outperformed PCA and t-SNE, improving predictive performance and identifying microbial species most strongly correlated with oil contamination. To mitigate data scarcity, we augmented the training data using an augmented data neural network (ADNN) with noise injection. Samples generated by a variational autoencoder (VAE) were used as controlled perturbations to probe model robustness during stress testing. Using the top 3-10 bacterial features, our model achieved an R² value of up to 0.99 in both training and stress testing for predicting oil contamination from microbial data. In a bottle-level hold-out evaluation (22 splits at an 80/20 bottle ratio), performance on held-out bottles was lower and variable (mean test R² = -0.150), indicating limited generalization within this cohort. These results should be interpreted as a feasibility demonstration requiring validation on larger independent datasets.
Publication Title
PloS one
Recommended Citation
Gao, T.,
Bigcraft, I.,
Techtmann, S.,
&
Nakamura, I.
(2026).
Predicting oil contamination in water using machine learning on microbial compositions.
PloS one,
21(3), e0344571.
http://doi.org/10.1371/journal.pone.0344571
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/2461