Predicting oil contamination in water using machine learning on microbial compositions

Document Type

Article

Publication Date

1-1-2026

Department

Department of Physics; Department of Biological Sciences

Abstract

We present a compact and generative machine-learning framework that predicts oil contamination based on microbial community compositions from experimental samples. Our method combines dimensionality reduction with data augmentation and generative modeling to address high-dimensional, non-linear, and sparse microbial data. To reduce the 503-dimensional bacterial composition dataset, we compared three dimensionality reduction techniques: feature importance from random forest, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). Feature importance outperformed PCA and t-SNE, improving predictive performance and identifying microbial species most strongly correlated with oil contamination. To mitigate data scarcity, we augmented the training data using an augmented data neural network (ADNN) with noise injection. Samples generated by a variational autoencoder (VAE) were used as controlled perturbations to probe model robustness during stress testing. Using the top 3-10 bacterial features, our model achieved an R² value of up to 0.99 in both training and stress testing for predicting oil contamination from microbial data. In a bottle-level hold-out evaluation (22 splits at an 80/20 bottle ratio), performance on held-out bottles was lower and variable (mean test R² = -0.150), indicating limited generalization within this cohort. These results should be interpreted as a feasibility demonstration requiring validation on larger independent datasets.

Publisher's Statement

Publication Title

PloS one

Recommended Citation

Gao, T., Bigcraft, I., Techtmann, S., & Nakamura, I. (2026). Predicting oil contamination in water using machine learning on microbial compositions. PloS one, 21(3), e0344571. http://doi.org/10.1371/journal.pone.0344571
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/2461

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Version

Publisher's PDF

Michigan Tech Publications

Predicting oil contamination in water using machine learning on microbial compositions

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Creative Commons License

Version

Included in

LINKS

Browse

Search

Graduate Students

Author Corner

Michigan Tech Publications

Predicting oil contamination in water using machine learning on microbial compositions

Authors

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Creative Commons License

Version

Included in

Share

LINKS

Browse

Search

Graduate Students

Author Corner