Date of Award

2025

Document Type

Open Access Master's Thesis

Degree Name

Master of Science in Biological Sciences (MS)

Administrative Home Department

Department of Biological Sciences

Advisor 1

Stephen M. Techtmann

Committee Member 1

Trista J. Vick-Majors

Committee Member 2

Ishi M. Keenum

Committee Member 3

Dukka B. KC

Abstract

Advances in genomic sequencing have dramatically increased the amount and the speed at which genomic data is being generated. These technological advances have enabled the ability to profile the genetic information of organisms and communities at unprecedented scales. Many methods have been developed to identify and classify genes within these datasets. However, many generic pipelines for gene annotation struggle to accurately predict specific protein classes that may not be represented in their databases. Two major challenges exist for classification of specific protein classes in metagenomic databases. The first is the fact that many metagenomic assemblies are highly fragmented with many genes being partial genes. Secondly, many protein families have very few representatives in databases, many of which have ambiguous annotations. The accurate detection of bacterial toxins in complex environments and the reliable prediction of ice-binding proteins (IBPs) remain unresolved issues for current bioinformatics pipelines. Fragmented or incomplete reads often obscure the identification of toxin-encoding genes within metagenomic datasets, while small, rigidly defined IBP datasets and outdated learning algorithms hamper predictive accuracy for antifreeze (AFPs) and ice-nucleation (INPs) proteins. Here, we introduce two computational strategies—one leveraging profile Hidden Markov Models (pHMMs) for toxin detection, the other employing a protein language model (ESM-2) for IBP classification—that tackle these methodological constraints. Our findings underscore a substantial improvement over existing tools. The pHMM-based toxin detection system, by focusing on core structural domains, achieved 99% sensitivity and specificity: 5,120 toxin-related sequences were identified from wastewater samples, yet no positives were detected in food samples. Simultaneously, the PLM-ICE framework yielded a Matthews correlation coefficient (MCC) of 0.984 for antifreeze proteins and 0.927 for ice-nucleation proteins—metrics that exceed those of previously described classifiers. Future studies can expand the approach to include other protein families or unify these strategies into broader annotation platforms and offer a path toward more nuanced analyses of large set genomic data.

Share

COinS