Date of Award


Document Type

Open Access Master's Thesis

Degree Name

Master of Science in Computer Science (MS)

Administrative Home Department

Department of Computer Science

Advisor 1

Dukka KC

Committee Member 1

Laura Brown

Committee Member 2

Tatyana Karabencheva-Christova


Sumoylation is an essential post-translational modification intimately involved in a diverse range of eukaryotic cellular mechanisms and plays a significant role in DNA repair. Some researchers hypothesize that a high level of SUMOylation events in cancer cells improves cells' chances for survival under stress conditions by regulating tumor-related proteins.

This study belongs to a booming field of harnessing computational power to the domain of life. Prediction of protein structure, its molecular function, and the design of new drugs are just a few examples of the applications within this exciting area of research. By leveraging computational power, researchers can analyze vast amounts of biological data, unravel complex biological processes, and gain insights into the inner workings of living systems.

Prediction of sumoylation sites attracted a lot of effort within the field due to its promising application in cancer treatment. Studies cover the problem from various points of view: constructing a novel dataset, employing knowledge of physical and chemical properties, designing sophisticated learning models, or even employing homology inference. As roughly half of the known sumoylation sites in proteins follow the consensus motifs, many sumoylation events occur under stress conditions only which often requires further investigation. Modern sumoylation predictors often consider a window around a lysine of interest from the raw protein sequence. Such approaches towards substrate protein-wise tend to lose importance for learning information that might be inferred from a sequence on a large scale. This study addresses the importance of context-sensitive features of individual amino acid residues. Employed protein language models play the role of feature extractor, providing an efficient mapping of protein information onto a dense numeric representation.

The objective of the study is to provide evidence for employing protein Language Modles (pLMs) representations over hand-crafted features. Compared with the reference study GPS-SUMO, a designated approach achieved a 4\% improvement in area under the curve (ROC AUC) value; the smooth curve of class threshold indicates that the predictor can be calibrated to a desired estimation of sumoylation event probability without animations.

By employing dense, feature-rich representations from language models, the study justifies the advantages of using PLM over a hand-crafted feature extractor. The protein-wide context contains the information necessary for efficient prediction thus highlighting the importance of PLM development for further advances in computational proteome studies.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.