Dissertations, Master's Theses and Master's Reports

Finer details of language modeling: text segmentation, working within resource limits, and watermarking

Evan Gordon Lucas, Michigan Technological UniversityFollow

Date of Award

2023

Document Type

Open Access Dissertation

Degree Name

Doctor of Philosophy in Electrical Engineering (PhD)

Administrative Home Department

Department of Electrical and Computer Engineering

Advisor 1

Timothy C. Havens

Committee Member 1

Timothy Schulz

Committee Member 2

Sidike Paheding

Committee Member 3

Anthony J. Pinar

Abstract

Language modeling is a vast sub-field of natural language processing and this work focuses on solving some specific problems within that field. Technically, the work falls into a number of sub-categories within natural language processing; how to segment texts, improving sparse transformer performance for summarization tasks, character level models for dialect determination, watermarking of large language models, and a general method of incorporating minimal human feedback for continual or online learning. Despite touching on many small areas, they all connect as being related to the very general problem of handling sequential data. Language and text can be thought of as a high-dimensional sequence, where each chunk (being a word, character, or several of either) carries specific meaning that is dependent on adjacent chunks. We often represent this meaning with an embedding with a dimension in the $100$s. The chapters of this dissertation individually consider separate scales of this representation, from character level up to sentence level. Starting at the most basic information representation, we focus on utilizing a single bit of information to generate training labels in Chapter 2. In this work, we query a "human" about whether the model classification predictions are correct. If correct, the data can be trained on normally; but if the prediction was incorrect we need to intelligently choose our training labels if we want to utilize this information. This chapter explores different surrogate labeling strategies as well as their impact on model performance. Moving on to the small side of representation scales, Chapter 3 focuses on using character level features to differentiate between Southwestern and Eastern Ojibwe. From here, we can generalize into the sub-word token level scale, where each meaningful unit is either a word or part of a word. This is also the most common representation used in large language models today and two chapters of this dissertation focus on developments in the sub-word token level. Chapter 4 fine-tunes a backdoor watermark into a common large language model and then demonstrates an attack to reveal the trigger word or phrase. Backdoor watermarks have been proposed as a way to assert ownership of language (and other AI) models released publicly. The demonstrated attack generates text using a sampling-based approach and then performs frequency analysis on the generated text. Chapter 7 looks at the problem of summarization large texts using sparse attention transformers and attempts to improve the summarization performance by adding global self-attention to select tokens. Finally, we end with the sentence-level representations. Chapter 5 proposes a new text segmentation metric that does not require a reference segmentation set, a requirement of all existing segmentation metrics. The proposed method generates an embedding to represent each sentence in a text and modifies a common cluster validity index to function as a text segmentation metric. Experiments to demonstrate the correlation between existing reference-based segmentation metrics and the proposed metric are performed. Chapter 6 builds on this work by further modifying the proposed segmentation metric to operate as a fuzzy clustering metric and attempting to treat text segmentation as a fuzzy problem. Fuzzy text segmentation is a relatively unexplored field, despite the fact that language can often hold multiple meanings and is an inherently fuzzy medium. Although each of these works approaches a slightly different problem within the wide world of natural language processing, each one provides some original and substantial contribution to the enormous and exciting field of natural language processing.

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Recommended Citation

Lucas, Evan Gordon, "Finer details of language modeling: text segmentation, working within resource limits, and watermarking", Open Access Dissertation, Michigan Technological University, 2023.

https://doi.org/10.37099/mtu.dc.etdr/1599

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

ORCID

0009-0001-6425-3416

Dissertations, Master's Theses and Master's Reports

Finer details of language modeling: text segmentation, working within resource limits, and watermarking

Date of Award

Document Type

Degree Name

Administrative Home Department

Advisor 1

Committee Member 1

Committee Member 2

Committee Member 3

Abstract

Creative Commons License

Recommended Citation

Included in

ORCID

LINKS

Browse

Search

Author Corner

Dissertations, Master's Theses and Master's Reports

Finer details of language modeling: text segmentation, working within resource limits, and watermarking

Author

Date of Award

Document Type

Degree Name

Administrative Home Department

Advisor 1

Committee Member 1

Committee Member 2

Committee Member 3

Abstract

Creative Commons License

Recommended Citation

Included in

Share

ORCID

LINKS

Browse

Search

Author Corner