Cluster Validity for Fuzzy Text Segmentation

Document Type

Conference Proceeding

Publication Date



College of Computing; Department of Computer Science


Topical text segmentation is an unsupervised learning process of separating documents, transcripts, and other text streams into segments-i.e., clusters-where the text in each segment is considered to be topically similar, and distinct from other segments. In this paper, we consider the task of fuzzy text segmentation, where words, or utterances, have shared membership in all segments. This is especially nascent for text sources like transcripts, where multiple topics are often simultaneously discussed: e.g., cost and deliverables in a sales meeting. One challenge in segmentation and clustering is how to choose the hyperparameters-e.g., number of clusters-in the algorithm. Hence, here we propose a fuzzy cluster validity metric, a modified Davies-Boudin index, and demonstrate how this index can be used to tune a fuzzy text segmentation algorithm. We demonstrate how fuzzy clustering can be used as a form of text segmentation and show some applications on benchmark data.

Publication Title

IEEE International Conference on Fuzzy Systems