Document Type
Article
Publication Date
6-2026
Department
Department of Mechanical and Aerospace Engineering
Abstract
The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, especially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in the manufacturing domain, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitation and Refinement (VLSR), which is a vision-language-based framework for label sanitation and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, a similarity-based dataset quality assessment is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. To assess the impact of VLSR on classification applications, experiments with five classification models were conducted. Baseline accuracy was shown to range from 16% to 24%, while VLSR cleaning improved the accuracy among all models, with the ConvNeXt rising to 49%. The strongest results were achieved through filtering the dataset based on VLSR implementation, where accuracies reached 81% with the ConvNeXt. Standard augmentations on the filtered set did not yield further gains, which suggests that dataset curation is the primary driver of classification performance. Therefore, this work presents a solution for dataset curation in multi-label manufacturing scenarios where label noise is prevalent.
Publication Title
Machine Learning with Applications
Recommended Citation
Mahjourian, N.,
&
Nguyen, V.
(2026).
Sanitizing manufacturing dataset labels using vision-language models.
Machine Learning with Applications,
24.
http://doi.org/10.1016/j.mlwa.2026.100893
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/2487
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Version
Publisher's PDF
Publisher's Statement
© 2026 The Authors. Published by Elsevier Ltd. https://doi.org/10.1016/j.mlwa.2026.100893