Document Type

Article

Publication Date

6-2026

Department

Department of Mechanical and Aerospace Engineering

Abstract

The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, especially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in the manufacturing domain, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitation and Refinement (VLSR), which is a vision-language-based framework for label sanitation and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, a similarity-based dataset quality assessment is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. To assess the impact of VLSR on classification applications, experiments with five classification models were conducted. Baseline accuracy was shown to range from 16% to 24%, while VLSR cleaning improved the accuracy among all models, with the ConvNeXt rising to 49%. The strongest results were achieved through filtering the dataset based on VLSR implementation, where accuracies reached 81% with the ConvNeXt. Standard augmentations on the filtered set did not yield further gains, which suggests that dataset curation is the primary driver of classification performance. Therefore, this work presents a solution for dataset curation in multi-label manufacturing scenarios where label noise is prevalent.

Publisher's Statement

© 2026 The Authors. Published by Elsevier Ltd. https://doi.org/10.1016/j.mlwa.2026.100893

Publication Title

Machine Learning with Applications

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Version

Publisher's PDF

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.