Michigan Tech Publications

Sanitizing manufacturing dataset labels using vision-language models

Document Type

Article

Publication Date

6-2026

Department

Department of Mechanical and Aerospace Engineering

Abstract

The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, especially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in the manufacturing domain, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitation and Refinement (VLSR), which is a vision-language-based framework for label sanitation and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, a similarity-based dataset quality assessment is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. To assess the impact of VLSR on classification applications, experiments with five classification models were conducted. Baseline accuracy was shown to range from 16% to 24%, while VLSR cleaning improved the accuracy among all models, with the ConvNeXt rising to 49%. The strongest results were achieved through filtering the dataset based on VLSR implementation, where accuracies reached 81% with the ConvNeXt. Standard augmentations on the filtered set did not yield further gains, which suggests that dataset curation is the primary driver of classification performance. Therefore, this work presents a solution for dataset curation in multi-label manufacturing scenarios where label noise is prevalent.

Publisher's Statement

Publication Title

Machine Learning with Applications

Recommended Citation

Mahjourian, N., & Nguyen, V. (2026). Sanitizing manufacturing dataset labels using vision-language models. Machine Learning with Applications, 24. http://doi.org/10.1016/j.mlwa.2026.100893
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/2487

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Version

Publisher's PDF

Download

Included in

Mechanical Engineering Commons

COinS

Michigan Tech Publications

Sanitizing manufacturing dataset labels using vision-language models

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Creative Commons License

Version

Included in

LINKS

Browse

Search

Graduate Students

Author Corner

Michigan Tech Publications

Sanitizing manufacturing dataset labels using vision-language models

Authors

Document Type

Publication Date

Department

Abstract

Publisher's Statement

Publication Title

Recommended Citation

Creative Commons License

Version

Included in

Share

LINKS

Browse

Search

Graduate Students

Author Corner