Bridging the safety-specific language model gap: Domain-adaptive pretraining of transformer-based models across several industrial sectors for occupational safety applications

Document Type

Article

Publication Date

3-1-2026

Abstract

Occupational safety remains a persistent global challenge despite advancements in regulatory frameworks and safety technologies. Unstructured incident narratives, such as accident reports and safety logs, offer valuable context for understanding workplace hazards but are underutilized due to the gap in the safety-specific language models. This study addresses that gap by adapting pretrained transformer-based models (BERT and ALBERT) to the occupational safety domain through Domain-Adaptive Pretraining (DAPT). We construct a large-scale, multi-source corpus comprising over 2.4 million documents spanning several industrial sectors, including mining, construction, transportation, and chemical processing, augmented with safety-related academic abstracts to preserve general linguistic understanding and mitigate catastrophic forgetting. Using this corpus, we develop two domain-adapted models, safetyBERT and safetyALBERT, through continual pretraining on the masked language modeling objective. Intrinsic evaluation using pseudo-perplexity (PPPL) demonstrates substantial improvements, with safetyBERT and safetyALBERT achieving 76.9% and 90.3% reductions in PPPL, respectively, over their general-domain counterparts. Extrinsic evaluation on the Mine Safety and Health Administration (MSHA) injury dataset across three classification tasks (accident type, mining equipment, and degree of injury) demonstrated consistent performance improvements, with both models outperforming diverse baseline models including general-purpose models (BERT, ALBERT, DistilBERT, RoBERTa), domain-specific scientific model (SciBERT), and large language model (Llama 3.1-8B), with safetyALBERT achieving competitive results despite its parameter-efficient design.. To further assess generalization in low-resource settings, these models were evaluated on the small-scale Alaska insurance claim dataset from mining industry across two classification tasks − claim type and injured body part. Both safetyBERT and safetyALBERT maintained strong performance under this constraint, demonstrating the value of domain adaptation for data-constrained environments. Additionally, multi-task classification on the MSHA dataset using safety domain models showed improved generalization and more balanced performance across underrepresented classes. These findings confirm that DAPT effectively enhances language understanding in safety–critical domains while enabling scalable, resource-efficient deployment. This work lays the foundation for integrating domain-adapted natural language processing (NLP) systems into occupational health and safety management frameworks.

Publication Title

Expert Systems with Applications

Share

COinS