Deciphering Stereotypes in Pre-Trained Language Models

Ma, Weicheng; Scheible, Henry; Wang, Brian; Veeramachaneni, Goutham; Chowdhary, Pratim; Sun, Alan; Koulogeorge, Andrew; Wang, Lili; Yang, Diyi; Vosoughi, Soroush

doi:10.18653/v1/2023.emnlp-main.697

Deciphering Stereotypes in Pre-Trained Language Models

Article Status

Published

Authors/contributors

Ma, Weicheng (Author)
Scheible, Henry (Author)
Wang, Brian (Author)
Veeramachaneni, Goutham (Author)
Chowdhary, Pratim (Author)
Sun, Alan (Author)
Koulogeorge, Andrew (Author)
Wang, Lili (Author)
Yang, Diyi (Author)
Vosoughi, Soroush (Author)
Bouamor, Houda (Editor)
Pino, Juan (Editor)
Bali, Kalika (Editor)

Title

Deciphering Stereotypes in Pre-Trained Language Models

Abstract

Warning: This paper contains content that is stereotypical and may be upsetting. This paper addresses the issue of demographic stereotypes present in Transformer-based pre-trained language models (PLMs) and aims to deepen our understanding of how these biases are encoded in these models. To accomplish this, we introduce an easy-to-use framework for examining the stereotype-encoding behavior of PLMs through a combination of model probing and textual analyses. Our findings reveal that a small subset of attention heads within PLMs are primarily responsible for encoding stereotypes and that stereotypes toward specific minority groups can be identified using attention maps on these attention heads. Leveraging these insights, we propose an attention-head pruning method as a viable approach for debiasing PLMs, without compromising their language modeling capabilities or adversely affecting their performance on downstream tasks.

Proceedings Title

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Conference Name

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Publisher

Association for Computational Linguistics

Place

Singapore

Date

2023

Pages

11328-11345

DOI

10.18653/v1/2023.emnlp-main.697

Citation Key

ma2023a

URL

https://aclanthology.org/2023.emnlp-main.697

Extra

<标题>: 解读预训练语言模型中的刻板印象 Read_Status: New Read_Status_Date: 2026-01-26T11:33:43.294Z

Citation

Ma, W., Scheible, H., Wang, B., Veeramachaneni, G., Chowdhary, P., Sun, A., Koulogeorge, A., Wang, L., Yang, D., & Vosoughi, S. (2023). Deciphering Stereotypes in Pre-Trained Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11328–11345). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.697

Link to this record

https://aievidencehub.org/lib/X5HC6Y4M