Less is More: Parameter-Free Text Classification with Gzip

Article Status

Published

Authors/contributors

Jiang, Zhiying (Author)
Yang, Matthew Y. R. (Author)
Tsirlin, Mikhail (Author)
Tang, Raphael (Author)
Lin, Jimmy (Author)

Title

Less is More: Parameter-Free Text Classification with Gzip

Abstract

Deep neural networks (DNNs) are often used for text classification tasks as they usually achieve high levels of accuracy. However, DNNs can be computationally intensive with billions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that's easy, light-weight and universal in text classification: a combination of a simple compressor like gzip with a $k$-nearest-neighbor classifier. Without any training, pre-training or fine-tuning, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distributed datasets. It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

Repository

arXiv

Archive ID

arXiv:2212.09410

Date

2022-12-19

URL

http://arxiv.org/abs/2212.09410

Accessed

19/10/2023, 17:39

Short Title

Less is More

Library Catalogue

arXiv.org

Extra

arXiv:2212.09410 [cs] <标题>: 少即是多：使用 Gzip 的无参数文本分类 Read_Status: New Read_Status_Date: 2025-11-10T07:25:51.859Z Citation Key: jiang2022

Citation

Jiang, Z., Yang, M. Y. R., Tsirlin, M., Tang, R., & Lin, J. (2022). Less is More: Parameter-Free Text Classification with Gzip (arXiv:2212.09410). arXiv. http://arxiv.org/abs/2212.09410

Link to this record

https://aievidencehub.org/lib/24M4KSPY