Title
Dataset 3: Forum only language models
Files
Download Forum, 5k, 1-gram (34 KB)
Download Forum, 5k, 2-gram, small (6.9 MB)
Download Forum, 5k, 2-gram, large (23.8 MB)
Download Forum, 5k, 3-gram, small (31.2 MB)
Download Forum, 5k, 3-gram, large (209.0 MB)
Download Forum, 5k, 4-gram, small (39.2 MB)
Download Forum, 5k, 4-gram, large (303.9 MB)
Download Forum, 20k, 1-gram (126 KB)
Download Forum, 20k, 2-gram, small (11.3 MB)
Download Forum, 20k, 2-gram, large (48.5 MB)
Download Forum, 20k, 3-gram, small (38.8 MB)
Download Forum, 20k, 3-gram, large (314.4 MB)
Download Forum, 20k, 4-gram, small (41.9 MB)
Download Forum, 20k, 4-gram, large (352.2 MB)
Download Forum, 64k, 1-gram (356 KB)
Download Forum, 64k, 2-gram, small (13.2 MB)
Download Forum, 64k, 2-gram, large (60.9 MB)
Download Forum, 64k, 3-gram, small (41.4 MB)
Download Forum, 64k, 3-gram, large (347.2 MB)
Download Forum, 64k, 4-gram, small (43.1 MB)
Download Forum, 64k, 4-gram, large (359.9 MB)
Description
These word language models were trained on only the forum data (141M words). For these models there is a choice of 5K, 20K, or 64K vocabulary sizes. These are available as 1-gram, 2-gram, 3-gram, or 4-gram models. Different entropy pruning thresholds were used to create a small and large version of each word language model.
Publication Date
2019
Keywords
text mining, text analysis, mobile text
Disciplines
Computer Sciences
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Vertanen, Keith and Kristensson, Per Ola, "Dataset 3: Forum only language models" (2019). Mobile Text Dataset and Language Models. 3.
https://digitalcommons.mtu.edu/mobiletext/3
Publisher's Statement
This data supports the paper "Mining, analyzing, and modeling text written on mobile devices," which can be accessed here on Digital Commons @ Michigan Tech.