Dataset 3: Forum only language models

Title

Dataset 3: Forum only language models

Files

Download Forum, 5k, 1-gram (34 KB)

Download Forum, 5k, 2-gram, small (6.9 MB)

Download Forum, 5k, 2-gram, large (23.8 MB)

Download Forum, 5k, 3-gram, small (31.2 MB)

Download Forum, 5k, 3-gram, large (209.0 MB)

Download Forum, 5k, 4-gram, small (39.2 MB)

Download Forum, 5k, 4-gram, large (303.9 MB)

Download Forum, 20k, 1-gram (126 KB)

Download Forum, 20k, 2-gram, small (11.3 MB)

Download Forum, 20k, 2-gram, large (48.5 MB)

Download Forum, 20k, 3-gram, small (38.8 MB)

Download Forum, 20k, 3-gram, large (314.4 MB)

Download Forum, 20k, 4-gram, small (41.9 MB)

Download Forum, 20k, 4-gram, large (352.2 MB)

Download Forum, 64k, 1-gram (356 KB)

Download Forum, 64k, 2-gram, small (13.2 MB)

Download Forum, 64k, 2-gram, large (60.9 MB)

Download Forum, 64k, 3-gram, small (41.4 MB)

Download Forum, 64k, 3-gram, large (347.2 MB)

Download Forum, 64k, 4-gram, small (43.1 MB)

Download Forum, 64k, 4-gram, large (359.9 MB)

Description

These word language models were trained on only the forum data (141M words). For these models there is a choice of 5K, 20K, or 64K vocabulary sizes. These are available as 1-gram, 2-gram, 3-gram, or 4-gram models. Different entropy pruning thresholds were used to create a small and large version of each word language model.

Publication Date

2019

Keywords

text mining, text analysis, mobile text

Disciplines

Computer Sciences

Publisher's Statement

This data supports the paper "Mining, analyzing, and modeling text written on mobile devices," which can be accessed here on Digital Commons @ Michigan Tech.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Dataset 3: Forum only language models

Share

COinS