"Dataset 2: Recommended Language Models" by Keith Vertanen and Per Ola Kristensson

Mobile Text Dataset and Language Models

Title

Dataset 2: Recommended Language Models

Authors

Keith Vertanen, Michigan Technological UniversityFollow
Per Ola Kristensson, University of Cambridge

Files

Download 30 characters, 12-gram, tiny (4.2 MB)

Download 30 characters, 12-gram, small (39.2 MB)

Download 30 characters, 12-gram, large (398.9 MB)

Download 64k words, 3-gram, tiny (4.0 MB)

Download 64k words, 3-gram, small (39.9 MB)

Download 64k words, 3-gram, large (400.2 MB)

Description

These word models were trained with a sentence start word of < s > , a sentence end word of < /s > , and an unknown word < unk > . The word vocabulary was the most frequent 64K words in the forum dataset that were also in a list of 330K known English words. All words are in lowercase. The character models are 12-gram models and were trained using interpolated Witten-Bell smoothing. The character model vocabulary consists of the lowercase letters a-z, apostrophe, < sp > ; for a space, < s > for sentence start, and < /s > for sentence end.

The perplexities in the above table are the average per-word or per-letter perplexity averaged on four evaluation test sets. The test sets were:

Held out forum data from our mined collection of forum posts.
Messages written by Enron employees on their Blackberry mobile devices, Enron mobile dataset
Tweets written on a mobile device in the Summer of 2015.
SMS messages from the NUS SMS corpus and the Mobile Forensics Text Message corpus

The above mixture models were trained on a total of 504M words of data: 126M words of forum data, 126M words from Twitter's streaming API between December 2010 and June 2012, 126M words of forum data from ICWSM 2011 Spinn3r dataset, and 126M words of blog data from the ICWSM 2009 Spinn3r dataset.

Publication Date

2019

Keywords

language models, text mining, text analysis

Disciplines

Computer Sciences

Publisher's Statement

This data supports the paper "Mining, analyzing, and modeling text written on mobile devices," which can be accessed here on Digital Commons @ Michigan Tech.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Vertanen, Keith and Kristensson, Per Ola, "Dataset 2: Recommended Language Models" (2019). Mobile Text Dataset and Language Models. 2.
https://digitalcommons.mtu.edu/mobiletext/2

COinS

Mobile Text Dataset and Language Models

Title

Authors

Files

Description

Publication Date

Keywords

Disciplines

Publisher's Statement

Creative Commons License

Recommended Citation

LINKS

Browse

Search

Author Corner

Links

Mobile Text Dataset and Language Models

Title

Authors

Files

Description

Publication Date

Keywords

Disciplines

Publisher's Statement

Creative Commons License

Recommended Citation

Share

LINKS

Browse

Search

Author Corner

Links