Mobile Text Dataset and Language Models

This collection contains supplementary materials for the paper "Mining, Analyzing, and Modeling Text Written on Mobile Devices." Access the paper here: https://digitalcommons.mtu.edu/michigantech-p/934/

Printing is not supported at the primary Gallery Thumbnail page. Please first navigate to a specific Image before printing.

Switch View View Slideshow

Dataset 1: Mobile Text Dataset

Keith Vertanen and Per Ola Kristensson

This zip file contains the sentences mined from public web forums and blogs. Additional details about the dataset:
- The data is split into training, development, and test sets based on the original domain name the text was mined from.
- The sent_*.txt files are tab-delimited and contain one sentence parsed from a particular post. Each line contains the device name, forum software, device form factor (tablet or phone), and device input (touch or touch+key) associated with the post it was obtained from.
- The set's subdirectory contains the groupings used in Section 2.
- 64K word list (used in the paper), 5K and 20K word lists used on Forum-only models found here: https://digitalcommons.mtu.edu/mobiletext/3/
- Various word lists used.
- Posts and Email development and test sets.
Dataset 2: Recommended Language Models

Keith Vertanen and Per Ola Kristensson

These word models were trained with a sentence start word of < s > , a sentence end word of < /s > , and an unknown word < unk > . The word vocabulary was the most frequent 64K words in the forum dataset that were also in a list of 330K known English words. All words are in lowercase. The character models are 12-gram models and were trained using interpolated Witten-Bell smoothing. The character model vocabulary consists of the lowercase letters a-z, apostrophe, < sp > ; for a space, < s > for sentence start, and < /s > for sentence end.

The perplexities in the above table are the average per-word or per-letter perplexity averaged on four evaluation test sets. The test sets were:
- Held out forum data from our mined collection of forum posts.
- Messages written by Enron employees on their Blackberry mobile devices, Enron mobile dataset
- Tweets written on a mobile device in the Summer of 2015.
- SMS messages from the NUS SMS corpus and the Mobile Forensics Text Message corpus
The above mixture models were trained on a total of 504M words of data: 126M words of forum data, 126M words from Twitter's streaming API between December 2010 and June 2012, 126M words of forum data from ICWSM 2011 Spinn3r dataset, and 126M words of blog data from the ICWSM 2009 Spinn3r dataset.
Dataset 3: Forum only language models

Keith Vertanen and Per Ola Kristensson

These word language models were trained on only the forum data (141M words). For these models there is a choice of 5K, 20K, or 64K vocabulary sizes. These are available as 1-gram, 2-gram, 3-gram, or 4-gram models. Different entropy pruning thresholds were used to create a small and large version of each word language model.

Mobile Text Dataset and Language Models

Dataset 1: Mobile Text Dataset

Dataset 2: Recommended Language Models

Dataset 3: Forum only language models

LINKS

Browse

Search

Author Corner

Links