Download (624.0 MB)
This zip file contains the sentences mined from public web forums and blogs. Additional details about the dataset:
- The data is split into training, development, and test sets based on the original domain name the text was mined from.
- The sent_*.txt files are tab-delimited and contain one sentence parsed from a particular post. Each line contains the device name, forum software, device form factor (tablet or phone), and device input (touch or touch+key) associated with the post it was obtained from.
- The set's subdirectory contains the groupings used in Section 2.
- 64K word list (used in the paper), 5K and 20K word lists used on Forum-only models found here: https://digitalcommons.mtu.edu/mobiletext/3/
- Various word lists used.
- Posts and Email development and test sets.
Publication Date
mobile text, text mining, modeling text
Computer Sciences
Recommended Citation
Vertanen, Keith and Kristensson, Per Ola, "Dataset 1: Mobile Text Dataset" (2019). Mobile Text Dataset and Language Models. 1.
Publisher's Statement
This data supports the paper "Mining, analyzing, and modeling text written on mobile devices," which can be accessed here on Digital Commons @ Michigan Tech.