Download (624.0 MB)
This zip file contains the sentences mined from public web forums and blogs. Additional details about the dataset:
- The data is split into training, development, and test sets based on the original domain name the text was mined from.
- The sent_*.txt files are tab-delimited and contain one sentence parsed from a particular post. Each line contains the device name, forum software, device form factor (tablet or phone), and device input (touch or touch+key) associated with the post it was obtained from.
- The set's subdirectory contains the groupings used in Section 2.
- 64K word list (used in the paper), 5K and 20K word lists used on Forum-only models found here: https://digitalcommons.mtu.edu/mobiletext/3/
- Various word lists used.
- Posts and Email development and test sets.
For further details, please see the forthcoming paper.
mobile text, text mining, modeling text
Vertanen, Keith and Kristensson, Per Ola, "Dataset 1: Mobile Text Dataset" (2019). Mobile Text Dataset and Language Models. 1.