Dataset 1: Mobile Text Dataset



Download (624.0 MB)


This zip file contains the sentences mined from public web forums and blogs. Additional details about the dataset:

  • The data is split into training, development, and test sets based on the original domain name the text was mined from.
  • The sent_*.txt files are tab-delimited and contain one sentence parsed from a particular post. Each line contains the device name, forum software, device form factor (tablet or phone), and device input (touch or touch+key) associated with the post it was obtained from.
  • The set's subdirectory contains the groupings used in Section 2.
  • 64K word list (used in the paper), 5K and 20K word lists used on Forum-only models found here:
  • Various word lists used.
  • Posts and Email development and test sets.

Publication Date



mobile text, text mining, modeling text


Computer Sciences

Publisher's Statement

This data supports the paper "Mining, analyzing, and modeling text written on mobile devices," which can be accessed here on Digital Commons @ Michigan Tech.

Dataset 1: Mobile Text Dataset