This zip file contains the sentences mined from public web forums and blogs. Additional details about the dataset:

  • The data is split into training, development, and test sets based on the original domain name the text was mined from.
  • The sent_*.txt files are tab-delimited and contain one sentence parsed from a particular post. Each line contains the device name, forum software, device form factor (tablet or phone), and device input (touch or touch+key) associated with the post it was obtained from.
  • The set's subdirectory contains the groupings used in Section 2.
  • 64K word list (used in the paper), 5K and 20K word lists used on Forum-only models found here:
  • Various word lists used.
  • Posts and Email development and test sets.

This data supports the paper "Mining, analyzing, and modeling text written on mobile devices," which can be accessed here on Digital Commons @ Michigan Tech.

