L3i TextCopies Dataset

This dataset contains clean, text-only, typewritten documents. They are a combination of single and double column documents. We used only and all the characters that tesseract with the default english training can recognize.

The dataset is made of 22 pages of text with the following characteristics:

  • 1 page of a scientific article with a single column header and a double column body
  • 3 pages of scientific articles with a double column layout
  • 2  pages of programming code with a single column layout
  • 4 pages of a novel with a single column layout
  • 2 pages of legal texts with a single column layout
  • 4 pages of invoices with a single column layout
  • 4 pages of payslips with a single column layout
  • 2 pages of birth extract with a single column layout

Several variants of these 22 text pages were created by combining:

  • 6 fonts : Arial, Calibri, Courier, Times New Roman, Trebuchet and Verdana
  • 3 font sizes : 8, 10 and 12 points
  • 4 styles: normal, bold, italic and the combination of bold and italic

This makes 1584 documents. We printed these documents with three printers (a Konica Minolta Bizhub 223, a Sharp MX M904 and a Sharp MX M850) and scanned them with three scanners and at different resolutions between 150dpi and 600dpi. Each document has been scanned the same number of times at each resolution in order to have enough images and to avoid any statistical bias.

This makes a dataset of 42768 document images.

The dataset webpage is http://navidomass.univ-lr.fr/TextCopies/
The dataset is hosted on sftp://pixlshare.univ-lr.fr.
Its size is 36GB.

Please contact sebastien (dot) eske (at) univ-lr (dot) fr to have access to it.