This dataset contains clean, text-only, typewritten documents. They are a combination of single and double column documents. We used only and all the characters that tesseract with the default english training can recognize.
The dataset is made of 22 pages of text with the following characteristics:
Several variants of these 22 text pages were created by combining:
This makes 1584 documents. We printed these documents with three printers (a Konica Minolta Bizhub 223, a Sharp MX M904 and a Sharp MX M850) and scanned them with three scanners and at different resolutions between 150dpi and 600dpi. Each document has been scanned the same number of times at each resolution in order to have enough images and to avoid any statistical bias.
This makes a dataset of 42768 document images.
Please contact sebastien (dot) eske (at) univ-lr (dot) fr to have access to it.
Copyright © 2010-2014 IAPR TC10