Basic Search
Browse
Resource Inspector
Title: A New Generation of Textual Corpora: Mining Corpora from Very Large Collections
Date: 2007
Creator: Crane, Gregory
Creator: Babeu, Alison
Creator: Stewart, Gordon
Format: application/pdf
Organizations: Perseus Project
Topics: Digital libraries
Topics: OCR evaluation
Topics: Ancient Greek
Topics: Text alignment

Access this object:help
-pdf (default)
Title: A New Generation of Textual Corpora: Mining Corpora from Very Large Collections
Citable URL: http://hdl.handle.net/10427/14853
Author: Crane, Gregory; Babeu, Alison; Stewart, Gordon
Date: 2007
Citation: Stewart, Gordon, Crane, Gregory, and Alison Babeu. A New Generation of Textual Corpora: Mining Corpora from Very Large Collections. Preprint of paper accepted to JCDL 2007. copyright ACM, 2007. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in JCDL '07: Proceedings of the 7th ACM/IEEE joint conference on Digital libraries, http://doi.acm.org/10.1145/1255175.1255247. 2007. Permanent URL:http://hdl.handle.net/10427/14853.
Rights: http://www.acm.org/pubs/copyright_policy/#Retained

View the PDF File: A New Generation of Textual Corpora: Mining Corpora from Very Large Collections (opens in a new window)

Abstract: While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.