Perseus Project Publications, 1987 -- 2014

This series is part of Tufts Published Scholarship, 1983 -- 2014

Series Overview

Title: Perseus Project Publications
Dates: 1987 -- 2014
Call Number: PB.001
Size: 56 Digital Object(s)
Language(s): English.  

Description

This series contains scholarship published by Tufts University faculty and staff working on the Perseus project.

Arrangement

This series is arranged by first author and chronologically by submission within first author.

Access and Use

Access Restrictions

Open for research.

Detailed Contents List


  Title Request Materials
 
Item ID: PB.001.001.00001
Type: Item
Access: Open for research.
 
Beyond Digital Incunabula: Modeling the Next Generation of Digital Libraries (preprint) 2006

Abstract: This paper describes several incunabular assumptions that impose upon early digital libraries the limitations drawn from print, and argues for a design strategy aimed at providing customization and personalization services that go beyond the limiting models of print distribution, based on services and experiments developed for the Greco-Roman collections in the Perseus Digital Library. Three features fundamentally characterize a successful digital library design: finer granularity of collection objects, automated processes, and decentralized community contributions.
Item ID: PB.001.001.00002
Type: Item
Access: Open for research.
 
ePhilology: When the Books Talk to Their Readers 2006

Abstract: This paper suggests directions in which an ePhilology may evolve. Philology here implies that language and literature are the objects of study but assumes that language and literature must draw upon the full cultural context and thus sees in philological analysis a starting point for the scientia totius antiquitatis - the systematic study of all ancient culture. The term ePhilology implicitly states that, while our strategic goal may remain the scientia totius antiquitatis, the practices whereby we pursue this strategic goal must evolve into something qualitatively different from the practices of the past.
Item ID: PB.001.001.00003
Type: Item
Access: Open for research.
 
Perseus: An Interactive Curriculum on Classical Greek Civilization, A Proposal to the Annenberg/CPB Project 1988

Traditionally, students of the humanities have encountered barriers in their learning. The primary sources of their studies are often physically inaccessible and difficult to assimilate. In Perseus, therefore, we have begun to use the technology of interactive computing and the dense storage capacity of optical disk to create a vast, highly cross-referenced database of textual and visual information, and to experiment with ways to let the user explore this information. Perseus promotes active learning and, we believe, has already begun to alter the three-way relationship between teacher, student, and source material.
Item ID: PB.001.001.00004
Type: Item
Access: Open for research.
 
Perseus: An Interactive Curriculum on Classical Greek Civilization, A Proposal to the Annenberg/CPB Project. Appendix 1988

Appendix to the 1988 proposal to the Annenberg Foundation for initial support of the Perseus Project.
Item ID: PB.001.001.00005
Type: Item
Access: Open for research.
 
A New Generation of Textual Corpora: Mining Corpora from Very Large Collections 2007

While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.
Item ID: PB.001.001.00006
Type: Item
Access: Open for research.
 
The Challenge of Virginia Banks: An Evaluation of Named Entity Analysis in a 19th Century Newspaper Collection 2006

This paper evaluates automatic extraction of ten named entity classes from a 19th century newspaper, the Civil War years of the Richmond Times Dispatch, digitized with IMLS support by the University of Richmond. This paper analyzes success with ten categories of entities prominent in these newspapers and the particular problems that these classes of named entities raise. Personal and place names are familiar but some more important categories (such as ship names and military units) illustrate some of the challenges that named entity identification confronts as it evolves into a fundamental tool not only for automatic metadata generation but also for searching and browsing as well. We conclude by suggesting the kinds of knowledge sources that digital libraries need to assemble as part of their machine readable reference collections to support named entity identification as a core service.
Item ID: PB.001.001.00007
Type: Item
Access: Open for research.
 
eScience and the Humanities 2007

Humanists face problems that are comparable to their colleagues in the sciences. Like scientists, humanists have electronic sources and datasets that are too large for traditional labor intensive analysis. They also need to work with materials that presuppose more background knowledge than any one researcher can master: no one can, for example, know all the languages needed for subjects that cross multiple disciplines. Unlike their colleagues in the sciences, however, humanists have relatively few resources with which to develop this new infrastructure. They must therefore systematically cultivate alliances with better funded disciplines, learning how to build on emerging infrastructure from other disciplines and, where possible, contributing to the design of a cyberinfrastructure that serves all of academia, including the humanities
Item ID: PB.001.001.00008
Type: Item
Access: Open for research.
 
Item ID: PB.001.001.00009
Type: Item
Access: Open for research.
 
Philology in an Electronic Age 2007

This paper considers two questions, one broad and thus more a series of further questions, the other more practical and with a particular suggestion for action. This paper emerges from a July 2002 conference about lexica and few would be willing to argue that classicists could not improve upon the foundations left to us by Liddell, Scott, Jones, and the others who labored on our shared Greek-English Lexicon. But simply because we could improve upon what we already have does not tell us what we should build now. We do not need a better nineteenth-century lexicon. We do not even need the best possible twentieth-century lexicon. We need to develop lexicographic resources to serve non-lexicographic readers of the twenty-first century.
Item ID: PB.001.001.00010
Type: Item
Access: Open for research.
 
Drudgery and Deep Thought: Designing Digital Libraries for the Humanities 2001

This is an expanded version of an article that appeared in the Communications of the ACM, May 2001 section on digital libraries. The published version is more informal. In this article, we describe the challenges of creating digital libraries for the humanities, based on our experience with a series of heterogeneous projects ranging from Ancient Greece to 19th-century London.
Item ID: PB.001.001.00011
Type: Item
Access: Open for research.
 
New Technology and New Roles: The Need for 'Corpus Editors' 2000

Digital libraries challenge humanists and other academics to rethink the relationship between technology and their work. At the Perseus Project, we have seen the rise of a new combination of skills. The "Corpus Editor" manages a collection of materials that are thematically coherent and focused but are too large to be managed solely with the labor-intensive techniques of traditional editing. The corpus editor must possess a degree of domain specific knowledge and technical expertise that virtually no established graduate training provides. This new position poses a challenge to humanists as they train and support members of the field pursuing new, but necessary tasks.
Item ID: PB.001.001.00012
Type: Item
Access: Open for research.
 
Document Quality Indicators and Corpus Editions 2001

Corpus editions can only be useful to scholars when users know what to expect of the texts. We argue for text quality indicators, both general and domain-specific.
Item ID: PB.001.001.00013
Type: Item
Access: Open for research.
 
New Technologies for Reading: The Lexicon and the Digital Library 1998

Books, codices, really, have begun to share the stage with electronic documents, a coexistence that is likely to prove as enduring as that of manuscript and print. While electronic documents are still quite new, we now have enough experience to speak with something other than anticipation, whether fretful or eager, about how print and electronic media complement one another. This discussion will focus on one particular reference work: the entire text of the Liddell-Scott-Jones Greek English Lexicon (9th edition 1940) has been available on the World Wide Web since the fall of 1995. Between July 9, 1996 and August 20, 1997, the electronic lexicon was accessed 336,649 times, with usage growing steadily (as I write in August 1997, the lexicon is used more than 1,000 times each day).
Item ID: PB.001.001.00014
Type: Item
Access: Open for research.
 
Building a Digital Library: The Perseus Project as a Case Study in the Humanities 1996

This paper outlines some of our preliminary findings in the Perseus Project, an on-going digital library on ancient Greek culture that has been under development since 1987.
Item ID: PB.001.001.00015
Type: Item
Access: Open for research.
 
Cultural Heritage Digital Libraries: Needs and Components 2002

This paper describes preliminary conclusions from a long-term study of cultural heritage digital collections. First, those features most important to cultural heritage digital libraries are described. Second, we list those components that have proven most useful in boot-strapping new collections.
Item ID: PB.001.001.00016
Type: Item
Access: Open for research.
 
Item ID: PB.001.001.00017
Type: Item
Access: Open for research.
 
Building a Hypertextual Digital Library in the Humanities: A Case Study on London 2001

This paper describes the creation of a new humanities digital library collection: 11,000,000 words and 10,000 images representing books, images and maps on pre-twentieth century London and its environs. The London collection contained far more dense and precise information than the materials from the Greco-Roman world on which we had previously concentrated. The London collection thus allowed us to explore new problems of data structure, manipulation, and visualization. This paper contrasts our model for how humanities digital libraries are best used with the assumptions that underlie many academic digital libraries on the one hand and more literary hypertexts on the other. Since encoding guidelines such as those from the TEI provide collection designers with far more options than any one project can realize, this paper describes what structures we used to organize the collection and why. We particularly emphasize the importance of mining historical "authority lists" (encyclopedias, gazetteers, etc.) and then generating automatic "span-to-span" links within the collection.
Item ID: PB.001.001.00018
Type: Item
Access: Open for research.
 
Towards a Cultural Heritage Digital Library 2003

This paper surveys research areas relevant to cultural heritage digital libraries. The emerging National Science Digital Library promises to establish the foundation on which those of us beyond the scientific and engineering community will likely build. This paper thus articulates the particular issues that we have encountered in developing cultural heritage collections. We provide a broad overview of audiences, collections, and services.
Item ID: PB.001.001.00019
Type: Item
Access: Open for research.
 
Item ID: PB.001.001.00020
Type: Item
Access: Open for research.
 
A Document Recognition System for Early Modern Latin 2006

Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters like ligatures and accented abbreviations. Current OCRs are inadequate for our purpose: their built-in training sets do not include all these special characters, and further, post-processing of OCR output is based on data and methods specific to the domain language, most of the current systems do not implement error-correction tools for Latin. This abstract outlines the development of a document recognition system for medieval and early modern Latin texts. We first evaluate the performance of the open source OCR framework, Gamera, on these manuscripts. We then incorporate language modeling functions to sharpen the character recognition output.
Item ID: PB.001.001.00021
Type: Item
Access: Open for research.
 
 
Student Researchers, Citizen Scholars and the Trillion Word Library 2012

The surviving corpora of Greek and Latin are relatively compact but the shift from books and written objects to digitized texts has already challenged students of these languages to move away from books as organizing metaphors and to ask, instead, what do you do with a billion, or even a trillion, words? We need a new culture of intellectual production in which student researchers and citizen scholars play a central role. And we need as a consequence to reorganize the education that we provide in the humanities, stressing participatory learning, and supporting a virtuous cycle where students contribute data as they learn and learn in order to contribute knowledge. We report on five strategies that we have implemented to further this virtuous cycle: (1) reading environments by which learners can work with languages that they have not studied, (2) feedback for those who choose to internalize knowledge about a particular language, (3) methods whereby those with knowledge of different languages can collaborate to develop interpretations and to produce new annotations, (4) dynamic reading lists that allow learners to assess and to document what they have mastered, and (5) general e-portfolios in which learners can track what they have accomplished and document what they have contributed and learned to the public or to particular groups.
Item ID: PB.001.001.00023
Type: Item
Access: Open for research.
 
Guidelines for the Synactic Annotation of Latin Treebanks (Draft) 2007

This paper presents a preliminary set of guidelines for the syntactic annotation of Latin treebanks, as jointly developed for the Latin Dependency Treebank and the Index Thomisticus.
Item ID: PB.001.002.00001
Type: Item
Access: Open for research.
 
The Latin Dependency Treebank in a Cultural Heritage Digital Library 2007

This paper describes the mutually beneficial relationship between a cultural heritage digital library and a historical treebank: an established digital library can provide the resources and structure necessary for efficiently building a treebank, while a treebank, as a language resource, is a valuable tool for audiences traditionally served by such libraries.
Item ID: PB.001.002.00002
Type: Item
Access: Open for research.
 
Building a Dynamic Lexicon from a Digital Library 2008

We describe here in detail our work toward creating a dynamic lexicon from the texts in a large digital library. By leveraging a small structured knowledge source (a 30,457 word treebank), we are able to extract selectional preferences for words from a 3.5 million word Latin corpus. This is promising news for low-resource languages and digital collections seeking to leverage a small human investment into much larger gain. The library architecture in which this work is developed allows us to query customized subcorpora to report on lexical usage by author, genre or era and allows us to continually update the lexicon as new texts are added to the collection.
Item ID: PB.001.002.00003
Type: Item
Access: Open for research.
 
The Logic and Discovery of Textual Allusion 2008

We describe here a method for discovering imitative textual allusions in a large collection of Classical Latin poetry. In translating the logic of literary allusion into computational terms, we include not only traditional IR variables such as token similarity and n-grams, but also incorporate a comparison of syntactic structure as well. This provides a more robust search method for Classical languages since it accomodates their relatively free word order and rich inflection, and has the potential to improve fuzzy string searching in other languages as well.
Item ID: PB.001.002.00004
Type: Item
Access: Open for research.
 
Item ID: PB.001.002.00005
Type: Item
Access: Open for research.
 
Item ID: PB.001.002.00006
Type: Item
Access: Open for research.
 
Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection 2010

We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2% projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6% accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one.
Item ID: PB.001.002.00007
Type: Item
Access: Open for research.
 
An Ownership Model of Annotation: The Ancient Greek Dependency Treebank 2009

We describe here the first release of the Ancient Greek Dependency Treebank (AGDT), a 90,903-word syntactically annotated corpus of literary texts including the works of Hesiod, Homer and Aeschylus. hile the far larger works of Hesiod and Homer (142,705 words) have been annotated under a standard reebank production method of soliciting annotations from two independent reviewers and then econciling their differences, we also put forth with Aeschylus (48,198 words) a new model of treebank roduction that draws on the methods of classical philology to take into account the personal responsibility of the annotator in the publication and ownership of a "scholarly" treebank.
Item ID: PB.001.002.00008
Type: Item
Access: Open for research.
 
Structured Knowledge for Low-Resource Languages: The Latin and Ancient Greek Dependency Treebanks 2009

We describe here our work in creating treebanks -- large collections of syntactically annotated data -- for Latin and Ancient Greek. While the treebanks themselves present important datasets for traditional research in philology and linguistics, the layers of structured knowledge they contain (including disambiguated lemma, morphological, and syntactic information for every word) help offset the comparatively small size of extant Greek and Latin texts for text mining applications. We describe two such uses for these Classical treebanks -- discovering lexical knowledge from a large corpus with the help of a small treebank, and identifying patterns of text reuse.
Item ID: PB.001.002.00009
Type: Item
Access: Open for research.
 
The Latin and Ancient Greek Dependency Treebanks 2011

This paper describes the development, composition, and several uses of the Ancient Greek and Latin Dependency Treebanks, large collections of Classical texts in which the syntactic, morphological and lexical information for each word is made explicit. To date, over 200 individuals from around the world have collaborated to annotate over 350,000 words, including the entirety of Homer's Iliad and Odyssey, Sophocles Ajax, all of the extant works of Hesiod and Aeschylus, and selections from Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and Vergil. While perhaps the most straightforward value of such an annotated corpus for Classical philology is the morphosyntactic searching it makes possible, it also enables a large number of downstream tasks as well, such as inducing the syntactic behavior of lexemes and automatically identifying similar passages between texts.
Item ID: PB.001.002.00010
Type: Item
Access: Open for research.
 
Measuring Historical Word Sense Variation 2011

We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are able to automatically classify the Latin word senses in a 389 million word corpus and track the rise and fall of those senses over a span of two thousand years. We evaluate the performance of seven different classifiers both in a tenfold test on 83,892 words from the aligned parallel corpus and on a smaller, manually annotated sample of 525 words, measuring both the overall accuracy of each system and how well that accuracy correlates (via mean square error) to the observed historical variation.
Item ID: PB.001.002.00011
Type: Item
Access: Open for research.
 
Named Entity Identification and Cyberinfrastructure 2007

Well-established instruments such as authority files and a growing set of data structures such as CIDOC CRM, FRBRoo, and MODS provide the foundation for emerging, new digital services. While solid, these instruments alone neither capture the essential data on which traditional scholarship depends nor enable the services which we can already identify as fundamental to any eResearch, cyberinfrastructure or virtual research environment for intellectual discourse. This paper describes a general model for primary sources, entities and thematic topics, the gap between this model and emerging infrastructure, and the tasks necessary to bridge it.
Item ID: PB.001.003.00001
Type: Item
Access: Open for research.
 
Analyzing Human Systems Across Time, Space, Language, and Culture 2011

Due to the rise of very large, heterogeneous collections, increasingly sophisticated multilingual services, and expanding high performance computing infrastructure, we are now in a position to begin studying 4000 years of linguistic data from around the world, tracing change within languages, the interaction of languages, the evolution and circulation of ideas, and the patterns of human society. Language has been an impenetrable barrier we can reach any point on the globe in a matter of hours but the amount of time required to master a new language remains unchanged. We can, however, now begin to work with far more languages than we could ever study, much less master. We are now in a position to pursue broader questions and to pursue these with greater rigor than would have been possible in print. A great deal of work remains to be done, however, for very large collections are not scientific corpora and need extensive processing, and many written sources do not yet lend themselves to optical character recognition. Simply scaling up existing systems to analyze millions of books poses software engineering challenges. Perhaps most important of all, we need to train a new generation of researchers who can bridge the intellectual gaps between the relevant computational methods and new research for social, behavioral and economic sciences.
Item ID: PB.001.003.00002
Type: Item
Access: Open for research.
 
Item ID: PB.001.003.00003
Type: Item
Access: Open for research.
 
Spacing Out: Web 3D and the Reconstruction of Archaeological Sites 2000

The emergence of high-speed processors with 3D graphics acceleration and the accessibility of high-speed internet connections have propelled web 3D from a frustrating to a viable internet technology. Without a doubt, the addition of spatial dimension to customary 2D web graphics can be visually gripping. But greater potential for the technology lies beyond its ability to command attention. Web 3D presents students and scholars alike with a new tool for visualizing and understanding ancient sites. Fundamentally, 3D digital reconstruction allows architecture, sculpture, and other remains to be considered in their respective contexts as a whole, rather than as individual items divorced from their intended settings. At the same time, digital reconstruction can incorporate a distinction between the "real" and the "hypothetical" just as found in other types of modern restoration work. The Perseus Project is currently preparing 3D models of various archaeological sites. These will be linked to other related materials already in its digital library: maps; plans; photographs; QTVR walkthroughs; site, architecture, and object catalogs; and literary and historical documents. In part, the models are intended as databases of geographic and architectural information in their own right. They are also intended to serve as the underpinning for a contextual presentation of architectural and freestanding sculptures. This paper surveys Perseus' work on the Apollo sanctuary at Delphi. It addresses both the process of digital modeling and the method used to integrate the model with Perseus' existing tools and databases.
Item ID: PB.001.004.00001
Type: Item
Access: Open for research.
 
Extracting Geometry from Digital Models in a Cultural Heritage Digital Library 2003

This paper describes research to enhance the integration between digital models and the services provided by the document management systems of digital libraries. Processing techniques designed for XML texts are applied to X3D models, allowing specific geometry to be automatically retrieved and displayed. The research demonstrates that models designed on object-oriented paradigms are most easily exploited by XML document management system.
Item ID: PB.001.004.00002
Type: Item
Access: Open for research.
 
Excavating the Hard Drive: Archaeological Research, XML, and 3D Graphics 2003

This paper focuses on the X3D XML application of the VRML international 3D graphic standard. It addresses the integration of X3D graphic files into the Perseus XML document management system. And it addresses the creation of a tool to extract and represent graphic elements from multiple files. The tool provides a specific research mechanism for the discovery of embedded graphic data interspersed through a large corpus, for example, allowing an archaeologist to retrieve every cornice from an extensive collection of files without having to open and search each file manually.
Item ID: PB.001.004.00003
Type: Item
Access: Open for research.
 
London calling: GIS, VR, and the Victorian period 2001

The Bolles Collection of Tufts University represents a comprehensive and integrated collection of sources on the history and topography of Victorian London. Texts, images, maps, and three-dimensional reconstructions are all interconnected forming a body of material that transcends the limits of print publication and exploits the flexibility of the electronic medium. The Perseus Digital Library has incorporated Geographic Information System and Virtual Reality technologies in a set of tools intended to help readers synthesize and visualize the numerous temporal and spatial interconnections between Bolles Collection materials. The tools, which are applicable to any large assemblage of related documents, also help readers grasp the complex temporal-spatial interactions that shape historical materials in general.
Item ID: PB.001.005.00001
Type: Item
Access: Open for research.
 
Generating and Reintegrating Geospatial Data 2000

The process of building a geospatial component to access existing materials in the Perseus Digital Library has raised interesting questions about the interaction between historical and geospatial data. The traditional methods of describing geographic features' names and locations do not provide a complete solution for historical data such as that in the Perseus Digital Library. Very often data sources for a spatial database must be created from the historical materials themselves.
Item ID: PB.001.005.00002
Type: Item
Access: Open for research.
 
Services Make the Repository 2006

This paper provides an overview of the collaboration between the Perseus Project and the Digital Collection and Archives at Tufts University in moving the collections of the Perseus Project into the DCA's Fedora based repository as well as a listing of potential services necessary to support a successful institutional repository.
Item ID: PB.001.005.00003
Type: Item
Access: Open for research.
 
Detecting Events with Date and Place Information in Unstructured Text 2002

Digital libraries of historical documents provide a wealth of information about past events, often in unstructured form. Once dates and place names are identified and disambiguated, using methods that can differ by genre, we examine collocations to detect events. Collocations can be ranked by several measures, which vary in effectiveness according to type of events, but the log-likelihood measure offers a reasonable balance between frequently and infrequently mentioned events and between larger and smaller spatial and temporal ranges. Significant date-place collocations can be displayed on timelines and maps as an interface to digital libraries. More detailed displays can highlight key names and phrases associated with a given event.
Item ID: PB.001.006.00001
Type: Item
Access: Open for research.
 
Detecting and Browsing Events in Unstructured Text 2002

Previews and overviews of large, heterogeneous information resources help users comprehend the scope of collections and focus on particular subsets of interest. For narrative documents, questions of \what happened? where? and when?" are natural points of entry. Building on our earlier work at the Perseus Project with detecting terms, place names, and dates, we have exploited co-occurrences of dates and place names to detect and describe likely events in document collections. We compare statistical measures for determining the relative significance of various events. We have built interfaces that help users preview likely regions of interest for a given range of space and time by plotting the distribution and relevance of various collocations. Users can also control the amount of collocation information in each view. Once particular collocations are selected, the system can identify key phrases associated with each possible event to organize browsing of the documents themselves.
Item ID: PB.001.006.00002
Type: Item
Access: Open for research.
 
Disambiguating Geographic Names in a Historical Digital Library 2001

Geographic interfaces provide natural, scalable visualizations for many digital library collections, but the wide range of data in digital libraries presents some particular problems for identifying and disambiguating place names. We describe the toponym-disambiguation system in the Perseus digital library and evaluate its performance. Name categorization varies significantly among different types of documents, but toponym disambiguation performs at a high level of precision and recall with a gazetteer an order of magnitude larger than most other applications.
Item ID: PB.001.006.00003
Type: Item
Access: Open for research.
 
Management of XML Documents in an Integrated Digital Library 2000

We describe a generalized toolset developed by the Perseus Pro ject to manage XML documents in the context of a large, heterogeneous digital library. The system manages multiple DTDs through mappings from elements in the DTD to abstract document structures. The abstraction of document metadata, both structural and descriptive, facilitates the development of application-level tools for knowledge management and document presentation. We discuss the implementation of the XML back end and describe applications for cross citation retrieval, toponym extraction and plotting, automatic hypertext generation, morphology, and word co-occurrence.
Item ID: PB.001.006.00004
Type: Item
Access: Open for research.
 
Integrating Harvesting into Digital Library Content 2002

The Open Archives Initiative has gained success by aiming between complex federation schemes and low functionality web crawling. Much information still remains hidden inside documents catalogued by OAI metadata. We discuss how subdocument information can be exposed by data providers and exploited by service providers. We discuss services for citation reversal and name and term linking with harvested data in the Perseus Project's document management system and a proxy service for automatically adding these links to OAI documents outside Perseus.
Item ID: PB.001.006.00005
Type: Item
Access: Open for research.
 
Integrating data from The Perseus Project and Arachne using the CIDOC CRM: An Examination from a Software Developers Perspective 2006

In a joint effort, The Perseus Project, a digital library hosted at Tufts University, and Arachne, the central database for archaeological objects of the German Archaeological Institute (DAI) and the Research Archive for Ancient Sculpture at the University of Cologne (FA), want to make their data accessible to a greater audience using the CIDOC CRM data model. Given the fact that the information in each of their databases is of interest to a large community of people, efforts to overcome the current lack of data integration have to be made. Aside from the philosophical implications and the mathematical background involved, the main concern of this project will be the practicability of a software implementation of all relevant concepts using basic Semantic Web technologies as described by the W3C, along with an investigation of the usability of the CIDOC CRM for a multilingual interface. The main purpose of the implementation process is to get a deeper understanding of the concepts and technologies involved when dealing with the Semantic Web, ontologies in general and the CIDOC CRM in particular. For the process of implementation it is essential that software tools team up with a methodical process, and that appropriate tools be discovered or developed and documented. Functional requirements have a tendency to evolve relatively rapidly when information systems are used by historians. Information systems in the humanities are especially confronted with the problem of constant change through the acquisition of new project partners carrying new and varied source material. As a consequence, potential integration efforts have to cope with changes of the database schemas and therefore should be flexible.
Item ID: PB.001.007.00001
Type: Item
Access: Open for research.
 
Managing Authority Lists for Customized Linking and Visualization: A Service for the National STEM Digital Library 2002

We propose two broad classes of service to the NSDL. First, we will provide automatic linking services that automatically bind key words and phrases to supplementary information. Such automatic linking services are already in place in the Perseus Digital Library. These services will help students, professionals outside a particular discipline, and the interested public to read documents full of unfamiliar technical terms and concepts. Astronomy students and curious amateurs may need to see expansions of some acronyms, e.g., MACHO: massive compact halo object, such as neutron stars and brown dwarfs, or pictures of Kuiper belt objects. These services can be of particular help to undergraduates as they shift from textbooks to scientific literature: the student struggling with research papers on bioluminescence, for example, will be able to locate information about particular chemical processes or relevant species of echinoderms. Second, we will base automatic linking on authority control of names and terms and on links among different authority lists such as thesauri, glossaries, encyclopedias, subject hierarchies, and object catalogues.
Item ID: PB.001.008.00001
Type: Item
Access: Open for research.
 
Vocabulary Building in the Perseus Digital Library 2002

In this paper, we will describe a new computational tool in the Perseus Digital Library designed to help students learn vocabulary by generating Latin and Greek word lists that are tailored to reading assignments.
Item ID: PB.001.009.00001
Type: Item
Access: Open for research.
 
Collecting Fragmentary Authors in a Digital Library (Greek Fragmentary Historians) 2009

This paper discusses new work to represent, in a digital library of classical sources, authors whose works themselves are lost and who survive only where surviving authors quote, paraphrase or allude to them. It describes initial works from a digital collection of such fragmentary authors designed not only to capture but to extend the ontologies that traditional scholarship has developed over generations: the aim is representing every nuance of print conventions while using the capabilities of digital libraries to extend our ability to identify fragments, to represent what we have identified, and to render the results of that work intellectually and physically more accessible than was possible in print culture.
Item ID: PB.001.010.00001
Type: Item
Access: Open for research.
 
Improving OCR Accuracy for Classical Critical Editions 2009

This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.
Item ID: PB.001.011.00001
Type: Item
Access: Open for research.
 
Rethinking Critical Editions of Fragmentary Texts By Ontologies 2009

This paper discusses the main issues encountered in the design of a domain ontology to represent ancient literary texts that survive only in fragments, i.e. through quotations embedded in other texts. The design approach presented in the paper combines a knowledge domain analysis conducted through semantic spaces with the integration of well established ontologies and the application of ontology design patterns. After briefly describing the specific meaning of "fragment" in a literary context, the paper gives insights into the main conceptual issues of the ontology design process. Lastly, it outlines the overall architecture of protocols, services and data repositories which is required to implement a digital edition of fragments based on the proposed ontology.
Item ID: PB.001.012.00001
Type: Item
Access: Open for research.
 
When Printed Hypertexts Go Digital: Information Extraction from the Parsing of Indices 2009

Modern critical editions of ancient works generally include manually created indices of other sources quoted in the text. Since indices can be considered as a form of domain specific language, the paper presents a parsing-based approach to the problem of extracting information from them to support the creation of a collection of fragmentary texts. This paper first considers the characteristics and structure of quotation indices and their importance when dealing with fragmentary texts. It then presents the results of applying a fuzzy parser to the OCR transcription of an index of quotations to extract information from potentially noisy input.
Item ID: PB.001.012.00002
Type: Item
Access: Open for research.
 
Item ID: PB.001.013.00001
Type: Item
Access: Open for research.