Event box

Harvard-Yenching Library

Transforming Classical Chinese Texts into Searchable Databases with AI

Thursday, November 7, 2024, 12:00pm - 1:00pm
Harvard ID required, Presentation,
Location: CGIS South S354
Date: November 7, 2024
Time: 12 – 1 PM
Registration: Click here. (Lunch will be served.)

As artificial intelligence becomes integral to the digital humanities, it offers innovative methods that transform research capabilities and uncover new insights into historical texts and cultural narratives. This talk will demonstrate how AI-powered pipelines can process large volumes of unstructured classical Chinese texts, such as genealogies and Qing dynasty government employee records, including those from the Da Qing jin shen quan shu, into organized, searchable databases.

The pipeline addresses a longstanding challenge in classical Chinese studies: the labor-intensive manual data entry process. It is designed to efficiently process millions of pages from historical Chinese texts, tackling complexities like layout identification and precision in text extraction. Central to this effort is customized Optical Character Recognition (OCR), which enhances data extraction accuracy and identifies key fields using Named Entity Recognition (NER) models. The result is clean, tabular databases that improve accessibility, allowing researchers to analyze Chinese historical content with unprecedented efficiency. Furthermore, this methodology holds potential applications for other languages, including Japanese, Korean, Arabic and Latin, broadening its impact.

By exploring these methodologies and their implications, this presentation aims to show how integrating advanced technological tools enriches scholarly inquiry in the digital humanities, providing deeper insights into patterns and narratives within Chinese history and beyond. This approach promises to revolutionize data collection, paving the way for alternative research practices across various linguistic contexts.


Speaker’s Bio

Guenther Lomas is a graduate of the University of Toronto and the founder of Sigtica, which specializes in document intelligence and the use of AI to transform unstructured data into structured databases. His work focuses on applying artificial intelligence technologies to the digital humanities, particularly in quantitative history and cultural heritage preservation. His latest project involves a collaboration with the Lee-Campbell Research Group at the Hong Kong University of Science and Technology, where he is utilizing Qing dynasty era employee record data, including records from the Da Qing jin shen quan shu, to develop tools that facilitate the digitization of classical Chinese texts.

Co-sponsored by Digital China Initiative and Harvard-Yenching Library.

Add to: Google Calendar Other calendar (.ics)

Event Organizer

No Profile image
Gloria Cadder