In recent years, theÂ Heidelberg Research ArchitectureÂ projectÂ Early Chinese Periodicals OnlineÂ (ECPO) has evolved from a data silo into an open-accessÂ research platform. In the first decade of its existence, the projectâ€™s focus was on the systematization of digitized early Chinese press material. This resulted in a searchable database for image scans and bilingual metadata with over 435,000 entries â€“ including 300,000 scans, 85,000 records and 50,000 agents names from Republican-era magazines and newspapers.
The ECPO platform was implemented in collaboration with theÂ Institute of Modern History, Academia Sinica, Taiwan, made possible with funding from theÂ Chiang Ching-kuo Foundation for International Scholarly Exchange. The platform has since been developed with further support from various institutions, such as theÂ Centre for Asian and Transcultural Studies (CATS) Library, theÂ Heidelberg Centre for Transcultural Studies (HCTS), theÂ Institute of Chinese StudiesÂ and theÂ Research Council Cultural Dynamics in Globalized WorldsÂ from the University of Heidelberg; alongÂ withÂ theÂ Konfuzius-Institut HeidelbergÂ and theÂ University of Erlangen-NĂĽrnbergÂ as affiliated partners.
The ECPO database combines what we call â€śextensiveâ€ť and â€śintensiveâ€ť approaches to Chinaâ€™s early periodicals.
The extensive comprises a comprehensive catalog and record of Republican-era art and literary periodicals, including such basic data as title, editor, publisher, location and dates of publication, periodicity, format, prominent contributors, etc.
The intensive approach involves archiving digital cover-to-cover copies of entire runs of periodicals, analyzing their complete contents, and tagging them with structured meta-data.
So far, six journals (four womenâ€™s journals and two entertainment periodicals) have been included in the database using the intensive approach, including the four in-depth analysed magazines in the WoMag database. In the extensive section of the database, we have so far been able to work on a selection of some 150 journals in addition to those contained in the Xiaobao database.
We will continue to upgrade the databaseâ€™s keyword metadata by mapping them -as far as possible- to established structured thesauri such as Getty Art and Architecture Thesaurus and its Chinese version, AAT Taiwan in cooperation with the ASDC. This mapping in turn opens up the possibility of linking ECPO directly to other digital collections.
ECPO is an Open Access resource developed by independent programmers and the Heidelberg Research Architecture (HRA). All bibliographic descriptions can be accessed through the ECPO API in MODS XML format. This allows broad sharing and exchange of the projectâ€™s research results in the future.
We are continuously working on improving our agent records. ECPO comprises more than 50.000 different names of persons, groups, or institutions. These names are recorded as they occur in the original publications. We developed a cross-database agent service which allows us to manage name records, assign them to individual agent records, or split similar names into various angents. We use international authority files, like VIAF or GND, and larger knowledge bases like Wikidata, DBpedia, ors well as the Chinese encyclopedia Baidu Baike, to uniquely identify our agent records and provide users with links to additional information on the respective agent. For an example, see the agent record ofÂ Bao Tianxiao ĺŚ…ĺ¤©ç¬‘.
Recently, ECPO started to work with neural networks with a focus on document layout recognition of Republican newspapers and OCRing individual text segments. Our aim is to advance the processing of Republican China newspapers and provide the content as full text. To learn more about our results, please follow our project presentations and have a look at the bibliography.Â
For an introduction into features of the database and the recent experiments with the use of neural networks, please have a look at the video Ground Truth, Neural Networks, OCR: Towards Full Text of Republican China NewspapersÂ presented at AAS2021.Â We have created a short tutorial using the databaseÂ to help you find your way thruogh the database. In that section we offer lists of the periodicals included in ECPO. We also provide a special section about theÂ technical background. Here you find some information about the systems we use. In addition, you will find information about the API's we currently provide.Â
As the material basis of the database consists mostly of image scans, the project has been running experiments on one Republican newspaper to explore approaches towards full-text generation. Computer-aided processing of image scans of historical periodicals is still a challenging process with the current state of technology, in particular because processing standards for Latin-script newspapers are not applicable for the Chinese context. It is only with new approaches in machine learning that it is now possible to transform material whichÂ was previously inaccessible just a few years ago. However, many challenges remain. Extremely complex layoutsÂ resulting inÂ difficulties forÂ reliable automatic detection ofÂ page segmentationÂ have so far prevented full-text generation for these newspapers even within China.Â
The application of artificial intelligence requires a ground truth data set. This error-free, manually corrected text with structural information is used both for evaluation and the training of software models for text and layout recognition. In fall of 2021, the project successfully implemented OCR on a sample from the newspaperÂ ć™¶ĺ ± Jing bao (The Crystal), with a character error rate below 3% (Henke 2021). On that basis, the project is now expanding and generalizing its approach. With additional funding recently received from theÂ Research Council Cultural Dynamics in Globalized WorldsÂ for the first half of 2022, the project is currently producing a new data set. The projectâ€™s aim is to offer a solution to automatically produce full text from Republican newspapers using neural networks and machine learning.
The projectâ€™s current work will not only further develop its original aims, but will also contribute to the field of research as a whole. With the disclosure of the projectâ€™s network models and data sets, its results can be reproduced, evaluated and its approaches can be adopted by others in the field. Although processing non-Latin-script is still a challenge in many cases, the project hopes that its work may serve as good practice examples for such initiatives.