Topic Modeling the H\`an di\u{a}n Ancient Classics

Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi'an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the"Handian"ancient classics corpus (H\`an di\u{a}n g\u{u} j\'i, i.e, the"Han canon"or"Chinese classics"). It contains classics of ancient Chinese philosophy, documents of historical and biographical significance, and literary works. We begin by describing the Digital Humanities context of this joint project, and the advances in humanities computing that made this project feasible. We describe the corpus and introduce our application of probabilistic topic modeling to this corpus, with attention to the particular challenges posed by modeling ancient Chinese documents. We give a specific example of how the software we have developed can be used to aid discovery and interpretation of themes in the corpus. We outline more advanced forms of computer-aided interpretation that are also made possible by the programming interface provided by our system, and the general implications of these methods for understanding the nature of meaning in these texts.


Introduction and Context
The use of computers to support scholarship in the humanities reaches back over 50 years. 1he first decades of the twenty-first Century have seen the acceleration of humanities computing, particularly in North America and Europe, with the field coalescing around the label "Digital Humanities" (DH).The recent growth of DH is the product of a feedback loop caused by several factors including: (1) the increasing availability of digitized materials, especially on the World Wide Web; (2) increased computer storage capacity and processing capacity; (3) advances in text modeling and visualization algorithms; (4) deepening understanding by scholars of the interpretive possibilities provided by computational methods; (5) funding commitments from government and private foundations; and (6) last, but not least, the growing perception among many young scholars and doctoral students that DH is an exciting area of inquiry and an important enhancement to their career prospects. 2DH projects may concern themselves with many different media: text, images, audio, video, etc.However, our focus here is the analysis of written texts.Textual analysis constitutes the largest component of DH.This is largely because written language has been central to the construction and transmission of intellectual culture, and because of the relative ease with which text can be encoded and shared.These factors have resulted in enormous amounts of textual material recently becoming available.
As an example from category (1), increased availability of digitized materials, we highlight the HathiTrust (HT) digital library. 3 It started as a collaboration among major university research libraries in the United States, to digitally scan the books in their collections. 4The page images from these books have been converted to text using optical character recognition (OCR) software.The HT collection now comprises over 14 million scanned volumes, equivalent to around five billion (5,000,000,000) pages based on the HathiTrust estimated average of 350 pages per book. 5(Perhaps as many as half a million of these books are Chinese language volumes, as determined by a search at babel.hathitrust.org.)By any standard, this is a vast amount of text: more than could be read in multiple human lifetimes.Because of its enormous scale, the digitized pages in the HT are relatively uncurated.Despite the care with which editors prepared the original physically printed editions, the images and OCR representations of the pages contain scanning errors that have not been corrected.Nevertheless, the HT digital library is a treasure trove for DH that offers multiple possibilities for analysis. 6At the same time, traditional scholarly editions have become increasingly digital, making available highly curated editions of historically and culturally significant text corpora.These projects use a laborintensive process of inserting markup according to widely adopted standards for the semantic web, such as TEI (Text Encoding Initiative) and OWL (Web Ontology Language). 7Only some of these editions are fully accessible to all users, however, because the costs of producing and curating them must often be recovered by subscriptions paid by libraries or individual users.For some scholarly purposes, the standard of care with which such editions are produced is essential, but access to imperfectly digitized texts, such as provided by the HathiTrust and by Project Gutenberg, 8 is adequate for many projects.However, even full access to relatively open archives such as the HathiTrust faces some restrictions because of complicated copyright issues which vary from country to country.
None of the large-scale repositories would have been possible without the exponential growth in computer capacity that has occurred since the advent of computing.Known as "Moore's Law", the doubling of computer speed and memory every 18 months for the same cost has produced supercomputers capable of processing terabytes of data, and personal laptops capable of storing and processing far more material than any single person could hope to read in several lifetimes.At the same time, the Internet and its hypertext offshoot, the World Wide Web, have made distributed repositories and cluster computing possible.This growth in speed, storage, and networking, has been accompanied by increasingly sophisticated algorithms for processing text and visualizing the results.The earliest efforts in humanities computing focused mainly on counting and localizing key words in texts.Other DH approaches applied network analysis to names, dates, places and other metadata such as citations, extracted from text.More recently, techniques for modeling full text contents have been introduced by computer scientists.Originally developed for the purposes of information retrieval, techniques such as latent semantic analysis (LSA), probabilistic topic modeling using latent Dirichlet allocation (LDA), and neural-network models of word embeddings, have been adopted within DH. 9 Although differing in their details, what these methods have in common is their representation of documents and words as vectors within a multidimensional space.In some representations, the dimensions of the space correspond to words.In other representations, the dimensions may correspond to concepts, topics, or other abstractions from the data.Algorithms based on these vector representations are capable of identifying hidden (latent) factors in text.Such representations allow for interesting and meaningful measures of similarity among terms and documents, for example the cosine of the angle between vectors, or well-defined informationtheoretic measures on the probability spaces.With such methods, DH scholars are beginning to move beyond counting words, to detecting and analyzing patterns of historical significance at cultural scale 10 and at the scale of an individual. 11ere is a small but growing literature on large-scale statistical modeling of Chinese language texts.Ouyang analyzed a corpus of over 40,000 ancient documents downloaded from multiple sources.This was used to plot the temporal distributions of word frequencies and geographic distributions of authors. 12Huang and Yu modeled the SongCi poetry corpus, first converting it to tonally marked pinyin to conserve poetically important pronunciation information. 13Nichols and colleagues reported initial modeling of the Chinese Text Project corpus 14 in a conference paper.(Further below, we describe differences between this corpus and the Handian.)With additional collaborators, this group has now conducted two studies that are currently unpublished but under review.In the first, they apply topic models to address scholarly questions about the relationships among important texts of Ancient Chinese philosophy.In the second, they use topic modeling to investigate the concepts of mind and body in ancient Chinese philosophy. 15lthough we share similar scholarly objectives with these researchers, our approach in this paper is unique in that for the first time anywhere we bring the benefits of computational modeling of ancient Chinese texts to a robust public platform that is mirrored on both sides of the Pacific.Besides being just a useful portal to the texts, our approach foregrounds the interpretive issues surrounding topic models, 16 and makes more sophisticated exploration and analysis of interpretive questions possible for experts and novices alike.
The Chinese language presents interesting challenges for humanities computing.Both modern and ancient Chinese, but especially the latter, rely heavily on context for the interpretation of individual characters and words 17 and some researchers have argued that differences in Chinese morphology make some of the techniques that work well for DH work in Western languages less applicable to Chinese. 18Words in Chinese are highly polysemous, requiring considerable amounts of context for their proper interpretation.The study of ancient Chinese philosophy is especially challenging because this ambiguity and openness to multiple interpretation seems to be deliberately exploited by the ancient masters. 19Take, for example the character '道' which could refer to Taoism, but has up to 10 meanings in ancient Chinese texts, such as 'way' or 'road', and is also used as a verb to mean 'say'.At the same time, the long and relatively continuous history of the Chinese nation has enabled the transmission of a rich corpus of ancient texts to the present day.Computational modeling of these texts does not, as we see it, aim to remove the human from the humanities.Rather, by enabling the discovery and quantitative analysis of connections, computational methods promise at least these two benefits: (i) enhanced means of access to large sets of documents, and (ii) new sources of evidence about texts that can support the ongoing discussion of their interpretation relative to the past and the present.We are also interested in a more general theme (iii), concerning the potential broader significance for theoretical discussions of the nature of meaning and the role of language in conceptual schemes.
Our primary contribution in this paper is of type (i), to provide enhanced access to a corpus of ancient Chinese documents.Specifically, we introduce an application of the InPhO Topic Explorer 20 developed at Indiana University, Bloomington, USA, to a large, public corpus of ancient Chinese texts, resulting from collaboration with philosophers and computer scientists at Xi'an Jiaotong University, Shaanxi, China.We also discuss potential projects and future research of type (ii) concerning the analysis of the themes in ancient Chinese philosophy and other literary sources.We present a very brief discussion of the broader significance (iii) before the conclusions section of this paper.

Selecting and Preparing the Corpus
A good understanding of Chinese intellectual culture during the classical period is important in itself, and essential for understanding the reception of Western ideas during various stages of China's history, and vice versa.As philosophers, we are particularly interested in philosophical texts, but we recognize that the boundaries between philosophy and other areas such as religion and political theory are fuzzy at best, and practically non-existent in some cultures or during certain periods of history.Thus, rather than try to demarcate "philosophy" from the rest, we decided to pursue our computational inquiry with as broad a corpus as we could locate.
A secondary consideration is that we want our work to provide a public benefit by being accessible to scholars and the public.It is less than optimal to analyze sources that only a few people --not even all scholars --have access to.For example, although the Wenyuange Edition of the Siku Quanshu archive 21 is of high quality for scholars, it is accessible only to those with subscriptions that are locked to specific IP addresses.Thus we conducted a scan of repositories of ancient Chinese documents, and found that the crowd-sourced website at zdic.net provided the best combination of quantity and access to a large number of classic texts, thanks to its permissive re-use policy under a Creative Commons 1.0 Public Domain Dedication. 22The full website at www.zdic.netcontains a dictionary of Chinese characters, a dictionary of words, dictionary of idioms and several other resources.Among them is the collection of classics identified as 汉典古籍 (Hàn diăn gŭ jí) or Chinese classics-the portion we refer to as the "Handian" corpus -directly accessible at http://gj.zdic.net/,and it is this portion of the website that we chose to model.This section of the website is not without problems, however.It contains a diverse collection of different file formats, containing both traditional and simplified characters, and of varying quality because they have been crowd-sourced from many different users using many different sources, with varying degrees of scholarly care.A better-curated corpus is the Chinese Text Project used by Nichols, Slingerland and colleagues. 23Although this site can be downloaded for private and academic use, its re-use policy is not as permissive as the Handian, and the online analysis tools require a subscription.Furthermore, because ctext.org is registered in Panama and hosted in the USA as well as directed towards Englishspeaking users, access by users in mainland China is generally slower and more difficult than zdic.net,which is registered and hosted in China.
For our initial goals, the benefits of accessibility, especially to Chinese users, outweighed the concerns about corpus curation quality.Such concerns are also partly mitigated by the topic modeling methods (described in more detail below).Because topic models treat documents as unordered "bags of words", they are relatively robust in the face of the "noise" provided by the variable quality of the texts.The techniques we describe here can be applied to more scholarly editions of the same texts.By demonstrating the power of the approach with the Handian corpus, we hope to encourage curators of scholarly editions to incorporate similar methods and make their efforts publicly available.We have made the products of our research available for all at our Indiana University website in the USA, mirrored at the Xi'an Jiaotong University website in China. 24 November of 2016 we crawled and downloaded the four sections of the Han classics from the gj.zdic.netsite.These sections, which are derived from the Siku Quanshu (the library of the Qianlong Emperor in Four Sections) are the 经部 (Jīng Bù), containing Confucian classics, 史部 (Shǐ Bù), historiographic works, 子部 (Zǐ Bù), containing writings of the philosophical schools, and 集部 (Jí Bù), a section of miscellaneous anthologies, including poetry, drama, and other works of literature.Each of the sections contains a multi-level tree of further subsections terminating in text files.For example, within the 经 (Jing) section are three subsections, labeled 十三经 (thirteen classics), 十三经注疏 (thirteen classics annotations), and 经学史及小学类 ("history of classical studies and traditional Chinese philology"), and these are further subdivided. 25We found that some of the files were index files listing the contents of the directories, so we discarded these.
We developed some custom mixture of automated and semi-automated methods to extract the original texts from the downloaded HTML pages.Next we cleaned the corpus by regularizing the characters and their encoding method.Because of the mixture of traditional and simplified characters in the corpus, we decided to map all characters to simplified characters.This entails a loss of visual, aural and etymological information, important for interpretation by knowledgeable readers, but of no direct use to the algorithms beneath the topic modeling process.(In the future we will provide additional support for both traditional and simplified characters within the Topic Explorer.) After this preliminary processing, we found that quite a few files were empty -some representing documents lost to history, others not present for other reasons.So, we removed these files leaving 18,818 files for analysis. 26These files contain approximately 100 million individual characters.Chinese does not use spaces to separate words, but some words comprise multiple characters.Hence, text modelers face a choice of whether to model the corpus character-by-character or whether to segment the text into words.Because the vast majority of ancient Chinese words are written as single characters, the character-by-character option may have been a reasonable choice for this corpus.It was our judgment, however, that segmentation of the texts into words rather than characters would improve interpretability of the models. 27Software to address the word segmentation problem in modern Chinese exists, but these solutions are dictionary-based.Thus it was necessary for us to find and deploy a dictionary of ancient Chinese that we constructed from different sources. 28ter applying the dictionary to our corpus, we identified nearly 84 million word tokens comprising nearly 85,000 unique word types.The most common word in the Handian corpus is 之 (zhī, it/this/for) at just over 1.25 million occurrences and the most common two-character word was 天下 (tiānxià, the World) at 83,805 occurrences, 93rd most frequent in the overall list.Very high frequency words are relatively uninformative and they tend to overwhelm the available methods for corpus analysis, both because of the additional time to process so many characters in a corpus of this scale, and because the highly frequent terms tend to dominate more meaningful terms in the trained models.Therefore, it is normal to develop a "stop list" of such words to remove them from the corpus. 29Our stop list of 187 words is larger than the 132 words listed by Slingerland et al., 30 and the two lists overlap in 50 words.The relative disjointness of the two can be explained by the differences in size and scope of the two corpora and the different objectives of the two projects.For example, we found it useful to filter out more of the frequently occurring number words.

LDA Topic Modeling
Based on our previous experience working with large text collections within the InPhO team at IU, we chose to apply LDA (Latent Dirichlet Allocation) Topic Modeling to the Handian corpus.(LDA is named for the 19th C. mathematician Gustav Dirichlet who laid the foundation in probability theory for the technique.)LDA Topic modeling has become popular within DH in recent years, although the interpretation of this kind of model remains a matter of considerable discussion. 31It treats documents initially as "bags of words" -that is, all grammatical structure and information about word order within sentences or documents is ignored, and the document's initial profile is simply the frequency with which of all the words appear in it.Topic modeling aims to find latent (hidden) structure among these "bags of words", by re-representing each document as a mixture of topics.A topic may also be thought of as a writing context, as we now explain.
We understand topic models to provide a theory about writing.Authors of documents combine different subjects of discussion.Different authors working within similar cultural contexts have overlapping interests in various subjects, but they combine the available topics differently.When writing about good behavior, for example, one may be concerned with the good behavior in the public sphere of business or politics or religion, or in the family or social community, or as a topic within moral philosophy.An author is more or less likely to use a given word when writing about each of these subjects.For example, the words 'sister' or 'father' are more likely to be used when the author's subject is family than when writing about business.Other words may have very similar likelihoods of being used in these contexts.For example, the word "virtue" might be equally likely to be used by authors discussing family or business matters.Discussion of good behavior may span the contexts of nature, family history, legal cases, theology, mythology, etc. Across a large corpus of documents we may expect to see these themes arising in different combinations -both when different authors are writing within similar cultures, and when one author writes at different times in his or her career.Furthermore, writers write for different contexts and audiences: letters to friends or family or superiors, philosophical dialogues, public speeches, etc.Each of these contexts also changes the likelihood of the author selecting certain words, and the same word in different contexts may produce slight or major variations in meaning.
LDA topic modeling provides a method for automatically identifying topics within a set of documents.At the end of a training process: (a) each topic is represented as a total probability distribution over all the words in the corpusthat is, every word is assigned a probability in every topic, and the sum of all the word probabilities within one topic is equal to one; and (b) each document is represented as a total probability distribution over the topics -that is, every topic is assigned a probability in every document, and the sum of the topic probabilities within one document is likewise equal to one.
The model starts with random probabilities assigned to the word-topic and topic-document distributions.It is trained by a process of adjusting the word-topic and topic-document probability distributions.The word-topic and topic-document distributions are controlled by two parameters (technically "hyperparameters" or "priors") that are set to ensure that there is sufficient variation in the probabilities assigned to the topics in the documents and to the words in the topics.The number of topics is chosen by the modeler.Our group typically trains multiple models with different numbers of topics, and we compare the different models to each other.For the present study we trained models with 20, 40, 60, 80, and 100 topics.In general, with too few topics, each topic becomes very general and hard to interpret.With too many topics, some of the topics are specialized on just a few documents, making them less useful for finding common themes.While there exist methods within computer science for estimating an optimally efficient number of topics for a given corpus, users of the models may prefer a coarse-grained scheme (fewer topics) for some purposes while other users may prefer a more fine-grained scheme (more topics) for other purposes. 32Furthermore, working with multiple models simultaneously, fosters the kind of "interpretive pluralism" that characterizes humanities computing. 33e process by which we built these models using the InPhO topic-explorer package consists of four steps: initialization of the corpus object, preparation of the corpus by filtering words according to their frequency, training the corpus models, and launching the Topic Explorer's Hypershelf interface. 34ing the Topic Explorer & Notebooks The Topic Explorer provides both a map-like visualization of the topic space (described further below) and a "Hypershelf" that allows users to experiment with the trained model to explore the corpus in any standard web browser.We call the latter interface a Hypershelf because although the browser initially presents documents from the corpus in a single linear order, it can be rearranged by the users to reflect their interests, and any document can be opened to view the full text.Thus, the Hypershelf initially provides a top level "distant reading" 35 view of the corpus, but allows the user to zoom down to the original text.This supports a two way interaction in which interpretation of the texts helps the user to interpret the topics in the model, while interpretation of the topics in the model can help the user to interpret the texts.(We provide an example of this interplay below.) The Hypershelf has two main modes: a document-centered view and a topic-centered view.
Beginning with the document-centered view, the user may either select a document at random or use the search box to enter a few characters.These characters are automatically matched to the document labels, and the user can select a document from the drop-down list (Figure 1 shows initial options for 论语 -Lúnyǔ, the Analects).Once a document is selected, and a number of topics for the model is chosen, the browser window is filled with a row of multicolored bars (Figure 2), each block of color corresponding to a topic.The top row represents the topics assigned to the document by the computer during the final training cycle, according to the key at the right.Hovering over any of the colored sections displays a list of the highest probability words for that topic (see Figure 3).It is important to remember, however, that every word is assigned a probability in every topic, so these first few words do not exhaust the context provided by the topic.Each subsequent row represents the topic distribution of another document from the corpus, scaled such that the length of the bar indicates similarity to the top document. 36gure 1.Autocompletion of document names.
Initially, similarity between documents is shown with respect to the entire topic mixture associated with the focal document, but clicking the mouse on any of the topics re-sorts the list according to overall proportion of each document that the model assigned to the selected topic.This capacity of the HyperShelf allows users to rearrange the documents according to their interest in a particular topic (Figure 3).
From this point the user may click the "Top documents for Topic…" button below the key on the right (not shown in the Figure ) to select the documents from the entire corpus that have been assigned the highest proportion of the selected topic.Alternatively, the user may choose to refocus on any of the other documents in the display by clicking on the arrow icon to the left of a row.(This icon appears when the mouse hovers nearby.)The user may read the full document by clicking on the "page" icon, which appears to the left of the arrow icon.

Results
We successfully trained topic models on the corpus of over 18,000 classical Chinese documents and made them available to explore interactively online. 37We believe our choice to do word segmentation rather than single character modeling is justified by the contribution that the two-character words make to the interpretability of the topics, as well as by our investigation of 阴阳 (yinyang) within the corpus, as described below.
Topic models for humanities computing cannot be evaluated against a "gold standard" of correct performance, because no such standard exists.Neither could such a standard exist if one takes seriously the idea that the process of interpretation at the core of the humanities applies to the models as much as the texts (see Discussion section below), and is as variable as the interests of the users themselves.Ultimately, a topic-modeling approach succeeds or fails according to the ability of users to use the models for their own purposes, be it self-education, pedagogy, exploratory research, or systematic analysis of the texts.In future work we intend to assess how users respond to the topic models, and to conduct more complete analyses of relationships among the texts using the models.Here we present an example of how a particular individual used the Topic Explorer modeling and visualization results for self-guided investigation and serendipitous discovery -a process we refer to as "guided serendipity". 38r subject, one of the Chinese coauthors of this paper, began this project with only a basic familiarity with ancient Chinese philosophy acquired from an undergraduate course.He decided initially to investigate the concept of yinyang (阴阳).Using the capacity of the Topic Explorer for topic-mediated term search, this term was queried in the 60 topic model. 39Documents are retrieved according to their overall similarity to the topics selected by the term.The practical import is that because searches are topic-mediated, the documents retrieved need not contain the actual query term.
The first document identified in this way is from the 子部 (zǐ bù) section of the corpus, which contains writings of the philosophical schools.It is from the 术数 (shu shu, or divination) section of the corpus, specifically the 三命通会 (Sān Mìng Tōng Hui), an important book from the Ming dynasty.The specific chapter located in this way is 卷一•论支干源流 (On the Origin of the Chinese Sexagenary Cycle), describing the "ten Heavenly Stems" (yang) used in combination with "twelve Earthly Branches" (yin) as a calendar dating system.
A little further down the list of documents, in the 7th row is a chapter from the Confucianism subsection 儒家 (rújiā ).The chapter labeled '參兩萹第二' from the volume labeled '张子正蒙' in the corpus is part of the work Zheng Meng (正蒙), which is very significant within the Confucian tradition.It was written by Zhang Zai (1020-1077), an important thinker of the Song Dynasty from Shaanxi province.The chapter relates yinyang theory to the astronomical calendar and the classical theory of Five Phases: Wood, Fire, Earth, Metal, Water (also referring to Jupiter, Mars, Saturn, Venus and Mercury respectively) used to explain the laws governing speed and direction of planetary motion.
Pursuing the idea that the Five Phases Theory provides the backbone of a broad system of thought encompassing many areas, our subject re-sorted the Hypershelf by clicking the arrow to the left of the top row, to refocus on this document. 40He then inspected the topics and identified topic 15 as seemingly most relevant to his interests.Clicking on that topic reorders the results according to the proportion of the topic allocated to each document in the list.
The top document identified in this way is also from the Confucian section of the Handian corpus, but in this case volume 12 from the book 春秋繁露 (Chūn Qiū Fán Lù, sometimes abbreviated as "Fan Lu", and also known in English as the "Luxuriant Dew of the Spring and Autumn Annals) which relates the changing of the seasons to yinyang.By inspecting the titles of the documents near the top of the list, our subject noticed that many of them are from the Chūn Qiū Fán Lù, from the Zheng Meng, and from a third important work titled '三命通会' (Sān Mìng Tōng Huì), which is a text about fortune telling and divination.This helped our subject to understand that the topic explorer could help him identify in which parts of Chinese culture the yinyang theory was prominent, namely Confucianism, Daoism, and traditional Chinese medicine.
For example, the document 卷二•论五行旺相休囚死并寄生十二宫 is part of the Sān Mìng Tōng Huì about the Five Phases theory, explaining the positive and negative relationships existing among Wood, Fire, Earth, Metal, and Water, and various ways in which those relationships may be in which these elements are related to human life, health, and death.Also present are numerous documents from the 医家 (Yījiā, or traditional medicine) section.Refocusing the topic explorer by clicking on the arrow icon to the left of the document HandianCorpus/『子部』/医家 /医学源流论/卷上•病同人异论.txtretrieves a large number of related medical texts.In particular, the document from the 素问 (Sù wèn, or 'basic questions') section labeled 八正神明论篇第二十 六 (Part 26 of the book Bā Zhèng Shénmíng Lùn) which is a very famous dialog between the mythologized "Yellow Emperor" Huángdì (黄帝) and his minister, Qí Bó (岐伯), supposed to have lived in the third millennium BCE.They discuss acupuncture in the context of qi (气), a very important concept concerning life force or vital energy in traditional Chinese medicine, and they connect qi to yinyang.
Our coauthor reports that before using the Topic Explorer his concept of the Five Phases Theory was ambiguous, but in the interplay between topics and documents he learned many new details about the Five Phases Theory and its relationship to yinyang.For an expert, these interrelations may be well-known, but for a learner, the capacity to rapidly relate the concepts in this way serves a very valuable function.His understanding of the complexity of the concept of qi was also broadened, leading to a plan to work further on this concept in future work with the topic models.This example shows how one individual's understanding of the connectedness of concepts from traditional Chinese medicine, astronomy, and religion was deepened by interaction with both the high-level overviews provided by the topic model and the close reading of specific texts directed by following the models.Although just one case, we believe that this case is not unique: the Hypershelf interface of the Topic Explorer supports spontaneous exploration and guided serendipity, customized to the user's particular interests.
We turn now from the Hypershelf to an interactive visualization of the entire topic space which is also provided by the Topic Explorer software package.This visualization can be explored interactively at InPhO websites.Figure 4 shows a map and cluster analysis of the topics across all five models.The map is generated using the isomap procedure applied to the JSD measure. 41Isomap is a technique for reducing a high dimensional space (in this case the probability space of the word distributions in the models) to fewer dimensions, in this case two.Such dimensionality reductions are useful for identifying principal components of the model structure.Whereas the standard MDS (multi-dimensional reduction) algorithms are linear, isomap detects non-linear structure in the data.The map allows one to assess the overall similarity of topics in the different models (20, 40, 60, 80, 100).The relationships among topics revealed by these figures are not strictly hierarchical; nevertheless, topics from the models with higher numbers of topics tend to cluster around topics from the models with smaller number of topics.Groups of topics are clustered and colored automatically according to an arbitrary choice of ten clusters.Although these clusters are very broad, some general themes emerge -for example, the dark green and dark purple clusters in the lower right are related to literature and poetry, the light blue region contains topics related to Confucianism, while the dark blue region below it spans topics related to traditional religions and traditional Chinese medicine.The light orange and dark orange regions cover different aspects of history; for example, topics related to military history are more prominent in the darker orange region.The dark red area corresponds to political and diplomatic topics while the pink cluster at the bottom left covers topics related to administration.The four topics colored light purple at the bottom center are heavily loaded with geographical terms -terms which are of course quite generally used in everything from poetry to military history and administration.
Figure 6 shows a similar comparison for the terms 气 (qi) and 阴阳 (yinyang).Here the distributions are quite similar, although the topics related to qi are more concentrated on the right side of the diagram whereas topics related to yinyang are distributed a bit more across central parts of the map.The relative confinement of topics related to qi corresponds to the fact that the Isomap algorithm has placed health and traditional medicine topics on the right side of the map in the dark blue cluster.Among the most central topics in the map (i.e., those closest to the 0,0 origin in the map) are these from right side of the dark orange cluster: 20:11 命, 官, 贼, 授, 兵, 尔, 巡抚, 营, 阿, 民, 大臣, 部, 明, 馀, 总督, … 40:23 命, 尔, 大臣, 授, 馀, 额, 阿, 营, 克, 总督, 部, 明, 巡抚, 布, 乾隆, … 60:18 命, 大臣, 馀, 总督, 授, 巡抚, 营, 议, 乾隆, 额, 署, 奏, 匪, 康熙, 由, … 80:40 馀, 命, 巡抚, 总督, 大臣, 议, 署, 匪, 由, 调, 乾隆, 奏, 免, 州县, 设, … 100:63 命, 大臣, 巡抚, 总督, 奏, 议, 学士, 谕, 乾隆, 署, 由, 授, 州县, 直, 调, … These topics are highly loaded with terms related to government officials, but also contain some words related to criminality.The centrality of these topics may be seen as reflecting both the large number of government documents in the Handian corpus, and the central importance of the civil service in China for the preservation and transmission of classical Chinese culture and values.It is also worth noting that the 20:2 model (阙, 德, 臣, 无, 元, 圣, 表, 可, 命, 实, 天, 奉, 道, 文, 神, ...) is actually slightly closer to the center than 20:11, but it is grouped with the light blue cluster of topics.Visual inspection of the highest probability terms suggests that 20:11than 20:2 is more aligned with the other topics listed above, and helps give some confidence in the clustering technique.It is important to keep in mind, however, that the isomap plot is generated using the full term distribution, not just the first 15 terms shown here.A complete assessment of the topics and their related documents would go beyond simple inspection of the top terms.
(6a) (6b) Figure 6.See Figures 4 and 5 for explanation of layout and coloring scheme.Here we compare distribution of topics for two terms：(6a) 气, qi, and (6b) 阴阳, yinyang.The distribution of these topics in the corpus is rather similar, but 气 is slightly more concentrated on the right side of the map, where topics related to health and medicine are clustered.

Discussion
LDA topic models are not themselves interpretations of the documents -indeed they stand in need of interpretation themselves 42 -but they may assist scholars in exploring and interpreting large collections of materials.Ultimately there is no substitute for reading the documents, but the Topic Explorer interface, through its Hypershelf and Topic Isomap components, can guide scholars and learners alike to documents that they might not have otherwise encountered or thought to look for, resulting in a particularly productive form of guided serendipity.
Topic models are interesting to think about from the perspective of theories of meaning.While they do not capture exact meanings -"John loves Mary" and "Mary loves John" are viewed as identical statements under the "bag of words" assumption -they are quite successful at capturing something like the general gist or context of the words being used.Scholars of Chinese literature have emphasized the high degree of context sensitivity for the meanings of words in the Chinese language, 43 but a strength of the topic modeling approach is that the same word is placed in multiple contexts, helping with the process of disambiguation.At the same time, because the models have a solid grounding in information theory, the use of metric measures such as the Jensen-Shannon distance is feasible for many applications.This provides new forms of evidence for humanistic discussions.
Although the corpus we used may be missing some potentially important documents, it is large enough that the topic models we derived from this corpus prove to be adequate for various purposes.Improved curation of the corpus nevertheless remains an important goal of our group for the future, and will be reflected in future iterations of the Handian Topic Explorer mirror site.Future work will allow us to address questions about topical relationships among the documents in the Handian corpus and about historical and geographical shifts in the topic distributions as represented in the corpus model, and ultimately to analyze the behavior of individual authors.
Finally, and more speculatively, philosophers of mind and cognitive science have sometimes been tempted by the idea that meaning or semantic content assignment is a kind of measurement process rather than the assertion of a relationship to a determinate abstract proposition. 44Computer scientists have started to provide the means to convert this idea into quantitative models 45 to which measures such as Jensen-Shannon distance can be applied.Thus, the Digital Humanities are poised to have a significant impact on philosophical and practical discussions of the nature of meaning.

Conclusions
Topic models present a powerful new tool for computer-assisted interpretation in the humanities.We have described some particular issues faced for using topic models with ancient Chinese texts, and we have detailed the process of training LDA topic models on the Handian corpus of over 18,000 classical Chinese texts using the InPhO Topic Explorer.The results of these efforts and the software we have developed have been made publicly available via the Hypershelf interface at mirror sites at Xi'an Jiaotong University and Indiana University.This interface allows users to visualize the results of the modeling process.We have provided some preliminary description and analysis of the topics discovered by the algorithms using the more advanced notebook features of the Topic Explorer.These preliminary investigations have revealed some interrelationships among Confucian, Taoist and Buddhist themes, and the penetration of these themes in many aspects of traditional Chinese culture, from medicine to government.By following the threads among specific texts, guided by these topic models, scholars may exploit these new tools to enrich their understanding and interpretation of China's rich cultural heritage.We developed a stopword list that includes 187 words, including 19 of these 20 most frequent terms (all but 州, zhōu, state or prefecture).The list was developed using a mixed strategy of (a) checking word frequencies; (b) checking inverse document frequencies (i.e., the logarithm of the ratio of the total number of documents to the number of documents containing the worda low score means the word is highly distributed throughout the corpus with a score of zero meaning that the word is present in every file); and (c) a cycle of training and retraining topic models, inspecting the topics, and adding terms to the stop list if they occurred in a high proportion of the topics and were judged uninformative about the topic.

经部 (Jīng
The criteria we considered in deciding whether to place a term on the stop list were: 1.Does the term sometimes have a culturally/philosophically significant meaning? 2.Where it appears among the highest probability words for a topic, is it most likely to be interpreted with that significant meaning? 3.If it were removed from those topics, would the topics be less interpretable?The resulting list of 187 stop words is available at http:// inphodata.cogs.indiana.edu/handian/cn_stop.txt.The top 20 words with their frequencies after the stopword list was applied are shown below.

Figure 2 .
Figure 2. Similarity of documents to Book 1, Chapter 1 (卷一 学而第一) of the Analects (论语) in the 60-topic model.Each bar represents a document, and the colors represent the distribution of topics assigned to that document.(Different topics may be assigned to the same color.).The length of the bar indicates overall similarity to the document on the first row.See Figure 3 for additional details about the Topic Explorer display.

Figure 3 .
Figure 3. Highest probability words for each topic appear in the topic key at right when the cursor is positioned over the key, or over the corresponding topic in any of the document rows.Clicking in either location causes the Hypershelf to re-sort the list of results according to proportion of that topic assigned to each document.Here we show the reordering of results after selecting Topic 51 as the comparison dimension for documents most similar to Book 1 Chapter 1 (卷一 学而第一) of the Analects (论语) in the 60-topic model.

Figure 4 .
Figure 4. Topics from all five models arranged and clustered by the isomap algorithm.Circle size is inversely proportional to the number of topics in the model: largest circles representing topics from the 20-topic model, smallest from the 100-topic model.Overlapping circles of different sizes indicates congruence between topics from models at different levels.See main text for further details, and the next two figures for applications of the map to topics and terms.

Figure 5 .
Figure 5. See figure 4 for explanation of underlying map.Here, colors are saturated according to relevance to word entered in the search box: (4a) 孔子, Confucius, and (4b) 佛, Buddha/Buddhism.Both are significant for many of the topics in the models, but 佛 selects a more specific set of topics, showing up as the saturated blue circles just to the right of the 0,0 origin of the map.See main text for further discussion.