Evaluating Transformer Models and Human Behaviors on Chinese Character Naming

Abstract Neural network models have been proposed to explain the grapheme-phoneme mapping process in humans for many alphabet languages. These models not only successfully learned the correspondence of the letter strings and their pronunciation, but also captured human behavior in nonce word naming tasks. How would the neural models perform for a non-alphabet language (e.g., Chinese) unknown character task? How well would the model capture human behavior? In this study, we first collect human speakers’ answers on unknown Character naming tasks and then evaluate a set of transformer models by comparing their performance with human behaviors on an unknown Chinese character naming task. We found that the models and humans behaved very similarly, that they had similar accuracy distribution for each character, and had a substantial overlap in answers. In addition, the models’ answers are highly correlated with humans’ answers. These results suggested that the transformer models can capture humans’ character naming behavior well.1


Introduction
Many aspects of language can be characterized as quasi-regular: the relationship between inputs and outputs is systematic but allow many exceptions. The grapheme-phoneme mapping is an example of such quasi-regularity. For example, the letter string '-ave' in English is regularly pronounced as /eIv/ in GAVE, SAVE, with the exception of /aev/ in HAVE. And human speakers can easily grasp both patterns, e.g., in a nonce word naming experiment, most speakers pronounced the word TAVE as /teIv/, while some pronounced it as /taev/ (Glushko, 1979).
To explain the grapheme-phoneme mapping process, many models have been proposed, among which the Dual Route Cascaded (DRC) model and the connectionist model are the two most influential yet opposite models. The DRC model (Coltheart et al., 2001;Coltheart, 1978) proposes that the grapheme-phoneme mapping is implemented in two separate routes: a lexical route that directly maps the word's spelling to its pronunciation through a dictionary-like lookup procedure 2 , and a non-lexical route that applies the graphemephoneme corresponding 'rules' to convert the letters to their corresponding pronunciation. The implementation of the DRC model requires domainspecific knowledge, such as spelling to sound rules. In contrast, the connectionist model (Seidenberg and McClelland, 1989;Plaut et al., 1996) proposed that a word's pronunciation is generated through a neural network that takes the orthographic representation as the input and outputs the phonological representation, which does not require specific knowledge of grapheme-phoneme correspondence rules. Both models can explain various behaviors in word identification, such as the faster identification of frequent words compared to infrequent ones. Therefore, there is still an ongoing debate about which model better captures the graphemephoneme mapping process.
However, most of these two models were tested on alphabetic languages (e.g., English and German), and it is still unclear how would these models be generalized to a non-alphabetic language, such as Chinese. The DRC model seems to be unfit for Chinese because there are no regularities in Chinese that can be defined as grapheme-phoneme corresponding rules (Yang et al., 2009). In addition, Coltheart et al. (2001) asserted that "the Chinese, Japanese and Korean writing systems are structurally so different from the English writing system, that a model like the DRC model would sim-Example characters regular 清, 情, 圊, 晴-<qing> alliterating 倩，輤<qian> rhyming 精, 靖, 菁-<jing> irregular 猜<cai>, 靚<liang>, 靛<dian> Table 1: Examples of characters with the phonetic radical 青<qing>, sorted into different regularity types. Syllable onsets and finals are bold when they are the same with the phonetic radical.
ply not be applicable." (p.236). Thus the connectionist model is the only candidate. The majority (81%) of Chinese characters are phono-semantic compounds (Li and Kang, 1993), which consist of a phonetic radical that contains pronunciation information (denoted by pinyin), 3 and a semantic radical that contains semantic information. 4 For example, for the character 晴(<qing2> 'sunny'), the left side 日(<ri4>, 'sun') is the semantic radical, and the right side 青(<qing1>, 'blue') is the phonetic radical. While the phonetic radical does not contain componential information about the pronunciation, e.g., the first part of the phonetic radical does not represent the first phoneme (e.g., consonant)/syllable onset as letter strings, the relationship between the phonetic radical's pinyin and the character's pinyin is also quasi-regular. Ignoring the tonal differences, the character's pinyin can be categorized into 4 types (Fang et al., 1986): regular, the same as the phonetic radical's pinyin; alliterating, deviating in the syllable final; rhyming, deviating in the syllable onset; and irregular, varying in both syllable onset and final (see Table 1 for examples). The process to pronounce an unknown character involves two steps, where the first step is to identify the phonetic radical, and the second step is to apply the regularity pattern of the pinyin. However, there are no reliable cues to identify the phonetic radical, and the regularity patterns are quite arbitrary (Yang et al., 2009). How do Chinese speakers name an unknown character, and how well can the neural models capture the Chinese speakers' behaviors?
In our study, we first collected human speak-ers' answers on unknown character naming, since there is no study investigating how Chinese adults read unknown characters. 5 We then trained a set of sequence-to-sequence transformer models with different settings on 4,281 phono-semantic characters. Neither human speakers nor models can name the unknown characters accurately, but the transformers have a slightly better average accuracy (47.4%) than the human speakers (45.3%). We then evaluated how closely the results of our aggregated transformers matched those of the human participants, in aspects of the variety of answer types and answer overlaps. In general, both the transformers and human speakers are able to identify the phonetic radical correctly and apply all 4 types of regularities to infer the pinyin, and the transformer models show a high correlation with human data in the proportion of each regularity type. In addition, there is a considerable amount of agreement between the answers generated by our models and those given by humans. Our results demonstrate that transformer models can well capture human behavior in unknown Chinese character naming.

Related Work and Current Study
Skilled Chinese readers make use of the phonetic radicals to name characters (Chen, 1996;Zhou and Marslen-Wilson, 1999;Ding et al., 2004), and previous studies measured how phonetic radicals influence character naming in two ways: regularity and consistency (Fang et al., 1986;Hue, 1992;Hsu et al., 2009). The regularity is exemplified in Table 1, and the consistency is defined as the number of characters that share the same phonetic radicals and pinyin. For example, there are 12 characters sharing the phonetic radical 青<qing> in Table 1, among which 3 characters (精, 靖, 靖) have the same pinyin <jing>, so the consistency score for these characters is 0.25 (3/12). Many studies have found regularity and consistency effects for human speakers -the regular and more consistent characters are named faster and more accurately, and these effects are stronger for low-frequency characters than high-frequency ones (Lien, 1985;Liu et al., 2003;Tsai et al., 2005). Previous studies of Chinese character modeling with phonetic radicals as inputs have successfully simulated the regularity effect and consistency effect. Yang et al. (2009) trained a feed-forward network on 4,468 Chinese characters and tested the model on 120 characters (seen in the training). The input to the model includes the character's radicals and radicals' positions (e.g., left-right, up-down). 6 The output of the model is the phonological features (e.g., stop, lateral) of the character's pinyin. They also measured the human speakers' response latency 7 on each of the 120 test characters. By comparing the human speakers' response latency and the model's sum squared error, they found very similar regularity and consistency effects. In addition, Shillcock (2004, 2005) trained a feedforward model on 2,159 left-right structured characters, with each character appearing according to its log token frequency. The input included each character's radicals, and the output was the character's pinyin. They analyzed the training accuracy of the model and found the model's sum squared errors lower for the regular characters, which successfully simulated the regularity effect.
The regularity and consistency effect revealed that both human speakers and the neural models utilized the statistic distribution of phonetic radicals in naming familiar characters. However, these effects can not be applied in unknown character naming since the speakers don't know the statistics of these characters. Therefore, we proposed a new metric (saliency of the phonetic radical) to measure how the phonetic radicals influence the speaker's unknown character naming behavior. Saliency is defined as the fraction of the regular characters among all characters sharing the same phonetic radical. For example, the phonetic radical 青<qing> appeared in 12 characters in Table 1, among which 4 characters (清，情，圊，晴) are regular. Thus the saliency score of 青<qing> is 0.33 (4/12). The more salient a phonetic radical is, the more likely the character that contains it is pronounced the same as its pinyin.
We hypothesized that the human speakers would show a saliency effect in unknown character naming -they would name the characters more accurately if the phonetic radical is more salient. We 6 There are 10 different Chinese character structures clustered by the arrangement of the character radicals, e.g., left-right (日+青=晴), top-down (相+心=想), and enclosure (口+或=國). The left-right structure is the most common type (71%) (Hsiao and Shillcock, 2006). 7 Response latency measures the response speed, usually in milliseconds. expected to find a similar saliency effect in the models. In addition, we also closely examine the models' answers and humans' answers to investigate if the models can represent the human speaker's behavior.

Data
The base character dataset consists of 4,341 Chinese characters constructed from the IDS dataset in CHISE project (Morioka, 2008). The original IDS (Ideographic Description Sequence) dataset contains 18,347 characters used in China, Japan, and Korea with the decomposition of each character's phonetic and semantic radicals. 8 The character selection criterion include: 1) is used in Chinese; 2) is a phono-semantic compound; 3) has a left-right structure 9 . The character's pinyin, along with its phonetic and semantic radical's pinyin, was collected using the pinyin package. The frequency of each character was extracted from BLCU Corpus Center (Xun et al., 2016). We further labeled each character's regularity: regular, alliterating, rhyming, and irregular as described in Table 1. In addition, we calculated each phonetic radical's saliency.
There are 660 radicals after decomposing the 4,341 characters, among which 46 radicals only serve as the semantic radicals; 493 radicals only serve as the phonetic radicals; 121 radicals serve as both semantic and phonetic radicals. Each radical appears in 7 characters on average, with a range of 1 to 30. 80% of the characters in our database have the phonetic radical on the right, with many exceptions, e.g., the semantic radical '戈' <ge> always appears on the right.

Test Data
We selected 60 characters with different phonetic radicals from the dataset as our test data, which are listed in Table 14 in Appendix B. The test characters are selected following three criteria to ensure that human speakers are unfamiliar with the character, while familiar with the phonetic radicals, 1) the character appears less than 5 times in the whole corpus, 2) the phonetic radical in each character appears in more than 4 other characters. The average saliency score for these phonetic radicals is 0.43, with the score distribution shown in Fig more than one pinyin, e.g., '硞' (<que>, <ke>, <ku>), which yields 88 pinyins for 60 characters. The distribution of the regularity type for the test characters is shown in Table 2

Training Data
We exclude the 60 test characters and use the rest of the characters as our training data (4,281). The regular is the most common type (42.7%), followed by irregular, rhyming, and alliterating. Since many of the characters have extremely low frequency and are not known to the Chinese speakers, we used three training datasets with characters of different frequencies to represent the native speakers' vocabulary size. The ALL dataset used all 4,281 characters. The MID dataset consists of 2,140 characters whose frequencies are in the top 50% percentile. The HIGH dataset consists of 1,070 characters with frequencies in the top 25% percentile. The statistics of these training sets are shown in Table 3  is 26.3, and 80% of them have an education background of college or above. In the experiments, they were asked if they knew the character and prompted to type the pinyin of the character. The detailed experiment procedure and sample questions are described in Appendix A.

Results: Human Answer's Accuracy
In general, the test characters are unknown to the participants. 10 The accuracy is calculated on the syllable onset and final, ignoring the tone, since tones are more affected by the speaker's accent than syllable onsets and finals. For polyphone characters, as long as the participant named one correct pinyin, we counted it as correct. The average accuracy for all participants is 45.3% (27 out of 60 characters), with a range of 26.7% -68.3%. Some characters are more difficult to name than others. For example, 8 characters' accuracies are 0, meaning that none of the participants named them correctly. The character's accuracy is calculated as the proportion of participants who named it correctly, ranging from 0 -98.2%. There is a strong positive correlation between the character's accuracy and its phonetic radical's saliency (r = 0.62), which confirms our hypothesis about the saliency effect. The more salient the phonetic radical is, the more participants named the character correctly. The accuracy measures how well the human speakers can grasp the grapheme-phoneme distributional patterns in Chinese. The results show that even native speakers can not accurately predict the pronunciation of an unknown character, which reflects the complex nature of Chinese grapheme-phoneme mapping system.

Results: Human Answer's Variability
Since the participants named the character's pinyin differently, each character has a variety of unique answers. On average, each character has 6.7 answers, with a minimum of 2 answers and a maximum of 15 answers. The number of answers is negatively correlated with the saliency of the phonetic radical (r = -0.51), that the more salient the phonetic radical, the fewer number of answers the speakers guessed.
We defined 5 answer types based on regularity. The participants either guessed the character's pinyin the same as its phonetic radical's (regular), or changing the syllable final (alliterating), the syllable onset (rhyming), or both (irregular), or mistakenly used the semantic radical to name the character (semantic). 11 We presented the answer types for character '煔' as an example in Table 4, and defined the production probability P p by the proportion of participants named that answer type.
The average production probability for each type is listed in Table 5. Most of the participants are able to identify the phonetic radical correctly, as the average production probability of the semantic type is only 2%. The regular answer type has the highest production probability (58%), suggesting that the participants are more likely to name the character the same as its phonetic radical. The production probabilities of answer types for each character are plotted in Figure 3 in Section 6.

Transformer Model
To model the joint probability of the syllable onset and final, we used seq-to-seq transformers (Vaswani et al., 2017) to generate the pinyin of Chinese characters trained from scratch. 12

Experiment Setups
Both encoder and decoder of all our models had 2 layers, 4 attention heads, 128 expected features 11 When examining the data, we found that some participants named the character the same as its semantic radical. We loosely defined this type of error as semantic type. It could also be that the participants applied irregular on the phonetic radical, and the pinyin happened to be the same as the semantic radical's pinyin. However, there's no way to confirm it. We asked some of our participants (with linguistic background) to explain how they guessed the pinyin, and none of them could articulate their thinking process. 12 We did not use a classification model because there are certain rules in pinyin formation (e.g., /ü/ can not follow /b/, /p/, /m/, /f/), which requires the model to learn the syllable onsets and finals jointly.  in the input, and 256 as the dimension of the feedforward network model. For training, we split the dataset into train/dev splits of 90-10, and replace those tokens that appear once in training data by 〈unk〉. We also set dropout to 0.1, batch size to 16, and used Adam optimizer (Kingma and Ba, 2015) with varied learning rates in the training process computed according to Vaswani et al. (2017). We used 5 different random seeds, and trained 40 epochs with early stopping for all of our experiments. For inference, we set beam size to 3.

Experiment 1
We trained a set of models to simulate the grapheme-phoneme mapping process in Chinese speakers. Our BASE model used the phonetic radical's orthographic forms to generate syllable onset and final (without tone) of the target character. We further examined whether identifying the phonetic radical before generating the syllable onset and final would improve the model's performance. We labeled the phonetic radical's position (left or right) with two methods: LABEL m and LABEL s . LABEL m used the true position of the phonetic radical as the ground truth label. Besides, since human speakers do not always identify the phonetic radical's position correctly, LABEL s labeled the position of the phonetic radical based on the phonetic similarity. We calculated the phonetic similarity between the character's pinyin and the two radicals' pinyins using the Chinese Phonetic Similarity Estimator (Li et al., 2018). The radical with higher phonetic similarity was labeled as the phonetic radical. 13 We further labeled the regularity type of the characters based on LABEL m and LABEL s , hence yielding LA-BEL mr and LABEL sr . Examples of input and gold output in the training data are shown in Table 6. All the models were trained on ALL, MID, and HIGH datasets as described in section 3.2.
Since previous studies suggested that the regularity and consistency effects are more prominent for the characters with low frequency than high frequency (e.g., Ziegler et al., 2000;Chen et al., 2009), the frequency of the known characters might also influence how participants predict the unknown characters. We further added the frequency label as an input feature in the full training data as the ALL+FREQ model. The characters were categorized into four categories based on their frequency: 'rare' (frequency = 1), 'low' (1 < frequency ≤ 50% percentile), 'mid' (50% percentile < frequency ≤ 75% percentile) and 'high' (frequency > 75% percentile).The distribution of regularity types is similar for the characters with different frequencies.
The summary of the number of characters and each regularity type can be found in Appendix B, Table 12.
In addition, we added two conditions for output in training all models: Shuffling and Adding tones. We shuffle the position of the syllable onset and final in model output to explore the impact of the generated order since we don't know if the human speakers identify the syllable onset or syllable final first in character naming. We also add tones before the 'End' token in the generation to see whether it improves the model performance. Examples of input and output of the conditions are shown in Table 6. In total, there are 80 types of models with different settings.

Accuracy Results
We calculated the test accuracy the same way as for the human data: we only counted the accuracy of the syllable onset and final. 13 For example, the character '烙' <luo4> ('flatiron') consists of the semantic radical '火' <huo3> ('fire') and the phonetic radical '各' <ge4> ('each'). The distance between <luo4> and <huo3> is 7.5, and the distance between <luo4> and <ge4> is 35.6. For LABELs, the output radical should be 'left', although the left radical '火' is the semantic radical.  For polyphone characters, as long as the model predicted one correct pinyin, it is counted as correct.
The average accuracy of all 400 models (80 types x 5 random seeds) is 42.1%, which is significantly lower than the human's accuracy (45.3%, t = 3.15, p<0.01). The average accuracy of each type of model is listed in Table 7. The best performing model is ALL+FREQ with LABEL m without tone and with shuffling, which achieved an accuracy of 50.3%. Compared to the BASE model, adding the label of phonetic position label and the character's regularity label usually could improve the model's accuracy. Adding tone would generally hurt the model's accuracy. Shuffling the syllable onset and final and adding the frequency label in the input would not change the model's accuracy.

Experiment 2
In Experiment 1, the input of our models only used the orthographic form of the radicals, which is how the previous literature described the Chinese grapheme-phoneme mapping process. However, the models might not have enough data to learn the full mapping from radicals to pinyin because many radicals only appeared once or twice in the training data since we only included compound characters with the left-right structure. For example, the phonetic radical '乘' <cheng> only occurred once in the character '剩' <sheng> in the training data. 14 The models would not be able to accurately learn the pinyins of these radicals. However, hu-  man speakers know the pinyin of most radicals, since many radicals are also commonly used as stand-alone characters, e.g., '乘' is a stand-alone character meaning 'to multiply'. In order to better model the human speakers, it is necessary to inject pinyin of the radicals as external information to the model. The model would also benefit from the added radicals' pinyin to generate the character's pinyin.
In addition, pinyin also plays an important role in modern Chinese speakers' reading and spelling experience. Pinyin is a Romanized phonetic coding system created in 1958 to promote literacy (Zhou, 1958). In the information age, pinyin has become indispensable in Chinese speakers' lives because it's the dominant typing system for computers, smart phones, and electronic devices. The prevalent experience of typing characters through pinyin has challenged the traditional view that Chinese characters are processed purely through orthographic forms (Tan et al., 2013). Many recent studies have found that pinyin mediates the character recognition process (Chen et al., 2017;Lyu et al., 2021;Yuan et al., 2022). To better capture modern Chinese speakers' character naming process, it is necessary to incorporate the radical's orthographic form as well as its pinyin in our models. Therefore, in Experiment 2, we added the radical's pinyin (syllable onset, syllable final, and tone) in the input, as shown in Table 8. We used the same model variations as in Experiment 1 15 and trained 80 different types of models (5 random seeds for each type) with the new input. The training settings are the same as Experiment 1.
Accuracy Results Adding pinyin to the input has increased the model's accuracy. 16 The average accuracy of 400 models in Experiment 2 is 47.4%, which is significantly higher than the human's accuracy (t = -2.7, p <0.01). The accuracy for each type of model is listed in Table 11 in Appendix B. The best performing model is ALL+FREQ with LABEL mr without tone and with shuffling, which achieved an accuracy of 55%. The effects of different labels, adding tone, and shuffling are similar to the models in Experiment 1.

Comparison Between Models' Results and Human Behaviors
In this section, we compared transformer models' results in Experiments 1 (MODEL[-PINYIN]) and 2 (MODEL[+PINYIN]) with human performance. Since human participants are different, i.e., they have different vocabularies, and they may use different strategies to identify the phonetic radical, we used all 80 models in each experiment to represent the human variety. Following Corkery et al. (2019), each random initialization was also treated as an individual participant. Therefore, the sample size for the human participants is 55, and the sample size for the models in each experiment is 400 (80 models × 5 initialization). We focused on three 15 For the output, we added LABELm, LABELs, LABELmr, LABELsr as well as adding tone and shuffling. For the input, we added frequency label to create ALL+FREQ. 16 We can not fully rule out the possibility that the increased accuracy is due to the model having longer inputs with pinyin instead of the model making use of the phonetic information. However, the input length might not have a significant impact on the models because our models with frequency labels (ALL vs ALL+FREQ) also vary in input lengths but the accuracies didn't change much. types of similarities: 1) accuracy, i.e., do humans and models show similar accuracy on each character? 2) overlap, i.e., do humans and models predict the same pinyin for each character? 3) variability, i.e., do humans and models have similar answer regularity patterns?
Accuracy We calculated each character's accuracy for MODEL [-PINYIN] and MODEL [+PINYIN]. First, both models showed saliency effect: the model's character accuracy is positively correlated with saliency score (Pearson r = 0.48 for MODEL [-PINYIN] and r = 0.57 for MODEL [+PINYIN]), which is not significantly different from human's saliency correlation (r = 0.62). In addition, there's a strong correlation between human character accuracy and both models' character accuracy (MODEL[-PINYIN] r = 0.79, MODEL[+PINYIN] r= 0.88), suggesting that the humans and models are in high agreement. In conclusion, the transformer models' answers are very similar to the human answers in terms of character accuracy.
Overlap The overlap rate was computed to measure to what extent different human speakers (and models) predict the same answers for each character. For example, if participant 1 and 2 have 30 same answers, then the overlap rate = 50% (30/60  lap rate for human-human, human-all models, and human-best model is shown in Figure 2. In general, the humans' answers are more similar to each other than to the models' answers. MODEL  Variability Like human speakers, transformer models also produce different answers for each character. We categorized these answers based on their regularity type and calculated the models' averaged production probability (P p ) for each answer type, as listed in Table 10. We further calculated Spearman correlation (ρ) and Pearson correlation (r) between the production probability of each type in human answers and the models' answers on each character (N = 60). All the regularity types are highly correlated except for the semantic type. The models did not produce as many semantic type answers as humans, suggesting the models are better at identifying the phonetic radical than humans. In addition, we also calculated the cross-entropy between the humans and the models on the production probability of 5 regularity types. The production probability of different regularity types for each character is shown in Figure 3. The answer type patterns are very similar for humans and models except for the semantic type. Humans produced semantic type answers for 15 characters, while both our models produced semantic type for fewer characters with a much smaller production probability. This implied that phonetic radicals are identified differently by humans and transformer models. Humans are affected by a wide range of linguistic knowledge in identifying the phonetic radical, including the semantic meaning of the radical, vocabulary size, and reading comprehension (Anderson et al., 2013;Yeh et al., 2017  els did not receive these extra inputs, and thus did not closely capture human behavior on semantic answer type.

Conclusion and Discussion
Conclusion We evaluated transformer models and human behaviors on an unknown Chinese naming task. This task is difficult for both humans and transformer models, as the average accuracy is lower than 50%. Humans have higher accuracy than MODEL[-PINYIN] and lower accuracy than MODEL[+PINYIN], and the models and the humans have very similar performances. First, saliency effects were found in both human data and the models' results, suggesting that both models and humans utilize the statistical distribution of the phonetic radical to infer the character's pinyin. Further, although humans' answers are more similar to each other, our models also achieved a substantial overlap with humans' answers. Additionally, the production probability of each answer type is highly correlated between models and humans (except for semantic type), suggesting that both models and humans are able to apply all regularity patterns in producing answers. Finally, models with radical's pinyins in the input are more similar to humans and achieved higher accuracy.
Capturing quasi-regularity Our work is also related to the long-standing criticism that the neural networks may only learn the most frequent class and can not extend other minority classes, thus would fail to learn the quasi-regularity in languages (Marcus et al., 1995). Previous studies on morphological inflections have shown that the neural models overgeneralized the most frequent inflections on nonce words and had almost no correlation with human's production probability on the less frequent inflections (e.g., ρ = 0.05 for the /-er/ suffix in German plural (McCurdy et al., 2020), and r = 0.17 for irregular English verbs (Corkery et al., 2019)). However, our results showed that the transformer models could learn the quasi-regularity in Chinese character naming, that the models produce all answer types, and the production probability of each type is highly correlated with human data. However, our results do not contradict the previous studies. Chinese character naming and morphological inflection both exhibit quasi-regularity, but the two domains are very different: the patterns in Chinese character naming are less rule-governed. This paper's contribution to the debate of quasiregularity in language processing is not to provide a 'yes' or 'no' answer; instead, we used a novel task and showed that the neural models have the potential to model human behaviors in learning quasi-regularity. We hope our study could inspire future work in this field to apply diverse tasks and conduct more detailed examinations of neural models' ability in learning quasi-regularity.
Modeling Chinese reading with neural network Our study also contributed to the current debate of whether reading skill is acquired by domain-general statistical learning mechanism (Plaut, 2005), or language-specific knowledge such as DRC model (Coltheart et al., 2001). Our results demonstrated that a general statistical learning mechanism (implemented as the transformer model) could learn the Chinese grapheme-phoneme mapping. We not only successfully simulated the general saliency effects in human's unknown character naming behavior, but also showed in details that the answers produced by models and humans are highly similar. Another contribution to modeling Chinese reading is that we are the first study that incorporated the radicals' pinyin in the model. Models with pinyin as input not only had better accuracy, but also are more similar to human behavior. Our results echoed the recent literature on the pinyin effect. For modern Chinese speakers who type characters through pinyin more often than hand-writing characters, pinyin can be an important mediator for the grapheme-phoneme mapping process.