WordPars: A tool for orthographic and phonological neighborhood and other psycholinguistic statistics in Persian

Esmaeelpour, Elmira; Saneei, Sarah; Nourbakhsh, Mandana

doi:10.3758/s13428-021-01712-4

WordPars: A tool for orthographic and phonological neighborhood and other psycholinguistic statistics in Persian

Published: 09 November 2021

Volume 54, pages 1902–1911, (2022)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

WordPars: A tool for orthographic and phonological neighborhood and other psycholinguistic statistics in Persian

Download PDF

894 Accesses
1 Citation
4 Altmetric
Explore all metrics

Abstract

This paper presents a Windows program providing statistics on word and non-word stimuli in Persian, including word frequency, orthographic and phonological length, orthographic and phonological neighbors, and transposed-letter neighbors. It also generates possible non-words that are orthographic neighbors of the target word. Persian is an under-represented Indo-European language that has historically been influenced by Arabic and adopted certain characteristics in its writing, e.g., the omission of short vowels. This tool aims to help researchers in psycholinguistics, specifically with regard to isolated word recognition in Persian. This downloadable program computes the aforementioned indices free of charge. This tool uses two corpora, i.e., Zaya corpus and WorldLex, for reporting and computing statistics and is a user-friendly program provided in English that is also easy to work with for non-Persian researchers. It can be accessed from https://github.com/ssaneei/Wordpars.

Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)

Article 17 August 2021

Phonological, grammatical, and written words in Wichi

Article 25 September 2014

Lexique-Infra: grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words

Article 21 May 2020

It is becoming increasingly difficult to ignore the high demand for a well-established tool to conduct linguistic studies containing stimuli in a given language. In recent years, there has been a growing interest in using tools dealing with psycholinguistic research, since it is highly optimized to select items with the least difference in given linguistic factors, for example, word frequency and orthographic and phonological neighborhood size. Various helpful tools can be found for this purpose in different languages; for instance, N-Watch for English (Davis, 2005), BuscaPalabras for Spanish (Davis & Perea, 2005), and E-Hitz(Perea et al., 2006) for Basque.

Nonetheless, far too little attention has been paid to establishing tools for psycholinguistic research, specifically for word recognition and reading in Persian. The development of psycholinguistic tools for Persian is important because it is an Indo-European language that has historically been influenced by Arabic and adopted certain characteristics in its writing, such as the omission of short vowels and the inclusion of various homographs, which make its orthographic system more complicated and inconsistent. Furthermore, Persian is an under-represented language in terms of the literature on this subject. The lack of such tools led us to design a program deriving orthographic and phonological statistics and neighbors, transposed-letter words, non-word neighbors and generating non-words to pave researchers’ path in carrying out experiments. The use of this tool can also be beneficial for instructors teaching Persian as a foreign language. For example, in terms of phonological similarities and word learning, words or novel words with more probability in phonological similarities can be learnt faster than those with fewer similarities (Luce & Large, 2001; Stamer & Vitevitch, 2012; Storkel, 2004; Storkel et al., 2006), and this advantage can be taken into consideration in teaching Persian as a foreign language.

Research on letter similarity or confusability can provide valuable information about visual word processing (see, e.g., Gutiérrez-Sigut et al., 2019). In Persian, many letters share the same base word, which can be confusable for the learners of Persian regardless of the place/location of diacritical marks. Wiley et al. (2016) studied the issue in Arabic in a same-different judgment task, and Perea et al. (2018) did so in a masked priming lexical decision task. The Persian writing system includes Arabic script with very few added letters; therefore, researchers of Persian can use the psycholinguistic indexes from WordPars with databases of letter similarity in Arabic script. For example, Boudelaa et al. (2020) present data on the similarities of Arabic letters and the allographs in three domains of visual, phonetic, and motoric.

This paper fills the aforementioned gap by presenting a tool (i.e., WordPars) for computing some psycholinguistic statistics, providing a list of neighbors of words and generating non-word neighbors and transposed-letter words in Persian. At present, Persian scholars mostly refer manually to corpora in order to find the frequency of a given word. The most cited corpora are FARSDAT, a Persian spoken corpus (Bijankhan, 2004), PLDB, a written one (Assi, 2005), and Zaya (compiled by Eslami et al., 2004). For this study, we used Zaya corpus and WorldLex due to their free accessibility, unlike the other corpora, and Zaya seems to be a suitable Persian corpus for phonological studies, owing to its phonological transcription.

WordPars program is a tool that mainly provides statistics like orthographic and phonological length and word frequency and generates neighbors of a given word from various word, non-word, transposed-letter non-word, and word options, which is crucial for studies on word processing and reading (see, e.g., Brysbaert et al., 2018). This tool is seemingly the first of its kind for this purpose. It can help scholars conduct psycholinguistic research in Persian more readily, thus leading to the development of research into Persian's unique and unknown characteristics. The findings of this study can be useful beyond the Persian language and the unique characteristics of Persian, as they can be utilized for developing or challenging the dominant theories in word recognition and reading. Also, it is vital to deal with pseudowords for conducting text-to-speech, as they can form actual words in the future.

This program enables users to add the phonetic transcription of words not presented in Zaya corpus; it is not possible to provide phonetic transcription automatically since short vowels are not written in Persian (for example, "در" /dor/, which means ‘pearl’, and "در" /dar/, which means ‘door’). Therefore, the results will be in alignment with the exact homonym. The program can further generate non-words from a target word, and both words and non-words can be extracted from this tool.

The program runs on Windows operating system, requires about 512 MB of RAM, and needs approximately 128 MB of hard disk space. It is downloadable and freely accessible from https://github.com/ssaneei/Wordpars.

The noteworthy merit of the tool introduced in this paper is presumably the first tool on this subject. This user-friendly tool is offered in English for ease of work and has an internal Persian keyboard for non-Persian researchers.

Persian language

Persian is a branch of Indo-European languages. There are currently three major regional variations of Persian: Persian spoken in Iran, in Afghanistan, and in Tajikistan. The variation studied in this research is the one spoken in Tehran, Iran's capital, which is sometimes referred to as Modern Colloquial Persian. The variety of Persian spoken in Iran is also known as Farsi.

As an SOV language, Persian has nearly lost the inherited synthetic nominal and verbal inflection and their inflectional classes. Consequently, the inflectional distinction of the case, number and gender has also been lost (Windfuhr, 2009). In addition, a discrete characteristic of modern Persian, which is becoming the focus of more syntactic and semantic studies, is the increasing number of compound verbs in respect of simple verbs. For example, the Persian equivalent of a simple word in English, such as /walk/, is /qadam zadan/, a compound verb consisting of two words. Persian has six vowels (i.e. /i, e, a, ɑ, o, u/) in terms of phonology. It also has 23 consonants, including eight stops /p– b, t – d, c – ɟ, ɢ, ʔ/, eight fricatives /f– v, s – z, ʃ – ʒ, x, h/, two affricates /t͡ʃ, d͡ʒ/, and five sonorants /m, n, j, l, r/(Modarresi Ghavami, 2018).

Regarding orthography, like Arabic, Persian uses the Arabic script, has a consonantal system and is written from right to left. This script is used not only for the Arabic language but also for Indo-European languages such as Kurdish, Urdu, Pashto, and Uyghur, as a Turkic language using Arabic script for their writing system (Yakup et al., 2015). There are 32 letters in the modern Persian alphabet. There are two main innovations in Persian. One is that additional letters were created for the four Persian consonants /p/, /tʃ/, /g/, /ʒ/ (Windfuhr, 2009). The other one is that the three long vowels are represented by the letter of the consonant nearest in pronunciation. Diacritics do not represent the three short vowels. Phonological and semantic ambiguity emerges when the reader is faced with a string of consonantal letters (Baluch, 2005).

The default vocabularies

The only accessible and free-of-charge word database in Persian that provides a phonetic form of words is Zaya (Eslami et al., 2004). Each Zaya entry includes Persian written and phonological form, part of speech, stress pattern, and frequency. Zaya documents claim that they have used a ten-million-word corpus and a Persian dictionary from the year 2002 for extracting about 55,000 entries. We assume that the frequencies are inserted according to the number of appearances in these corpora. Zaya corpus contains a total of 54,553 lines. These lines are dedicated to 163 "Affixes", 33 "Syntactic Categories", 14 "Affix Types", and 54,340 words, marked as "entries". Furthermore, three lines have been left blank. For this tool, we have considered only the word entries that had at least a length of one and all of whose characters were letters, space, or zero-width non-joiner (ZWNJ). We did not use affixes for this tool. Also, it seems that there is no homogeneity in the corpus structure between affixes and entries. In addition, since using all the words in this corpus would mean a large number of rare words, we considered only those entries with a frequency above 5.

We should acknowledge that Zaya is not a perfect corpus, as only about 38% (21,000) of its words have frequencies above ten. The corpus does not seem to cover all the Persian words available. To address this issue, it should be noted that WordPars has the capability of analyzing any Persian word defined by the users in its first tab. Researchers can define the phonological form of words not present in the corpus-based on the guideline in the program. Also, to cover the insufficiency in orthography, we used another corpus named WorldLex. Researchers should bear in mind that some of the words with a frequency of 5000 in Zaya corpus may not showcase their real frequencies, and these figures may be unreliable. This issue can be considered a limitation of Zaya corpus, and an extra section is added to the software’s help tab to make a note of it.

WorldLex (Gimenes & New, 2016) was used for the orthography and frequency part in WordPars. This corpus, which is freely available for 66 languages and can be downloaded from http://wordlex.lexique.org, includes both informal and formal languages because it has been derived from Twitter, blogs, and subtlex. It must be noted that this corpus contains words without ZWNJ and any space from the collected documents in public pages with their frequencies. There are almost two files for each language, i.e., one without spell-checker validation that contains 591,710 entries, and the other one with spell-checker validation that contains 67,047 entries. Thereafter, we will call the first one WorldLexAll and the second one, StandardizedWorldLex. In this study, we used both of these files, though we eliminated the strings with numbers, stretched letters, such as “جمــــــــــــــعـــــــــــــه” instead of “جمعه”, Kojira He “ۀ”, diacritics, and dash. Also, we only took the entries with a frequency above 1 for the blogs and eliminated the entries with zero frequency in subtlex and Twitter. Interestingly, there are some words from different Persian dialects in this corpus. We took the number of letters in each word more than one and less than 12. The total number of words used in our study from this corpus was 122,912 for WorldLexAll and 33,289 for WorldLexStandardized.

As mentioned above, each of these two files contains three different frequencies—the frequency for blogs, for Twitter and for subtlex. Adding all the word frequencies in these two files, we assume that, in WorldLexAll, 14,847,183 words are extracted from blogs, 12,250,143 words from Twitter and 15,096,117 words from subtlex. After the standardization phase and spell checking, the same figure in WorldLexStandardized became 12,717,663 words from blogs, 8,973,249 from Twitter and 11,614,876 from the news.

WorldLex itself consists of two kinds of corpora, one with all the words used in it (i.e., WorldLexAll) and the other a standardized one (i.e., WorldLexStandardized). Using these three corpora helped the researchers achieve better results because the words in Zaya are mostly older than the words in WorldLex corpora due to the nature of blogs, tweets and news items, which are more recent in comparison with texts derived from books. Also, blogs and tweets are written by all groups of the society and can be more fruitful for sociolinguistic research. Furthermore, spelling errors, which are becoming common and sometimes lead to changes in orthography, can be thus tracked and reviewed in terms of their frequency. Moreover, many up-to-date concepts and proper nouns exist in WorldLex as opposed to Zaya.

It should be noted that in order to remove the low-frequency words, a subjective solution based on intuitive criteria has been used for each corpus. This issue can be considered a limitation of the tool in terms of the psychological validity of the output.

The corpora defined above were used internally by the program (i.e., WordPars). In the following sections, we will explain the tool and its features in detail.

Specifying the stimuli to be analyzed

The menus in this tool are presented in English in order to simplify the job for people who are not native Persian speakers. WordPars includes a guide file that provides a table for each grapheme and its phonological form according to the default vocabulary. Consequently, if researchers are not familiar with phonetic rules, they can enter the phonetic of words not existing in Zaya corpus based on the guide.

The program contains two main tabs: One for calculating the neighborhood similarity (match/distance) and the other one for neighbor finding. To work with the first tab in WordPars, there are four ways to input the stimuli: (1) By typing, (2) using the "Open from a file" button and then opening a text file, (3) by pasting it from the clipboard (right-click pop-up menu or simply the shortcut Ctrl-V), or (4) by the use of WordPars’ internal Persian keyboard.

The second option is useful for files including lists of words previously prepared, and the third option is handy when the word list is in an Excel spreadsheet or another format. The fourth option is mostly dedicated to non-Persian users who do not have a Persian keyboard. WordPars provides an internal Persian keyboard to enable users to enter words by pressing the buttons.

For the first tab, each word of the stimuli must be in a separate line. The list of words must therefore have an even number of lines. If the number is odd, WordPars will show a warning. In case the user does not add/delete a word, it will omit the last word. Besides, the input must contain characters only from the below sets:

Persian alphabet (32 letters),
Some letters from the Arabic alphabet, which are frequent in Persian words,
ZWNJ (i.e., zero-width non-joiner), which is equal to semi-space; for example, in the word "بی‌نظیر", which means unique,
Space,
Newline character, which is the word separator.

The set of valid characters to be inserted is thus {آ ,ة ,ي ,إ ,أ ,ء‌‌ ,آ ,ۀ ,ض ,ص ,ث ,ق ,ف ,غ ,ع ,ه ,خ ,ح ,ج ,چ ,ش ,س ,ی ,ب ,ل ,ا ,ت ,ن ,م ,ک ,گ ,ظ ,ط ,ز ,ر ,ذ ,د ,ئ ,و ,پ ,ؤ ,ژ space, ZWNJ, newline}. Users must be aware that if they insert words containing letters other than those in this list, either the letter will not show in the text box or a warning message will be shown. Consequently, if users are typing or pasting something but nothing changes, it must be that the characters are not valid. Nonetheless, as noted earlier, users must pay attention that ZWNJ has not been inserted in Worldlex corpora.

Afterwards, users can decide which corpus they want to work with: Zaya, WorldLexAll or WorldLexStandardized. In the end, by pressing the Submit button, the list of words will be loaded.

In the second tab, WordPars will get a single word and a number as distance. This word can be typed or simply pasted from the clipboard or inserted with the help of the internal Persian keyboard. The input format would be the same as the other tab. The user can then select the orthography or phonology options.

Checking phonetics

After defining the words and pressing the submit button, another window will show up for the user to select among heteronyms ("سر" /sar/ vs. "سر" /sor/) or -if the user is working in the first tab- to insert phonetic forms for words not found in Zaya corpus. In this window, the frequencies of the words (including polysemy or homonym) are added together.

The checking phonetics form will be available for edits until a new input is added.

Also, a "Help me" button is provided, in which phonetic representations of Zaya corpus are shown in a table (Fig. 1).

Available statistics

By pressing the "Calculate Similarity" button, the program reports the statistics, including word frequency, orthographic length, phonological length, orthographic distance and match, and phonological distance and match of each word pair. Finally, the user can attain an overall result, i.e., the mean and standard deviation related to all distances and matches of both orthography and phonology, by pressing the "overall results" button. It is possible to analyze two items or as many items as the user requires Figs. 2, 3, 4, and 5.

By inserting a word in the second tab of the program together with a distance and choosing among the possible homonyms in the Checking Phonetics Form, the program can perform orthographic or phonological tasks. In the case of orthography, users can decide on having words, non-words, transposed-letter words, or transposed-letter non-words; however, in the case of phonology, users have the "words" option, and words will be shown together with their frequencies.

Also, as WorldLex consists of orthographic forms of words while searching for orthographic neighbors, users can choose among two corpora (i.e., Zaya and WorldLexStandardized).

Frequency

The frequency of words can be one of the most fundamental factors in conducting experiments in psycholinguistics, specifically the relationship between word frequency and reaction time (RT)(see, e.g., Carroll & White, 1973; Monsell et al., 1989, for early evidence). In addition, word frequency has clear electrophysiological correlates, showing that lower-frequency words are mainly in central and frontal electrode sites (Winsler et al., 2018).

In both tabs, WordPars will show the word frequency, consulting Zaya corpus and WorldLex. Although Zaya corpus provides frequencies according to each word’s part of speech (POS), in this program, POS will not be shown. We would sum up the frequencies related to the homonyms. In addition, frequencies based on WorldLex include Twitter, blogs, and subtlex. WorldLex contains two files—the first one with frequencies of all the strings and the second one with only the strings that have spell-check validation (WorldLexAll and WorldLexStandardized). Users will have the opportunity to use each one they choose. Selecting Zaya corpus enables users to see just one frequency while selecting each of the WorldLex corpora offers three frequencies. In the first tab, WordPars enables users to work with words not existing in the corpora and even non-words. Moreover, a zero frequency will be assigned to words or non-words the user enters that does not exist in the selected corpus. In the second tab, if the word entered by the user does not exist in the selected corpus, orthographic neighbors will still be generated, and only the frequency of -1 will show up. Meanwhile, as it is not feasible to have the phonological form of a word not presented in Zaya, WordPars will not work for these words in the phonological neighbor’s part. Moreover, as Worldlex does not contain phonological forms, it cannot support homonyms’ frequency. It will only show the first entity in the corpus (e.g., if the user enters the word ‘در’, WordPars will not ask them ‘which ‘در’ they mean’ because it cannot provide phonotactic forms of a word for the user to decide), and the frequency it shows is the sum of all homographs.

Orthographic statistics

When given at least one pair of words, WordPars calculates their orthographic similarity. Likewise, users can paste/type a list of words or use the internal Persian keyboard. Users must be aware that they should only use Persian characters. Then, by pressing the calculation button, WordPars will show the orthographic length and match/distance of each pair, an average, and the standard deviation of the matches and distances of all the pairs.

Phonological statistics

WordPars will compute phonological similarity for a list of words containing at least two words. Even if a non-word is entered, the similarity will be calculated, although it is up to the user to define the phonetics for words not existing in Zaya corpus. If the user does not fill the phonetic form, WordPars will not count that pair in the list of words and will omit it from the list. Finally, an overall average and standard deviation of all the similarities will be calculated, as well.

Orthographic neighborhood

Nowadays, most psycholinguistic studies require computing orthographic neighborhood (for example see, Barnhart & Goldinger, 2015). There are different methods to form neighbors—by deletion, addition, and substitution (Marian et al., 2012). In the first method, N (the distance) letters will be omitted from the word, resulting in a new word. For example, assuming the word "ساکن" and the distance of 1, by omitting the last letter, we will have "ساک". In the second method, N letters will be added to the word to form a new word. For example, with a distance of 2, "ساکن" will become "مساکین". In the last method, N letters in the word will be replaced by other letters in the alphabet. For instance, considering a distance of 3, "ساکن" will be changed to "آنتن". Users can choose between these methods for producing neighbors when using WordPars.

We have mainly focused on a measure of orthographic similarity that is less restrictive than the principles of Coltheart's Orthographic Neighborhood size metric (ON). Orthographic Levenshtein distance 20 (OLD20), which is often used interchangeably with edit-distance, is the minimum change that needs to be made to transform word A to word B. The changes that can take place are the removal, insertion, or substitution of a character in the string.

A word's orthographic neighborhood refers to the set of words that share all but one letter in that word (Coltheart et al., 1977). In the first tab, users can select from two algorithms: Levenshtein and Damerau–Levenshtein distance (Coltheart et al., 1977; Hosangadi, 2012; Yarkoni et al., 2008). These two algorithms are a little different from each other. Damerau-Levenshtein's changes can also include either deletion, insertion, substitution, or transposition of two adjacent characters. Normally, their statistics are majorly equal.

In the second tab, the user can enter or paste a word from the clipboard or type it using the internal Persian keyboard. Then a distance must be selected, with the maximum value being the length of one unit less than the word's length. For ease of use, we have set the maximum value of the related box to this amount. After setting these two values and pressing the "Find orthographic neighbors" button, WordPars will provide a list of words in a separate line with their frequencies found from either Zaya or WorldLex—based on the selected corpus—that are the neighbors of the given word. In this tab, we use the Levenshtein algorithm for words. For example, assuming the word "ساکن" and the distance of 1 with Zaya corpus, the list shown in Fig. 6 will emerge.

Users can check boxes for insertion, deletion and substitution in the second tab. One of the other merits of WordPars is that the user can see which process has been used. For example, if the user enters the word ‘ساکن’ with 1 distance and searches for orthographic neighbors in Zaya corpus, WordPars will yield five results: ‘مساکن (mmmmi) 15,ساک (dmmm) 12,ساکت (smmm) 105,سالن (msmm) 1020,سان (mdmm) 42’. As can be seen, the first item is the word in Persian, the other is the frequency, and the one in parenthesis defines the process by which each of these five words can be generated from the original word (‘ساکن’). Assuming the first neighbor (‘مساکن’), we can see that only one letter has been inserted at the beginning of the word, so we have an ‘i’, which shows ‘insertion’, and four ‘m’s, which show ‘match’. There are four matches because the original word and the neighbor have four letters in common. Assuming the second neighbor, which is (‘ساک’), one can see that only the last letter has been omitted, so the process would contain a ‘d’ for ‘deletion’ and other letters would be in ‘match’ with the original word. For the third neighbor (‘ساکت’), we see that the last letter has been changed from (‘ن’) to (‘ت’), so the process has three matches and one ‘s’, showing the ‘substitution’. Users should be aware that since Persian words are written from right to left, the process has been indicated in the same direction.

In WordPars, one can choose to see only neighbors that are the result of one, two or all of the processes of deletion, insertion and substitution with the original word. Also, if the neighbor is in a transposed letter relationship with the original word, WordPars will show the label of “TL” for “TransposedLetter”.

Phonological neighborhood

The other aspect of lexical processing that has received significant attention in recent decades is the role of phonological neighborhood in psycholinguistic studies. The phonological neighborhood can be defined as the number of words differing in phonetic structure in comparison with other words based on a single phoneme (Luce & Pisoni, 1987). Words are phonologically structured based on dense or sparse neighborhoods according to phonologically similar words (Luce & Pisoni, 1987). To better illustrate the point at hand, take the word "باد", which means ‘wind’; this word has many phonetically similar words (e.g. شاد، داد، ماد، باغ, etc.); however, it also has sparse neighbors, such as the word "صبح", which means ‘morning’.

Phonological neighborhood indicates when two words differ by a phoneme. Words with the same neighbors are processed more rapidly in comparison with those with few neighbors (Yates, 2005). In this tool, phonological processing is based on the Levenshtein algorithm. The user can enter or paste a word from the clipboard in the second tab or via the internal Persian keyboard. Afterwards, define a distance with a maximum value of one unit less than the word’s phonological form’s length. As the user might not know this maximum value restriction, we have set a limit on the combo box for the phonological distance that can never pass the maximum possible number.

In this case, users have only one choice, i.e., the neighbor words. We have not placed a non-word neighbor option because, in Persian, the orthographic form of a word cannot be assigned by its phonological form due to the vague and inconsistent orthographic system of Persian compared to the spoken format of the language; for instance, the phonological form /saba/ can be written as "سبا" or "صبا", which are two words with different meanings.

After selecting a distance and entering the word, WordPars will ask for checking the phonetics to know if typing "سر" is intended to be read as /sar/, /sor/ or /ser/. After the user has approved their choice, WordPars will show a list of neighbor words and their frequencies. Since WordPars must first find the given word in Zaya corpus, check all the phonological forms in terms of distance, and add the orthographic form of the word to the list, this process will take a bit of time. We have placed a "Stop" button with which users can stop the process whenever they want.

Transposed-letter effects

Research on transposed-letter effects is becoming more noticed nowadays, mainly with regard to word recognition and reading (see, e.g., Mousikou et al., 2015). In fact, in transposed-letter neighbors, two words share the same letters, but two letters are swapped. A mis-spelled word resulting from letter transposition can be inconsistent with its base word (e.g., Perea & Lupker, 2003). The priming effect can be more visible in a word with two letters that are swapped than a word with letters that are replaced by different letters (Perea & Lupker, 2003).

WordPars -using either Zaya or WorldLexStandardized- can handle the option of transposed-letter words and non-words in the second tab, in the "orthographic details" section. It seems that this program is the first tool to generate transposed-letter words and non-words in Persian. By selecting transposed-letter words, WordPars will bring words like "اسکن", meaning ‘to scan’, as a borrowed word in Persian, and by choosing transposed-letter non-words, it will provide the non-word " کاسن" as an example, presuming the word "ساکن". For this part, no distance is needed.

Non-words

As a unit of speech or text, pseudowords could have been an actual word, but they are not, and non-words are illegal letter strings in terms of orthography and phonology (Ziegler et al., 1997). They are used in a variety of psycholinguistic studies (see, e.g., Gathercole, 2006). Non-words and pseudowords are essential in lexical decision tasks when there are some strings of letters or sounds and when the participant should decide whether the stimulus is a real word or not. Non-words are of use for studying non-word reading and word recognition (Keuleers & Brysbaert, 2010).

WordPars can generate non-words when presented with a word and a distance. Given the lack of tools to generate non-words in Persian, we did not have any hints to rely on for the present tool, and we have made a few rules for this first step so as to avoid having non-words that are not mostly based on Persian phonotactics. We hope that the researchers using this tool will provide us with their feedback so that the tool can be improved in its later versions. There have been so many studies in other languages in which Wuggy (Keuleers & Brysbaert, 2010) has had a significant role. Yet, Wuggy is not the same as our tool, and its goal is to create non-words that carefully match sets of words. Because of the lack of previous research on the subject in Persian, non-words are created by substitution in WordPars for this part, and we do not apply a special algorithm to match the non-words in factors such as bigram frequency with a base word while Wuggy has applied.

As Persian has 32 letters in its alphabet, for forming the non-words, WordPars must substitute all these letters with letters in words in each place (considering the distance provided); the number of words will increase as the distance and/or the word's length increases. For instance, take the disyllabic word "ساکن", with three distances; WordPars substitutes three of the four letters each time and adds the result to a list of non-words. According to combinatorics, this permutation could generate strings. To better illustrate this point, for each of the three places, we could replace 32 letters, but as mentioned above, we have restricted some word occurrences. Hence, the actual non-words that WordPars will consider are somewhat less than the number above. Also, some of these strings would be words in and by themselves, recognized by Zaya corpus or WorldLexStandardized. Therefore, two kinds of words will be omitted from the non-word list: Words existing in the corpora and impossible non-words. For example, take the word "ساکن", which means ‘being still’, and the distance of three, and a possible non-word would be "سلان", but the impossible non-word would be "سسسا" or "ااین" or "سثطذ".

As explained above, this process will take a long time, which is the main disadvantage of this approach. For polysyllabic words, we would have a combinatorial explosion (Keuleers & Brysbaert, 2010), leading to billions of probable non-words. Therefore, we have placed a «Stop» button, and WordPars will also show the processing words in real-time. Whenever the user wants to stop the process, the list of words will be shown after a while.

Saving the outputs

There are two ways for saving outputs in WordPars: (1) Users can simply select the content of each box, and from the right-click pop-up, choose copy and then paste it wherever necessary, (2) users can press the "Copy" button in the tabs and copy the related list to their clipboard, (3) users can press the "Save" button in each tab, and all the content of the related part will be saved in the file direction that the user will select, in a .txt or .tsv format Fig. 7.

The first and second forms of saving are suitable for those working in Excel or another form of text files, as they can readily convey their desired information. This mode of copying the output will also be more suitable for those working on articles. If the user needs to have different documents from the results, it will probably be more efficient to choose the third form, by which a file will be generated, named and placed in the location of the user’s choosing.

Conclusions

To conclude, the WordPars program may be considered the first tool that can be used for psycholinguistic studies in Persian, particularly word recognition studies. This tool gives researchers access to a wide range of statistics (i.e., word frequency, orthographic and phonological length, and orthographic and phonological similarity) as well as orthographic neighbors (words and transposed-letter words) and phonological neighbor words.

In addition, it offers the possible non-words of a target word with a given distance. The program can be downloaded and used free of charge. Users can access the statistics for the default vocabulary corpora (i.e., Zaya and WorldLex) used in this tool from https://github.com/ssaneei/Wordpars.

References

Assi, M. S. (2005). PLDB Persian linguistics database pazuhešgarn (researchers). Technical report, Institute for humanities and cultural studies, Iran.
Baluch, B. (2005). Persian orthography and its relation to literacy. In R. M. Joshi & P. G. Aaron (Eds.), Handbook of orthography and literacy (pp. 365–376). Mahwah: Erlbaum.
Barnhart, A. S., & Goldinger, S. D. (2015). Orthographic and phonological neighborhood effects in handwritten word perception. Psychonomic Bulletin & Review, 22(6), 1739–1745. https://doi.org/10.3758/s13423-015-0846-z
Article Google Scholar
Bijankhan, M. (2004). The role of the corpus in writing a grammar: An introduction to software. Iranian Journal of Linguistics, 19(2), 48–67.
Google Scholar
Boudelaa, S., Perea, M., & Carreiras, M. (2020). Matrices of the frequency and similarity of Arabic letters and allographs. Behavior Research Methods, 52, 1893–1905. https://doi.org/10.3758/s13428-020-01353-z
Article PubMed Google Scholar
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50. https://doi.org/10.1177/0963721417727521
Carroll, J. B., & White, M. N. (1973). Word frequency and age of acquisition as determiners of picture-naming latency. The Quarterly Journal of Experimental Psychology, 25(1), 85–95. https://doi.org/10.1080/14640747308400325
Article Google Scholar
Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance VI (pp. 535–555). Erlbaum.
Google Scholar
Davis, C. J. (2005). N-watch: A program for deriving neighborhood size and other psycholinguistic statistics. Behavior Research Methods, 37(1), 65–70. https://doi.org/10.3758/BF03206399
Article PubMed Google Scholar
Davis, C. J., & Perea, M. (2005). BuscaPalabras: A program for deriving orthographic and phonological neighborhood statistics and other psycholinguistic indices in Spanish. Behavior Research Methods, 37(4), 665–671. https://doi.org/10.3758/BF03192738
Article PubMed Google Scholar
Eslami, M., SharifiAtashgah, M., AlizadehLamjiri, S., & Zandi, T. (2004). Zaya Words in Persian. The First Research Workshop in Persian and Computer. Tehran: Iran.
Gathercole, S. E. (2006). Nonword repetition and word learning: The nature of the relationship. Applied Psycholinguistics, 27(4), 513. https://doi.org/10.1017/S0142716406060383
Article Google Scholar
Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48(3), 963–972. https://doi.org/10.3758/s13428-015-0621-0
Article PubMed Google Scholar
Gutiérrez-Sigut, E., Marcet, A., & Perea, M. (2019). Tracking the time course of letter visual-similarity effects during word recognition: A masked priming ERP investigation. Cognitive, Affective, and Behavioral Neuroscience, 19(4), 966–984. https://doi.org/10.3758/s13415-019-00696-1.
Article Google Scholar
Hosangadi, S. (2012). Distance measures for sequences. arXiv preprint arXiv:1208.5713.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods 42(3), 627–633. https://doi.org/10.3758/BRM.42.3.627
Article PubMed Google Scholar
Luce, P. A., & Large, N. R. (2001). Phonotactics, density, and entropy in spoken word recognition. Language and Cognitive Processes, 16(5–6), 565–581. https://doi.org/10.1080/01690960143000137
Article Google Scholar
Luce, P. A., & Pisoni, D. B. (1987). Speech perception: Recent trends in research, theory, and applications. Human communication and its disorders. Norwood, NJ: Ablex. https://doi.org/10.1121/1.392451
Marian, V., Bartolotti, J., Chabal, S., & Shook, A. (2012). CLEARPOND: Cross-linguistic easy-access resource for phonological and orthographic neighborhood densities. PLoS One, 7(8), e43230. https://doi.org/10.1371/journal.pone.0043230
Article PubMed PubMed Central Google Scholar
Modarresi Ghavami, G. (2018). Sound System. In A. Sedighi, & P. Shabani-Jadidi (Eds.). The Oxford handbook of Persian linguistics (pp. 91–111). Oxford University Press.
Monsell, S., Doyle, M. C., & Haggard, P. N. (1989). Effects of frequency on visual word recognition tasks: Where are they?. Journal of Experimental Psychology: General, 118(1), 43. https://doi.org/10.1037/0096-3445.118.1.43
Article Google Scholar
Mousikou, P., Kinoshita, S., Wu, S., & Norris, D. (2015). Transposed-letter priming effects in reading aloud words and nonwords. Psychonomic Bulletin & Review, 22(5), 1437–1442. https://doi.org/10.3758/s13423-015-0806-7.
Article Google Scholar
Perea, M., & Lupker, S. J. (2003). Does judge activate COURT? Transposed-letter similarity effects in masked associative priming. Memory & Cognition, 31(6), 829–841. https://doi.org/10.3758/BF03196438
Article Google Scholar
Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610–615. https://doi.org/10.3758/BF03193893
Article PubMed Google Scholar
Perea, M., Abu Mallouh, R., Mohammed, A., Khalifa, B., & Carreiras, M. (2018). Does visual letter similarity modulate masked form priming in young readers of Arabic?. Journal of Experimental Child Psychology, 169, 110–117. https://doi.org/10.1016/j.jecp.2017.12.004
Article PubMed Google Scholar
Stamer, M. K., & Vitevitch, M. S. (2012). Phonological similarity influences word learning in adults learning Spanish as a foreign language. Bilingualism (Cambridge, England), 15(3), 490. https://doi.org/10.1017/S1366728911000216
Article Google Scholar
Storkel H. L. (2004). Methods for minimizing the confounding effects of word length in the analysis of phonotactic probability and neighborhood density. Journal of Speech, Language, and Hearing Research: JSLHR, 47(6), 1454–1468. https://doi.org/10.1044/1092-4388(2004/108)
Storkel, H. L., Armbrüster, J., & Hogan, T. P. (2006). Differentiating phonotactic probability and neighborhood density in adult word learning. Journal of Speech, Language, and Hearing Research: JSLHR, 49(6), 1175–1192. https://doi.org/10.1044/1092-4388(2006/085)
Article PubMed Google Scholar
Wiley, R. W., Wilson, C., & Rapp, B. C. (2016). The effects of alphabet and expertise on letter perception. Journal of Experimental Psychology: Human Perception and Performance, 42, 1186–1203. https://doi.org/10.1037/xhp0000213.
Article PubMed Google Scholar
Windfuhr, G. L. (2009). Persian. In Bernard Comrie (ed.), The world’s major languages (pp. 445–459). Routledge.
Google Scholar
Winsler, K., Midgley, K. J., Grainger, J., & Holcomb, P. J. (2018). An electrophysiological megastudy of spoken word recognition. Language, Cognition and Neuroscience, 33(8), 1063–1082. https://doi.org/10.1080/23273798.2018.1455985
Article PubMed PubMed Central Google Scholar
Yakup, M., Abliz, W., Sereno, J., & Perea, M. (2015). Extending models of visual-word recognition to semicursive scripts: Evidence from masked priming in Uyghur. Journal of Experimental Psychology: Human Perception and Performance, 41(6), 1553. https://doi.org/10.1037/xhp0000143
Article PubMed Google Scholar
Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart's N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15(5), 971–979. https://doi.org/10.3758/PBR.15.5.971
Article Google Scholar
Yates, M. (2005). Phonological neighbors speed visual word processing: Evidence from multiple tasks. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 1385–1397. https://doi.org/10.1037/0278-7393.31.6.1385
Article PubMed Google Scholar
Ziegler, J. C., Besson, M., Jacobs, A. M., Nazir, T. A., & Carr, T. H. (1997). Word, pseudoword, and nonword processing: a multitask comparison using event-related brain potentials. Journal of Cognitive Neuroscience, 9(6), 758–775. https://doi.org/10.1162/jocn.1997.9.6.758
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics, Alzahra University, Tehran, Iran
Elmira Esmaeelpour & Mandana Nourbakhsh
Language and Linguistics Center, Sharif University of Technology, Tehran, Iran
Sarah Saneei

Authors

Elmira Esmaeelpour
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Saneei
View author publications
You can also search for this author in PubMed Google Scholar
Mandana Nourbakhsh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarah Saneei.

Ethics declarations

We have no conflicts of interest to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Esmaeelpour, E., Saneei, S. & Nourbakhsh, M. WordPars: A tool for orthographic and phonological neighborhood and other psycholinguistic statistics in Persian. Behav Res 54, 1902–1911 (2022). https://doi.org/10.3758/s13428-021-01712-4

Download citation

Accepted: 15 September 2021
Published: 09 November 2021
Issue Date: August 2022
DOI: https://doi.org/10.3758/s13428-021-01712-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

WordPars: A tool for orthographic and phonological neighborhood and other psycholinguistic statistics in Persian

Abstract

Similar content being viewed by others

Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)

Phonological, grammatical, and written words in Wichi

Lexique-Infra: grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words

Persian language

The default vocabularies

Specifying the stimuli to be analyzed

Checking phonetics

Available statistics

Frequency

Orthographic statistics

Phonological statistics

Orthographic neighborhood

Phonological neighborhood

Transposed-letter effects

Non-words

Saving the outputs

Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

WordPars: A tool for orthographic and phonological neighborhood and other psycholinguistic statistics in Persian

Abstract

Similar content being viewed by others

Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)

Phonological, grammatical, and written words in Wichi

Lexique-Infra: grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words

Persian language

The default vocabularies

Specifying the stimuli to be analyzed

Checking phonetics

Available statistics

Frequency

Orthographic statistics

Phonological statistics

Orthographic neighborhood

Phonological neighborhood

Transposed-letter effects

Non-words

Saving the outputs

Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation