1 Introduction

The Internet and specifically the World Wide Web offer access from around the world to a huge pool of data. However, accessing data does not necessarily equate with accessing information. In order to have access to information, a certain capacity to interpret and understand the data is required. A myriad of factors aid or inhibit access to information, with language being one of the most important, obvious and visible ones. An information-rich text is useless if the reader sees only letters or signs because she cannot understand the language.

Statistics show that the majority of data on the Web is provided in English, while the majority of users are non-English native speakers. Reasons for the dominance of English on the Web are manifold; they include the origins of the Internet/Web and the role of English as international lingua franca. This remains true even today when Internet growth is fastest in Asia, the Middle East and Africa (Internet Coaching Library 2008a).Footnote 1

Nonetheless, even with shifting language dominance on the Web it remains a fact that information can be accessed either in a native (L1) or in a non-native language (L2)––or not accessed at all, when the necessary language knowledge is lacking. This raises the questions of what impact this type of language divide has on Internet users and what can be done to overcome this divide.

First research in L2 information access has shown that non-native speakers may indeed differ from native speakers in their preferences how to search for information. For example, they differ in their use of search options (Kralisch and Berendt 2004, 2005). Consequently, it can be expected that neglecting existing differences between native and non-native speakers within Web standards and best practices considerably limits access to information for speakers of less-dominant language groups.

Assuring equal information access across languages can follow two approaches: one is to enhance translation and multi-language retrieval tools, the other is to investigate the impact of language on Web search in terms of user behaviour and attitudes, and to derive consequences for Web standards and practices. In our study we focus on the second approach.

Our main objective is to investigate how data distribution across languages on the Web affects how Web users access information. For the purpose of our study, the term data distribution across languages can be approximated with the number of Web hosts per language. A number of questions arise: when do L2 users access a non-native language Web site? Which language-related factors determine whether non-native or native speakers search for information on a certain Web site? We will investigate these questions by measuring Web (language) content and (access and search) behaviour. Beyond that, another question arises: does language-based data distribution also affect attitudinal outcomes of seeking for information?

Answers to these questions are helpful for three reasons. The first is scientific: an increase in knowledge about Web users. The second is commercial: since site translation and adaptations often represent important investment decisions, knowledge about users’ language-related decision processes are valuable for appropriate linguistic adaptations. The third reason is ethical: the goal of increasing participation in Internet communication.

The article is structured as follows: after an overview of the conceptual framework and related work in Sect. 2, Sect. 3 describes data and empirical studies on behaviour. Section 4 complements this by studies on attitudes. Section 5 concludes with an outlook.

2 Conceptual framework and literature overview

There is a large body of literature on language, but only few studies dedicated specifically to language and its impacts on the Web and its usage. We therefore need a conceptual framework to select relevant prior work. In the introduction of this section, we outline a framework for this purpose in terms of three perspectives 1–3. Literature on 1–3 is then described in Sects. 2.1–2.3. Section 2.4 summarises the introduced constructs and gives an overview of how they are operationalised in our empirical studies in Sects. 3 and 4.

Many decisions concerning Web content production and use are made on an individual basis (1), but many also emerge from the interaction of various individuals’ attitudes and behaviours (2). For example, while the individual user chooses whether or not to buy a product in a given online store, she will only be able to do so if that store offers an interface in a language she understands and if she has found it––in her favourite search engine, via a link on another site she used, or as a personal recommendation. Out of all these conditions, only the first may be a purely personal decision of that user––the remaining ones also depend on choices made by other agents on the Web: the choice to create a specific interface, the choice to index and/or link to a Web site, or to talk about it. In addition (3), the Web (in whatever language) is not just “some” store of content but a software/information system.

With perspective (1), we describe the basics of individuals’ language-related perceptions of value and costs on the Web, and we derive the expectation of a general preference of L1 over L2 processing. This perspective is informed by cognitive science/linguistics.

Perspective (2) starts from the observation that information offered in a given language constitutes a supply of (information in that) language. This is met by demand for information from Web users. Given the cognitive preference for certain languages derived in (1), this demand becomes demand for (information in) that language. The Web may be regarded as a market where this supply and this demand meet, and we employ basic economic reasoning to derive further expectations concerning individuals’ value perceptions on the Web.

Finally and using (3), we combine these considerations to emphasize the close relationship between individual usage and acceptance of information systems on the one hand and market processes on the other hand.Footnote 2

2.1 Cognition: language and information-seeking behaviour

Despite the huge amount of information resources on the Web, it is not unusual to experience moments when one stops seeking for information because any further search “is not worth the effort”, even if previous searches were unsuccessful.

A theoretical foundation of the link between cognitive burden (i.e. effort) and behavioural reactions is provided by the Information Foraging Theory (Pirolli and Card 1995, 1999). Pirolli and Card’s model aims to explain user strategies for seeking, gathering and consuming information on the Web as a result of the effort involved. The theory posits that users continue to follow links as long as the information gained from following the link is not exceeded by the costs of accessing it, where costs are determined by time and cognitive effort. Thus, an examination of the impact of language on link-following behaviour requires an analysis of the potential additional navigation costs and values.

The Revised-Hierarchy Model by Dufour and Kroll (1995) is a representation of language mechanisms in bilinguals, which investigates the cognitive costs of processing in L1 and L2. It is often used in cross-linguistic market research (e.g., Luna et al. 2002, 2003). The higher cognitive burden for non-native speakers is explained in the model with the mechanisms of how second languages are acquired and stored. The model also emphasizes that the higher cognitive costs remain, even after the individual has become fluent in both languages. Hence, costs of L2 information processing are higher than those of L1 information processing. In the same manner, L2 information processing at a lower proficiency level requires more cognitive effort than L2 information processing at a higher proficiency level (for examples, see Hahne 2001; Steffenson et al. 1979).

In sum, we expect that language skills as one determinant of cognitive effort affect search behaviour on the World Wide Web: other things being equal, native speakers of a Web site’s languages are more likely to access this particular Web site as they experience lower costs.

Our earlier results (Kralisch and Berendt 2004, 2005) suggest that the ratio of L1 users accessing a Web site considerably exceeds that of L2 users: a logfile study of a large and heavily frequented site offered in five of the largest languages on the Web (www.dermis.net) showed that 83.2% were sessions of users who are native speakers of one of the languages offered on the site at that time (“L1 users”), while 16.8% were sessions of non-native speakers (“L2 users”). The share of L1 users was not only significantly larger than that of L2 users; it was also 50% higher than the corresponding proportion among all Internet users.

Furthermore, we applied the concept of linguistically determined foraging to the use of search options. Similar to link-following behaviour, the decision of whether or not to use a certain search option represents a trade-off between inherent benefits of the search option, in terms of time or information quality, and the cognitive effort involved. For instance, a lower cognitive effort of search option A (e.g. picture supported search) does not automatically mean that every user prefers A over B (search engine)––if search option B offers better features than A, such as faster access or more comprehensive information. To the best of our knowledge, language-related aspects have not been considered in other studies of Information Foraging.

In line with the Information Foraging Theory, all of these findings suggest that language-based cognitive burden is a determinant of Web-site access and therefore influences information seeking.

2.2 Economics: information and information flow on the Web

Conclusions on the impact of language-related cognitive burden face a problem: the lack of control for a potential bias of the hyperlink distribution. We argue that cognitive burden is only one factor in the relationship between language and who accesses a site. We suggest that accessing a site is not only based on users’ capacities to access the site, but also based on their opportunities to access it. Language is a determinant of these opportunities as it affects the structure of the World Wide Web, in terms of the number of Web hosts per language and in terms of how Web sites are linked among each other. In other words, language has an impact on how information flows on the Web.

Information can be available in a Web site (supply) and desired by a user (demand), and ideally an information flow will couple these two. With regard to the World Wide Web, the term information flow is generally used to describe hyperlink distribution, i.e. how hyperlinks are set to link various sources of information. However, this distribution only describes an intended flow of information (and in this sense supply). The actual flow of information arises from the ways in which people follow these links and gather information (and in this sense represents demand). In this subsection, we elaborate on expectations concerning supply and demand; in Sects. 3 and 4, we suggest measures and compare values of intended and actual information flow.

By referring to supply and demand, we take an economic approach to language questions in this subsection. Economic approaches to language have been developed and tested in various studies outside of the WWW, for example with respect to the choice of language when doing business (e.g., Grin 1994a). For surveys of the economics of language, we refer the reader to Vaillancourt (1985) and Grin (1994b, 1996).

2.2.1 Amount of content per language

The relationship between the number of (potential) Internet users speaking a certain language (demand) and the number of Web hosts in that language (supply) can be founded on two arguments. The first line of reasoning is based on a market perspective where the number of Web hosts in a language follows the number of potential customers/visitors. Due to simple mechanisms of supply and demand, a higher number of potential customers should attract a higher number of suppliers. The second argument is that a higher number of (native) speakers increases the number of people who are able to create a Web site in that particular language. Both arguments lead one to expect that the number of sites per language should be higher for languages with many speakers than for those with few speakers.

However, the relationship between the number of speakers and the number of Web hosts is not necessarily straightforward. Despite a lack of empirical research, it can be expected that the number of users and the number of Web sites are not directly proportional, due to scale, network and other effects. Scale effects predict that the costs per produced unit decrease with each additional unit. The fact that the value of a network increases more than by n (namely by n2−n according to Metcalfe’s law) with n additional members is a network effect (Metcalfe 1995; see also Odlyzko and Tilly 2005). Different wealth and heterogeneous education levels as well as discriminatory marketing goalsFootnote 3 (Grin 1994b) represent further influencing factors. In summary, therefore, supply and demand may not be related proportionally.

2.2.2 Language-related links between web sites

Web sites are mutually dependent entities that constitute a system (Park and Thelwall 2003). Language can be expected to not only influence the total amount of information available to Web users, but also how information sources (i.e. how many and which Web sites) are linked among each other and therefore how easy/likely it is to find and access a certain Web site.

Bharat et al. (2001) and Halavais (2000) studied the role of geographic borders and language affiliation on link setting behaviour. It was shown that the number of links within a country domain is generally much higher than towards any other country domain. For example, about 90 percent of the hyperlinks on U.S. websites link to other American sites. In Europe, between 60 and 70 percent of the hyperlinks are directed to other national websites. Among the remaining hyperlinks, 70 percent link from Europe to U.S. Web sites. Interestingly, results also revealed that strong geographical connections were sometimes overridden by language affiliation (e.g. Brazil–Portugal) (Halavais 2000).

The studies by Bharat et al. and Halavais are first indicators of the potential impact of language: Web sites in different languages are less connected than sites in the same language. However, these studies analysed data aggregated on the national level and therefore provided limited insight into the role of language.

2.3 Information systems and technology acceptance: attitudes

The aforementioned relationships between language and access to information regard behavioural aspects of information seeking. To complement these investigations, we suggest adding an investigation of attitudinal outcomes of the role of language in information seeking on the Web.

A theoretical basis of linking technology with behaviour and attitudes is provided by Davis’ (1993) Technology Acceptance Model (TAM). This model has been used and validated in numerous follow-up studies (for a review, see Lee et al. 2003). The results have shown that “usefulness” and “ease of use” of a technology system are significant determinants of the acceptance of that system in terms of attitudes towards it and subsequent system use.

We expect an analogous influence of language-related perceptions of usefulness and ease of use on attitudes. Language-related cognitive effort has a crucial impact on the perceived ease of use. Therefore, the costs of language processing during information seeking can be expected to affect attitudes in Web search.

Evidence of a negative relationship between cognitive effort and satisfaction has been found in consumer research and analyses of users of search or decision support tools (e.g., Bechwati and Xia 2003; Branting 2004).

In addition to language proficiency, a user’s perception of how much information is offered in his/her native language is taken into account as a second linguistic variable. According to basic economic theories, a product’s usefulness and therefore value decreases with an increase in the amount of it offered on the market (Menger 1871). Transferring this assertion to language issues and the Web, we expect that the language-related value of information decreases as more information is offered in that language on the Web. In addition, value perceptions are also determined by topic; thus a large amount of content on a topic in a native language may also reduce the value of content on that topic in other languages.

2.4 Constructs and operationalisations

In the following Sects. 3 and 4, we describe data analyses and empirical studies that examined the consequences of these observations on language-related behaviour and attitudes. An overview is given in Table 1.

Table 1 Constructs from the literature (col.2) introduced in the literature overview (col.1) are operationalised, in Section (col.3), as measures (col.4)

3 Language and behaviour

In an ideal world, all Web users would be able to access content regardless of language, empowered by ubiquitous high-quality language tools. Failing this, in a “linguistically equitable world”, all users would have access to content in their own language to an extent that is representative of this language group’s size. This raises two questions: (a) Is the amount of available content representative of the size of language groups? (b) Do L2 users not use sites because they consider them too difficult to use, or because they do not reach them in the first place?

These questions translate into questions about behaviour focussed on content creation and link setting.

  • Assuming that each set of Web users (who are also potential content creators) with a given native language constitutes a market, is content produced proportional to that market’s size?

  • Do Web authors link to the available content in that language proportional to the amount of that content?

In Sects. 3.1 and 3.2, we show evidence of behaviour that disadvantages languages that are not English. A third question follows from these:

  • Do Web users accept the information offered in their native language? I.e., (how) does link-setting behaviour influence link-following behaviour?

In Sect. 3.3, we show evidence of a third behavioural effect that further diminishes the multilinguality of the Web.

3.1 Behaviour I: content-creation behaviour leads to an under-representation of non-English languages

3.1.1 Hypotheses

In a “linguistically equitable Web”, all users would face a supply of content in their own language to an extent that reflects this language group’s size. Conversely, if a language (e.g., English) dominates, the relations will differ.

3.1.2 Method

3.1.2.1 Measures

If the first hypothesis held, the following ratio would be expected to be 1 regardless of the size of the language group:

$$ {\text{Percentage of Web hosts with main language L}}/{\text{Percentage of Internet users whose L1 is L }} . $$
3.1.2.2 Data

We analysed data for 2005, the last year for which global statistics of the number of hosts by language were available, and concentrate on the seven largest linguistic user groups (Global Reach 2005).

3.1.3 Results

Table 2 shows that the number of Web hosts is distributed highly unevenly. All non-English languages have values clearly below 1.

Table 2 Languages: users and Web hosts

3.1.4 Discussion

The results indicate that non-English languages are under-represented on the Web in terms of the amount of content supplied.

While the Web has grown since these data were collected, 2008 figures show that the shares of these user groups have remained quite stable (Internet Coaching Library 2005, 2008b): the global user base grew from below 900 million users to 1.26 billion users. Proportions of users of the seven largest linguistic user groups changed as follows: English (32% → 30%); Chinese (11% → 15%); Japanese (8% → 7%); Spanish (7% → 9%); German (6% → 5%); French (4% → 5%); Portuguese (3% → 4%). In spite of the lower percentage of English native speakers, English is still the largest language in terms of host numbers (which increased by ca. 60%): English-language country domains (.us, .uk, .ie,
) continued as the largest share, and this share increased by ca. 1 percentage point. In addition, (.net, .com, .org) hosts, which are to a large extent in English, still make up 52.5% of all hosts (down from 62.6% in 2005).Footnote 4

3.2 Behaviour II: link-setting behaviour leads to an under-representation of non-English languages

Users whose native language is not English are thus faced with a relatively small(er) supply of Web content in their native language. So do they at least have the chance of reaching this content to the extent that it exists? To investigate this, we focused on reaching by navigation and thus the link structure of the Web.

3.2.1 Hypotheses

If the link distribution was “linguistically equitable”, all users would face a number of links to content in their own language to an extent that reflects the supply of content in this language. Conversely, if a language (e.g., English) dominates, the relations will differ.

3.2.2 Method

3.2.2.1 Measures

If the first hypothesis held, the following ratio would always be expected to be 1 regardless of the size of the language group:

$$ {\text{Percentage}}\,{\text{of inlinks to content in language L}}/{\text{Percentage of Web hosts with main}}\,{\text{language L}}. $$

The inlinks in this expression could be defined “statically” (how many inlinks exist on the Web) or “dynamically” (how many inlinks are actually used). The first can be measured by a crawler analysis of the Web graph, the second by a referrer analysis of a Web server log.

3.2.2.2 Data

These considerations make it obvious that the question of this section cannot be answered with respect to the Web as a whole. The static analysis would require access to the whole index of a major search engine, and even that would only be an approximation (given that each search engine covers only parts of the Web). The dynamic analysis would require access to all those sites’ Web server logs, which is impossible because Web server logs are only disclosed in exceptional circumstances. We therefore decided to investigate this question using data about one site, www.dermis.net. The site was chosen for a number of reasons: it is very large, heavily frequented by users from across the world, offers information for domain experts and laypeople in one of the largest growth areas of Web content (e-Health, see MarketIntellNow 2008), and it is available in five major languages: English, Spanish, German, French, and Portuguese.Footnote 5 This site was also used in (Kralisch and Berendt 2004, 2005); in contrast to the earlier study, a later dataset was used, and the analysis methods extended beyond the earlier study’s focus on search behaviour.

Two datasets were used for the analysis: data from a Web crawler and data about distinct referrers from the server log. The crawling dataset provides distributions of existing inlinks as indications of link-setting behaviour and therefore intended information flow. The referrer dataset provides distributions of which of these inlinks were used at least once, as indications of link-following behaviour and therefore actual information flow. Due to uncertainties involved in automatic language identification, we also cross-validated the links found by the crawler with the external referrers resulting from this analysis.

3.2.2.3 Procedure

The Web crawler is based on Jobo (www.matuschek.net) and was developed by Thomas Mandl. The crawler queries search engines to collect information about other Web sites and their links to the Web site investigated. For each of these links, the dataset contains the URL of both the source and target page and their language. For this analysis, all Web pages were considered independent objects regardless of potential relationships. We used the language classifier Ngramj (http://sourceforge.net/projects/ngramj) for automatic classification of the source pages’ languages. In cases where pages contain text in more than one language, we assumed that there is one main language, and the tool’s results were used (see also Martins and Silva 2005).

To obtain the referrer dataset, we analysed 374,458 user sessions consisting of 497,912 page requests, recorded in the site’s logfile between February and April 2005. The analysis followed the standards of referrer analysis: data preparation including sessionization followed the usual steps (e.g., Cooley et al. 1999), mostly done automatically using WUMPREP (www.hypknowsys.de). External referrers that indicated the use of a search engine were excluded. To determine the language of the referrer inlinks, we used top-level domains as proxies, thus for example a page from an .es domain was assumed to be in Spanish.Footnote 6

3.2.3 Results

The crawler found 4,220 links pointing to pages within the site. The usage analysis found between 5,191 (April 2005) and 7,748 (February 2005) distinct links. The overlap between both sets––the existing/found inlinks versus the used inlinks––is only about 20% of each set. This means that the search engine queried by the crawler has not registered all pages linking to the site, and many links known to the search engine were not used. The results are shown in Table 3. For comparison, the global distribution data (cf. Table 2) are listed in column 1.

Table 3 Link-setting and link-following behaviour: How many hosts link to a site with five L1, and how many unique referrers lead to sessions on this site?

All non-German languages have values clearly below 1. The high value for this language probably results from the site being operated by a team in Germany.

3.2.4 Discussion

The smaller the language, the smaller the relative percentages of inlinks; this holds for all languages for which the site offers users an interface in their L1 (above the line in Table 3). The number of links from sites in languages that are only served as L2 interfaces (below the line) is even smaller (with the exception of Chinese, for which we have no good explanation). This indicates that non-English languages are under-represented on the Web in terms of the links that content creators set to content in those languages.

3.3 Behaviour III: link-following behaviour reinforces the under-utilization of non-English content

3.3.1 Hypotheses and data

In a follow-up study of the data described in Sect. 3.2, we investigated the relationships between referrer languages, users’ native languages and the chosen site language. Based on the previously described findings, we expected to find an over-representation of English even in the usage of a site offered also in other languages.

3.3.2 Method/measures

Proportions of usage by language were investigated. We derived data about the users’ native languages from logfile information. Detailed geographic information was obtained from IP addresses by means of a geocoder (www.geobytes.com). This was combined with a language database we created, which includes data on regional language distribution and language status.

3.3.3 Results

Not only were there substantially fewer referrers from L2 sites to the investigated site, as shown in Table 3; these were also used to different extents. A further analysis of the data described in the previous section showed that country-specific links were always used most by users with the corresponding mother tongue (e.g., .jp by Japanese). Inlinks from .de were the only exception, since they were used more by English native speakers than by German native speakers. However, most country-specific referrers have little importance in an overall view. The .de inlinks are the most used inlinks in all language groups except the Russian one (which only used .ru inlinks). The share of .de inlinks was over 40% for all seven languages shown in Table 3. The shares of .com inlinks were second across languages, with single-digit shares for all languages except Chinese (15%).

Due to the outstanding role of English and German, we analysed the link-following behaviour of visitors directed to the site through a .com or a .de inlink in more detail as examples of “international”/English and German language referrers, respectively. Table 4 shows which percentage of users with a given native language (rows) chose the different site versions (columns). Bolds and Italics indicate majority behaviours. The table investigates (a) whether links to native language content are used if they are available (above the line) and (b) which language is used when a native language version is not available (below the line). The investigation was conducted in 2 scenarios: in scenario 1, the visitors come from a .com Web site, in scenario 2 from a .de site.

Table 4 Link-following behaviour: impact of referrer language and user native language on chosen site language

3.3.4 Discussion

With regard to question (a), the table shows that, where the native language is available, users go either to their native language version or to the English language version. A prior navigation in English appears to strengthen the dominance of English (all values but one in the “English” column are higher in the left part of the table than in the right part), while a prior navigation in a native language appears to weaken this dominance (see especially the high value for German → German in the main diagonal of the right part of the table).

With regard to question (b), results indicate that the majority of users choose the English-language version if the native language is not available, regardless of the language of the previously visited site.

Taken together, these results indicate that users have a clear preference for navigation in their native language when it is available via a link, but if that is not available, they accept the necessity to navigate in English.

In summary, the results described in Sects. 3.1–3.3 indicate that behavioural tendencies both of content providers and of content users lead to mutually reinforcing under-representations of non-English languages––compared to the respective market size or available options, there is less content in these languages, this content is linked to less, and the links are followed less often.

4 Language and attitudes

The advantage of behavioural studies like those described in the previous section is that they make very large-scale studies possible; their disadvantage is that they do not provide explanations of why certain behaviour occurs. We therefore complement the behavioural studies by a––necessarily smaller––attitudinal study that scrutinizes in detail a subgroup of users about which we ask a fourth question:

  • Do Web users appreciate that content is provided in their native language? I.e., concentrating just on those users who have successfully navigated the first three, behavioural obstacles (content in their language exists, is hyperlinked, and is accessed), how does language affect these users’ attitude to the content?

4.1 Hypotheses

In this study, we examined the role of language as a determinant of users’ satisfaction with a Web site. The impact of language is expected to be twofold.

First, a user’s language skills affect the perceived cognitive effort required when using the Web site. In contrast to the previous analyses, in this study cognitive effort was measured in terms of effort saved. “Saved effort” relates to the difference in cognitive effort between the use of a user’s native language and the use of a non-native language. An L1 user will generally save effort (relative to a situation in which she would access information in a second language, usually English). The less knowledge of that other language she has, the more effort she will save. Conversely, an L2 user will expend effort (relative to a situation in which he would access information in his first language), i.e. experience a negative saved effort. The less knowledge of the access language he has, the higher the expended effort or equivalently the smaller the saved effort. In terms of the constructs of the Technology Acceptance Model (TAM, see Sect. 2.3): the higher the perceived saved effort, the higher the system’s (comparative) ease of use. From this and the TAM’s positive relationship between ease of use and satisfaction, we derive the following hypothesis:

H1

Users’ perceived language-related saved effort is positively related to their satisfaction with a Web site.

Second, the overall amount of information in a language is also expected to influence satisfaction. First, as argued in Sect. 2.3, we expect that the higher the amount of native-language information on the Web, the lower the value (usefulness) of a site in that language. Since users have no perfect information, their perceived amount of information in a given language is expected to be decisive. Second, the TAM leads us to expect that usefulness increases satisfaction. We therefore consider the perceived amount of native-language information on the Web as the inverse of usefulness and formulate the following hypothesis:

H2

Users’ perceived amount of native-language information on the Web is negatively related to their satisfaction with a site in that language.

Besides their influences on satisfaction, these two constructs may themselves interact. In Davis’ (1993) results, the ease of use has a positive impact on usefulness. The expected relation in our study is not as straightforward. First, saved effort is a construct that expresses ease of use from a comparative view: how much effort is saved by using a particular feature instead of another feature. In a similarly comparative way, a site is more useful to the extent that the market supply is perceived as smaller. Second, different perception and learning processes may operate in this dynamic and relative setting. Two types of processes are conceivable, depending on whether perception processes or market demand-side processes dominate: (a) Users with higher saved effort (= less knowledge of non-native languages) have a higher need for native-language sites; and their perception evaluates the existing supply as insufficient. In this case, saved effort would correlate negatively with the perceived amount of native-language information. (b) Users from smaller-language communities develop higher competencies in English to be able to profit from the (mostly non-native) content. In this case, the perceived amount of native-language information would correlate positively with the saved effort.

4.2 Method

4.2.1 Participants

Data was obtained from two surveys that together yielded 103 valid answer sets. English native speakers were excluded from the analysis. (This was done because it can be assumed that English is the most frequent L2 language for Web users. Consequently, for L1 presentation, L2 knowledge was equated with English-language skills, and respondents whose L1 is English were excluded.)

4.2.2 Materials

Survey 1 was conducted on the Web site described in Sect. 3.2 between April and September 2005; survey 2 on another e-Health Web site (affiliated to the first site, but targeted towards domain experts and more communicative than informative uses) between July and September 2005. Users were informed by links on the site itself (1 and 2) and an invitation to all registered users by E-mail (2). Both surveys employed a questionnaire. Questionnaires were offered in the site’s five languages (1: English, French, German, Spanish, Portuguese) or in English and German (2). The questionnaires asked users to rate a number of statements concerning (a) their impression of the specific site, (b) using the Web (as a whole) in their native and in other languages, and (c) their language skills.Footnote 7

4.2.3 Measures

Questionnaires for both surveys were based on similar item questions but had to be adapted to the characteristics and features of each site and the requests of the collaborating partners.

4.2.3.1 Saved effort (SAVEFF)

Four questions with a 7-point Likert scale (6-point in survey 2) concerned this construct. L2 users were asked whether, if they accessed a site in their native language rather than the current one in a second language, they would save time while surfing and seeking for information, would surf more, and would perceive less effort. L1 users were asked whether, if they accessed a site in a second language rather than the current one in their native language, they would lose time, surf less and perceive more effort.

4.2.3.2 Perceived amount of native language information online (ILO)

Three items, with answers on 7-/6-point Likert scales, contained questions about the amount of general and medical information on the Web in the user’s native language, as well as his/her tendency to communicate in English as a result of insufficient native language information.

4.2.3.3 Satisfaction (SAT)

Survey 1 asked users to evaluate the site on a 7-point semantic differential scale with the adjective pairs satisfied/unsatisfied, content/discontent, positive/negative, pleased/unpleased. The 6-point semantic differential in survey 2 was based on Davis’ (1993) items for assessing satisfaction.

Respondents were asked to specify their native language and their English language skills.

4.3 Results

4.3.1 Basic results: L1/L2 users

The ratio of native speakers to non-native speakers participating in survey 1 mirrored the results from our previous studies on the usage of a site dominated by non-domain experts (Kralisch and Berendt 2005, see Sect. 2). Among the 35 valid cases were 30 L1 users (85.7%) and 5 L2 users (14.3%). Due to the low number of L2 participants, analyses were restricted to the L1 user group.

In survey 2, the relation was reversed, probably because the target group of the site investigated in survey 2 consists only of domain experts (this is also consistent with the result from the earlier study). Among the 68 valid cases were 12 L1 users (17.6%) and 65 L2 users (82.4%). Due to the low number of L1 participants, analyses were restricted to the L2 user group. We split the analysis by whether the respondents used the discussion forum or the individual-consultation service of the site, and here report only the results of the former (because it is more similar to the general-information nature of the site of survey 1). This reduced the number of valid cases to 54.

In the following, we first describe the results of survey 2 that investigated the typical kind of user who stands to benefit from language tools: L2 users who at present do not get access to information in their native language. We then progress to survey 1 that investigated users who already profit from an existing translation into their native language.

4.3.2 Attitudes I: L2 users are more satisfied when they can save cognitive effort

Answers concerning the three constructs were tested for reliability; results were sufficient for SAVEFF and SAT (Cronbach’s alpha = 0.892 and 0.950, resp.), but too low for ILO (0.460). We therefore limited the calculation of constructs based on the means of SAVEFF and SAT. Moreover, a factor analysis (Principal component analysis, Varimax rotation) also revealed the three expected factors with the respective items and factor loadings, further confirming the item-construct classification.

SAVEFF correlates significantly with the users’ English proficiency levels (active knowledge: r = 0.388, p < 0.05 / passive knowledge r = 0.322, p < 0.05).

4.3.2.1 Saved effort––satisfaction (H1)

There was a significant positive correlation between SAVEFF and SAT (r = 0.313, p < 0.05), corroborated by significant correlations between the single items.

4.3.2.2 Amount of native-language information––satisfaction (H2)

No significant correlation between the items for ILO and SAT was found; for two items, a negative correlation trend was found (r = −0.257, p = 0.062; r = −0.231, p = 0.09).

4.3.2.3 Saved effort––amount of native-language information

There was no significant correlation between SAVEFF and ILO.

In summary, the only finding regarding L2 users was a confirmation of hypothesis 1: L2 users are more satisfied when they can save cognitive effort.

4.3.3 Attitudes II: the satisfaction of L1 users depends on their command of English and their perception of other available information

Answers concerning the three constructs were tested for reliability, with sufficient results in each case (Cronbach’s alpha was 0.855, 0.875 and 0.736 for SAVEFF, SAT and ILO, resp.). A factor analysis (Principal component analysis, Varimax rotation) also revealed the three expected factors with the respective items and factor loadings.

Given the sufficient reliability coefficient and the constraints due to the limited data set, each construct was represented by the mean of the items composing it. The correlation between SAVEFF and the users’ English-language skills was (almost) high with r = −0.698 and p < 0.001. Thus, saved effort seems to be an appropriate reflection of a user’s perceived linguistic effort.

We tested for bivariate Spearman correlations among the single items and between the items and the constructs. Significant correlations between the items loading on the same factor confirm the results obtained from the reliability tests and factor analysis, and therefore the validity of the chosen items. However, consistent significant correlations between items of different factors were not revealed; instead correlations between single items indicated potential relationships.

4.3.3.1 Saved effort––satisfaction (H1)

No correlation between SAVEFF and SAT was found.

4.3.3.2 Amount of native-language information––satisfaction (H2)

ILO was negatively correlated with SAT, though not significantly (r = −0.419, p = 0.27). A low ILO in this study of L1 users means that the user perceives little content in L1 and therefore is used to visiting L2 language sites.

4.3.3.3 Saved effort––amount of native-language information

SAVEFF was negatively correlated with ILO (r = −0.423, p < 0.05): the more effort saved, the lower the perceived amount of medical information.

In summary, the negative relationship between SAVEFF and ILO supports the assumption of a language-skill related perception of the amount of native-language information. In addition, the findings are consistent with H2 (even though not statistically significant), i.e. visitors used to visiting L2 sites tend to be more satisfied with an L1 Web site than those used to obtaining information in their native language.

4.4 Discussion

Users who are at present relegated to an L2 usage situation and who speak English poorly expend effort and tend to experience lower satisfaction. They would clearly benefit from more translated content and/or language tools. Such tools might be less relevant for L2 users who speak English well. L1 users who speak English poorly save effort. They think that there is, in general, little L1 content for them. This group would also benefit from more translated content and/or language tools. In contrast, L1 users who speak English well do not save much effort, but tend to think that there is, in general, much L1 content for them. Moreover, the more content is perceived to exist in the native language in general, the less satisfactory the specific present site appears to be.

This can result from several reasons: (a) These L1 users’ good command of English enables them to navigate wide areas of the Web effortlessly (and they do this, see Sect. 3). They therefore know more competitor Web sites and consequently rate the site less highly. (b) Good command of English may be confounded with high domain expertise (and maybe also with a better knowledge of domain-specific information) and a consequently more critical evaluation of a site. (c) Since they are used to high-quality Web sites in English, these users may develop country-of-origin attitudes and associate Web sites in their own language with lower quality.Footnote 8

5 Summary, conclusions and outlook

In this paper, we have investigated the existence and use of non-English content as well as attitudes towards content in different languages. First, we have shown that the continued under-representation of non-English languages results from a number of behavioural effects: there is less content in these languages (than the size of the language groups would warrant), this content is linked to less (than its share of all Web content would warrant), and the links are followed less often (than the opportunities given by links would warrant).

Second, we have shown that attitudes confirm earlier work on the general desirability of more translation and better language tools (Chung et al. 2006). However, our results also shed a differentiated light on users’ attitudes, revealing a complex interplay of English language skills, the perceived saved effort of using native-language content, the perceived overall supply in that language on the Web, and satisfaction.

Further research is needed to replicate and extend these conclusions. In particular, this should go beyond the present studies’ limitations.

5.1 Limitations

First, our aggregate measures (such as percentages of Web hosts) should be complemented by more fine-grained analyses of, e.g., content in different application domains. Second, while we did use a combination of behavioural and observational methods, neither is without problems. Results from Web server logfiles have the advantage of yielding huge datasets gained under ecologically valid usage conditions, but they need to be interpreted with caution because actions do not reveal the individual user’s reasons for coming, staying or leaving, and log files generally do not capture users’ tasks and intentions. In general, the field nature of these studies makes it difficult to control many variables. In addition, the native-language assignment resting on IP addresses has a margin of error.Footnote 9 Results from questionnaires, on the other hand, face problems of self-selection and the limitations of introspective assessments (such as language skills). In addition, the specifics of our sites led to a confounding of language skills (L1 vs. L2) with domain expertise (layperson vs. expert) in the studies reported in Sect. 4. (An interesting finding was that the expert users displayed a large demand for language tools. We would not have expected this, given our earlier results on domain knowledge mediating language skills, see Kralisch and Berendt 2005.)

Third, the restriction to two Web sites and five languages should be lifted; further studies on other sites, in other domains and with further languages (especially Chinese and Arabic, the largest-growing languages on the Web) should complement our findings. Finally, most of the data used in this study originate from 2005. While some current data on Internet usage indicate that the basic relations have not changed and many of our methods and findings rest on basic and comparatively persistent cognitive structures, it is possible that learning and market processes have modified the Web environment. Further investigations and long-term analyses are needed.

5.2 Conclusion

In conclusion, the creation of language tools should start from the basic observation that everyone is an L2 user in some contexts. In these contexts, savings in cognitive effort––for example, due to the presence of language tools––are universally appreciated and helpful for performance. Many known results (e.g., Chung et al. 2006) support this conclusion. However, they may not sufficiently take into account that societal factors interact with individual cognitive factors: people act in a dynamic Web context and adapt to the (still very unevenly distributed) linguistic information in complex ways.

Taken together, these results point to a possible digital divide and different possible strategies for deploying language tools in originally English-language Web environments: there exists a “linguistic upper class” of people who are proficient in English, often prefer to navigate in English (even if offered content in their own language), and are more scrutinizing of the quality of Web content. These users probably do not care much about whether sites make efforts to provide them with content in their own languages (in fact, this may actually make them more critical of a site). However, the “linguistic lower class” of people who are not so proficient in English do perceive the (real) scarcity of information in their native language and are highly appreciative of content in this language. Translations and language tools for these users are important tools for winning their approval, but it should not be forgotten that “content comes first”.

While the impact of language on Web usage behaviour and attitudes is a line of research that is only at its beginning and therefore necessarily yields comparatively general insights, we also want to derive some more specific conclusions for design.

5.3 Conclusions for design

First, language tools may be more useful for generic search engines than for individual sites. Language tools will be highly appreciated, and in this sense be worth the investment, in more generic tools for cross-site, non-English or cross-language search and retrieval. However, individual sites may be better advised to make technology development and deployment choices depend on domain, user target groups and competing information.

Second, content and search-tool designers should not draw simplistic conclusions based on behaviour alone, because this is not a reliable indicator of attitudes and preferences. In the absence of links and/or content in their native language, users will acquiesce to English-language content. However, their preferences will persist. We therefore encourage designers to explicitly offer access, by linking or search tools, also to content in languages other than English. This will not only make users more satisfied, but may also lead to positive reinforcement loops resulting in ultimately more non-English content replacing the current negative reinforcement loops.

Finally, cross-language search tools could build on these insights. Both linguistic capacities and preferences predict that queries will probably be formulated most often in the user’s native language. In response to this, search tools could compile result sets in different languages and rank the results by a combination of language skills and preferences (L1 first, then L2 s in decreasing order of proficiency, assuming English in first place and/or resorting to information gathered from the user) and relevance of the query/of a complete answer. The compilation of result sets could utilise the contents, anchor texts, metadata or tags of resources, and it could leverage multilingual thesauri or aligned corpora and automatic translation (see Airio 2008 for experimental evidence of the benefits of query translation especially for users with moderate language skills). Users should be enabled to customize the relative importance of language and comprehensiveness. Alternatively, automatic personalization could suggest non-native-language content only if user behaviour indicates a hitherto-unsuccessful search.