An Exploratory Study of Models of Mobile Map User Experience

Several user studies have been conducted to evaluate the User Experience (UX) of thematic mobile maps, but models describing the results beyond point studies are still lacking. This article explored mathematical functions to predict the UX on the visualization types Choropleth Maps and Graduated Symbol Maps. Ten different Choropleth Maps and ten different Graduated Symbol Maps were utilized to conduct a user study, in which 30 participants solved information-gathering tasks on a mobile device. The data from the first 20 participants served as input to build 12 mathematical models on the accuracy, efficiency, perceived mental demand, perceived performance, perceived effort demanded and perceived frustration level for solving the given map tasks. The predictive performance of the models was then evaluated using data from the remaining ten participants and the predictions were within 30% of unseen empirical data. The models obtained are relevant to the design of adaptive and plastic geovisualizations on mobile devices.


Introduction
Geographical maps serve as a tool to visually present information about the world in an abstract and simple way. These can be 'topographic maps' that show the shape of the Earth's surface to support orientation, or 'thematic maps' that focus on a specific topic, also often named 'single-topic maps' (Wade et al. 2006). These representations enable map users to gain knowledge and support them in decision-making processes. Regardless of the type of geovisualization, the perception of the conveyed information should be easy for map users. Hence, geographic visualizations underwent lots of improvements to keep refining map readability and map user experience (Bessadok and Dominguès 2011;Harrie 2009). Over time, information technology has developed immensely fast so that 'personal' computers became ubiquitous, mostly in the form of mobile phones. For this reason, mobile maps have become essential for decision-making processes and for gathering geographic information, not only for map experts but everyone with different skill levels of map use.
In the field of thematic mapping, different types of maps can be used to visualize the same quantitative data: choropleth map, graduated symbol map, proportional symbol map, dot density map, isoline map, dasymetric map, cartogram and flow map (see Słomska-Przech and Gołębiowska 2021; Golebiowska et al. 2021). The user experience of these geovisualizations is increasingly studied (Słomska-Przech and Gołębiowska 2021; Brychtová 2015b; Gorte and Degbelo 2022;Coltekin 2015, 2016;Brychtová and Çöltekin 2017b). The goals of these studies are diverse, for example, compare the impact of different visualization types on the user performance during map reading tasks Gorte and Degbelo 2022;Słomska-Przech and Gołębiowska 2021), or investigate the impact of different design parameters (spatial distance, colour distance, legend position, font size) on the interpretation of maps (Brychtová 2015b;Coltekin 2015, 2016;Brychtová and Çöltekin 2017b).
Although these user studies have provided valuable insights, models systematically describing how the user experience of thematic maps varies as a function of changes in design parameters are still lacking. There are different types of models in HCI (see Oulasvirta 2019 for examples) and 'model' denotes here a representation in mathematical terms of the behaviour of a dependent variable in response to a design parameter (i.e. mathematical models). The benefits of these mathematical models are at least threefold. First, and from the theoretical point of view, they enable generalizations beyond the results of empirical 'point studies'. That is, mathematical models can be seen as the conceptual 'glue' (Oulasvirta and Hornbaek 2016) that links different empirical studies examining the impact of a design parameter on a dependent variable. Second, and from the practical viewpoint, they can inform design by allowing numerical derivation of design consequences (see Oulasvirta and Hornbaek 2021). In that sense, and as indicated in (Cockburn and Gutwin 2010;Bailly et al. 2014), they can reduce the dependence on empirical evaluations when testing the performance of a particular prototype. Third, these models are needed to help computers 'understand' UX. That is, they are needed to formulate optimization constraints for computational user interface design (Oulasvirta 2016;Oulasvirta et al. 2020) and intelligent geovisualizations (Degbelo and Kray 2018).
This work systematically investigated the impact of two design parameters on the user experience of map-based products on mobile devices. The focus was on the following research questions: 1. Which mathematical function best describes the relationship between user experience and the colour distance on a choropleth map? 2. Which mathematical function best describes the relationship between user experience and the size distance between symbols on a graduated symbol map? 3. How do the mathematical functions perform on unseen data?
In this exploratory study, we are particularly interested in learning about the form of the mathematical functions. The two types of maps investigated are shown in Fig. 1. We chose choropleth maps (CM) and graduated symbol maps (GSM) as a starting point because they are the most frequently used to visualise open data (see e.g. . The contributions of this work are threefold: • An open-source prototype that generates choropleth maps with different colour distances and graduated symbol maps with different symbol size distances for mobile devices; • Twelve mathematical functions describing the user experience of choropleth and graduated symbol maps. Their evaluation has shown that they are promising: their predictions remain within 30% of unseen empirical data; • Hypotheses about the relationships between colour/size distance and pragmatic as well as hedonic aspects of user experience on mobile devices.

Related work
The aim of the work is to learn about mathematical functions that can be used to describe the user experience of mobile maps (CM and GSM). Hence, this section briefly reviews previous work on user studies on mobile map UX, user studies on choropleth and graduated symbol maps, and mathematical models of interaction in HCI. In essence, though previous work has contributed several user studies about digital maps on mobile devices, mathematical models of UX are still rare.

User studies on mobile maps user experience
There is a growing body of work studying the user experience on mobile maps. The goal is to detect issues when interacting with mobile maps and find ways to improve the overall map user experience. For instance, Bartling et al. (2021) created an online survey to test 84 (topographic) map variations to evaluate the relationship between user context and user experience. Participants interacting with polygons on most of the map variations had high task success, comfort, and confidence ratings. Bertel et al. (2017) compared the effect of a visual display condition and a tactile display condition on spatial knowledge acquisition on mobile devices. Their study revealed a distinctive strength for each condition: the visual display condition helped build up survey knowledge more, while the tactile condition helped build up mental route knowledge more. Another work focusing on topographic maps proposed a method using speech interaction for editing maps on mobile devices (Degbelo and Somaskantharajan 2020). Participants used 11 speech commands to add features to the mobile map. This work showed that using speech recognition is feasible and usable, while the user experience was rated rather average. Einfeldt and Degbelo (2021) studied the impact of the sequence of UI elements and tab/scrolling navigation for forms on mobile devices. They reported that sequence matters while the UI navigation modalities had no impact on the UX. Finally, Horbiński et al. (2020) investigated the placement of buttons for map-related tasks on mobile devices and pointed at a discrepancy between the users' preferences and the current practice in designing mobile maps. While the works aforementioned have provided insight into users' wishes and mental models from different perspectives, they present two key differences with the current work: (1) their focus was on topographic maps, not thematic maps as addressed here; and (2) they studied factors of positive user experience, not mathematical models of UX.

User Studies on Choropleth and Graduated Symbol Maps
There are also a couple of studies that attempted to assess the effectiveness of CMs and GSMs for specific tasks. For instance, Schiewe (2019) observed a dark-is-more bias in his Fig. 1 Visualization types investigated in the work: a colour distance refers to the average of the distances between the colours used to represent adjacent classes in the map legend; b size distance denotes the average of the distances between the circles used to represent adjacent classes in the map legend study. The dark-is-more bias refers to the fact that even without a legend, people associated the darkest colour hues with the largest data values. He also reported that including a map legend improves the correctness of the association between colour values (dark/light) and data values (large/small), especially for non-experts. The comparison of graduated symbol maps to data tables was the topic of , and the authors reported that graduated symbol maps made 'space-alone compare' information more visible, while the tables made 'space-in-time compare' information more visible to users. Choropleth, graduated symbol and isoline maps were compared in Słomska-Przech and Gołębiowska (2021), and the results showed that choropleth maps were the most effective to complete map-relevant tasks. In addition, a combination of the visual variables colour and orientation was introduced as 'Choriented Map' in Gorte and Degbelo (2022), where participants answered questions on choropleth, graduated symbol and choriented maps on mobile devices. Choriented maps resulted in comparable and sometimes better usability and performance than CMs and GSMs. Change blindness in animated choropleth maps was investigated in Fish et al. (2011). They pointed out that map readers have difficulty detecting changes in animated choropleth maps, and tend to overestimate their own change detection abilities. The studies by Brychtová and colleagues (Brychtová 2015b;Brychtova and Coltekin 2016;Brychtová and Çöltekin 2017b) are closest to the current work because they investigated the impact of colour distance on map readability on choropleth maps. Nonetheless, there are a few important differences. First, the focus of Brychtova and Coltekin (2016) was on the colour distance between map labels and background colours, not on colour distances between adjacent classes on the legend as in this work. Second, Brychtová and Çöltekin (2017b) investigated colour distances between adjacent classes like done in this work, but used six levels (0,2,4,6,8,10) of colour distances on a Desktop device, while we use 10 levels on mobile devices (see Sect. 3). Third, their work focused on six classes whereas we use five. Fourth, they have investigated both sequential and qualitative colour schemes, while the current study is only focused on a sequential scheme. Finally, they focused on a fewer number of dependent variables (i.e. performance-related), while our study includes also subjective metrics (see Sect. 3).

Mathematical Models of Interaction in HCI
Mathematical models may be divided into analytical models (i.e. models that use a known mathematical formula to describe phenomena) and machine/deep learning models (i.e. created from machine learning algorithms through training using labelled data and/or unlabelled data). Previous work in HCI has contributed both types of models. As to analytical models, much work has focused on understanding the selection of menu items, and the UX of websites. Cockburn et al. 's Search-Decision-Pointing (SDP) model (Cockburn et al. 2007) incorporates the Fitts's law (time taken to move to a target item) and the Hick-Hyman law (time to find an item) to predict menu selection time. It also accommodates users' increasing expertise. Scrolling hierarchical lists was the topic of Cockburn and Gutwin (2009). They pointed out that when users can anticipate the location of items in a list, the time to acquire them is best modelled by logarithmic functions. When they cannot anticipate the location of items, linear models are more appropriate. Cockburn and Gutwin (2010) proposed the Constrained Input Navigation model to predict human navigation and selection performance in constrained-input scenarios. Bailly et al. (2014) proposed a mathematical model to predict total selection time in linear menus. Their model incorporates three components: serial search (top-to-bottom inspection of items), directed search (direct glance at the target item) and pointing.
In the context of websites, Tuch et al. (2009) reported an inverse-linear relationship between visual complexity and pleasure (i.e. start pages with low visual complexity were rated by users as more pleasurable). A replication of several studies confirmed that inverse-linear relationships (Miniukovich and Marchese 2020). Miniukovich and De Angeli (2014) also observed a negative correlation between visual complexity and aesthetics for mobile apps. They proposed linear regression models for visual complexity, visual aesthetics, and visual complexity+aesthetics on mobile devices. Finally, Reinecke et al. (2013) proposed a model to predict the initial impression of a website's aesthetics based on its colourfulness and visual complexity.
Examples of work on machine/deep learning models include Li et al. (2018) and Ramakrishnan and Kaur (2020)). With the assumption that UX is affected by page load times, Ramakrishnan and Kaur (2020) compared 18 machine learning models to predict page loading time. They reported that the Radial Basis Function and Random Forest models seem more promising for that task. Li et al. (2018) used recurrent neural networks to predict human performance regarding the execution of a sequence of selection tasks. The type of task considered was the selection of a target item from a vertical menu/list.

Summary
Cartography and Information Visualization research have contributed insights into users' preferences regarding maprelated tasks and Human-Computer Interaction research has produced useful models for the areas of pointing, menu interaction, and user experience of websites. Nonetheless, there is still a need for models of users' experience with visualizations in general, and maps in particular. The models developed during this work attempt to address that gap, focusing on question-answering tasks with maps on mobile devices.

Method
As discussed in Nosek et al. (2018), there is the dichotomy between prediction (i.e. have an idea about how the world works and make new observations to test whether that idea is a reasonable explanation) and postdiction (i.e. use existing observations of nature to generate ideas about how the world works). This work does some postdiction on the topic of predictive models. That is, it uses observations made in the user study to generate ideas about what predictive models of mobile map UX could look like, for some dependent variables. Hence, the work is at the exploratory end of the exploratory-confirmatory spectrum. We took the following steps: • Step 1: set a number n_steps of colour/size distances to collect UX values for. The value of n_steps in this work was 10 because at least ten observations per predictor variable are needed to build a regression model that allows good estimates (see e.g. Babyak 2004). • Step 2: collect a number n_values of user experience values for each colour/size distance. For this exploratory study, n_values was set to four, which means that the experiment was designed so that four different participants provide a UX value for a given colour/size distance. We built a prototype to generate mobile maps with different colour/size distances. • Step 3: find a representative value for each colour/size distance. The representative value was computed using the median of the four values from Step 2. The median is a more robust summary statistic compared to the mean because one single outlier can drastically impact the mean (see e.g. Daszykowski et al. 2007). • Step 4: find the regression line that fits best the 10 representative UX values across all colour/size distances. • Step 5: collect some empirical data about the predictive performance of the best regression model. This was done by comparing predicted and observed values for n_testvalues = 2 . That is, the predicted values were compared to the UX values provided by two different participants. In the absence of baseline models to compare our models to, this step is useful to document the expected performance of the identified regression lines on unseen data.
Since the work is learning the mathematical models from data, the reader may question why this is not done using machine learning models. While machine/deep learning models produce more accurate predictions, their drawback is that (1) they typically require a lot of training data, and (2) it is challenging to get insights into what the model actually learned. For this reason, we chose to focus on regression models so that we can get mathematical formulas as results, and hence, have outcomes that are good candidates to inform further work in analytical modelling. We provide a definition of key terms next (Sect. 3.1), followed by the description of the prototype (Sect. 3.2), the study design (Sect. 3.3) and the analysis strategy (Sect. 3.4).

Definition of Terms
On both visualization types (choropleth and graduated symbol maps), geographical data is divided into five separate classes using the data classification technique natural breaks. Natural breaks divide the data into "natural" classes, by determining the best arrangement of values (Chen et al. 2013). The transformation of this data into visuals implies the use of colour value (CM) and size (GSM) as visual variables, and hence a choice of a colour distance and size distance.

Colour Distance
Colour distance is defined as a metric between two colours within a colour space. 1 The colour distance used in this work is the CIEDE2000 colour difference formula developed by the International Commission on Illumination (CIE) (Sharma et al. 2005). It is based on the CIELAB colour space applying the coordinates L* (lightness), a* (red/green value) and b* (blue/yellow value) to define a colour. Sequential colour schemes (Brewer 1994) were used on choropleth maps. In this scheme, the darker the colour value, the larger the value it represents. Colour values for choropleth maps were generated by the Sequential Colour Scheme Generator developed by Brychtová (2015a). The Sequential Colour Scheme Generator provides colour values after specifying the origin of the colour scheme, the number of classes and the colour distances between adjacent classes. Ten different choropleth maps were developed. The colour distances used on these maps ranged from 2 to 11. The colour distances between neighbouring classes remained the same on one map: for example, if the colour distance between Class A and B was 2, the colour distance between Class B and C was also 2, and so on.

Size Distance
Size distance indicated the circle diameter difference between neighbouring classes. In this work, ten different size distances were used ranging from 2.5 to 25 in 2.5 steps. Circle sizes were specified in pixels. For all size distances, the class representing the lowest values of the dataset were represented with a circle diameter of 10 pixels. The size distance between adjacent classes stayed the same on a map. This means for instance that a size distance of 2.5 provided circles with diameters of 10, 12.5, 15, 17.5 and 20; the largest size distance of 25 generated circles with diameters of 10, 35, 60, 85 and 110.

Prototype
The prototype was created by using React-Native, which is an open-source JavaScript framework to develop mobile applications on multiple platforms (e.g. iOS, Android) utilizing the same code base. React-Native uses a componentbased approach to build user interfaces fast and responsive. It also comes with a reload feature that allows one to make and see the changes in real-time. Both choropleth and graduated symbol maps are integrated with Mapbox GL JS 2 in React-Native. Mapbox is an easy-to-implement mapping system and gives developers the control to style and customize maps individually by adding custom markers, polygons or polylines. The geographic focus of this work was Europe.
To visualize SDG (Sustainable Development Goal) data for European countries, their geometries were needed first. For better performance purposes, a GeoJSON with low resolution (110 m) was applied, which is sufficient for conducting the user study. The GeoJSON was downloaded freely. 3 Datasets displayed on the maps were open-source and were provided as JSON datasets from the articles at Our World in Data. 4 The generation of the maps for the user study via the prototype is described below. The source code is available on GitHub. 5

Generating Choropleth Maps
Sequential colour schemes were applied on choropleth maps. There were five different blue colours with a fixed colour distance between adjacent classes. The colour hexes were calculated by the sequential colour scheme generator developed by Brychtová (2015a). Since no colour can be labelled as the standard colour for choropleth maps, the choice of blue as a colour for the experiment stems from the authors' subjective preference. European countries (geometries) are coloured according to the dataset value they represent and the class these values fall into. Countries without data are coloured grey.

Generating Graduated Symbol Maps
Circles are the (de facto) standard symbols for graduated symbol maps, and hence our GSMs used them as symbols. The GSMs were generated by displaying circles of five different sizes in relation to the dataset value and the class it belongs to. The centroid, the arithmetic mean position of a polygon could be external to a country, if the shape is irregular, which is unfavourable for the user study. For example, the arithmetic mean position of Norway could be outside of Norway (for instance in Sweden), which means that this circle could no longer be assigned to Norway. Therefore, circles were placed at the countries' pole of inaccessibility, the most distant internal point from the polygon outline (Garcia-Castellanos and Lombardo 2007). No circle was presented for a country without data.

Tasks
We focused on question-answering tasks with mobile maps. These questions are typically useful during the exploratory stage of data analysis. Sarikaya et al. (2018) identified six types of high-level data characteristics: trends, outliers, clusters, frequency, distribution, and correlation. To make the study manageable, the focus was on cluster-questions (i.e. identification of geographic entities that belong to a common class). Since the questions involved the 'attribute-in-space' operand of Roth (2013a), they are attribute-in-space/cluster questions. We originally considered including several types of questions (e.g. cluster, frequency), but it proved challenging to distribute them in a balanced way across map types and participants. Also, it would have been more challenging to extract models relevant only to one question type, if the data was collected by mixing up question types. For that reason, we came to the conclusion that it is best to build models that address one type of question at a time (trends, outliers, clusters, frequency, and so on), and see in the future how these can be combined into models that generalize over question types. We did not control for panning and zooming behaviour, nor did we measure it. Since the map was interactive, the participants were free to pan/zoom as much as they wanted to, in order to find answers to the questions. The datasets interacted with touched on different topics: proportion of women in managerial positions (SDG 5.5.2); share of children who report being bullied; share of people disagreeing that vaccines are safe; and share of tax revenue. Though the datasets communicated temporal information, that temporal information was only shown in the title (see Fig. 1). Hence, there was no need to provide assistance for temporal navigation across different years. The structure of the study and the information-gathering tasks are available in the supplementary material (Sect. 7, Appendix A).

Study Variables
The independent variables of the study were: different types of map visualization (Choropleth Maps and Graduated Symbol Maps); different colour distances on Choropleth Maps, and different size distances on Graduated Symbol Maps. Previous studies reported that the spatial distance between symbols affects the performance of users during map reading tasks on choropleth maps (Brychtová and Çöltekin 2017b) and graduated symbol maps (Cybulski 2020). Nonetheless, these studies were done using static maps. In the current case, the map is interactive and thus the spatial distance between the enumerations units (countries) was constantly changing because the gap between symbols varies as one zooms in or out. Hence, the implications of these studies are not directly transferable to the current study. As a result, the spatial distance was not considered as a variable during the study.
The main dependent variable is the user experience of the participants. User experience has several facets, notably a pragmatic and a hedonic dimension (see e.g. Hassenzahl 2005). As for the pragmatic dimension, we measured both accuracy (percentage of correct answers) and efficiency (time in seconds needed for solving the tasks). The following formula was used to calculate the accuracy of a participant's answer to a question about the countries fulfilling a constraint (see supplementary material, Sect. 7, Appendix A): where cag is the number of correct answers given, ag is the total number of answers given by a participant and ca is the number of correct answers. As for the hedonic dimension, we used a modified dimension of the NASA TLX questionnaire. The NASA Task Load Index (TLX) is a questionnaire that is used to assess workload on six dimensions (Hart 2006). The overall Nasa TLX score derives from the subscales mental demand, physical demand, temporal demand, performance, effort and frustration. Since the user study does not ask for any physical effort, the dimension 'physical demand' was not captured. Participants were not pressured by time, therefore the dimension 'temporal demand' was also omitted. Moreover, the NASA TLX was simplified by accuracy = cag max(ag, ca) , reducing the number of scales to seven. We measured the perceived mental demand (how mentally demanding was the task?); perceived performance (how successful were you in accomplishing what you were asked to do?); perceived effort (how hard did you have to work to accomplish your level of performance?); and perceived frustration (how insecure, discouraged, irritated, stressed and annoyed were you?). All these four hedonic variables were measured on a seven-point Likert scale from 1 to 7, where 1 equals low and 7 equals high. Scale 4 represents the middle of the scale, which means neither low nor high. In sum, user experience was measured through six variables: two touching on the pragmatic dimension and four on the hedonic dimension. The modified version of the NASA TLX questionnaire used in the work is available as supplementary material (Sect. 7, Appendix B).

Procedure and Apparatus
In each study, there was one participant and a moderator. Firstly, the moderator gave a brief explanation of the objective and the procedure of the study. The participant was then asked to sign the consent forms and an additional statement on vision, declaring having normal or corrected-to-normal vision by wearing glasses or contact lenses. After that, the mobile device with the pre-installed prototype application was handed out. A Samsung Galaxy A32 5 G (display size: 720 × 1600 pixels) was used throughout the experiment. On another laptop, an online questionnaire in LimeSurvey was handed over to the participants. They first filled out a questionnaire about background information. They then went on to interact with the prototype application and give their answers to the questions. At the end of the study, the moderator wrapped up the session, by asking the participant if there was some feedback they were willing to share and giving them their rewards (sweets). The study was conducted in a room where sunlight did not obscure the participant's view of the map. For example, unfavourable lighting conditions could affect colour perception on choropleth maps and thus worsen the study results and the time taken to solve the tasks. The study was pilot-tested and approved by the institutional ethics board.

Participants
Thirty users participated in the experiment. They were divided into two different groups. The first group included the first 20 participants whose study results were used to develop the model. In this group, there were eleven male and nine female participants. The majority of the participants were between 23 and 27 years old (11/20). Seven were 28-32 years old and two participants stated to be older than 33 (34 and 58). All participants were frequent mobile map users, at least once a week. Twelve participants stated to be very familiar with choropleth maps (12/20). The remaining eight participants reported being familiar with choropleth maps. None stated to be unfamiliar with them. Regarding graduated symbol maps, six participants were very familiar with them. Ten participants mentioned being familiar with graduated symbol maps, whereas four participants reported being unfamiliar with them. The second group included the remaining ten participants, whose data was used to test the predictive performance of the models. It included seven female and three male participants. One participant was between 18 and 22 years old. Four participants stated being 23 to 27 years old. Three participants reported an age between 28 to 32, while two stated being older than 33 (34 and 35). All participants were frequent mobile map users (at least once per week). Most participants stated being very familiar with choropleth maps (7/10), while two participants reported finding them familiar. Only one participant stated to feel unfamiliar with choropleth maps. In connection with graduated symbol maps, three participants were very familiar with them. Five reported being familiar with them, while two stated being unfamiliar with graduated symbol maps.

Data Analysis
The aim of the modelling process was to find a model that fits best the median values for each dependent variable on both visualization types. The data from the first 20 participants in the experiment was used to learn about the model as said above. Five different polynomial regression functions were tested to fit the data by using the lm() linear model function in R, i.e. the tests went from a linear regression model to a fifth-degree polynomial regression model. The best fit was identified by comparing the Bayesian information criteria (BIC) values of these five models. The BIC discounts goodness-of-fit to the extent that it is realized by a model that is overly complex, and hence implements the principle of parsimony (see Vandekerckhove et al. 2015). The BIC leads to models with lower prediction error compared to other criteria (e.g. adjusted R squared, see Sharma et al. 2021). The model with the lowest BIC was selected as the "best" fit for a given set of data points. The predictive performance of the best fit was then evaluated using the data from the remaining ten participants. The units of the different dependent variables (accuracy, efficiency, and so on) are different. To make the results comparable across dependent variables and map types, a delta ( Δ ) was computed and expressed in percentage of unseen, empirical data. For accuracy and efficiency, Δ =

Results
As mentioned in Sect. 3.4, the BIC was used to select the best fit for the data among the five polynomial regression models. The BIC is a quantity useful to choose between two or more alternative models (the lower the BIC the better the model), but absolute values of the BIC do not have a practical meaning. On the contrary, the adjusted R-squared ( R 2 ) indicates the percentage of the variation in the response variable that can be explained by the predictor variable in the model, adjusted for the number of predictor variables (Yang and Berdine 2015). The higher the R 2 value, the better the model. Because absolute values of the R 2 are easier to interpret, they will be mentioned while reporting in this section. To reduce the density in the plots, only the best fit is visualized. Table 1 gives an overview of the parameters of the aforementioned predictive models. Each equation for a dependent variable is presented separately in the corresponding section below. As for the error on the test datasets, we report on the average (i.e. median value) of all deltas across all colour/size distances. The median is more robust to outliers. Figure 2a, b present the accuracy results on choropleth and graduated symbol maps respectively. The median accuracies ranged from 40.5 to 98.5% on choropleth maps and on graduated symbol maps from 82 to 100%. Regarding choropleth maps, participants performed best at colour distances 10 (98.06%) and 11 (98.53%) and worst at colour distances 2 (40.54%) and 4 (49%). The best performances on graduated symbol maps were at size distances 10, 17.5 and 25 (100% each). On this visualization type, only two size distances performed worse than 95%, namely size distance 2.5 (82%) and 7.5 (91.26%).

Accuracy
On choropleth maps, the best fit was a first-degree polynomial regression model, which had an adjusted R-squared of 0.8835. The equation for this linear equation is: The predictions on the test dataset using this model had an average error (median) of 10.83% of unseen empirical data (Mean: 14.65%, Max: 56.89%, Min: 0.29%, SD: 13.04%). That is, on average, the accuracy values predicted by the model and those of the test data differed by 10.83%. y = 5.8217x + 33.046.  The best fit on graduated symbol maps was a third-degree polynomial, which has an adjusted R-squared of 0.5119. The equation that describes the model is: Here, on average, the accuracy values predicted by the model and those of the test data differed by 2.14% (Mean: 7.33%, Max: 34.51%, Min: 0%, SD: 11.12%).

Efficiency
The diagrams for efficiency are shown in Fig. 3a, b. Here, each black data point represents the average time taken by participants across all tasks in the four sections (see supplementary material, Sect. 7, Appendix A). The average times needed to answer one question ranged around 50-300 s on both map types. On choropleth maps, participants performed fastest at colour distances 10 (108.64 s) and 11 (111.63 s) and worst at colour distances 2 (166.5 s) and 3 (168.74 s). The results on graduated symbol maps show the fastest performances at size distances 7.5 (104.97 s) and 20 (111.38 s) and the slowest at size distances 2.5 (162.5 s) and 5 (170.27 s) for each question.
On graduated symbol maps, the best fit was obtained with a linear equation ( R 2 = 0.1711): The efficiency values predicted by the model and those of the test data differed by 22.64% on average (Mean: 22.79%, Max: 51.72%, Min: 5.09%, SD: 13.77%). Figure 4a shows that the median values ranged between 2.5 (colour distance 10) and 6.5 (colour distance 3) on choropleth maps. On graduated symbol maps (Fig. 4b), the values spanned between 2.5 (size distance 12.5) and 5.5 (size distance 2.5). The mental demand was rated highest at colour distances 3 (6.5) and 5 (6) and lowest at colour distances 10 (2.5) and 11 (3) on choropleth maps. Regarding graduated symbol maps, the mental demand ratings were highest at size distances 2.5 (5.5) and 5 (5) and lowest at 12.5 (2.5) and 15 (3).
The data points regarding graduated symbol maps were best fitted by a fourth-degree polynomial regression model ( R 2 = 0.6815 ). The equation of this model is: On average, the mental demand values predicted by the model and those of the test data differed by 14.17% (Mean: 18.60%, Max: 52.00%, Min: 0%, SD: 13.21%).

Perceived Performance
The average performance ratings were 4.35 on choropleth and 5.55 on graduated symbol maps. The values ranged from 2.5 (colour distance 5) and 5.5 (colour distance 10 and 11) with approximately equal distribution over all colour distances (Fig. 5a). On graduated symbol maps (Fig. 5b) the span was between 3 (size distance 2.5) and 6.5 (size distance 10 and 20). Apart from size distance 2.5, all other size distances were rated higher than or equal to 5, which is noticeable. The performance on choropleth maps was rated highest at colour distances 10 and 11 (5.5 each) and lowest at colour distances 5 (2.5) and 3 (3.5). The performance on graduated symbol maps was rated highest at size distances 10 and 20 (6.5 each) and lowest at size distances 2.5 (3) and 22.5 (5). y = − 0.0001734x 4 + 0.0088982x 3 − 0.1381002x 2 + 0.53885x + 4.875.
On choropleth maps, a linear equation fitted the data points best ( R 2 = 0.4104): The performance values predicted by the model and those of the test data differed by 14.92% on average (Mean: 15.63%, Max: 36.50%, Min: 0.50%, SD: 9.20%).
The best fit on graduated symbol data was a third-degree regression model ( R 2 = 0.6343): The performance values predicted by the model and those of the test data differed by 19.83% on average (Mean: 21.95%, Max: 76.83%, Min: 1.17%, SD: 16.05%).

Perceived Effort Demanded
The average effort ratings were 5.15 on choropleth and 4.4 on graduated symbol maps. The values ranged from 3 (colour distance 10) to 6.5 (colour distance 5) on choropleth maps. Apart from colour distances 10 and 11, all other colour distances were rated higher than or equal to 5, which is noticeable (Fig. 6a). The effort on choropleth maps was rated highest at colour distances 3 (6) and 5 (6.5) and lowest at colour distances 10 (3) and 11 (3.5). On graduated symbol maps (Fig. 6b), the span was between 3 (size distance 15) and 5.5 (size distance 2.5). The lowest ratings were provided y = 0.21515x + 2.95152. y = 0.001486x 3 − 0.073706x 2 + 1.11251x + 0.966667. On graduated symbol maps, the best fit was a seconddegree regression model ( R 2 = 0.2412): The effort values predicted by the model and those of the test data differed by 11.83% on average (Mean: 13.95%, Max: 31.83%, Min: 1.3%, SD: 11.03%).

Perceived Frustration Level
Choropleth maps were most frustrating at colour distances 3, 4 and 8 (5.5 each) and least frustrating at colour distances 10 (1.5) and 11 (2.5) as seen in Fig. 7a. On graduated symbol maps (Fig. 7b), the highest level of frustration was perceived when working with size distances 2.5 (5.5) and lowest at size distances 10 with 1 (very low frustration) as well as 5, 15 and 25 (1.5 each).
On choropleth maps, a fifth-degree polynomial regression model was the best fit ( R 2 = 0.6093): The frustration values predicted by the model and those of the test data differed by 29% on average (Mean: 29.50%, Max: 66.83%, Min: 1.50%, SD: 20.40%).
On graduated symbol maps, the best fit was a thirddegree polynomial regression model ( R 2 = 0.4385 ). The equation of this model is: The frustration values predicted by the model and those of the test data differed by 20.75% on average (Mean: 24.45%, Max: 68.83%, Min: 1.33%, SD: 17.33%).

Multiverse Analysis
As mentioned in Sect. 3, a key step during the analysis was the choice of a representative value for each colour/ size distance. While the median was used as a technique to compute that representative value, there are alternative techniques (e.g. mean or mode) to compute the representative value for each colour/size distance. Since the regression model fits the representative values across all colour/size distances, the form and monotonicity of the y =0.007051x 5 − 0.23048x 4 + 2.840035x 3 − 16.352855x 2 + 43.312005x − 36.575758. y = −0.002959x 3 + 0.132354x 2 − 1.747591x + 8.716667. mathematical function may be influenced by where that representative value for each colour/size distance exactly lies. Conducting (large-scale) user studies that control several parameters (task, device, user background), so that we can generalize across settings and become more confident about the "true" representative value for each colour/size distance is beyond the scope of this work.
To evaluate the sensitivity of the results to the location of the representative values, we did a sensitivity analysis (a.k.a. multiverse analysis). As discussed in Steegen et al. (2016), a multiverse analysis acknowledges that datasets during an analysis are, to some extent, actively constructed. It involves performing the analysis of interest across the whole set of data sets that arise from different reasonable choices for data processing. It offers an idea of how much the conclusions change because of arbitrary choices in data construction. The key idea was to shed some light on the question "how sensitive are the results to alternative locations of the representative values"?
The analysis iterates over all input data points about a million times ( 4 10 = 1,048,576 times exactly: we have four values per step and 10 steps). At each iteration, we take a value randomly from the four input values available for a step (= colour/size distance). We then look for the best regression model given those randomly chosen input points and document both its form (1st deg, 2nd deg and so on) and monotonicity (monotonic vs non-monotonic). This is basically equivalent to assuming that any of the values collected has an equal chance to be taken as the "representative value" if more data is collected, and simulating the possible consequences of that scenario. Table 2 summarizes the frequencies of occurrences of the function forms as well as the monotonicity trends across all 1,048,576 iterations. The following observations can be made about the models for the different aspects of user experience: • Accuracy Choropleth: likely a 1st-degree polynomial, likely monotonic; • Accuracy GSM: likely a 5th-degree polynomial, likely non-monotonic; • Efficiency Choropleth: likely a 1st-degree polynomial, likely monotonic; • Efficiency GSM: likely a 1st-degree polynomial, likely monotonic; • Mental Demand Choropleth: the results are inconclusive.
Although a 1st-degree polynomial is the most frequent best model, the probability that the model is non-monotonic lies at 62%; • Mental Demand GSM: likely a 1st-degree polynomial, likely monotonic; • Performance Choropleth: the results are inconclusive.
Although a 1st-degree polynomial is the most frequent best model, the probability that the model is non-monotonic lies at 54%; • Performance GSM: likely a 5th-degree polynomial, likely non-monotonic; • Effort Choropleth: likely a 2nd-degree polynomial, likely non-monotonic; • Effort GSM: likely a 1st-degree polynomial, likely monotonic; • Frustration Choropleth: likely a 1st-degree polynomial, likely monotonic; • Frustration GSM: likely a 1st-degree polynomial, likely monotonic.

Theoretical Value of the Models
The section addresses four topics: the usefulness of the models for hypothesis generation, their theoretical implications for the study of visual variables, the lessons learned about predictors of mobile map UX, and the relevance of the models for plastic maps.

Hypothesis Generation
As mentioned in Sect. 3, one key aim and outcome of the study is to generate hypotheses about predictive models. A look at the equations (Table 1) suggests some tentative statements about the form of the mathematical functions that predict mobile map UX. A linear equation suggests a function of the form dependent variable = a + b * X , where X denotes either a colour distance or a size distance; a twodegree polynomial function suggests a function of the form dependent variable = a + b * X + c * X 2 ; and so on. In the long run, once the form of the model is validated for a class of questions (e.g. clusters), the constants may be learned through calibration work for other settings (e.g. devices). A second direction in which the results can help formulate hypotheses is about the monotonicity of the relationship between the colour/size distance and the dependent variable (i.e. whether these are monotonic or non-monotonic relationships). Below, we summarize the hypotheses that can be formulated based on the data. We formulate a hypothesis when the three pieces of evidence available (i.e. function obtained from the analysis using the median value, most frequent function form and monotonicity trends) converge. When this is not the case, we refrain from making a statement. A look at the graphs from Sect. 4, Tables 1 and 2 suggests the following: • Colour distance and accuracy: the relationship is likely linear (and hence monotonic); • Size distance and accuracy: the relationship is likely nonlinear and non-monotonic; • Colour distance and efficiency: the relationship is likely linear (and hence monotonic); • Size distance and efficiency: the relationship is likely linear (and hence monotonic); • Size distance and perceived performance: the relationship is likely non-linear and non-monotonic; • Colour distance and perceived effort: the relationship is likely non-linear and non-monotonic; • No strong hypotheses can be formulated for the following relationships, based on the current data: -Colour distance and perceived mental demand -Size distance and perceived mental demand -Colour distance and perceived performance -Size distance and perceived effort -Colour distance and perceived frustration -Size distance and perceived frustration In sum, the pieces of evidence (i.e. models based on the median plus sensitivity analysis) have converged for the pragmatic aspects of user experience (accuracy and efficiency), but are still conflicting for the majority of hedonic aspects of user experience (perceived mental demand, perceived performance, perceived effort, and perceived frustration level). Formulating these hypotheses about the function form and the monotonicity properties would have been challenging using machine/deep learning models. It should also be noted that the hypotheses only apply to mobile devices, cluster-tasks and the range of colour/size distance investigated. Indeed, the UX may exhibit a trend (e.g. linear) on a range of values and another trend (e.g. logarithmic) on another range of values. The trends may also be different if the type of device/task changes, but this needs to be investigated in future work. Bertin (1983) suggested seven visual variables, which were subsequently extended into a list of 12 (Roth 2017): location, size, shape, orientation, colour hue, colour value, colour saturation, orientation texture, arrangement, crispness, resolution and transparency. Because the choropleth maps investigated in the work use colour value and the graduated symbol maps use size as visual variables, the hypotheses made just above on the form of the modelling function and its' monotonicity are implicit hypotheses about the trajectory of the impact of colour value and size on design. Hence, they can inform a theory of visual variables. For instance, Garlandini and Fabrikant (2009) compared participants' response times (efficiency) for change detection tasks using size and colour value and reported that size was faster. A look at Fig. 3 suggests that the assertion that 'size is faster than colour' cannot hold for all colour and size distances. Though the purpose of the study was not to compare CMs and GSMs, the figure suggests that for the ranges of colour/size distances investigated and the cluster tasks, efficiency data fall in a similar range of values. The colour/size distances were not documented in Garlandini and Fabrikant (2009), but one can guess that these are so that efficiency values in their study could fall in different ranges.

Predictors of Mobile Map UX
The R 2 values ranged from 41 to 88% for colour distances, and from 17 to 68% for size distances (Table 1). These values are relatively high for a single regressor and suggest that colour/size distance are good candidate predictors for models of mobile maps UX on CMs and GSMs respectively. Another implication from Table 1 is that authors should discuss the possible impact of colour/size distances on mobile map UX ratings obtained in their experiments. Colour/size distances indeed appear to have a non-negligible contribution to the dependent variable measured.

Plastic Maps
Kray and Degbelo (2019) adapted the idea of user interface plasticity (Thevenin and Coutaz 1999;Coutaz 2010) to the domain of maps and proposed the concept of map plasticity. Like in the original plasticity framework, 'plastic maps' foresee an interactor model. A recommendation made about interactors in Thevenin and Coutaz (1999) was that they should "specify the abstract data types they are able to handle. They should also be able to evaluate their appropriateness as well as their rendering cost" (Page 113). The models of map UX learned in this work could be one way of informing these interactors as they are evaluating their appropriateness in maximizing the user experience of map users.

Practical Usefulness of the Models
In addition to their usefulness to advance Cartography theory discussed above, the mathematical models have also some practical value. For instance, several authors have mentioned that empirically-derived guidelines for map design are still lacking (Roth 2013b;Kray et al. 2017;Degbelo et al. 2022). Though the R 2 values from 17 to 88% remind us that not every model is a 'good' fit for the data points collected at this point, the empirically-derived models are still useful to formulate some recommendations (e.g. they give some indications about trends of the dependent variables for both map types). A word of caution: using them for predictions should be done only if the designer considers the error ranges documented above to be acceptable for their task. The functions modelling accuracy and efficiency (Figs. 2 and 3) show that increasing the colour distance on choropleth maps implies having greater success and needing less time to answer questions. The trend that accuracy improved with increasing colour distances was also observed in Brychtová and Çöltekin (2017b), who recommended the colour distance 10 to cartographers and information visualization designers. Our data suggest that colour distance 11, which was not tested in their work, might yield even better results from the accuracy and efficiency point of view. Taking all six models into consideration, colour distances 10 and 11 are favourites among the colour distances investigated, if designers aim for the 'best' user experience as users answer cluster questions on mobile devices.
On graduated symbol maps, the model anticipates the lowest accuracy at a size distance of 2.5. The remaining size distances might produce accuracies greater than 90%. The efficiency model suggests that a graduated symbol map with a size distance of 25 would be best to answer questions efficiently. Combining these two factors, the size distance of 25 would be best to answer cluster questions, if only pragmatic aspects of UX are important. Map users might perceive the least mental demand at size distances 12.5 and 15 according to our model; perceive the highest performance level at size distances 10 and 12.5; perceive that the least effort is demanded at size distances 12.5, 15 and 17.5; and be less frustrated at size distances 7.5, 10, 12.5 and 25. Taking all these factors into account, the models suggest a graduated symbol map with a size distance of 12.5 as adequate for the 'best' user experience if hedonic aspects of UX should be given more weight.
At last, an area where the models can be of practical value is that of toolkit design. Several toolkits have been proposed in Cartography research to support the creation of choropleth and/or graduated symbol maps. These include for instance SDG Viz (Gong 2019), AdaptiveMaps , the GAV Toolkit (Van Ho et al. 2012) and the Geoviz Toolkit (Hardisty and Robinson 2011). The default colour/size distance used in these toolkits has been rarely if at all discussed. Hence, our work can inform the design of future toolkits for interactive map creation, and more broadly of tools in need of good defaults to support the creation of choropleth/graduated symbol maps semi-automatically.

Limitations
One limitation of the work relates to the number of participants. Only four data points for each colour/size distance were used to build the models and two data points were used to test them. These were helpful to learn what the models could look like, but increasing the number of data points is needed to find out what the models truly look like. In addition, the number of colour/size distance steps used to construct the model could be extended (i.e. go beyond 10). The evaluation has not yet addressed the question of how well the model performs for colour/distances we have not investigated (i.e. colour/size distances in-between two step values or beyond the range of the ten selected).
In addition, the experiment used real data and this comes with the risk of participants' prior knowledge as a confounding variable. Nonetheless, since the questions were of type cluster, answering them necessitated the interaction with the actual datasets given in the experiment, as opposed to general knowledge about the topic. That is, finding all countries that belong to a given cluster is a question that requires interaction steps with the map to find the specific spatial entities within the cluster. Given the nature of the task and the fact that we used four different datasets, the impact of the participants' individual levels of familiarity with each dataset on the overall results is likely minimal. Still, since the level of familiarity with each topic of the four datasets (e.g. share of children who report being bullied, rate of women in senior and middle management positions, and so on) was not explicitly controlled, no statement can be made at this point about the impact of participants' familiarity with the topics on the mathematical functions obtained.
Furthermore, the user group that participated was relatively homogeneous. The age of (29/30) of the participants ranged from 18 to 35, and only one participant was significantly older than the rest of the group (58). Hence, the results apply at best to the age group 18-35 only. From the implementation perspective, choropleth maps used the colour blue to visualize geographical data. The extent to which the results generalize to other colours still needs to be tested empirically. Likewise, circles were utilized to represent data on graduated symbol maps. The extent to which the results hold for other symbols such as squares or triangles needs to be investigated as well. In addition, the study investigated questions of the type cluster only. Asking different types of questions may lead to different tentative models.

Future Work
There is still much work to be done to develop mature computational models for the UX of maps and geovisualizations more broadly. These computational models should cover more thematic maps-more tasks-more users-more devices-more modalities-more design parameters. Once specific computational models for each dimension become available, there will be the question of combining them into a coherent framework and deriving, if possible, a general theory of UX of geovisualizations. Various types of thematic maps were already mentioned in Sect. 1; tasks relevant to thematic maps can be derived from the classifications provided in e.g. Roth (2013a) and Brehmer and Munzner (2013); a typical way of distinguishing users of interactive maps is through their abilities, expertise and/or motivation (see e.g. Roth 2013b; Degbelo 2022); additional devices could include desktop computers, interactive tabletops and surfaces, smartwatches or large displays; additional modalities could go beyond the visual to include the haptic, auditory, olfactory and gustatory modalities (Hogan 2018). Finally additional design parameters, in addition to those already mentioned in Sect. 5.3, could include the number of data classes, the number of alternative visualizations of the geographic dataset offered in a single interface (e.g. as a map, chart, data table), and the use or not of a dark mode.

Conclusion
This work has investigated how mathematical functions to describe the user experience of maps on mobile devices could look like, in an exploratory study. The outcomes of the analysis are hypotheses about the behaviour of map user experience as one changes colour or size distance. The work also provided a quantification of the expected errors for predictions using the mathematical functions on unseen data. Overall, we have learned that some aspects of user experience are likely linear monotonic (e.g. the relationship between colour distance and accuracy) while some are probably not (e.g. the relationship between size distance and accuracy). A replication of this study in other contexts is needed to confirm these observations.

Supplementary Material
All scripts used during the analysis are available on GitHub (https:// github. com/ Sulax anSo/ Mapbox_ UX). The supplementary material showing the tasks given to the participants and the error values on the test dataset is available at https:// doi. org/ 10. 6084/ m9. figsh are. 21908 079.