A Systematic Literature Review of User Evaluation in Immersive Analytics

User evaluation is a common and useful tool for systematically generating knowledge and validating novel approaches in the domain of Immersive Analytics. Since this research domain centres around users, user evaluation is of extraordinary relevance. Additionally, Immersive Analytics is an interdisciplinary field of research where different communities bring in their own methodologies. It is vital to investigate and synchronise these different approaches with the long‐term goal to reach a shared evaluation framework. While there have been several studies focusing on Immersive Analytics as a whole or on certain aspects of the domain, this is the first systematic review of the state of evaluation methodology in Immersive Analytics. The main objective of this systematic literature review is to illustrate methodologies and research areas that are still underrepresented in user studies by identifying current practice in user evaluation in the domain of Immersive Analytics in coherence with the PRISMA protocol. (see https://www.acm.org/publications/class-2012)


Introduction
Immersive Analytics (IA) is a research domain that focuses on adding a dimension of immersion to visual data analysis.It provides users with new opportunities to engage with their data during the analysis process [CCC * 15].This is especially interesting for uncovering relations and patterns in multi-dimensional data and use cases where spatial knowledge is beneficial [GHAWK16].For example, immersive data representations can aid in discovering multidimensional clusters [KWO * 20] or in decision making using immersive graph layouts [KMLM16].Nevertheless, 2D representations are still more suitable for interaction that requires precision, such as data value measurement or comparison [ .This illustrates the necessity to not only include papers that specifically use the term IA but also earlier research before the term became known within the community.
IA is an interdisciplinary field of research with different communities working on the topic.There are three main research communities in IA: Virtual and Augmented Reality, Visualisation, and Human-Computer Interaction.Since the initial definition of IA, there have been multiple workshops at conferences within these communities.In 2016 there was the first workshop at the IEEE VR, as well as a Dagstuhl seminar and a workshop at the ACM ISS conference.The first workshop at the IEEE VIS was then held in 2017.At the ACM CHI there have been three workshops on IA so far, in the years 2019, 2020 and 2022.Additionally, there was a second workshop at the ACM ISS in 2022.At the IEEE ISMAR also a workshop on IA was held in 2022 and 2023.
With each of these communities bringing their own background and research approach to the interdisciplinary field of IA, it is vital to formalise these practices.This is especially important, as the focus of IA is to enhance the data analysis process by providing multisensory interfaces [CCC * 15].This focus on human interaction and user experience requires evaluation of these newly developed approaches employing user studies.Moreover, there are concepts that need to be considered from different research disciplines that are especially relevant for user evaluation in IA.In this field of research, for example, place illusion and plausibility illusion [Sla09] need to be considered as they may influence how well users are able to immerse themselves in the data.Furthermore, the use of immersive technology can introduce ergonomic issues or phenomena like simulator sickness, which in turn influences the suitability for data analysis.Another factor is the impact of this type of technology on mental workload and we also need to acknowledge that the subjective user experience is an essential aspect of how well data analysis can be performed.
Besides, collaborative data analysis has been defined as an integral facet of IA [BCBM18], which adds another layer of complexity to the evaluation that should not be disregarded.In addition to all the factors relevant to IA for single users, aspects like awareness and social behaviour need to be considered.Such a collaborative data analysis process has for example been studied by Lee et  Therefore, defining an evaluation framework in this complex environment has also been marked as one of the grand challenges in IA [EBC * 21].To achieve this long-term goal, we need to first explore how user evaluation has been performed in this field so far.While other surveys do report on research that focuses on evaluation [FP21, KFS * 22], they do not look at the details of the evaluation, but rather report on this research as its own category and give a short overview.While this provides some guidance on evaluation, we believe that the research community would greatly benefit from a more detailed investigation of how user studies are performed in IA.With this state of the art report we lay the foundation for discussion on how we conduct user evaluations and the first step to defining an evaluation process.Thus, in this work we elaborate on all aspects of the user evaluation process in the area of immersive analytics.In this process we investigate study methodology and design, to reveal which methods are yet underrepresented in current IA user evaluation.Furthermore, we look at which immersive technologies and data representations are currently utilised in literature.The evaluation of study participants, tasks and data sets as well as measures and data analysis methods provides guidance for future research in IA when designing a study.Finally, the analysis of the evaluation goals and which methodology was used to evaluate those goals, leads us to common evaluation strategies.In summary, the main contributions of this systematic literature review are: • Overview over current practices in user evaluation in IA • Analysis of evaluation methodologies uncovering underrepresented study designs • Review of relevant measures and methods to be considered in user evaluation of IA • Guidance for future research by providing an overview of common evaluation strategies to achieve specific evaluation goals

Background and Related Work
Since empiric evaluations are critical for generating insights in a user-centred field of research, there have been several reviews of evaluation practices in different fields.Lam et al. [LBI *  12] presented a review of evaluation in information visualisation consisting of seven evaluation scenarios.This includes four scenarios aimed at understanding data analysis and three scenarios to evaluate visualisations.For each scenario, goals and output, research questions and methods, and examples are provided.While this classification is mainly designed to guide evaluation, it has also been used as a basis for further systematic reviews of evaluation practices [IIC * 13, MSK * 20].Isenberg et al. [IIC * 13] extended these scenarios and classified papers over a period of ten years of research in visualisation within these categories, while Merino et al.
[MSK * 20] used it for the classification of evaluation scenarios in mixed and augmented reality, next to their other data categories of venue, research topic, cognitive aspects and configuration.Additionally, they applied the categorisation of paper type introduced by Munzner [Mun08] which categorises paper types based on intention and projected outcome.This classification was also adopted as primary classification by Merino et al. when investigating the evaluation of software visualisations [MGAN18] While all of these reviews focus on empiric evaluation, they include user-based evaluations as well as system evaluations.
Swan and Gabbard on the other hand, focus only on user-based evaluation in AR, classifying studies into the three categories of investigating Perception and Cognition, Performance and Interaction Techniques, and Collaboration [SIG05].This was classification was then extended with system usability studies by Dünser et al. [DGB08].Moreover, they introduced five categories for study approaches and methods.They considered objective measurements, subjective measurements, qualitative analysis, usability evaluation techniques and informal evaluations.Dey et al. [DBLS18] then investigate usability studies in AR over a ten-year period, looking at study type, collected data, supported senses, study design, participants, display types and application areas, without using prior classification systems.Furthermore, da Silva et al. [dSTCT19] used a systematic review approach to analyse evaluation practice in AR tools for education focusing mainly on education specific criteria, but also including a more general classification along methodologies.Finally, Saffo et al. [SBC * 23] presented a design space for immersive analytics also mentioning two different types of evaluation studies, i.e. comparing approaches and understanding how users perform.

Methods
At the beginning of this systematic review, in accordance with the PRISMA protocol [PMB * 21], we defined eligibility criteria for including and excluding publications from the review, describe the selection and data collection processes, list all investigated data categories and discuss risks of bias in our study.

Eligibility Criteria
To be eligible for our systematic review, publications need to include three criteria: a user study, a visual data representation for data analysis and an immersive analytics technology.For the user study we considered any description of real users completing a study procedure.This includes case studies with real users but excludes any use case descriptions without users, which are also sometimes referred to as case studies.The visual data representation included any plot or rendering of the data that was used within the user study in a visual data analysis scenario.Therefore, we excluded data representations that were used in other ways e.g. for selecting a specific object to evaluate a novel input method.Finally, the immersive analytics technology included any stereoscopic display system, devices that were tracked in space or extremely large displays that encompass the user's whole view.In addition to that, the technology must be used in the user study.
We included all peer-reviewed publications available up to November 30, 2023, that were written in English and where the full text was accessible to us either directly at the publisher's website or by using Google as a search engine.Therefore, we excluded 33 non-English publications and 33 duplicate studies as illustrated in Figure 1.

Information Sources
Since the main communities involved in research on IA are Virtual and Augmented Reality, Visualisation and Human-Computer Interaction, we searched the databases most relevant to these three communities: IEEE Xplore (2442 results), ACM Digital Library (1997 results) and Wiley Online Library (371 results).This resulted in a total of 4810 publications that needed to be checked for their eligibility.We have included all publications listed in these databases.In some cases, this includes work that was published with different publishers, such as Springer, when the respective publication was still listed in the ACM Digital Library.

Search Strategy
Since there is no defined way in which user studies are referred to in the respective publications, we based our database search on visual data representation for data analysis and immersive analytics technologies.We defined a list of keywords for both criteria, where each publication needed to include at least one of each column, see Table 1.Furthermore, we added three additional keywords that already include immersive analytics technology and data analysis.To avoid missing out on any relevant publications we used a full text search.After the search, we used the filters of the database to filter for journal and conference articles, including workshop proceedings that were published by one of the three publishers included in our review.

Selection Process
For an overview of the selection process, see Figure 1.Since our main eligibility criteria, having a user study and using an immersive analytics technology and a visual data representation for the study, are not always mentioned in the title, we had to retrieve the reports before the screening phase.Here we excluded 66 publications that we could not access with reasonable effort.Based on the large number of results from our search, each of the 4744 publications was then screened by one researcher to identify whether it contained a user study, a visual data representation and an immersive analytics technology.We always looked for the user study first and only continued to look for the other eligibility criteria when we found a user study.This is why we excluded 3382 for not reporting on a user study, 391 for missing immersive analytics technology and 195 for not including a visual data representation in an analytics task.Whenever the respective researcher was unsure whether a publication was relevant or not, e.g. when a data representation is described but it is not clear whether it was used in the study, it was marked as relevant and sent to the next stage of the selection process, where each of the relevant publications was assessed by another researcher.This was the case for 710 studies.In this stage we again excluded 63 papers for not reporting on a user study, 75 for missing immersive technology and 195 for violating the visual data representation criterion.When a single publication reported on multiple studies, each of the studies was assessed individually.Overall, this led to a total of 231 studies which were reported in 209 publications being included in the detailed analysis for this systematic review.

Data Collection Process
Each of these 231 studies was reviewed by one researcher to extract any reported information on the study design and the study goals.The extracted information was entered into a spreadsheet and any ambiguities were resolved through discussion.
Using this process, we extracted data for eleven data items, listed in Table 2.We then developed categories for each of these based on the collected data using a bottom-up approach.The only exceptions were the study data sets, where we used the categorisation of Munzner [Mun14], the participant details, where it was not necessary to form categories, and the study designs, which were categorised according to commonly used categories [Fie18,LFH17].

Study Risk of Bias Assessment
We did not include methods to assess the risk of bias for the studies we analysed in this publication, since the procedure of the study itself is the subject of this systematic review.Therefore, we included all forms of user studies to accurately reflect the state of the art in user evaluation in IA.
Simple plots, Geo-spatial, Networks, Scatterplots, Volume rendering, Trajectories, 3D bar charts, Point clouds and simple objects, Timeline, Multi-dimensional or Others Immersive Analytics Technologies Head-mounted displays, large screen displays, CAVE-like systems, DesktopVR or Spatially tracked mobiledisplays

Results
In this Section, we present the results of our detailed analysis and the insights we gained.An overview of the analysed categories and the structure of this work can be found in Table 2.As our main goal is to reveal how user evaluation is conducted, we structure our results according to their respective order in the research process.We start with the evaluation goals of the analysed user studies and then move on to the study methodology.We then discuss what measures were investigated in the user studies and what methods were used for the data collection process, including a table of standardised questionnaires that were employed.Then we analyse which data types and study task types have been used in the user studies.After that, we move to types of data visualisations and immersive analytics technologies have been employed.Thereafter we elaborate on the participant demographics and publication venues.Finally, we discuss the interrelation between study goals, evaluation methodologies and measures in our evaluation strategies.

Evaluation Goal
The goal of the evaluation is the most relevant identifier for selecting a study procedure.However, there are many more parameters to consider when choosing a study type.Researchers might consider the availability of the target user group, the expertise for certain study types in their team or their desired venue for a publication.Although these considerations are pragmatic in nature, they are still important to ensure that a study can be conducted in high quality.
Nonetheless, the selected study type fitting the evaluation goals remains the most important consideration.
As mentioned in Section 2.5, we used a bottom-up approach for forming the categories to classify the studies.The categories for the evaluation goals are primarily based on the description of the purpose of the empiric study in the respective papers.For the following classification, we then also looked into the study design, to validate whether it matches the stated goal.Finally, we considered the reported results to assign a study to a specific evaluation goal.We chose not to use the scenarios proposed by Lam et al [LBI *  12] and other related work [IIC * 13, MSK * 20] as they also include system evaluations.Moreover, this classification was originally created to guide researchers in finding the best evaluation approach for their scenario and thus suggests different methods and is not fully intended to analyse the state of the art.
Evaluate novel prototype: By far the most common goal in our corpus of studies is the evaluation of a specific prototype presented in the respective publication, which occurred in 136/231 studies, see Table 3.Here, the focus of the publication lies on the introduction and description of the prototype and its implementation.This can include a novel visualisation approach for a specific type of data, e.g.3D time-varying field data [DZQX21], a prototype for applied visual analytics in a certain domain, such as the geo-temporal visualisation for law enforcement [CWT18], and novel device for interacting with the data, such as the MADE-axes [SLT * 21].The evaluation is then often conducted to validate the relevance of the introduced technique and also includes studies where the novel prototype is compared to an existing approach.This type of evaluation goal can be achieved with all five types of study methodologies described in Section 3.2, depending on the specific research question.While the most common study methodology for this goal is the quantitative experiment, this is the only evaluation goal in our data, that utilises informal evaluations, see Figure 8.
The quantitative experiments are based on a research question that defines clearly quantitatively measurable metrics that are mostly used to compare the novel approach to at least one other existing approach.The most prominent metric used in these experiments is user performance, see Section 3.3.In the studies applying a qualitative methodology, research questions are more open and mostly focus on the user experience of the novel approaches.With the mixed-methods approaches, on the other hand, the research question needs to include both quantitative and qualitative aspects.Case studies were employed to describe how the novel technique or prototype was used by domain experts and enhances their work.The informal studies mostly provide initial user opinions on the usefulness, highlight advantages and disadvantages, and give suggestions for improvements and missing functionality.
Compare existing approaches: When studies were sorted into this category, the goal of the evaluation is to compare different techniques, such as interaction techniques [WSN21], visualisations [BRLD17], interaction methods [DCW * 18] or variations of the same approach [YDJ * 19], which was the case for 45/231 studies.Especially, when using a quantitative methodology, this goal is similar to the novel approach.However, in this case the approach itself is neither novel, nor the central aspect.For this evaluation goal, the key contribution of the publication are one or more user studies which compare implementations of concepts from literature, often in terms of their suitability for a specific domain or problem.Alternatively, different variations of a specific existing approach might be compared to identify the best fitting parameterisation for a certain application.The most common methodology for this goal is the quantitative experiment, since a statistical analysis of the differences between several conditions is a straightforward way to compare those items.When using a qualitative methodology, it is often more difficult to present a sufficient comparison of the elements under investigation.Here, the experience of the researchers in synthesising the data is key.Furthermore, this goal could also be achieved by employing a mixed-methods approach.But this was not found in any of the analysed studies.

Technology comparison:
The comparison of technologies is mostly a domain or problem specific approach where different display approaches or physical input modalities for interaction in the virtual environment are compared.In contrast to the Compare existing approaches category, studies completely focus on the influence of the hardware, mainly the display type, and often evaluate a 3D representation vs. a 2D representation while keeping the data visualisation as similar as possible.This category was found in 22/231 studies and includes the early example of Ware and Franck [WF96], who investigate the difference between 2D projections and 3D stereoscopic visualisations for tracking the paths in graph visualisations, as well as the more recent example of Wagner Filho et al. [FSN20], who compare an immersive version of a Space-Time Cube to a desktop version.This study type was mostly met by using a quantitative study methodology with only one exception where a qualitative study was used.

Foundational Research
The 19/231 publications that were sorted into this category focus on generalisable results that are not directly linked to a specific implementation or domain.Thus, the tasks for this study goal are as simple as possible and abstract to stay away from a specific application scenario, e.g.[AWR18].Other approaches leaning slightly more towards a domain are for example the investigation of the influence of real-world backgrounds in AR on the perception of data plots [SD21].In general, this evaluation goal can be achieved by conducting a quantitative experiment, a qualitative study or a mixed-methods study, as these approaches allow for comparisons between different conditions.
Formative improvement: This study goal was found in 9/231 studies and is characterised by mostly small studies on usability, usefulness or design aspects of a specific prototype, that is at that point still in the implementation phase.The main goal here is to make informed design decisions that are based on the feedback and requirements of real users and domain experts.By its nature, this means that this type of evaluation goal is bound to a specific domain and application scenario.This was the least common study goal in our analysis and was completed either with a qualitative or mixedmethods study methodology.In contrast to the evaluation of a novel prototype, this study is formative in the design process and the main outcome is the description of how a prototype was adapted based on the study results.
With the large number of studies introducing a novel technique or approach, the question arises why this goal is so much more common than the others.The reason could be that by implementing and presenting a novel approach, the original contribution of a publication is clear from the beginning.For other study goals, such as the comparison of existing approaches or foundational research, the contribution relies more heavily on the study results and is therefore not completely predictable at the start of the research.Only when the study is conducted and analysed it becomes clear how well the research can be published, with the effort of the implementation lying in the dark.Therefore, the introduction and evaluation has a better ratio between effort and publishability.This leads, however, to a lack of evaluation in the other areas, such as the comparison of existing approaches.

Study Methodology
The study methodology is determined by the combination of a research goal, the specific research question and the methodological competences of the researchers.Other approaches (partially) describe this methodological perspective when reporting on study design methods [dSTCT19] or collected data [DBLS18].It is also included in the description of scenarios by Lam et al. [LBI * 12], which has in turn been utilised in further systematic reviews for paper classifications [IIC * 13, MSK * 20].However, we believe, that this choice of methodology needs to be considered at the beginning of designing a user study as it requires different perspectives and competences, and affects the structure of a study and the methods that can be utilised.
Furthermore, classification of the evaluation methodology partially uses the same categories as related work.Specifically, "Case Study" and "Experiment" [dSTCT19] and "qualitative analysis" and "informal evaluation" [DGB08] have been used in prior reviews of evaluation.As far as we are aware, the systematic mixedmethods approach has not been analysed in a review of evaluation.
Since the research methodology is not clearly stated in many publications, we categorised the approaches based on the methods and measures used during the evaluation as well as the reporting.We found five different study categories in our publication corpus: quantitative controlled experiments, qualitative user studies, mixed-methods studies, qualitative case studies and informal evaluations, see Table 4.We consider those studies as mixed-methods, that include both formal quantitative as well as qualitative measures to arrive at their results.We do not include approaches that only collect informal qualitative feedback such as informal verbal feedback or open comments in a custom questionnaire in the mixed-methods category.
Quantitative experiment: This is the most common study methodology.Here, the hypotheses are formed based on priorresults and literature.Then, one or more specific (independent) variables are manipulated while keeping environmental variables equal so a causal relationship can be assumed between the modified independent variable and the outcome (dependent) variable [LFH17].To analyse whether the independent variable has the expected effect on the dependent variables, statistical data analysis is applied.The most common experimental design for conditions is the within-subjects design, which was used by 89/119 studies.This means that each subject of the study completed each condition.Therefore, the change in the dependent variable can not be caused by the different backgrounds of the individuals that completed the condition.However, this experimental design requires a method for counterbalancing the order of conditions to avoid the confounding variable of learning effects to influence the outcome of the dependent variable [LFH17].The most common counterbalancing method in our analysed studies was the Latin Square Design.
In contrast, 15/119 studies used a between-subjects design which is useful when learning effects in a condition are so big, that they cannot be counteracted by employing counterbalancing techniques.Therefore, it is important when using between-subjects designs to ensure that the different groups which are assigned to one condition are similar in terms of any relevant parameters such as prior knowledge, age distribution and gender distribution [LFH17].In the mixed experimental design, which we found in 9/119 studies, several independent variables are modified with some being treated as within-subjects variables and others as between-subjects variables.The remaining 6/119 quantitative experiments did not provide information on their study design.

Qualitative user study:
The second most common study methodology we found were qualitative user studies.Here, qualitative data is collected by means of interviews, observations and questionnaires with open questions.The approach to qualitative analysis is inherently different to quantitative studies, as there are no fixed hypotheses that are verified by collecting and analysing data.Qualitative analysis rather aims to describe and explain what is happening in a rich and holistic manner [BFM16].In the analysed studies qualitative data is often used as an exploratory approach to gain insights using a rather small number of participants, but in a formalised process as opposed to informal evaluations.Moreover, this study methodology is often implemented when using domain experts, as it is often hard to find enough experts for a quantitative evaluation.Moreover, experts are rarely interested in traditional performance metrics when evaluating a system that is designed to suit their sense-making process [BYK * 21].The qualitative data these domain experts provide is then much more useful for evaluating and improving the system than testing a closed hypothesis with a large number of novice users.While most of the studies are snapshots at one specific time, it is also possible to conduct a study in several sessions over a longer period of time, which was the case for the study of scientific collaboration in VR by Olaosebikan et al. [OABK * 22], which was performed over the course of two months.
Mixed-methods study: When studies systematically collect qualitative and quantitative data, they are classified as mixedmethods studies.However, we only included studies that also synthesised both types of data in their reporting in this category.Qualitative studies that included a short custom questionnaire with a Likert scale were classified as qualitative and quantitative studies that added an open feedback question were classified as quantitative, as this does not constitute a systematic and equal mixed-methods approach.In addition to the studies that collected both types of data in the same study, there was one publication that used a different approach by first conducting a quantitative study and then following up with a qualitative study [KCWK20].This resembles a classic explanatory sequential design as opposed to the exploratory approach that conducts a qualitative study first, followed by a quantitative study to test the hypothesis developed based on the qualitative data [Cre09].While conducting two separate user studies for a mixedmethods approach appears to double the time and effort going into a project, researchers that want to collect both types of data within the same study, often face the challenge of different requirements for quantitative and qualitative data collection.For example, sampling strategies for quantitative research might not be suitable of feasible for qualitative research [BFM16].For quantitative data collection it takes more participants and a very strict control of the environment and the information users are provided with.For qualitative studies, on the other hand, it takes less participants but here it is necessary to also consider the individual differences of each participant.This includes, for example, deliberately changing questions in a semi-structured interview to get the most information out of every participant or to give assistance to a participant that is stuck in a task.As Blandford et al. describe it, qualitative research can often be seen as a shared journey, whereas quantitative research can mostly be clearly defined to start with a hypothesis and end in a conclusion [BFM16].Furthermore, the effort needed for data analysis in a qualitative scenario increases approximately linearly with every additional participant, while in a quantitative scenario it makes little difference to add more participants.
Case study: The fourth study methodology that we found was the case study.In this type of study, a small number of real users, sometimes only one, is asked to use a tool or prototype in a real scenario.This scenario must be in line with the users' expertise and the purpose of the prototype.This allows an in-depth analysis of how a tool can be used in a real world application scenario [LFH17].In the analysed studies, this approach was often used to describe user behaviour and motivation as well as the usefulness and improvement opportunities for a specific prototype.We excluded, however, any study were the term "case study" was used but that did not include actual users but rather described how the proposed tool would be used by a fictitious persona in a use case defined by the researchers.
Informal study: In addition to the formal study approaches, we found informal evaluations which do not follow a structural or methodological approach.In this category, researchers demonstrate their approach to users or experts and collect informal verbal or written feedback for initial insights.This type of evaluation is useful for formative evaluations in a design process or as pilot study, as it does not provide conclusive results but highlights advantages, disadvantages and room for improvement early in the design or development process.

Measures and Data Collection
When collecting data, we categorise the data into quantitative or qualitative as well as objective or subjective data.Quantitative and qualitative data can describe the same occurrence using different paths.While quantitative data is the numerical description, qualitative data contains a detailed description either in the form of textual, video or audio data.However, qualitative data can also be quantified, for example, by counting the occurrence of a specific event in a video.Subjective data, can both be qualitative or quantitative and is mediated by the user.This data reflects a users perception and is mostly collected using interviews or questionnaires.Objective data on the other hand, is not directly mediated by the participant of a study.Quantitative objective data is for example the measured task completion time and qualitative objective data can be obtained using qualitative observations of a user's behaviour.While Dünser et al. found a prevalence of objective data in their analysis published in 2008 [DGB08], we could not confirm this trend in our current analysis, as the use of questionnaires and interviews to collect subjective data has gained in popularity while objective measures are still widely used, see Table 5.Furthermore, we analysed which measures or concepts were frequently investigated in the analysed studies and how they were assessed.The classification is purely based on the concepts, standardised measures and interview questions were stated in the respective publications.However, we grouped some measures into the group of user experience to include aspects that were either measured on their own or in the context of standardised measures which capture multiple aspects related to user experience User Performance: The second most common measurement in our data is user performance, which describes how well a user is able to complete the study task.It is commonly assessed by collecting data on task completion time and error rate, see Table 5.This is especially common for the goals of evaluating a novel technique or comparing existing techniques.In addition to the objective time and error measurements, performance can be assessed subjectively by employing a custom questionnaire.When using both subjective and objective measures it also allows researchers to compare the actual performance to the user's perception.Furthermore, performance can be assessed by observation, where a researcher judges what constitutes good and what bad performance.While this is still an objective measure, it allows researchers a more detailed view on the performance than simple time and error measurements.Other approaches to measure performance that are more study specific are the use of system log data (e.g. the number of interactions with the study prototype [NKE * 17]), a post-study knowledge test [ATC21], and the use of eye-tracking [DMG * 01], see Table 5.
User Experience: The most common measures to evaluate in a user study within our analysed studies is user experience (UX).Since IA always involves users, it is important to include their experiences in the evaluations.However, most studies only look at specific aspects of UX, see Figure 3.When looking at the holistic view on UX, it is measured using custom and standardised questionnaires as well as qualitative interviews, the "Thinking Aloud" method [ES84, NCY02] and informal verbal feedback, see Table 5.The most commonly measured aspects of UX were Usefulness, Ease of use, and Usability, see Figure 3. Customised questionnaires are the dominant method for measuring these UX aspects.Besides, the standardised questionnaire "System Usability Score" [Bro95] is a common tool to measure usability.For ease of use the "Single Ease Question" [SD09] is a standardised and simple option, see Table 6.Usefulness, on the other hand, was not measured using a standardised questionnaire.However, it was commonly evaluated using qualitative interviews and informal verbal feedback, see Figure 3.
Simulator Sickness: There are different terms to describe the users experiencing sickness during an immersive experience, such as simulator sickness, cybersickness, VR sickness or motion sickness.We use simulator sickness which gained popularity as it comes with its own questionnaire, the "Simulator Sickness Questionnaire" (SSQ) [RSKL93].It includes symptoms of nausea, oculomotor disturbances and disorientation and is commonly administered before and after the immersive experience to assess how the experience changed the perception of sickness symptoms in comparison to the baseline before.In our data the SSQ was the most common method to measure simulator sickness, followed by custom questionnaires.Furthermore, there were two more standardised questionnaires that were used in the studies we analysed, the "Virtual Reality Sickness Questionnaire" [KPCC18] and the simulator sickness questionnaire that is based on the study of Bouchard et al. [BRR07], see Table 6.Both of these measures, however, are derived from Kennedy's SSQ.In addition to these questionnaires, Zielasko et al. [ZWK19] added a verbal feedback on the participants' state of well-being, see Table 5.
User Preference: Especially in comparative study methodologies a common and simple measure is user preference.It is mostly evaluated using custom questionnaires and verbal feedback or as part of the qualitative interview, see Table 5.However, some of the studies infer the users' preference using system log data to analyse how much they interacted with different features of the study prototype.
Usage Behaviour: By analysing usage behaviour, researchers describe overall usage patterns and strategies as well as specific interaction behaviours, for example, when a user uses a tool in a way unintended by the researchers.This is especially common in case studies, as this study methodology mainly describes how a tool is used in a real world scenario.The most common way to analyse this measure is observation.Similar to direct observational strategies, the user behaviour and movement is often reproduced using log data, position tracking and eye-tracking.When applying custom questionnaires, qualitative interviews and verbal feedback the results, the subjective data that is obtained can then reveal the motives behind the users' behaviour.
Presence: Presence was defined as "the sense of being there", or how humans respond to an immersive system with the feeling of truly being in the virtual environment [SBL * 95, SVS05].However, a later approach split the term of presence into "place illusion" and "plausibility illusion".Place illusion therefore is bound to the feeling of being in a specific place in spite of the knowledge that this is not true.In contrast, plausibility illusion is connected to events and actions as "the illusion that what is apparently happening is really happening (even though you know for sure that it is not)" [Sla09].While presence can be (and has been [TTCLER16]) argued to be part of user experience, we classified it in a separate category, as it is specific to the immersive experience and not always part of UX measures.Since it is a common concept across all kinds of systems, we consider UX in a more general context in our classification While presence can be assessed on evaluation specific attributes, such as signs of anxiety when standing in front of a virtual pit [SKMY09], this is usually not feasible in IA.Therefore, the most common evaluation method is the use of standardised and sometimes custom questionnaires, see Table 5.The standardised questionnaires for this measure include the Igroup Presence Questionnaire [SFR01] and the MEC Spatial Presence Questionnaire [VWG * 04], see Table 6.
Workload: Workload is generally used to evaluate how demanding a task is.Here, we include mental workload, physical workload and general task load, which includes both physical and mental effort often in combination with temporal demand and effort.Workload is mainly measured using standardised questionnaires, such as the NASA Task Load Index (NASA-TLX) [HS88], a variation that omits the weighting of the subscales, also known as raw Task Load Index [Har06], and custom questionnaires that are sometimes based on this measure [BRLD17].Furthermore, the mental effort scale, introduced by Paas et al. [Paa92], was reported as a measure of cognitive load.Additionally, mental workload can be measured using physiological measurements such as electroencephalography data  6.
Open Prototype Feedback: This measure is a main component of informal evaluation and all other domain specific evaluations, such as formative improvements.It provides researchers with concrete advantages and disadvantages as well as improvement suggestions and missing features for the studied tool.It is mainly evaluated using verbal feedback and open-ended questions in custom questionnaires.When a more formalised approach is used, qualitative interviews, Thinking Aloud protocols or observations can be employed, see Table 5.
Collaborative Behaviour: This measure is only used in the few collaborative studies.While it is similar to the usage behaviour, as it can also describe behavioural patterns, it adds the component of collaborative interaction.Here, researchers can look at collaborative coupling, communication and collaborative awareness.This measure is especially interesting when going beyond co-located collaboration and investigating remote collaboration or cross reality collaboration.Within the studies we analyse, it is evaluated using observations, qualitative interviews, custom questionnaires and Thinking Aloud protocols, see Table 5.While collaboration has been described as an integral part of IA [EBC * 21], we found only a relatively small number of studies on collaboration in our analysis (15/231), similar to Saffo et al. [SBDE23] and Dey et al. [DBLS18].

Study Tasks and Datasets
While all analysed studies are in the general context of IA, not all focus solely on data analysis.Thus, our task classification does not exclusively refer to analysis tasks.The most simple type of task within the analysed studies were simple lookup tasks(50/231).Here, the user does not need to compare any data points but rather look for the one data point that fulfils the given criteria.Another common task was the comparison of single items (56/231) that required users for example to compare given items to identify the higher value.In the pattern or trend identification (59/231) as well as for outlier and maximum identification (14/231), users are required to get an overview over the whole data set to identify underlying patterns.This is also relevant for the comparison of different patterns (28/231) in the task.Furthermore, path tracing (23/231) is common in scenarios using node-link diagrams or tree structures for the visualisation of graph data.Such analysis tasks which have an inherently correct answer, are especially common in quantitative comparative or mixed-methods studies, as it already provides a metric for task accuracy.However, closed tasks can also be useful to familiarise the participants with the analysis tool [LHC * 21].The open exploration (43/231) of the data set and the implemented visualisations, on the other hand, is common when applying the qualitative, mixed-methods or case study methodology.Moreover, this is also a common task for collaboration studies, as it allows participants to discuss the data in detail and enables different strategies within a group [LHC * 21].Additionally, we also found a group of studies where the task was focused on the use of a tool in its intended context (15/231), such as the data analysis of user study data [HWF * 22].Apart from these analytical tasks, there were also many study specific tasks (64/231) such as navigation tasks [BNC * 03] and assembly tasks [WNN18].The papers are categorised according to their task types in Table 7.However, 18/231 studies are not included in the table as we could not find any information on their specific study tasks.
For data type classification we primarily used the definitions by Munzner [Mun14].Moreover, several studies used more than one type of data in their study.The most common data type within the analysed studies was table data (134/231).This describes data where a single data point is comprised of multiple discrete attributes which can easily be displayed within tables.There are numerous charts available to display this type of data, such as bar charts, scatterplots, and parallel coordinate plots.In this category we also included data with a geospatial reference, such as GPS co-ordinates.While in the visualisation process this data is usually displayed using a map, which in itself would be classified as geometry data, this is just a way to give context to the attributes of the geospatial table data.Field data (35/231) was most commonly found in medical imaging data, such as MRI data, or when analysing fluid dynamics.This data type is more complex than the table data, as it often describes a volumetric point that has a specific value.While many medical imaging data sets are displayed using surface render- Yield Shift Theory(YST) Satisfaction using 5 items [BR08] [DASS20] Single Ease Questionnaire (SEQ) Ease of use in a single item [SD09] [WSN21, WSN22] Self Assessment Manikin (SAM) Affective State in 3 scales using pictograms [BL94] [ ings instead of volume renderings, which would be considered geometry data, we decided to include this data in the field data as the surface rendering simply serves the purpose of high-performance visualisation.The network data (37/231) specifies a relationship within the data, which also includes hierarchical data like trees.Geometry data (37/231) refers to data about the shape of objects.This data would typically be displayed using surface rendering.In addition to these categories described by Munzner [Mun14], 7/231 of studies used textual data and 6/231 used image data in their user studies.
While the data types mostly correspond to their visualisation type, there are some examples where this is not the case, e.g. when network data is extracted out of textual data to visualise references across documents [PDS19].

Data Visualisations
Within the analysed publications we found numerous different ways of visualising data.We grouped the visualisations into eleven groups and depicted their use over time in Figure 4.
The most common type of visualising data we found are simple plots, which occurred in 49/231 of studies.This includes all classical 2-dimensional plots such as 2D bar charts line charts, histograms, as well as pie charts.This type of charts is often combined with other approaches to visualise data and has greatly increased in its use over the last five years, which is likely due to the many upcoming tools that combine multiple visualisation types for a specific use case.Therefore, the data representations that are common within the domain are also implemented, such as the histograms in material science [GGH21].
Next are geospatial visualisations, which are used to display  Also used in 44/231 of studies are network representations.This mainly includes node-link representation.By connecting nodes through edges, it is the most common form of visualising network data.Within the node-link representation we also find the hierarchical variation of tree visualisations, [SLF10].A unique and inherently three dimensional variation of these tree structures is the cone tree.This data representation was used by Elmqvist et al. who investigated the benefit of motion constraints for navigating complex 3D environments [ETT08].Moreover, we also found a compound graph which combines a matrix representation and node-link diagrams [AHE16] Scatterplots, both two and three dimensional, were used in 38/231 of the analysed studies.This chart is perfectly suited to describe correlations and trends and enables users to discuss patterns in the data in collaborative scenarios [LHC * 21].Furthermore it is a common 2D visualisation that enables a simple transformation to 3D by adding a third axis.However, it also suffers from distortion based on the user's viewpoint.Thus the view of a small and a tall person on the data can vary and distort the pattern.A special variant of the scatterplot, the scatterplot matrix was investigated by Batch et Volume rendering, which we understand to be the display of volumetric data achieved either through volumetric or surface rendering, was used in 33/231 of studies.It is used to display both geometry and field data.Common application areas of this data representation type is the rendering of medical imaging data [KCWK20,JK22], or the material science domain [GGH21, JLS * 13].
For displaying the course of objects or particles as well as spatial movement, trajectories were found in 23/231 of the analysed studies.In this category, we also include flowlines, which were used to visualise the airflow in a cavity [Men12].
Also used in 23/231 of user studies were 3D bar charts.While the 2D bar charts are widely spread in every day life, the 3D bar chart has been gaining increasing attention in IA research in the last five years, see Figure 4.This chart type is for example used in a comparison with a physical replica [DUWW22].
Point clouds and simple shapes were adopted in 21/231 of studies.These items were often used as simple and abstract visualisations to connect a task to a visual analytics context without the influence of a specific dataset.Büschel et al. used spheres as "mock data items" [BMMD19] and Guéniat et al. performed an abstract outlier search task by asking users to find three eggs in a cloud of spheres [GCG * 13].
We also found timeline visualisations in 12/231 of our analysed studies, which can also be used as interaction items to manipulate the time axis in an interactive chart [HWF * 22].
The final individual visualisation category we found were multidimensional charts which include multiple different axes.This type was used in 10/231 analysed studies and consists of parallel coordinate plots [HZBR21] and radar plots [SDA19].
The remaining rarely used visualisation types were combined to the others category.31/231 studies employed one or more of these data representations which include 2D medical image slice data [CBC * 20], beeswarm plots [SAHCV20], Gantt charts [SCT * 22] and many others.
Next to the topic of visualisation for data analysis, there are publications that focus on placement of data that were not included in our analysis, [LLWD22, GSL21, KHKH13, DAE * 08]

Immersive Analytics Technologies
With our definition of immersive analytics systems in the paper selection process, the technologies in the publications can be divided into the five display categories of Head-Mounted Displays (HMDs) (175/231), large screen displays (20/231), CAVE-like systems (20/231), DesktopVR (16/231) and spatially tracked mobile displays (22/231).Few studies incorporate multiple of these technologies.
In the category of displays by far the most commonly used technology was the HMD.This is due to the recent (and still ongoing) surge in the development of consumer-grade hardware in this area which facilitated the wide spread use of VR and AR in data visualisation and data analysis.Inside this category the most commonly used HMDs were the HTC Vive, the Microsoft HoloLens and the Oculus Quest.Additionally there were some earlier HMDs such as the Virtual Research V8 HMD.However, before these modern HMDs were available, researchers mainly used CAVE-like devices and DesktopVR or large screen displays (LSDs) like powerwall installations combined with tracking and stereoscopic glasses.While some of the multi-user CAVE-like apparatuses are still in use, the combination of a regular display and stereoscopic glasses has mainly been replaced by HMDs, see Figure 3.6.After the rise of smartphones and tablets these mobile devices using spatial tracking were also used for IA.Compared to the other display technologies their usage is rather rare, which is likely caused by their limited display size.One of the key advantages of IA having a large field of view and a huge layout space for data is constrained by this restriction.In our extracted data three more dedicated setups using spatial augmented reality on a data physicalisation [HOZ * 21], projection on a semi-transparent spatial display [KBGM15] and projection on a table [DASS20] have been used for IA.
In the category of input devices instead of the display type the devices are often required to enable spatial interaction.The most common interaction device type in the category was the tracked controller (118/231).In this category, 16/231 studies tracked custom tangible objects instead of controllers.These devices are often designed to support data manipulation as for example ImAxes [CDH16].11/231 studies also tracked humans to manipulate data points and experience the data visualisation by moving across the tracking space.Gesture (44/231), Voice (7/231) and Gaze (11/231) tracking have also been used for data manipulation and interaction.When using mobile devices like phones and tablets the interaction often relies on touch input (31/231) and tracking the mobile device spatially (5/231).
In general, it can be observed that the input devices or input modalities are selected depending on the display technology used.Opposed to traditional keyboard/mouse interaction in desktop systems no standardised input approaches are available for spatial interaction which is often required for immersive analytics.

Participants
Out of the 231 studies 221 reported on the number of participants with an overall median participant count of 14.However, since the number of participants varies greatly depending on the type of evaluation, we show distribution participant numbers for each study type, as well as an overall score in Figure 6.The median number of participants for quantitative experiments is 17 (range 4-60), for qualitative studies it is 10 (range 2-25), for mixed-methods studies 15 (range 3-32), for case studies 4 (range 1-10) and for informal evaluations 5 (range 1-36).This shows that in our extracted data, there were also very small quantitative studies, meaning that their statistical power is presumably very limited.Furthermore, while case studies and informal evaluations are almost equal in terms of median number of participants, the informal evaluation is more flexible in the number of study subjects.This is also due to the case study being more complex in the data analysis and result presentation of its inherently qualitative data.When compared to participant numbers from literature, the median numbers for quantitative  However, there are also calls that traditional approaches to sample size calculation may be inherently flawed.In the medical domain, Bacchetti argues that the assumption that any statistical evaluation needs at least 80% power is harmful to the scientific progress and suggests multiple alternative approaches to sample size planning [Bac10].
In terms of participant age, it is also noteworthy that there are different ways in which this demographic parameter was reported.The average age also reflects in the most recruited group with 66/231 studies sourcing their participants from students and campus staff in an opportunity sample resulting in low average age.41/231 studies on the other hand recruited domain experts from the specific user group of their prototype for their evaluation.This difference in sampling also relates to the type of evaluation that was employed.In case studies requiring only a small number of participants, 10/12 studies that reported on sampling recruited domain experts for their study.For qualitative studies it was similar, with 17/29 and for informal studies 5/8 recruited experts.On the other hand, for quantitative experiments only 6/56 were conducted with experts and 3/11 for mixed-methods studies.
For gender, 139/231 studies reported on the distribution of male and female participants with 6/139 studies including participants that did either identify as non-binary or did not wish to disclose their gender.The median number of female participants per study is 5. Overall, this is similar to the results of Dey et al. [DBLS18] and Merino et al. [MSK * 20], as they also found participants in their review to be mostly young, male university students.However, as Offenwanger et al. explain, it is not only women that are underrepresented.In many notations, that only declare how many of the participants were female, non-binary gender identities are invisible [OMC * 21].
In our corpus 41/231 studies reported on their subjects' vision with 34/41 reporting on all participants having normal or corrected to normal vision when participating in the study.Furthermore, 103/231 studies also gathered data on their subjects' experience with VR or AR technology, 66/231 reported their subjects experience with visual data analysis and 38/231 studies reported on other participant variables, such as frequency of computer use, dominant hand or education level.

Publication Venues
We categorised the publications according to their publication venue into journal (51/231), conference (167/231) and workshop Since 2016 there has also been an increase in these types of publications that also report on user evaluations.The wide variety of venues illustrates that IA is relevant for numerous different domains where it adds an immersive component to the current process of visual data analysis.On the other hand, it also highlights that in current research, there is not yet a particular conference or journal that is the main venue for research in IA.This is also emphasised by the fact that workshops on IA have taken place at different conferences, but often were not repeated at the same conference in the following year.For example, the first workshop was held in 2016 at the IEEE VR.Among others this was followed by a workshop at the IEEE VIS in 2017, three workshops at the ACM CHI in 2019, 2020 and 2022 as well as two workshops at the IEEE ISMAR in 2022 and 2023.The distribution of venues for IA related research also illustrates the interdisciplinary nature of the topic.The most common venues for publications are centred around the communities of Virtual Reality, Human-Computer Interaction and Information Visualisation.

Evaluation Strategies
When planning a study the most integral parts are the evaluation goal, the study methodology and the measures, typically determined in that order.Overall, there is no one single path, where there is only one suitable methodology to reach a specific evalu- Different methodologies rather provide different opportunities to the researchers, which is similar to the elaboration of Lam et al. on their "Many-to-Many Mapping" between their described research methods and scenarios [LBI *  12].In our analysis, we discover that evaluate novel is by far the most common and the most versatile goal, as it can be accomplished by any methodology.Moreover, it is the only study goal that was accomplished using informal evaluation methodology (17/136) and case studies (16/136).The comparison of existing approaches (34/45) and the comparison of technologies (21/22) both have a clear focus on quantitative experiments, as this study methodology provides clear and widely recognised standards for comparison of different elements.Despite that, the comparison of existing approaches and techniques was also evaluated using qualitative approaches (7/45).This is, however, a less common approach for comparative study goals, as the results of this study methodology are more dependent on the subjects.While for the foundational research approach we also found more quantitative experiments (11/19), this is also influenced by the general preference for this approach in IA research.However, with 4/19 it has the highest number of mixed-methods studies relative to its occurrence out of all study goals.While the formative improvement goal also has this relatively high number of mixed methods studies (2/9), there were only little studies with this goal in our analysis and it is far more common to choose a qualitative study design to achieve this goal (7/9), see Figure 8.
The decision of the study methodology then sets the procedural structure for the study and is an important factor for choosing the specific methods for data collection, which can either collect quantitative, qualitative or both kinds of data.However, the choice of study methodology does not limit the measures that can be investigated, see Figure 9.While in our analysis almost any measure was used at least once in combination with quantitative experiment, qualitative study and mixed methods study, there is a clear preva-lence of some combinations of methodology and measures.The user performance is almost exclusively measured using a quantitative experiment (103/130).Qualitative studies on the other hand, show a clear preference for measuring user experience (36/57) and open feedback (28/57), which includes advantages and disadvantages of the item under investigation.Additionally, qualitative approaches are very common when analysing collaborative usage behaviours (7/15).For the collaborative usage scenario quantitative experiments often face difficulties with controlling for any confounding variables, as the group dynamic in a collaborative study can usually not be completely controlled for.This intended volatility in scenarios where more than one participant is involved leads to qualitative or mixed-methods studies being highly suitable for this measure.After reporting the detailed analysis results, we want to discuss the most relevant challenges for future evaluation in Immersive Ananlytics.
Standardised measurement methods: When applying methods for data collection, the custom questionnaire was a frequently employed tool for many measures, see Table 5.While it is often useful and necessary to construct a questionnaire for specific research questions, we have found several instances, where a standardised questionnaire was available and has been used in other studies, see Table 6.We also found several instances, mentioning that a custom questionnaire was based on a standardised one or that the standardised questionnaire was adapted.In these cases, we counted the respective questionnaire as a custom questionnaire, since its validity has not been tested in a study.This has also been discussed by Hart in 2006 [Har06] when analysing how the widely spread NASA-TLX questionnaire had been used in the 20 years since its creation.Hart also mentions that changing a standardised questionnaire also means that the results might not be comparable to similar studies using the same questionnaire.We would therefore argue for not adapting questionnaires without securing their validity, to report an adapted questionnaire as a custom questionnaire, to clarify whether the results can be compared to other studies and to choose a standardised, validated questionnaire whenever it is suitable.

Benchmark tasks and data sets:
The basis for any user study is the question of what task a participant should complete for getting reliable and comparable results.We did find different simplified task types that are used across different data.However, we could not identify a benchmark task that could be used in different studies and allow comparisons between those results.For data sets this was quite similar.There are some data sets, such as the wine quality data from the UCI Machine Learning Repository [FVP * 18, MSGY * 20] or the Melbourne housing auction data [SS22, LHC * 21], which were used in two studies.However, to enhance comparability across different publications, for example, when introducing a novel interaction technique, using a standardised and open source data set would be useful.This could be the first part of an evaluation, while a more scenario specific task could then constitute the second part of the study.We would therefore argue for the development of different tasks in combination with public data sets to be used for benchmarks when introducing new tools or techniques.
Study reporting: In our analysis, we often struggled to find concrete information on the core aspects of the study design, e.g. when trying to compare demographics across all studies, which has also been found in a prior systematic review by Isenberg et al. [IIC * 13].Each evaluation strategy requires different types of information, validation and metrics to assess its quality.While a small number of participants in a user study critically influences the validity of a quantitative experiment, it is perfectly suitable for a qualitative study or case study.This also includes the reporting of results.For statistical results this means following reporting guidelines and carefully interpreting results without relying on simply categorising results in statistically significant and not significant.For further information on this pitfall of dichotomous interpretation of statistical results, see the comment by Amrhein et al. [AGM19].Furthermore, clearly stating the measures and methods in the user study description, supports readers in understanding the user study immediately.Therefore, we want to encourage researchers to clearly state their evaluation goal, evaluation methodology, measures and methods when writing up user studies.This enables readers to consider the context of the reported metrics and results.We also believe that clearly stating the methodology will boost the overall methodological competences of the IA community as a whole.Crisan and Elliott also provide a checklist that can guide researchers when preparing their study for publication, so that the merit of the study can be adequately judged by reviewers and readers [CE18].
Research Transparency: To improve transparency and quality of the research, preregistration of studies is becoming increasingly popular in user studies, which is also a trend we found in our analysis [YTHL22, SGBI22, SBDE23, WBR * 20, YCB * 21, BCC * 20, DUWW22, LPED20].Therefore, researchers submit details on their planned study to an online registry.When the study was then conducted and evaluated, researchers can prove that they followed their plan and did not change their data analysis when they couldn't find the expected outcome.In our corpus we found eight studies that were preregistered.This is a very positive trend to enhance the quality and reliability of our empiric studies.This practice boosts the confidence in the research quality and is not only useful for quantitative but for all user studies.However, preregistration is not the singular solution to enhancing research transparency [BPSS * 21, Har18].In many cases, the information published for preregistration does not fully enable replication studies.Therefore, Watson argues for promoting the completely transparent Open Methodology approach [Wat15].This way, journals would need to ensure that studies are fully reproducible before publishing to increase the robustness and verifiability of results.This would also aid to combat the replication crisis, based on the application of null hypothesis significance testing [CDBG20] and publication bias [VSHZ18].Preregistration can also help to reduce the waste of resources, as researchers can see which studies are ongoing and can design their research to address complementary research questions [BPSS * 21].Like many others before us [BPSS * 21,CDBG20,KH18,NEDM18, WVA * 16], we therefore want to encourage researchers to follow this practice of preregistration and data sharing in their future user studies.The most popular registry for user studies in our analysis was Open Science Framework.
Evaluating the immersive component: The use of immersive technologies is one of the defining characteristics of IA.However, only 33/231 studies reported on including measures that are specific to this type of technology, such as measuring presence or simulator sickness.These concepts were measured through standardised questionnaires, custom questionnaires and verbal feedback.Nevertheless, it is possible that more studies included questions on presence and simulator sickness in their qualitative interviews or custom questionnaires but did not report it in the results or their methods description.Moreover, we did not find any standardised questionnaires or evaluation procedures for other common measures.For example there is a user experience questionnaire for virtual environments, that includes aspects of traditional user experience questionnaires as well as presence and simulator sickness questionnaires [TTCLER16].Additionally, there has been an evaluation of user experience in immersive virtual environments to create design guidelines, which also include additional categories such as locomotion and fluidity, difficulty of input devices, and scene design considerations [GLL18].Therefore, it might be necessary to extend the understanding of user experience in immersive environments by including additional facets such as the experience of input and locomotion into standardised measures.We want to encourage researchers to create standardised and validated measures that consider the immersive component and to look for and use such measures when conducting their user studies in IA.

Conclusion
We have presented a systematic review of how user studies are planned and conducted in the domain of Immersive Analytics using the PRISMA protocol.Therefore, we checked 4678 publications from three different databases for eligibility and included 231 studies, that were reported in 209 individual publications, in our analysis.We collected data on publication venue, evaluation goal, study methodology, study design, measures, data collection methods, investigated datasets, study tasks, participants, visual data representation and immersive analytics technologies.For each investigated dimension, we formed suitable categories and reported distributions, trends and further insights.Finally, we discussed challenges and future work for user evaluation in Immersive Analytics.

Declarations
Study registration: This systematic review was not registered online during the planning stage.Therefore, there is an increased risk of parallel duplicate research and it does not allow for checking whether the review protocol was conducted as planned.Nevertheless, we followed the PRISMA protocol to adhere to the guidelines for systematic reviews.
Funding: This work is part of X-PRO.The project X-PRO is financed by research subsidies granted by the government of Upper Austria.
MCH * 18].Therefore, Marriott et al. argue that linked 2D and 3D perspectives might be useful [MCH * 18] and Yang et al. found that immersive environments provide the opportunity to connect and seamlessly transition between such 2D data representations [YDM * 21].Since the term IA was coined by Chandler et al. in 2015 [CCC * 15], the field has emerged to be an active and steadily growing field in research [EBC * 21].There are multiple surveys on IA, with Fonnet and Prié [FP21] as well as Klein et al. [KSS22] surveying the whole field and Kraus et al. exploring the subtopic of abstract 3D visualisations in their work [KFS * 22].Nevertheless, there is also earlier work summarising research in IA before the term became popular [GHAWK16]

Figure 1 :
Figure 1: Flowchart of the paper selection process based on the PRISMA flowchart by [PMB * 21]

Figure 2 :
Figure 2: Distribution of evaluation goals over time [AHSB22].Arjun et al. also included eye gaze data to infer mental workload.For physical workload, Drogemuller et al. report on the use of tracking data of the participants to infer their physical effort [DCW * 18] and Satriadi et al. [SCT * 22] use the Borg's ratings of perceived exertion [Bor98] to measure physical exertion, see Table

Figure 3 :
Figure 3: Aspects of User Experience and the Methods used to measure them.

Figure 4 :
Figure 4: All visualisations that were found in at least ten studies and their use over time.

Figure 5 :
Figure 5: Distribution of Immersive Analytics Technology over time and timeline of technology releases

Figure 6 :
Figure 6: Boxplot of the range of participant numbers for each study type and an overall category.The actual values are depicted as jitter on top of the boxplots to show the distribution.

Figure 7 :
Figure 7: Distribution over time of publications containing a user study in the field of IA and the studied immersive analytics technologies and the types of venues they were published in

Figure 8 :
Figure 8: Distribution of study methodologies per evaluation goal

Figure 9 :
Figure 9: Distribution of evaluation goals, the employed study methodology and the investigated measures al. [LHC * 21].

Table 1 :
Search terms for the database search

Table 5 :
Measures and their respective evaluation methods

Table 6 :
Standardised questionnaires reported in the user studies.