Effects of Individual Research Practices on fNIRS Signal Quality and Latent Characteristics

Functional near-infrared spectroscopy (fNIRS) is an increasingly popular tool for cross-cultural neuroimaging studies. However, the reproducibility and comparability of fNIRS studies is still an open issue in the scientific community. The paucity of experimental practices and the lack of clear guidelines regarding fNIRS use contribute to undermining the reproducibility of results. For this reason, much effort is now directed at assessing the impact of heterogeneous experimental practices in creating divergent fNIRS results. The current work aims to assess differences in fNIRS signal quality in data collected by two different labs in two different cohorts: Singapore (N=74) and Italy (N=84). Random segments of 20s were extracted from each channel in each participant’s NIRScap and 1280 deep features were obtained using a deep learning model trained to classify the quality of fNIRS data. Two datasets were generated: ALL dataset (segments with bad and good data quality) and GOOD dataset (segments with good quality). Each dataset was divided into train and test partitions, which were used to train and evaluate the performance of a Support Vector Machine (SVM) model in classifying the cohorts from signal quality features. Results showed that the SG cohort had significantly higher occurrences of bad signal quality in the majority of the fNIRS channels. Moreover, the SVM correctly classified the cohorts when using the ALL dataset. However, the performance dropped almost completely (except for five channels) when the SVM had to classify the cohorts using data from the GOOD dataset. These results suggest that fNIRS raw data obtained by different labs might possess different levels of quality as well as different latent characteristics beyond quality per se. The current study highlights the importance of defining clear guidelines in the conduction of fNIRS experiments in the reporting of data quality in fNIRS manuscripts.


I. INTRODUCTION
F UNCTIONAL near-infrared spectroscopy (fNIRS) is a non-invasive optical neuroimaging technique.fNIRS allows for changes in brain activity to be measured thanks to the relative transparency of biological tissues to near-infrared light [1].fNIRS emits near-infrared light from a matrix of light sources positioned on the scalp.This light travels through the biological tissues on the surface of the head and is eventually detected by a series of light detectors.Since oxyand deoxygenated hemoglobin possess distinct light absorption spectra, brain activity can be measured by comparing the amount of absorbed infrared light with different wavelengths.
Due to its advantages over more traditional neuroimaging techniques (e.g.electroencephalography and functional magnetic resonance imaging), such as cost-effectiveness and portability [2], the use of fNIRS is gaining momentum in social and cognitive neuroscience [3], [4], [5], [6].Crosscultural studies are an expanding topic in the field of cultural and developmental neuroscience, that has been tremendously facilitated by the adoption of fNIRS [7].One seminal example is the work of Lloyd-Fox and colleagues [8], who used fNIRS to compare attentional-and learning-related brain activity in infant data collected from the United Kingdom and Gambia.Using a similar approach, they were also able to compare the neural activity in response to social vs. non-social stimuli in infants from the United Kingdom, Gambia, and Bangladesh [9].
The problem of reproducibility and cross-comparability of fNIRS studies is one of the main current issues in neuroimaging [10].As highlighted by Pinti et al. [11], there is high heterogeneity in the pre-processing pipelines and analytical procedures across fNIRS studies.Different research groups adopt different approaches when it comes to, for instance, assessing signal quality or correcting motion artifacts.In turn, different methodological choices lead to divergent results, undermining the reproducibility and validity of studies [12].To tackle this issue, in recent years, several groups of researchers helped develop standardized and reproducible procedures for treating fNIRS signals (e.g., [13] for motion artifacts treatment).
A crucial step in any fNIRS signal processing pipeline is the assessment of the Signal Quality (SQ) to reduce the impact of low-quality data on the research outcomes.signals from subsequent analysis [14].Notwithstanding its importance, there is currently no established reference nor consensus to assess the SQ of fNIRS signals.The adopted approaches range from completely subjective evaluations from human experts, to completely automatized procedures based on SQ indicators, such as the Scalp Coupling (SC) and Scalp Coupling Power (SCP) [15], Coefficient of Variation (CV) and Coefficient of Variation of the Wavelengths (CVW) [16], Signal Quality Index [17], association with cardiac signals [18], among others.
On the other side, some researchers have also focused their work on assessing the role played by more technical aspects in influencing the reproducibility of fNIRS results (e.g., [19], [20]).Orihuela-Espina and colleagues [21], included the ambient light, the laboratory conditions, and the instrumentation in their taxonomy of experimental factors influencing fNIRS experimentation.More recently, Gemignani et al. [22] reported high reproducibility of the results in infant fNIRS studies across different NIRS machines, testing sites, and developmental populations.
A further source of variability that has been less investigated in the fNIRS literature is the experimenter, who plays a critical role in fostering the replicability of fNIRS results.In fact, many steps within a typical fNIRS experiment pipeline still heavily rely on subjective evaluations [21].The researcher is critical in the decision regarding the positioning of sources and detectors, location, and registration.The high degree of subjectivity in the optode positioning seems to be one of the major factors in undermining the fNIRS within-individual reproducibility [23].Previous research has also suggested that there are differences in how experts rate the quality of fNIRS data [24], or choose the appropriate processing parameters to identify the correct motion artifacts.Ultimately, these non-measurable individual choices can have a direct impact on the acquired data.However, this potential issue has been rarely considered in the scientific literature and little is still known about how subjective experimental choices impact the latent characteristics of the collected signal at its very basic and fundamental level, beyond its "quality" per se.This poses a critical concern to studies that require data collected from different cohorts to be merged and compared in a meaningful and reproducible way, as, for instance, in crosscultural, or multi-labs studies.
The present study aims to raise awareness in the scientific community about how individual and non-measurable practices might influence the collected fNIRS data.Particularly, the study investigates to what extent fNIRS data collection, and consequent signal characteristics, depend on individual research practices.Specifically, we aim to assess whether fNIRS signals collected by different experimenters show significant differences in terms of signal quality and signal characteristics.To do so, we replicated the same experimental settings in two different labs and countries, where two different teams of researchers collected restingstate fNIRS data.Subsequently, we combined statistical and data-driven methods to highlight differences between the two cohorts.First, we statistically compared the prevalence of bad-quality signals in the data collected in the two labs; then we applied a machine learning approach to classify the lab from which fNIRS signals were collected.The classification was performed both on a dataset with signal segments of good and bad quality and in a dataset with only good quality signal segments.The use of the latter dataset allows investigating whether there are some latent characteristics in the signals beyond those associated with signal quality that allow differentiating among labs.This approach was needed as, differently from signal quality, the latent characteristics of fNIRS signals were not directly quantified in the current study.

II. MATERIALS AND METHODS
This study analyzes data collected in two cohorts by two different experimental labs (located in Singapore (SG) and Italy (ITA)) and aims to investigate differences in fNIRS signal quality and signal latent characteristics.The acquired data underwent feature extraction and signal quality inference using an end-to-end deep learning approach [24], followed by an analysis of signal quality, dimensionality reduction, and classification of the cohorts using an SVM model.The methods used in this experiment provide a comprehensive approach to comparing fNIRS data quality between different experimental labs and cohorts.

A. Data Collection
Neural data were obtained from a cross-cultural study that aimed to investigate the neural underpinnings of role-play [25].Role-play is a technique that is commonly used in clinical settings to alleviate psychopathological symptoms.Data were collected by two experimental labs: SG and ITA.The two labs have a high overlap in terms of experimental practices and training for the use of fNIRS.Data collections were approved by the Nanyang Technological University (NTU-IRB-2021-03-013) and by the University of Trento (2022-059).The conduction of the experiment followed the guidelines provided by the Declaration of Helsinki, Informed Consent was obtained from all participants.
A total of 160 participants, corresponding to 80 dyads, were recruited for the study, which involved two cohorts: SG cohort (N=76, 38 dyads) and ITA cohort (N=84, 42 dyads).Data of two subjects in the SG cohort were lost due to technical issues during the experiment, thus the total number of participants is N=158.All participants were aged between 18 and 35 years old.Participants in the SG cohort were recruited via convenience and snowball sampling from a University's research participation program, social media sites, and personal networks.Participants in the ITA cohort were recruited via convenience and snowball sampling from social media sites and personal networks.Across both cohorts, the recruited dyads consisted of people who had an existing peer relationship.All participants had no history of known and diagnosed health or neurological conditions, particularly conditions that may alter the oxygen-binding capacity of the blood.The two members of each dyad were identified using the letters "A" and "B".
The experimental session consisted of four phases during which the brain activity of both members of the dyad was Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.monitored using fNIRS.The experiment started with a 2-min recording to assess the baseline brain activity of the participants in resting state.During this phase, participants were asked to sit silently in front of each other and not to move their limbs as much as possible; only data from this phase were used in this study.The following three phases were: naturalistic conversation, role-play, and role reversal, each of which lasted for a total of 5 min.
The fNIRS signal was recorded from both members of each dyad using a tandem fNIRS scanning approach.The caps were equipped with 8 LED sources emitting light at wavelengths of 760nm and 850nm, along with 7 detectors, arranged in accordance with a standard prefrontal cortex (PFC) montage [3].This configuration resulted in 20 fNIRS channels for each subject.The positioning of the channels followed the commonly used standard international 10-20 EEG layout; for the ITA cohort, the distance between sources and detectors never exceeded a maximum optimal distance of 3cm (this practice could not be enforced in the SG cohort due to a logistical lack of optode stabilisers).The same standardized acquisition procedures were used in both labs.For the SG cohort, a NIRSport device (NIRx Medical Technologies LLC) was used, with a sampling rate of 7.81Hz.For the ITA cohort, a NIRSport2 device (NIRx Medical Technologies LLC) was used, with a sampling rate of 10.17Hz.

B. Feature Extraction and Signal Quality Inference
To facilitate comparison between datasets, the acquired fNIRS signals were resampled to a common sampling rate of 10 Hz.Apart for the resampling, the collected data underwent no other signal processing steps, to ensure that the results are not affected by the processing pipeline and parameters.
Subsequently, random segments of 20 seconds in length were independently extracted for each experimental session, participant, and channel.
Each segment was given in input to a deep learning (DL) network that was trained to classify the quality of segments of fNIRS signals [24], [26].The use of DL to assess signal quality has proven more accurate as compared to approaches based on manual threshold and it represents a more objective signal quality control for fNIRS signals [24].The network is composed of two parts: the first part includes a sequence of convolutional blocks, that computes the 1280 features from the input segment; the second part is a sequence of fully connected layers to classify the quality of the input segment, based on the computed features [24].From each segment, we thus derived the value of the 1280 features and the detected signal quality.
Two distinct datasets of fNIRS segments were obtained: the first dataset (ALL dataset) comprised all segments extracted from all signals and channels, including segments classified as having a bad signal quality; the second dataset (GOOD dataset) was the subset of segments that were classified as having a good signal quality.Due to the removal of segments with detected bad quality, the number of segments in the SG and ITA cohorts differs for each channel.
To be able to evaluate the generalisability of the predictive models, each dataset was split into two partitions.Data from members labeled as "A" were used for training the models ("Train" partitions), while data from members labeled as "B" were used for testing ("Test" partitions).The number of samples in the ALL dataset for the two cohorts is: Train partition: N SG =36, N I T A =42; Test partition: N SG =38, N I T A =42; these are the same for all channels.The number of samples in the GOOD dataset differs for each channel and is reported along with the classification performance in Table III.
The ALL dataset was also used to assess the difference in the prevalence of signals with bad quality between the two cohorts.The prevalence was compared using a chi-square test; the odd of having a signal with bad quality was computed for both the SG and ITA cohorts, and the effect size was reported as the odds ratio (OR) for the SG cohort compared to the ITA cohort.The OR was computed as the probability of having a bad signal in the SG cohort as compared to the ITA cohort.

C. Dimensionality Reduction
To reduce the dimensionality of the extracted features, a Principal Component Analysis (PCA) was performed to extract the five main components from the 1280 standardized features.It is important to note that the standardization parameters (mean and standard deviation) and the PCA transformation matrices were derived solely from the Train partitions to prevent information leakage between the Train and Test partitions.

D. Classification of Cohort
To classify the cohort from which the segments were derived, a Support Vector Machine (SVM) model with a radial basis function kernel was trained on the value of the five principal components computed by the PCA.The SVM model was selected due to its ability to handle high-dimensional data and good performance in classification tasks.Model performance was evaluated using Matthew's Correlation Coefficient (MCC) as the performance metric, which is a measure of the quality of binary classifications, taking into account true positives, true negatives, false positives, and false negatives [27].
The C parameter of the SVM model was optimized by gridsearch (C: 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10 000) using a 10 × 5-fold cross-validation scheme [28], [29] to enhance classification performance.Balanced class weighting was used; all other parameters have been left to their default values.Finally, the final model was trained on the Train partition and evaluated on the Test partition to assess its classification performance and generalizability.In order to provide an indication of the reliability of the results, we employed a bootstrapping technique to compute the MCC along with 90% Confidence Intervals (90% CI).The bootstrapping process involved randomly selecting 25% of the samples from the dataset with replacement and then calculating the MCC score on this selected subset.The procedure was repeated 1000 times to obtain a distribution of MCC scores.To derive the overall MCC with 90% CI, we determined the 50th percentile as well as the 5th and 95th percentiles of the generated distribution.These values represent the central tendency and the range of the MCC scores, providing a measure of the classification performance with a 90% level of confidence.The was applied to both the ALL and GOOD datasets.Although the ALL dataset is representative of the signals collected in the two cohorts, the different prevalence of bad signal quality segments in the SG and ITA cohorts could introduce a bias in the predictive performance, since the network used to extract the 1280 deep features was trained to recognize the quality of the signals.The second dataset (GOOD dataset) was created to address this issue and obtain an estimation of the classification performances that were less influenced by the differences in the signal quality between the two cohorts, but more by the non-measured latent characteristics of the signals.

A. Signal Quality
The quality of the signal segments was objectively detected using a DL classification model [24].We focused on the segments of the ALL dataset to investigate the prevalence of bad signal quality segments in the two cohorts, for both the Train and Test partitions.Significant differences emerged (see Table I).From the results, it appears that the prevalence of bad-quality signals is higher in the SG cohort: six channels (2, 5, 8, 9,10, 15) had no bad quality segments in the ITA dataset (so it was not possible to compute the ORs); for all other channels, except for channels 3, 4, 11, and 19, the OR is greater than one.No notable differences in the OR values emerge between the Train and Test partitions.The chi-squared test reported significant results for the most channels, either in the Train or Test partitions; only channels 1, 13, and 18 showed significant difference any partition.We note that the p-values reported in Table I have not been corrected for the multiple hypothesis testing, as the primary aim was to provide indications about the size of the differences, not to confirm any research hypotheses.

B. Machine Learning
We then evaluated the results of the predictive models that classify the cohort of the signal segments for the ALL and GOOD datasets.
Regarding the ALL dataset (Table II), we note that we obtained positive median MCC values in both partitions for most channels, with the exception of channels 1, 3, 4, 16, and 20, for which negative MCC values on the Test partition have been obtained.Overall, the results on the Test partition are in general lower than on the Train partition, indicating that the training procedure was partially affected by overfitting.A more conservative interpretation of the results can be derived by the analysis of the 5%CI values obtained on the Test partition.In particular, four channels (6,11,12,13,14,18) reported negative MCC values, or lower than MCC=0.1.The results for these channels indicate that the model was not able to robustly differentiate between the SG and ITA cohorts.For the remaining 9 channels, however, the 5%CI values range between MCC=0.19 and MCC=0.73,suggesting that, to different extents, the model was able to extract information to classify the two cohorts.The results on the ALL dataset, however, could be influenced by differences in the prevalence of bad-quality signals in the two cohorts, since the 1280 features used for the PCA were obtained by a DL network trained to classify the signal quality.
Regarding the GOOD dataset (Table III), which only includes segments with a detected good signal quality, we note Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III PERFORMANCE OF THE CLASSIFICATION OF THE LAB OF ORIGIN OF A FUNCTIONAL NEAR-INFRARED SPECTROSCOPY SIGNAL SEGMENT OF THE DATASET THAT INCLUDES ONLY SEGMENTS WITH A DETECTED GOOD SIGNAL QUALITY. RESULTS ARE REPORTED IN TERMS OF MATTHEW'S CORRELATION COEFFICIENT (MCC) (5%CI-95%CI) FOR BOTH THE TRAIN AND TEST PARTITIONS
that for the majority of the channels (14 channels) the 5%CI value on the Test partition is negative or lower than 0.1, suggesting that the results obtained on the ALL dataset were partially influenced by the different prevalence of bad quality signals in the two cohorts.However, for the remaining six channels, the 5%CI ranges between 0.12 and 0.61.Specifically, the median bootstrap MCC values on the Train and Test partitions were respectively: channel 2: 0.859/0.509;channel 7: 1.000/0.533;channel 8: 0.739/0.656;channel 10: 0.829/0.866;channel 17: 0.700/0.592;channel 18: 0.570/0.446)suggesting that the model could learn differences in the data, not immediately related to the signal quality, to discriminate the two cohorts.
IV. DISCUSSION The current work aimed to investigate the differences in terms of signal quality and presence of latent characteristics in raw data collected by two labs in two different cohorts (SG and ITA).To do so, fNIRS resting state data were used.A pretrained DL network was used to classify fNIRS segments' quality.A first analysis was conducted to assess the prevalence of bad-quality signal segments in the two cohorts by adopting a statistical approach.Subsequently, an SVM model was used to classify the cohorts from portions of fNIRS signals.The classification task was conducted with two different datasets: a dataset of segments with different quality (ALL dataset) and a dataset with only good quality segments (GOOD dataset).Overall, results showed that data from the SG cohort were more likely to have bad-quality segments as compared to data collected from the ITA cohort.In fact, the machine learning model was able to classify the cohorts from the fNIRS signals.For the majority of channels, the performance of the SVM model dropped when the analysis was conducted on the GOOD dataset, which included only portions of signals with good quality.However, a correct classification of the cohort from the fNIRS signals was observed in six channels (channels: 2, 7, 8, 10, 17, and 18) even when the machine learning analysis was conducted on the GOOD dataset.This result suggests that there could be some differences in the latent signal characteristics between the two cohorts that do not depend on the signal quality.
Before discussing the results, there is one limitation of the study that needs to be considered.In fact, due to a lack of optode stabilisers, we could not enforce a maximum source-detector distance of 3cm in the SG cohort.However, to rule out the possibility that results depended on this technical aspect, we considered the "unconstrained" source-detector distance when not using optode stabilisers computed on a template head model.Out of all the channels with an unconstrained source-detector distance greater than 3.5cm, only channel 7 was part of the six channels for which we obtained a good classification performance on the GOOD dataset.This suggests that the different source-detector distances in the two cohorts did not play a significant role in this study and did not explain the observed results.
Technology might have played a role in differentiating fNIRS signals.In fact, the two labs collected data using different fNIRS machines: the ITA lab used the most recent version of the same device.It might be that the technological advancements led to signals with better quality for the data in the ITA cohort.However, the technological differences are minimal and are not expected to cause such a significant variation in terms of signal quality [19], [20].
These findings might seem in contrast with the results from studies that report a high level of reproducibility of fNIRS studies (e.g.[22], [30], [31]).However, it should be noted that the current study focused on the characteristics of the raw collected data, not on the results of a complete fNIRS study.We demonstrate that different experimenters produce data with different characteristics, both in terms of signal quality and signal characteristics.Signal processing procedures that are applied to the raw collected signals are expected to reduce the influence of these differences.For instance, signals with bad quality are typically removed from the dataset and differences are managed at the statistical level using approaches based on Generalized Linear Models [13], [32].On the other side, differences in the quality of raw data might even get amplified by heterogeneous approaches in the pre-processing and in the analytical procedures [11].While these concerns should be better investigated in future research, the results of this study already speak of the urgency of shared procedures, standardization, and guidelines in fNIRS research.Moreover, the results of this study suggest that beyond "signal quality" per se, different researchers might collect signals with some difference in their fundamental latent characteristics.
Large fNIRS studies might involve multiple experimenters in the data collection.Typically, experimenters are trained by practice to use the instrumentation and software, allowing them to be able to independently collect new data, but this might not be enough.An additional effort should be directed into training researchers to compare the quality and characteristics of the collected data with gold standards, for instance, as done by Blasi and colleagues [33].This would allow researchers to consistently assess their actual competence.The two teams of experimenters considered in this study were selected because they shared a common knowledge base, training history, and data collection procedures, built through previous research collaborations.The differences in terms of signal quality observed in the current work are therefore likely to be an underestimation of the case in which two different and non-collaborating labs would like to replicate their own works.
Notwithstanding the common background, the lack of objective step-by-step guidelines for conducting fNIRS experiments might have left room for subjective interpretations.For instance, one researcher might be more rigorous than another when optimizing the light transmission between sources and detectors before starting the experiment.A study focusing on the assessment of fNIRS data quality [24] showed that different experts evaluate differently the quality of fNIRS signals.That study considered the signal quality assessment that is performed after the data collection, to decide whether or not to include a data point in the following analysis.However, it is also important to remember that a subjective signal quality assessment is always performed by the experimenter before the data collection, to evaluate whether the sources and detectors have been correctly positioned, and if the experimental conditions are optimal.Different experimenters might therefore produce data with different quality and characteristics.
Noteworthy, the current study raises awareness about the existence of these differences in the collected data among the scientific community, paving the way for future systematic investigations about which factors are responsible for the observed differences.These results are significant two-fold.First, in terms of signal quality, fNIRS researchers should be able to "control" and minimize the signals with low quality.This is important because neural data are typically financially expensive and time-and effort-consuming for both and participant.Second, in terms of signal characteristics, it is important to raise awareness of the fact that other factors beyond signal quality might have an impact on the reproducibility of fNIRS studies.
Finally, while fNIRS is widely adopted in neuroscience and brain research studies, its adoption in real-world applications is growing, even in life-critical conditions.For instance, in the medical field it is being proposed as a tool to monitor the cognitive load of surgeons [34] or to identify brain death [35].Within these contexts, the reliability and comparability of collected fNIRS data is even more crucial.
We argue that the reproducibility of fNIRS studies would benefit from clear guidelines in reporting the quality of collected data: not only in terms of number of channels/clusters that are discarded during the analysis [14], but also the detected signal quality before the acquisition itself.Reporting this could also motivate the experimenter to periodically revise the adopted procedures.

V. CONCLUSION
The current study aimed to assess differences in terms of signal quality and latent signal characteristics in data collected by two labs in different cohorts.Although the labs that collected the data highly overlap in terms of training and research practices for conducting fNIRS studies, differences between the two cohorts emerged, both from the statistical analysis of the prevalence of bad quality signals, and from the machine learning classification models.Future studies could adopt a similar approach to investigate whether current pre-processing and analytical practices amplify or buffer the differences in the quality of raw fNIRS data, and ultimately evaluate the impact of these initial differences in undermining the reproducibility of fNIRS studies.Overall, we argue that the reproducibility and cross-comparability of fNIRS studies would benefit from clear guidelines and gold standards not only for analyzing the data and reporting the results but also for the experiment setup and data collection.

TABLE I PREVALENCE
OF FUNCTIONAL NEAR-INFRARED SPECTROSCOPY SIGNALS WITH BAD QUALITY ACROSS THE TWO COHORTS FOR BOTH THE TRAIN AND TEST PARTITIONS OF THE ALL DATASET.THE ODDS RATIO (OR) REPRESENTS THE PROBABILITY OF HAVING A BAD SIGNAL IN THE SINGAPORE (SG) COHORT AS COMPARED TO THE ITALY (ITA) COHORT.FOR EACH CHANNEL, THE DIFFERENCES IN THE PREVALENCE OF

TABLE II PERFORMANCE
OF THE CLASSIFICATION OF THE LAB OF ORIGIN OF A FUNCTIONAL NEAR-INFRARED SPECTROSCOPY SIGNAL SEGMENT OF THE DATASET THAT INCLUDES ALL SEGMENTS.RESULTS ARE REPORTED IN TERMS OF MATTHEW'S CORRELATION COEFFICIENT (MCC) (5%CI-95%CI) FOR BOTH THE TRAIN AND TEST PARTITIONS