Computational Testing for Automated Preprocessing: a Matlab toolbox to enable large scale electroencephalography data processing

distinction between true sources and noise indeterminate; EEG very large. various factors create a large number of subjective decisions with consequent risk of compound error. batch processing of the average datasets, methods to choose an optimal approach across many possible parameter the classification of artefacts in channels, epochs or segments. extra subjectivity, is reproducible. subjectivity, and consequent error. We present the Computational Testing for Automated Preprocessing (CTAP) toolbox, to facilitate: i) batch processing that is easy for experts and novices alike; ii) testing and manual comparison of preprocessing methods. CTAP extends the existing data structure and functions from the well-known EEGLAB toolbox, based on Matlab, and produces extensive quality control outputs. ABSTRACT EEG is a rich source of information regarding brain functioning. However, the preprocessing of EEG data can be quite complicated, due to several factors. For example, the distinction between true neural sources and noise is indeterminate; EEG data can also be very large. The various factors create a large number of subjective decisions with consequent risk of compound error. Existing tools present the experimenter with a large choice of analysis methods. Yet it remains a challenge for the researcher to integrate methods for batch processing of the average large datasets, and compare methods to choose an optimal approach across the many possible parameter conﬁgurations. Additionally, many tools still require a high degree of manual decision making for, e.g. the classiﬁcation of artefacts in channels, epochs or segments. This introduces extra subjectivity, is slow, and is not reproducible. Batching and well-designed automation can help to regularise EEG preprocessing, and thus reduce human effort, subjectivity, and consequent error. We present the Computational Testing for Automated Preprocessing (CTAP) toolbox, to facilitate: i) batch processing that is easy for experts and novices alike; ii) testing and manual comparison of preprocessing methods. CTAP extends the existing data structure and functions from the well-known EEGLAB toolbox, based on Matlab, and produces extensive quality control outputs. CTAP is available under MIT licence from https://github.com/bwrc/ctap .


INTRODUCTION
1 Measurement of human electroencephalography (EEG) is a rich source of information regarding 2 certain aspects of brain functioning, and is the most lightweight and affordable method of brain 3 imaging. Although it can be possible to see certain large effects without preprocessing at all, in 4 the general case EEG analysis requires careful preprocessing, with some degree of trial-and-error. 5 Such difficult EEG preprocessing needs to be supported with appropriate tools. The kinds of tools 6 required for signal processing depends on the properties of data, and the general-case properties of • feature and raw data export 53 We will next briefly motivate each of the benefits above. 54 Incomplete runs A frequent task is to make a partial run of a larger analysis. This happens, for 55 example, when new data arrives or when the analysis fails for a few measurements. The incomplete 56 run might involve a subset of a) subjects, b) measurements, c) analysis branches, d) collections of 57 analysis steps, e) single steps; or any combination of these. CTAP provides tools to make these 58 partial runs while keeping track of the intermediate saves.

Computer Science
Bookkeeping A given EEG analysis workflow can have several steps, branches to explore al- 60 ternatives, and a frequent need to reorganise analysis steps or to add additional steps in between.

61
Combined with incomplete runs, these requirements call for a system that can find the correct input 62 file based on step order alone. CTAP does this and saves researchers time and energy for more 63 productive tasks. 64 Error handling Frequently, simple coding errors or abnormal measurements can cause a long 65 batch run to fail midway. CTAP catches such errors, saves their content into log files for later 66 reference and continues the batch run. For debugging purposes it is also possible to override this 67 behaviour and use Matlab's built-in debugging tools to solve the issue. level tools to work with mixed text and numeric data. To this end, CTAP provides its own format of 81 storing data and several export options. Small datasets can be exported as, e.g. comma delimited 82 text (csv) while larger sets are more practically saved in an SQLite database. CTAP also offers the 83 possibility to store single-trial and average ERP data in HDF5 format, which makes the export to 84 e.g. R and Python simple. 85 In summary, CTAP lets the user focus on content, instead of time-consuming implementation of 86 foundation functionality. In the rest of the paper, we describe how CTAP toolbox does this using a 87 synthetic dataset as a running example. 88 We start with related work followed by the Materials & Methods section detailing the architecture 89 and usage of CTAP. The Results section then describes the technical details and outcomes of a  1 For example, we conducted a search of the SCOPUS database for articles published after 1999, with "EEG" and "electroencephalography" in the title, abstract, or keywords, plus "Signal Processing" or "Signal Processing, Computer-Assisted" in keywords, and restricted to subject areas "Neuroscience", "Engineering" or "Computer Science". The search returned over 300 hits, growing year-by-year from 5 in 2000 up to a mean value of 36 between 2010 and 2015. bookkeeping gives the user a distinct advantage over the common approach of 'EEGLAB + a few 146 scripts', which seems simple on its face, but in practice is non-trivial as the number and complexity 147 of operations grows. As all algorithms added to CTAP will produce quality control outputs 148 automatically, fast performance comparison is possible between methods or method parameters, 149 speeding the discovery of (locally) optimal solutions. The system has potential to enable such 150 parameter optimization by automated methods, although this is not yet implemented.

152
The core activity of CTAP is preprocessing EEG data by cleaning artefacts, i.e. detection and 153 either correction or removal of data that is not likely to be attributable to neural sources. CTAP • what analysis functions to apply and in which order (analysis pipe) 168 • analysis environment and parameters for the analysis functions (configuration) 169 • which EEG measurements/files to process (measurement configuration) 170 Typically, the analysis is run by calling a single script that defines all of the above and passes 171 these on to the CTAP pipeline looper.m function, that performs all requested analysis steps 172 on all specified measurements. In the following, we describe in more detail how the configurations

Manuscript to be reviewed
Computer Science the whole analysis in smaller chunks and to manually check the mid-way results as often needed, e.g., 186 while debugging. Further on, the ability to create branches is important to help explore alternative 187 ways of analysing the same data.

188
To specify the order of steps and sets within a pipe, we recommend to create a single m-file for 189 each intended pipe 4 . This file will define both the step sets as well as all the custom parameters to 190 be used in the steps. Default parameters are provided, but it is optimal to fine tune the behaviour 191 by providing one's own parameters. Both pipe and parameter information is handled using data 192 structures, rather than hard-coding. CTAP then handles assignment of parameters to functions based 193 on name matching.

194
Once the steps and their parameters are defined, the last requirement to run the pipe is to define 195 the input data. In CTAP the input data are specified using a  Users can also call the ctapeeg .m functions directly as part of their own custom scripts, since 219 these are meant to be used like e.g. any EEGLAB analysis function.

220
Analysis results are saved separately for each pipe. A typical structure contains:

221
• intermediate results as EEGLAB datasets, in one directory per step set; names are taken from 222 the step set IDs as defined by the user, prefixed by step number.

223
• export directory contains exported feature data (txt, csv or SQLite format).

224
• features directory: computed EEG features in Matlab format.

225
• logs directory: log files from each run.

Manuscript to be reviewed
Computer Science • quality control directory: quality control plots, reflecting the visualisations of analysis steps chosen by the user.

228
Apart from running the complete pipe at once the user has many options to run just a subset 229 of the pipe, analyse only certain measurements, or otherwise adjust usage. Before-and-after 'Peeks' The CTAP peek data.m function is called near the start (after initial 313 loading and re-referencing) and the end of the pipe. Visual inspection of raw data is a fundamental 314 step in EEG evaluation, and quantitative inspection of channel-wise statistics is also available. A 315 logical approach is to compare raw data at same time-points from before and after any correction

347
The blink template option compares mean activity of detected blink events to activations for each IC.

348
CTAP filter blink ica.m is used to filter blink-related IC data, and reconstruct the EEG

387
In this section, we show the output of CTAP as applied to the synthetic dataset, based on the 388 analysis-pipe steps shown above. The pipe outputs ∼30MB of EEG data after each step set, thus 389 after debugging all steps can be expressed as one set, and data will occupy ∼62MB (before and after 390 processing). Additionally the quality control outputs of this pipe occupy ∼70MB of space, mostly 391 in the many images of the peek-data and reject-data functions.

Before-and-after 'Peeks'
393 Raw data Figure 3 shows raw data before and after preprocessing. has the desired effect on power spectrum and that its response to a unit step function is reasonable.  Bad channels In total 10 bad channels were found which included all six 'wrecked' channels -this 419 shows the algorithm is slightly greedy, which is probably preferable in the case of a high-resolution 420 electrode set with over 100 channels. Bad channels are rejected and interpolated before proceeding 421 (not plotted as it is a straightforward operation).

422
Bad segments An example of bad segment detection, using simple histogram-based amplitude 423 thresholding, is shown in Figure 10. In this case the bad data is high amplitude EMG but in a general 424 setting e.g. motion artefacts often exhibit extreme amplitudes. Using these figures the user can 425 quickly check what kind of activity exceeds the amplitude threshold in the dataset. . EEG amplitude histograms for four channels (A) before preprocessing, and (B) after preprocessing. Fitted normal probability density function (PDF) is shown as red solid curve. Upper and lower 2.5 % quantiles are vertical black solid lines; data inside these limits was used to estimate the trimmed standard deviation (SD), and normal PDF fitted using trimmed SD is shown as black solid curve. Distribution mean is vertical dashed blue line. Channel D15 has clearly been detected as bad, removed and interpolated. Of the 50 EMG artefacts inserted in the synthetic data, 37 still existed at least partially, at the end 427 of pipe. The low rejection percentage is due to the fact that EMG is more of a change in frequency 428 spectrum than in amplitude, yet the pipe looked for deviant amplitudes only. Manuscript to be reviewed Computer Science Figure 6. Scatter plot of the criterion used to detect blinks. Horizontal axis shows the criterion value while vertical axis is random data to avoid over-plotting. The classification is done by fitting two Gaussian distributions using the EM algorithm and assigning labels based on likelihoods.

437
We have presented CTAP, an EEG preprocessing workflow-management system that provides 438 extensive functionality for quickly building configurable, comparative, exploratory analysis pipes.   exactly what CTAP provides.

466
As different analysis strategies and methods can vary greatly, CTAP was implemented as a 467 modular system. Each analysis can be constructed from discrete steps which can be implemented 468 as stand-alone functions. As CTAP is meant to be extended with custom analysis functions the 469 interface between core CTAP features and external scripts is well defined in the documentation. The 470 only requirement is to suppress any pop-ups or GUI-elements, which would prevent the automatic 471 execution of the analysis pipe 11 . It is also up to the user to call the functions in the right order.

472
The system supports branching. This means that the analysis can from a tree-like structure, 473 where some stage is used as input for multiple subsequent workflows. To allow this, any pipe can act 474 as a starting point for another pipe. The CTAP repository provides a simple example get the user 475 going. For branches to appear, a bare minimum is a collection of three pipes of which one is run 476 11 As noted above, for this reason much original code has been refactored to avoid runtime-visible or focus-grabbing outputs. The ultimate aim is for CTAP to interface directly to Matlab functions to remove dependency on EEGLAB releases, while retaining compatibility with the EEGLAB data structure

Manuscript to be reviewed
Computer Science first. The other two both act on this output but in different ways. Currently the user is responsible for calling the pipes of a branched setting in a meaningful order. However, this is straightforward to 478 implement and having the analysis logic exposed in the main batch file makes it e.g. easy to run 479 only a subset of the branches.

480
Although CTAP works as a batch processing pipeline, it supports seamless integration of manual 481 operations. This works such that the user can define a pipeline of operations, insert save points at 482 appropriate steps, and work manually on that data before passing it back to the pipe. The main extra 483 benefit that CTAP brings is to handle bookkeeping for all pipeline operations, such that manual 484 operations become exceptional events that can be easily tracked, rather than one more in a large 485 number of operations to manage.

486
CTAP never overrides the user's configuration options, even when these might break the pipe.

487
For example, CTAP reject data.m contains code to auto-detect the data to reject. However 488 the user can set this option explicitly, and can do so without having first called any corresponding 489 detection function, which will cause preprocessing on that file to fail. Allowing this failure to happen 490 is the most straightforward approach, and ultimately more robust. Combined with an informative 491 error message the user gets immediate feedback on what is wrong with the pipe.

492
On the other hand, CTAP does provide several features to handle failure gracefully. As noted, the 493 pipe will not crash if a single file has an unrecoverable error, although that file will not be processed  In contrast to many analysis plugins built on top of EEGLAB, no GUI was included in CTAP.

501
While GUIs have their advantages (more intuitive data exploration, easier for novice users, etc) 502 there is a very poor return on investment for adding one to a complex batch-processing system like 503 CTAP. A GUI also sets limits to configurability and can constrain automation if CTAP is executed 504 on a hardware without graphical capabilities. The absence of GUI also makes the development of 505 extensions easier as there are fewer dependencies to handle.

506
In contrast to many other broad-focus physiological data analysis tools, CTAP is designed to 507 meet a very focused goal with a specific approach. This does however create some drawbacks.

508
Compared to scripting ones own pipeline from scratch, there are usage constraints imposed by the 509 heavy use of struct-passing interfaces. Some non-obvious features may take time to master, and it 510 can be difficult (albeit unnecessary) to understand the more complex underlying processes.

511
CTAP is also built to enable easy further development by third parties, by using standardised 512 interfaces and structures. This was a feature of original EEGLAB code, but contrasts with many 513 of the EEGLAB-compatible tools released since, whose functionality was often built in an ad hoc 514 manner. The main requirement for development is to understand the content and purpose of the 515 EEG.CTAP field (which is extensively documented in the wiki), and the general logic of CTAP.

516
Developers can easily extend the toolbox by using (or emulating) the existing ctapeeg * .m 517 functions, especially the ctapeeg detect * .m functions, which are simply interfaces to external 518 tools for detecting artefacts. Existing CTAP * .m functions can be relatively more complex to 519 understand, but the existing template provides a guideline for development with the correct interface.

Computer Science
Future work 521 CTAP is far from finalised, and development will continue after the initial release of the software.

522
The main aim of future work is to evolve CTAP from workflow management towards better 523 automation, with computational comparative testing of analysis methods, to discover optimal 524 parameters and help evaluate competing approaches. 525 As stated above, the potential to fully automate EEG processing is constrained by the inde-526 terminacy of EEG: known as the inverse problem, this means that it is not possible to precisely 527 determine a ground truth for the signal, i.e. a unique relationship to neural sources. The signal can 528 also be highly variable between individuals, and even between intra-individual recording sessions 529 (Dandekar et al., 2007). These factors imply that there cannot be a general algorithmic solution to 530 extract neurally-generated electrical field information from EEG, thus always requiring some human 531 intervention. By contrast, for example in magnetoencephalography certain physical properties of 532 the system permit inference of sources even from very noisy data (Taulu and Hari, 2009) (although 533 recording of clean data is always preferable, it is not always possible, e.g. with deep brain stimulation 534 patients Airaksinen et al. (2011)). 535 While many publications have described methods for processing EEG for different purposes, 536 such as removing artefacts, estimating signal sources, analysing event-related potentials (ERPs), and 537 so on. However despite the wealth of methodological work done, there is a lack of benchmarking, 538 or tools for comparison of such methods. The outcome is that the most reliable way to assess 539 each method is to learn how it works, apply it, and test the outcome on one's own data: this is a 540 highly time-consuming process which is hardly competitive with simply performing the bulk of 541 preprocessing in a manual way, as seems to remain the gold standard. The effect of each method 542 on the data is also not commonly characterised, such that methods to correct artefacts can often 543 introduce noise to the data, especially where there was no artefact (false positives).

544
Thus, we also aim to enable testing and comparison of automated methods for preprocessing.

545
This is still work in progress, as we are building an extension for CTAP that improves testing 546 and comparison of preprocessing methods by repeated analyses on synthetic data. This extension, 547 tentatively titled Handler for sYnthetic Data and Repeated Analyses (HYDRA), will use synthetic 548 data to generate ground-truth controlled tests of preprocessing methods. It will have capability to 549 generate new synthetic data matching the parameters of the lab's own data, and compare outcomes of 550 methods applied to this data in a principled computational manner. This will allow experimenters to 551 find good methods for their data, or developers to flexibly test and benchmark their novel methods.

552
Another desirable, though non-vital, future task is to expand the quality control output, to include 553 functionality such as statistical testing of detected bad data, for the experimenter to make a more 554 informed decision. Although statistical testing is already implied in many methods of bad data 555 detection, it is not visible to users. This will take the form of automated tools to compare output 556 from two (or more) peeks, to help visualise changes in both baseline level and local wave forms.

557
Such aims naturally complement the work of others in the field, and it is hoped that opportunities 558 arise to pool resources and develop better solutions by collaboration.

560
The ultimate goal of CTAP is to improve on typical ways of preprocessing high-dimensional EEG 561 data through a structured framework for automation. 562 We will meet this goal via the following three steps: a) facilitate processing of large quantities of 563 EEG data; b) improve reliability and objectivity of such processing; c) support development of smart

Manuscript to be reviewed
Computer Science algorithms to tune the thresholds of statistical selection methods (for bad channels, epochs, segments or components) to provide results which are robust enough to minimise manual intervention. 566 We have now addressed aim a), partly also b), and laid the groundwork to continue developing 567 solutions for c). Thus the work described here provides the solid foundation needed to complete 568 CTAP, and thereby help to minimise human effort, subjectivity and error in EEG analysis; and 569 facilitate easy, reliable batch processing for experts and novices alike.