Automating Areas of Interest Analysis in Mobile Eye Tracking Experiments based on Machine Learning

For an in-depth, AOI-based analysis of mobile eye tracking data, a preceding gaze assign-ment step is inevitable. Current solutions such as manual gaze mapping or marker-based approaches are tedious and not suitable for applications manipulating tangible objects. This makes mobile eye tracking studies with several hours of recording difficult to analyse quan-titatively. We introduce a new machine learning-based algorithm, the computational Gaze-Object Mapping (cGOM), that automatically maps gaze data onto respective AOIs. cGOM extends state-of-the-art object detection and segmentation by mask R-CNN with a gaze mapping feature. The new algorithm’s performance is validated against a manual fixation-by-fixation mapping, which is considered as ground truth, in terms of true positive rate (TPR), true negative rate (TNR) and efficiency. Using only 72 training images with 264 labelled object representations, cGOM is able to reach a TPR of approx. 80% and a TNR of 85% compared to the manual mapping. The break-even point is reached at 2 hours of eye tracking recording for the total procedure, respectively 1 hour considering human working time only. Together with a real-time capability of the mapping process after completed train-ing, even hours of eye tracking recording can be evaluated efficiently. (Code and video examples have been made available at: https://gitlab.ethz.ch/pdz/cgom.git)


Introduction
Areas of Interest (AOIs) are widely used for stimulidriven, quantitative analysis of eye tracking data and allow the determination of important metrics such as dwell time or transitions (Holmqvist et al., 2011). Despite the progress in eye tracking software over the last years, AOI analysis for mobile eye trackers is still an error-prone and time-consuming manual task. In particular, this applies to studies in which the participants move around and interact with tangible objects. This is often the case for usability testing in real-world applications (Mussgnug, Singer, Lohmeyer, & Meboldt, 2017). As a result of these challenges, many scientists are hesitating to use mobile eye tracking in their research even though it is often the appropriate tool for the study design (Vansteenkiste, Cardon, Philippaerts, & Lenoir, 2015). For an in-depth, AOI-based analysis of mobile eye tracking data, a preceding gaze assignment step is inevitable. Current solutions such as manual gaze mapping or marker-based approaches are tedious and not suitable for applications manipulating tangible objects. This makes mobile eye tracking studies with several hours of recording difficult to analyse quantitatively. We introduce a new machine learning-based algorithm, the computational Gaze-Object Mapping (cGOM), that automatically maps gaze data onto respective AOIs. cGOM extends state-of-the-art object detection and segmentation by mask R-CNN with a gaze mapping feature. The new algorithm's performance is validated against a manual fixationby-fixation mapping, which is considered as ground truth, in terms of true positive rate (TPR), true negative rate (TNR) and efficiency. Using only 72 training images with 264 labelled object representations, cGOM is able to reach a TPR of approx. 80% and a TNR of 85% compared to the manual mapping. The break-even point is reached at 2 hours of eye tracking recording for the total procedure, respectively 1 hour considering human working time only. Together with a real-time capability of the mapping process after completed training, even hours of eye tracking recording can be evaluated efficiently. Various methods exist for assigning gaze data to respective AOIs such as manual frame-by-frame or fixation-byfixation analysis and dynamic AOIs using either key frames or different types of markers. Ooms et al. (2015) state that dynamic AOIs based on interpolation between key frames are generally not suitable for interactive eye tracking studies. Vansteenkiste et al. (2015) add that for experiments in natural settings, it is almost inevitable to manually assign the gaze point frame-by-frame to a static reference image or, as proposed in their paper and which is state of the art by now, using a fixation-by-fixation algorithm. These manual methods are very effective and applicable to any possible case, but also highly tedious. Over the last few years, marker-based approaches using visible, infrared or natural markers have become more and more common and are now widely used for automated computing of AOIs (Kiefer, Giannopoulos, Kremer, Schlieder, & Raubal, 2014;Pfeiffer & Renner, 2014;Zhang, Zheng, Hong, & Mou, 2015). Although the use of markers can accelerate the evaluation process enormously, they are limited to the types of scenes that can be analyzed (Evans, Jacobs, Tarduno, & Pelz, 2012). Applied to interactive experiments with tangible objects, they represent a potential disturbance factor for analyzing natural attentional distribution, cannot be attached to small objects due to the necessary minimum detectable size, must face the front camera for detection and generally cannot be used for objects that move and rotate during the experiment (e.g. rolling ball).
To overcome these limitations, object detection algorithms could be applied directly to the objects of interest and not to markers (De Beugher, Brone, & Goedeme, 2014). In recent years, major breakthroughs in object detection have been achieved by machine learning approaches based on deep convolutional neuronal networks (deep CNNs) (Garcia-Garcia et al., 2018). Until recently, CNN-based object detection algorithms were solely able to roughly predict the position of an object by means of bounding boxes (Chukoskie et al., 2018). Figure 1 ( left) shows the disadvantage of such a rectangular AOI using a simple diagonally placed pen as an example. The oversize and shape of the AOI can lead to high error rates, in particular in experimental setups in which overlapping is expected (Orquin, Ashby, & Clarke, 2016). In 2017, mask R-CNN was introduced (He, Gkioxari, Dollar, & Girshick, 2017) as one of the first deep CNNs that not only detects the objects, but also outputs binary masks that cover the objects close contour (Figure 1, right). In this article, a study is conducted to compare AOI analysis using Semantic Gaze Mapping (SGM), which is integrated in SMI BeGaze 3.6 (Senso Motoric Instruments, Teltow, Germany) and is considered as ground truth, with an AOI algorithm based on mask R-CNN being introduced here for the first time.
Semantic Gaze Mapping. SGM is a manual fixation-byfixation analysis method used to connect the gaze point of each fixation to the underlying AOI in a static reference view (as described by Vansteenkiste et al., 2015). Successively for each fixation of the eye tracking recording, the fixation's middle frame is shown to the analyst (e.g. for a fixation consisting of seven frames only the fourth frame is displayed). The analyst then evaluates the position of the

Methods
The study presented in this article consisted of two parts.
Firstly, an observation of a handling task was performed for creating a homogeneous data set in a fully controlled test environment. Secondly, the main study was conducted by analysing the data sets of the handling task and varying the evaluation method in the two factor levels SGM (Semantic Gaze Mapping) and cGOM (computational Gaze-Object Mapping).

Handling Task
Participants. 10 participants (9 males and 1 female, average 26.6 years, range 21-30 years) conducted the handling task wearing the eye tracking glasses. All participants had normal or corrected to normal vision and were either mechanical engineering students or PhD students.

Design
For the main study, the data set of the handling task was analyzed by the two evaluation methods SGM and cGOM.
Both evaluation methods were compared in terms of conformance with the ground truth and efficiency, quantified through the two dependent variables (i) fixation-count per AOI and (ii) required time for each evaluation step. For calculation of the fixation-count per AOI, three AOIs were defined. All syringes were equally labelled as syringe without further differentiation. The disinfectant dispenser was referred to as bottle. All gaze points that did not affect either AOI were to be assigned to the background ("BG").
True positive rates (TPR) and true negative rates (TNR) were calculated for the AOIs syringe and bottle to evaluate the effectivity of the algorithm. While TPR describes how many of the true positive assignments of the ground truth were found by the cGOM tool, TNR describes the same comparison for the true negative assignments or non-assignments.
Even though cGOM is able to assign the gaze point of each frame to the corresponding AOI, for reasons of comparability the assignment was also performed using only the fixations' middle frame. For comparison of efficiency,  Effectivity evaluation. Figure 4 shows the results for TPR sub-times for the total mapping process are presented in Table 2.
Efficiency of the computational mapping. The whole process using the algorithm required 236 minutes in total. The exact times of the subtasks of the mapping process are shown in Table 3

Discussion
The goal of the main study was to investigate whether the newly introduced machine learning-based algorithm cGOM offers the potential of replacing conventional, manual AOI evaluation in experimental setups with tangible objects. Therefore, manual gaze mapping using SGM, which was considered as ground truth, and cGOM were compared in regards to their performance. In the process, it was quantified whether the new algorithm is able to effectively map gaze data to AOIs (RQ1) and from which recording duration on the algorithm works more efficiently than the manual mapping (RQ2). Based on the results of this study we evaluate both research questions.

(RQ1) How effective is cGOM in assigning fixations to respective AOIs in comparison with the ground truth?
The two objects used during the handling task were deliberately selected because they represent potential challenges for machine learning. On the one hand, the syringes are partially transparent and constantly change their length during decanting. On the other hand, both the syringes and the bottle have partly tapered contours, which were assumed difficult to reproduce by close contour masks, in particular when working with small training data sets. According to the results presented in Figure 4, the assignment by the computational mapping has a TPR of 79% for the AOI syringe and 58% for the AOI bottle.