<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">

<article article-type="research-article" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">
 <front>
    <journal-meta>
	<journal-id journal-id-type="publisher-id">Jemr</journal-id>
      <journal-title-group>
        <journal-title>Journal of Eye Movement Research</journal-title>
      </journal-title-group>
      <issn pub-type="epub">1995-8692</issn>
	  <publisher>								
	  <publisher-name>Bern Open Publishing</publisher-name>
	  <publisher-loc>Bern, Switzerland</publisher-loc>
	</publisher>
    </journal-meta>
    <article-meta>
	<article-id pub-id-type="doi">10.16910/jemr.11.6.6</article-id> 
	  <article-categories>								
				<subj-group subj-group-type="heading">
					<subject>Research Article</subject>
				</subj-group>
		</article-categories>
      <title-group>
        <article-title>Automating Areas of Interest Analysis in Mobile Eye Tracking Experiments based on Machine Learning</article-title>
      </title-group>
	   <contrib-group> 
				<contrib contrib-type="author">
					<name>
						<surname>Wolf</surname>
						<given-names>Julian</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>
				<contrib contrib-type="author">
					<name>
						<surname>Hess</surname>
						<given-names>Stephan</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>
				<contrib contrib-type="author">
					<name>
						<surname>Bachmann</surname>
						<given-names>David</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>
				<contrib contrib-type="author">
					<name>
						<surname>Lohmeyer</surname>
						<given-names>Quentin</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>
				<contrib contrib-type="author">
					<name>
						<surname>Meboldt</surname>
						<given-names>Mirko</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>                        				
        <aff id="aff1">
		<institution>ETH Zürich</institution>,   <country>Switzerland</country>
        </aff>
		</contrib-group>   

		
	  <pub-date date-type="pub" publication-format="electronic"> 
		<day>10</day>  
		<month>12</month>
        <year>2018</year>
      </pub-date>
	  <pub-date date-type="collection" publication-format="electronic"> 
	  <year>2018</year>
	</pub-date>
      <volume>11</volume>
      <issue>6</issue>
	 <elocation-id>10.16910/jemr.11.6.6</elocation-id> 
	<permissions> 
	<copyright-year>2018</copyright-year>
	<copyright-holder>Wolf, J., Hess, S., Bachmann, D., Lohmeyer, Q., &#x26; Meboldt, M.</copyright-holder>
	<license license-type="open-access">
  <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License, 
  (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">
    https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use and redistribution provided that the original author and source are credited.</license-p>
</license>
	</permissions>
      <abstract>
        <p>For an in-depth, AOI-based analysis of mobile eye tracking data, a preceding gaze assign-ment step is inevitable. Current solutions such as manual gaze mapping or marker-based approaches are tedious and not suitable for applications manipulating tangible objects. This makes mobile eye tracking studies with several hours of recording difficult to analyse quan-titatively. We introduce a new machine learning-based algorithm, the computational Gaze-Object Mapping (cGOM), that automatically maps gaze data onto respective AOIs. cGOM extends state-of-the-art object detection and segmentation by mask R-CNN with a gaze mapping feature. The new algorithm’s performance is validated against a manual fixation-by-fixation mapping, which is considered as ground truth, in terms of true positive rate (TPR), true negative rate (TNR) and efficiency. Using only 72 training images with 264 labelled object representations, cGOM is able to reach a TPR of approx. 80% and a TNR of 85% compared to the manual mapping. The break-even point is reached at 2 hours of eye tracking recording for the total procedure, respectively 1 hour considering human working time only. Together with a real-time capability of the mapping process after completed train-ing, even hours of eye tracking recording can be evaluated efficiently. <italic>(Code and video examples have been made available at: <ext-link ext-link-type="uri" xlink:href="https://gitlab.ethz.ch/pdz/cgom.git" xlink:show="new">https://gitlab.ethz.ch/pdz/cgom.git</ext-link>)</italic></p>
      </abstract>
      <kwd-group>
        <kwd>mobile eye tracking</kwd>
        <kwd>areas of interest</kwd>
        <kwd>machine learning</kwd>
        <kwd>mask R-CNN</kwd>
        <kwd>object detection</kwd>
        <kwd>gaze mapping</kwd>
        <kwd>tangible objects</kwd>
        <kwd>cGOM</kwd>
        <kwd>usability</kwd> 
      </kwd-group>
    </article-meta>
  </front>	
  <body>

    <sec id="S1">
      <title>Introduction</title>


<p>Areas of Interest (AOIs) are widely used for stimuli-driven,
quantitative analysis of eye tracking data and allow the determination
of important metrics such as dwell time or transitions (<xref ref-type="bibr" rid="b1">1</xref>). Despite
the progress in eye tracking software over the last years, AOI analysis
for mobile eye trackers is still an error-prone and time-consuming
manual task. In particular, this applies to studies in which the
participants move around and interact with tangible objects. This is
often the case for usability testing in real-world applications (<xref ref-type="bibr" rid="b2">2</xref>). As
a result of these challenges, many scientists are hesitating to use
mobile eye tracking in their research even though it is often the
appropriate tool for the study design (<xref ref-type="bibr" rid="b3">3</xref>).</p>

<p>Various methods exist for assigning gaze data to respective AOIs such
as manual frame-by-frame or fixation-by-fixation analysis and dynamic
AOIs using either key frames or different types of markers. Ooms et al.
(<xref ref-type="bibr" rid="b4">4</xref>) state that dynamic AOIs based on interpolation between key frames
are generally not suitable for interactive eye tracking studies.
Vansteenkiste et al. (<xref ref-type="bibr" rid="b3">3</xref>) add that for experiments in natural settings,
it is almost inevitable to manually assign the gaze point frame-by-frame
to a static reference image or, as proposed in their paper and which is
state of the art by now, using a fixation-by-fixation algorithm. These
manual methods are very effective and applicable to any possible case,
but also highly tedious. Over the last few years, marker-based
approaches using visible, infrared or natural markers have become more
and more common and are now widely used for automated computing of AOIs
(<xref ref-type="bibr" rid="b5 b6 b7">5, 6, 7</xref>). Although the use of markers can accelerate the evaluation process
enormously, they are limited to the types of scenes that can be analyzed
(<xref ref-type="bibr" rid="b8">8</xref>). Applied to interactive experiments with tangible objects, they
represent a potential disturbance factor for analyzing natural
attentional distribution, cannot be attached to small objects due to the
necessary minimum detectable size, must face the front camera for
detection and generally cannot be used for objects that move and rotate
during the experiment (e.g. rolling ball).</p>

<p>To overcome these limitations, object detection algorithms could be
applied directly to the objects of interest and not to markers (<xref ref-type="bibr" rid="b9">9</xref>). In
recent years, major breakthroughs in object detection have been achieved
by machine learning approaches based on deep convolutional neuronal
networks (deep CNNs) (<xref ref-type="bibr" rid="b10">10</xref>). Until recently, CNN-based object detection
algorithms were solely able to roughly predict the position of an object
by means of bounding boxes (<xref ref-type="bibr" rid="b11">11</xref>). Figure 1 (left) shows the disadvantage
of such a rectangular AOI using a simple diagonally placed pen as an
example. The oversize and shape of the AOI can lead to high error rates,
in particular in experimental setups in which overlapping is expected
(<xref ref-type="bibr" rid="b12">12</xref>).</p>

<fig id="fig01" fig-type="figure" position="float">
					<label>Figure 1:</label>
					<caption>
						<p>Bounding box created by a conventional deep CNN (left) and
close contour mask created by mask R-CNN (right).</p>
					</caption>
					<graphic id="graph01" xlink:href="jemr-11-06-f-figure-01.png"/>
				</fig>

<p>In 2017, mask R-CNN was introduced (<xref ref-type="bibr" rid="b13">13</xref>) as one of the first deep CNNs
that not only detects the objects, but also outputs binary masks that
cover the objects close contour (Figure 1, right). In this article, a
study is conducted to compare AOI analysis using Semantic Gaze Mapping
(SGM), which is integrated in SMI BeGaze 3.6 (Senso Motoric Instruments,
Teltow, Germany) and is considered as ground truth, with an AOI
algorithm based on mask R-CNN being introduced here for the first
time.</p>

<p><italic>Semantic Gaze Mapping.</italic> SGM is a manual
fixation-by-fixation analysis method used to connect the gaze point of
each fixation to the underlying AOI in a static reference view (<xref ref-type="bibr" rid="b3">3</xref>).
Successively for each fixation of the eye tracking recording, the
fixation’s middle frame is shown to the analyst (e.g. for a fixation
consisting of seven frames only the fourth frame is displayed). The
analyst then evaluates the position of the gaze point in the frame and
clicks on the corresponding AOI in the reference image.</p>

<p><italic>Computational Gaze-Object Mapping (cGOM)</italic>. cGOM is
based on a loop function that iterates through all fixations’ middle
frames and always performs the same routine of (i) object detection
using mask R-CNN and (ii) comparison of object and gaze coordinates. In
detail, each frame consists of a number of pixels that can be precisely
described by x and y coordinates in the two-dimensional plane with the
origin in the top left corner of the image. Mask R-CNN uses plain video
frames as input and outputs the frame with a suggested set of
corresponding pixels for each object of interest. If the gaze coordinate
matches with the coordinate of an object of interest, cGOM automatically
assigns the gaze to the respective AOI.</p>

<p>The performance of the two evaluation methods is compared in terms of
conformance with the ground truth and efficiency, expressed by the two
research questions <italic>RQ1</italic> and <italic>RQ2</italic>. The
goal of the study is to investigate whether the new algorithm offers the
potential of replacing conventional, manual evaluation for study designs
with tangible objects. Mask R-CNN, which is the core element of the cGOM
algorithm, has already surpassed other state-of-the-art networks in
object detection and segmentation tasks when trained on huge online data
sets (<xref ref-type="bibr" rid="b13">13</xref>). However, since the creation of such data sets is very
time-consuming and not feasible for common studies, a small and more
realistically sized training data set will be used for the
investigations in this article.</p>

<p>(RQ1) How effective is cGOM in assigning fixations to respective AOIs
in comparison with the ground truth?</p>

<p>(RQ2) At which recording duration does the efficiency of the
computer-based evaluation exceed that of the manual evaluation?</p>
    </sec>
	
    <sec id="S2">
      <title>Methods</title>


<p>The study presented in this article consisted of two parts. Firstly,
an observation of a handling task was performed for creating a
homogeneous data set in a fully controlled test environment. Secondly,
the main study was conducted by analysing the data sets of the handling
task and varying the evaluation method in the two factor levels
<italic>SGM (Semantic Gaze Mapping)</italic> and <italic>cGOM
(computational Gaze-Object Mapping)</italic>.</p>

    <sec id="S2a">
      <title>Handling Task</title>

<p><italic>Participants.</italic> 10 participants (9 males and 1 female,
average 26.6 years, range 21-30 years) conducted the handling task
wearing the eye tracking glasses. All participants had normal or
corrected to normal vision and were either mechanical engineering
students or PhD students.</p>

<p><italic>Material.</italic> The data was collected using the eye
tracking glasses SMI ETG 2 with a scene resolution of 1280 x 960 px
(viewing angle: 60° horizontal, 46° vertical) of the front camera
offering a sampling frequency of 24 Hz with the gaze point measurement
having an accuracy of 0.5° over all distances.</p>

<p><italic>Stimuli.</italic> The stimuli (Figure 2) were placed on a
table covered in a green tablecloth and consisted of five transparent
syringes, two beige bowls and one disinfectant dispenser. Four of the
syringes had a green piston and maximum filling capacities of 2, 5, 10
and 20 ml and one was fully transparent with a filling capacity of 50
ml.</p>

<fig id="fig02" fig-type="figure" position="float">
					<label>Figure 2:</label>
					<caption>
						<p>Spatial arrangement of the stimuli at the beginning of the
handling task.</p>
					</caption>
					<graphic id="graph02" xlink:href="jemr-11-06-f-figure-02.png"/>
				</fig>

<p>The bowl on the left side was almost completely filled with water and
the other one was empty. Moreover, there was one long piece of adhesive
tape attached to the tablecloth with five filling quantities written on
it (12ml, 27ml, 19ml, 150ml and 87ml).</p>

<p><italic>Task.</italic> The participants were asked to decant the
filling quantities specified on the adhesive tape from the left bowl to
the right bowl. The two bowls and the disinfectant dispenser should not
be moved and the participants were only allowed to use the maximum
filling quantity for each syringe. After each completed decanting of one
of the five preset values, the participants were instructed to disinfect
their hands.</p>
    </sec>
	
    <sec id="S2b">
      <title>Design</title>

<p>For the main study, the data set of the handling task was analyzed by
the two evaluation methods <italic>SGM</italic> and
<italic>cGOM</italic>. Both evaluation methods were compared in terms of
conformance with the ground truth and efficiency, quantified through the
two dependent variables (i) fixation-count per AOI and (ii) required
time for each evaluation step. For calculation of the
<italic>fixation-count per AOI</italic>, three AOIs were defined. All
syringes were equally labelled as <italic>syringe</italic> without
further differentiation. The disinfectant dispenser was referred to as
<italic>bottle</italic>. All gaze points that did not affect either AOI
were to be assigned to the <italic>background (“BG”)</italic>. True
positive rates (TPR) and true negative rates (TNR) were calculated for
the AOIs <italic>syringe</italic> and <italic>bottle</italic> to
evaluate the effectivity of the algorithm. While TPR describes how many
of the true positive assignments of the ground truth were found by the
<italic>cGOM</italic> tool, TNR describes the same comparison for the
true negative assignments or non-assignments.</p>

<p>Even though <italic>cGOM</italic> is able to assign the gaze point of
each frame to the corresponding AOI, for reasons of comparability the
assignment was also performed using only the fixations’ middle frame.
For comparison of efficiency, the <italic>required time for each
evaluation step</italic> of <italic>SGM</italic> and
<italic>cGOM</italic> was measured and summed up. For all manual work
steps, the times were averaged over all analysts, whereas all
computational steps were measured in one representative run. Finally,
the relation of data size and required time for evaluation was plotted
and extended by a linear trend line to allow the determination and
visualization of the break-even point of both evaluation methods.</p>
    </sec>
	
    <sec id="S2c">
      <title>Participants &#x26; Materials</title>

<p>Five professional analysts (5 males, average 29 years, range 26-37
years), experienced in eye tracking data analysis, performed both the
evaluation using <italic>Semantic Gaze Mapping</italic> and the manual
steps of <italic>computational Gaze-Object Mapping</italic> (e.g. data
labelling). For the latter, they received a training prior to execution.
All operations concerning mask R-CNN were performed using an AWS GPU 1
NVIDEA Tesla V100 (Graphics Processing Unit) via Amazon Web Services
cloud computing. Both Semantic Gaze Mapping and the export of gaze data
were performed using SMI’s BeGaze 3.6.</p>
    </sec>
	
    <sec id="S2d">
      <title>Procedure</title>

<p>The evaluation process is divided into three purely manual steps for
<italic>SGM</italic> and five steps for <italic>cGOM</italic> with the
latter consisting of two computational operations and three that demand
manual execution by the analyst. The respective steps of both methods
are explained in the following.</p>

<p><italic>Semantic Gaze Mapping.</italic> Initially, the evaluation was
prepared once by loading a reference image with all objects of interest
into the software and drawing AOIs accordingly. This reference image and
all respective AOIs can be reused over all trials. Subsequently, the
manual mapping for the recordings of the handling tasks was performed
for all fixations until at last, the data was exported.</p>

<p><italic>Computational Gaze-Object Mapping.</italic> First, training
images were collected to train mask R-CNN on the handling task setup,
using only 72 frames from the recording of a pilot study. All images
were taken from the front camera of the eye tracking glasses, resulting
in a corresponding resolution of the training and the test images. Due
to the small amount of training images, it was of great importance to
keep environmental conditions constant throughout all studies. Mask
R-CNN needs labelled images as input for training, just as they should
be outputted later. To do this, close contour masks were manually drawn
on all objects of interest in the training images. Once all images were
labelled, the training of the neural network was started. This operation
is purely computer-based and thus did not require the analyst's working
time, allowing the analyst to export the gaze data in the meantime. The
<italic>cGOM</italic> algorithm requires start time, duration and end
time of all fixations, the x and y coordinates of the gaze point as well
as the raw video recording of the front camera. Once all data was
prepared, in a final step, the algorithm performed the gaze-object
mapping of all trials in a single run.</p>
    </sec>
    </sec>
    	
    <sec id="S3">
      <title>Results</title>

<p>As presented in <italic>Methods</italic>, the algorithm in its core
is an already established state-of-the-art convolutional neuronal
network for object detection including the prediction of masks. Figure 3
shows that the algorithm is able to perform the mapping of specific gaze
points to gazed-at objects when comparing the position of the object
masks with the gaze point coordinates. On the one hand, this chapter
shall evaluate the mapping performance of the algorithm in comparison to
the manual mapping, which is considered as ground truth. On the other
hand, it shall provide an overview of the time needed to operate the
algorithm and present the break-even point from which on the algorithm
approach is faster than the manual mapping.</p>

<fig id="fig03" fig-type="figure" position="float">
					<label>Figure 3:</label>
					<caption>
						<p>Detection of the two objects syringe and bottle by the cGOM
algorithm. The detected objects are marked with masks, coloured
according to the object class. Their positions in the image are then
compared with the corresponding gaze point coordinate (red ring).</p>
					</caption>
					<graphic id="graph03" xlink:href="jemr-11-06-f-figure-03.png"/>
				</fig>

<p><italic>Effectivity evaluation.</italic> Figure 4 shows the results
for TPR and TNR of the two AOIs <italic>syringe</italic> and
<italic>bottle.</italic> For the AOI <italic>syringe,</italic> the
<italic>cGOM</italic> tool achieves a TPR of 79% and a TNR of 85%. For
the AOI <italic>bottle</italic>, the TPR of the <italic>cGOM</italic>
tool is 58% and the TNR is 98%. Table 1 shows the overview statistic of
the fixation mapping both for SGM and for cGOM, including the number of
fixations mapped [-], the average fixation duration [ms] and the
standard deviation of the fixation durations [ms] for the two examined
objects syringe and bottle.</p>

<fig id="fig04" fig-type="figure" position="float">
					<label>Figure 4:</label>
					<caption>
						<p>True positive rate (TPR) and true negative rate (TNR) of
the cGOM assignment, compared to the manual assignment, which is
considered as ground truth [%]. Relation between the results of the
manual mapping (abscissa) and the mapping by the cGOM tool (ordinate).
264 labelled representations for the AOI syringe, and 32 labelled
representations for the AOI bottle were used for training the cGOM
algorithm.</p>
					</caption>
					<graphic id="graph04" xlink:href="jemr-11-06-f-figure-04.png"/>
				</fig>

<p><italic>Efficiency of the manual mapping.</italic> The analysts
needed a total of 128 minutes on average for the mapping, with a
standard derivation of 14 minutes. The subtasks were preparation,
mapping and export of the data sample. The main time was needed to map
all 4356 fixations. The exact sub-times for the total mapping process
are presented in Table 2.</p>

<p><italic>Efficiency of the computational mapping.</italic> The whole
process using the algorithm required 236 minutes in total. The exact
times of the subtasks of the mapping process are shown in Table 3. The
mapping aided by the algorithm requires manual and computational steps.
The <italic>cGOM</italic> tool spends most of the time for the steps of
training and mapping that are solely conducted by a computer. The steps
for collecting training images, manual labelling of the training images
and export of the main study gaze data have to be performed by an
operator. The labelled training set of this study included 72 images
showing in total 264 representations of the object
<italic>syringe</italic> and 32 representations of the object
<italic>bottle</italic>. These three manual steps required 129 minutes
in total.</p>

<p><italic>Break-even point.</italic> Figure 5 shows the approximated
break-even point for the average time required for the manual mapping.
The measurement points for the manual</p>

<p>mapping are the averaged interims for the single participant data
sets. The completely analysed data sample (video time) totals up to 52
minutes. The start time for the data mapping was dependent on the
required preparation for the manual and for the computational mapping.
The average of the measured mapping times is extended by a linear trend
line (dashed lines in Figure 5). The grey cone shows the extended
linearized times of the fastest and slowest expert. All linearization is
based on the assumption that the experts take sufficient breaks. The
horizontal dashed line in Figure 5 indicates the time of the three
manual steps required for the <italic>cGOM</italic> tool. For the manual
<italic>SGM</italic> tool, the ratio between to-be-assigned gaze data
sample and the required time for the mapping is 2.48 on average. For the
<italic>cGOM</italic> tool, the ratio is 0.77 on average. The break-even
point of the whole manual mapping procedure and the whole computational
mapping procedure lies approximately at a data sample size of 02:00h.
When comparing only the manual steps of both procedures, the break-even
point reduces to a data sample size of approximately 01:00h (see Figure
5).</p>

<fig id="fig05" fig-type="figure" position="float">
					<label>Figure 5:</label>
					<caption>
						<p>Graphical representation of data samples size in hours and
minutes (abscissa) and time required for mapping in hours (ordinate) for
manual mapping using SGM (●) and computational mapping using cGOM (▲).
The linear approximation is based on the measured sub-times data.
According to the approximation, a sample size of 4 hours video time
would require 6.5 hours for the computational mapping and 10 hours for
the manual mapping on average. The grey cone represents the distribution
of the manual mapping and the linear continuation. The break-even point
is at approx. 02:00h gaze data sample size. The break-even point for the
man hours investment for cGOM is at approx. 01:00h gaze data sample
size.</p>
					</caption>
					<graphic id="graph05" xlink:href="jemr-11-06-f-figure-05.png"/>
				</fig>

<table-wrap id="t01" position="float">
					<label>Table 1:</label>
					<caption>
						<p>Overview statistics of the fixations mapped by SGM and cGOM
for the objects syringe and bottle. The applied statistics are number of
fixation [-], fixation duration mean [ms] and fixation duration standard
deviation [ms].</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th><italic><bold>Mapping characteristics</bold></italic></th>
        <th>SGM (syringe)</th>
        <th>cGOM (syringe)</th>
        <th>SGM (bottle)</th>
        <th>cGOM (bottle)</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of fixations [-]</td>
        <td>2016</td>
        <td>1934</td>
        <td>79</td>
        <td>82</td>
      </tr>
      <tr>
        <td>Duration mean [ms]</td>
        <td>445</td>
        <td>434</td>
        <td>311</td>
        <td>248</td>
      </tr>
      <tr>
        <td>Duration SD [ms]</td>
        <td>568</td>
        <td>571</td>
        <td>296</td>
        <td>146</td>
      </tr>
      <tr>
        <td></td>
        <td></td>
        <td></td>
        <td></td>
        <td></td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<table-wrap id="t02" position="float">
					<label>Table 2:</label>
					<caption>
						<p>SGM - Overview of the mapping sub-times (mean and standard
derivation in minutes) of the manual mapping by five professional
analysts. The sub-tasks were preparation, mapping and export of the data
sample.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th><italic><bold>SGM (Semantic Gaze
        Mapping)</bold></italic></th>
        <th>Mean [min]</th>
        <th>Standard derivation [min]</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Preparation</td>
        <td>1.5</td>
        <td>1</td>
      </tr>
      <tr>
        <td>Mapping</td>
        <td>126</td>
        <td>13</td>
      </tr>
      <tr>
        <td>Export</td>
        <td>1</td>
        <td>0.5</td>
      </tr>
      <tr>
        <td><bold>Total time</bold></td>
        <td><bold>128.5</bold></td>
        <td><bold>14</bold></td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<table-wrap id="t03" position="float">
					<label>Table 3:</label>
					<caption>
						<p>cGOM - Overview of mapping sub-times (mean and standard
derivation in minutes) of the computational mapping by the algorithm.
The mapping aided by the algorithm requires manual steps <sup>(#)</sup>
and computational steps <sup>(*)</sup>.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th><italic><bold>cGOM (computational Gaze-Object
        Mapping)</bold></italic></th>
        <th>Mean [min]</th>
        <th>Standard derivation [min]</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Collecting training images <sup>(#)</sup></td>
        <td>15</td>
        <td>0</td>
      </tr>
      <tr>
        <td>Manual labelling of the training images <sup>(#)</sup></td>
        <td>113</td>
        <td>38</td>
      </tr>
      <tr>
        <td>Training of the neural network <sup>(*)</sup></td>
        <td>67</td>
        <td>0</td>
      </tr>
      <tr>
        <td>Export of the main study gaze data <sup>(#)</sup></td>
        <td>1</td>
        <td>0.5</td>
      </tr>
      <tr>
        <td>Mapping of the main study gaze data <sup>(*)</sup></td>
        <td>40</td>
        <td>0</td>
      </tr>
      <tr>
        <td><bold>Total operator time</bold></td>
        <td><bold>129</bold></td>
        <td><bold>38</bold></td>
      </tr>
      <tr>
        <td><bold>Total time</bold></td>
        <td><bold>236</bold></td>
        <td><bold>38</bold></td>
      </tr>
    </tbody>
  </table>
</table-wrap>

    </sec>
	
    <sec id="S4">
      <title>Discussion</title>

<p>The goal of the main study was to investigate whether the newly
introduced machine learning-based algorithm <italic>cGOM</italic> offers
the potential of replacing conventional, manual AOI evaluation in
experimental setups with tangible objects. Therefore, manual gaze
mapping using <italic>SGM</italic>, which was considered as ground
truth, and <italic>cGOM</italic> were compared in regards to their
performance. In the process, it was quantified whether the new algorithm
is able to effectively map gaze data to AOIs (RQ1) and from which
recording duration on the algorithm works more efficiently than the
manual mapping (RQ2). Based on the results of this study we evaluate
both research questions.</p>
<p specific-use="wrapper">
  <disp-quote>
    <p><italic>(RQ1) How effective is cGOM in assigning fixations to
    respective AOIs in comparison with the ground truth?
    </italic></p>
  </disp-quote>
</p>

<p>The two objects used during the handling task were deliberately
selected because they represent potential challenges for machine
learning. On the one hand, the syringes are partially transparent and
constantly change their length during decanting. On the other hand, both
the syringes and the bottle have partly tapered contours, which were
assumed difficult to reproduce by close contour masks, in particular
when working with small training data sets. According to the results
presented in Figure 4, the assignment by the computational mapping has a
TPR of 79% for the AOI <italic>syringe</italic> and 58% for the AOI
<italic>bottle</italic>. For training the neural network, not only the
number of training images but also the total number of object
representations in these images is important. Since the stimuli
consisted of five <italic>syringes</italic> and only one
<italic>bottle</italic>, the 72 training images included 264
representations of the AOI <italic>syringe</italic> on which the neural
network could learn, but only 32 representations of the AOI
<italic>bottle</italic>. Due to the small learning basis, the masks
produced by the algorithm sometimes did not include crucial features
like the tip of the bottle (Figure 3).</p>

<p>For the AOI <italic>syringe</italic>, the TNR is slightly better than
the TPR, whereas for the AOI <italic>bottle</italic> the TNR greatly
exceeds the TPR. The relation between TPR and TNR can be well explained
by the quality of the created masks. The masks tend to rather fill too
little of the object than too much. The more the masks are directed
inwards from the outer contour or exclude crucial features like in case
of the bottle, the less true positives are registered, but the higher
the probability of true negatives being recorded.</p>

<p>In line with the results, it can be concluded that for the AOI
<italic>syringe</italic> the conformance with the ground truth is
already promising, but can still be further increased. For the AOI
<italic>bottle,</italic> there is no sufficient true positive rate yet,
but with almost 60%, it is surprisingly high given the fact that the
neural network was shown only 32 object representations of the
<italic>bottle</italic> during training. In comparison, large online
databases such as MS COCO work with hundreds to thousands of labelled
images for training one single object. Objects already represented in
the COCO data set may even be used in a plug-and-play approach without
training, using only the gaze-object mapping feature of the presented
algorithm. In contrast to the 72 images needed for the tailored approach
of this study, the COCO dataset includes more than 320k labelled images
in total (<xref ref-type="bibr" rid="b14">14</xref>). To improve the performance of the <italic>cGOM</italic>
algorithm, the amount of training images and object representations can
be increased until a sufficient TPR and TNR is reached.</p>
<p specific-use="wrapper">
  <disp-quote>
    <p><italic>(RQ2) At which recording duration does the efficiency of
    the computer-based evaluation exceed that of the manual
    evaluation?</italic></p>
  </disp-quote>
</p>

<p>The <italic>cGOM</italic> tool exceeds the manual evaluation when the
respective procedure in total needs less time. The amount of data from
which onwards the <italic>cGOM</italic> tool is faster than the manual
evaluation is called the break-even point. For the break-even point one
has to distinguish between the time for the total computational mapping
procedure and the time, a person has to invest (see man hours
investments of <italic>cGOM</italic> in Figure 5). The main part of the
time is needed for training the algorithm on the training images, which
is solely performed by the computer, and for the labelling of the
training images, which has to be performed once by the analyst. For the
total procedure and the consideration of the average speed for manual
mapping by the experts, the break-even point lies at 2 hours of eye
tracking recording. When focussing only on the time a person has to
invest, this break-even point reduces to just 60 minutes of eye tracking
recording.</p>

<p>After the preparation of the <italic>cGOM</italic> algorithm, the
algorithm needs less than 8 minutes for every 10 minutes of eye tracking
recording and thus is able to work in real time with a factor of 0.77.
This is by far faster than the manual mapping, which has on average a
ratio of 2.48 between recording length and time needed for the manual
mapping. The difference in evaluation speed does not take into account
symptoms of fatigue on the part of the analyst that increase
considerably with longer evaluation times of <italic>SGM</italic>. With
the given break-even points of only 2 hours of eye tracking recording or
rather 1 hour, considering only the human working time, and the
real-time capability of the mapping, the authors evaluate the efficiency
of the <italic>cGOM</italic> tool exceeds the manual mapping for the
majority of mobile eye tracking studies.</p>

<p>Due to the novelty of the algorithm presented in this study, there
are several limitations regarding the results. First and foremost, to
achieve the best possible results with a minimum amount of training
images, the results presented are only valid in a laboratory environment
with constant conditions. The amount of training data has to be higher
to cover the possible variations in field studies. Further
investigations are also required to determine which other objects are
suitable for the presented algorithm and how the characteristics and
number of objects influence the evaluation time. Although the comparison
with SGM as a ground truth allows for a good assessment of the
algorithm’s potential, it is questionable that the results of the manual
mapping are always error-free due to the subjective evaluations by the
analysts. The approach using five independent professional analysts
tries to compensate this limitation.</p>

<p>Moreover, the gaze data set consisted of only 52 minutes video
material and the derived linearization may not be true, as the human
analysts cannot work uninterruptedly and the mapping speed decreases.
The results for the measured times for the computational and the manual
mapping may not be achieved when using a different computer system or
mapping a different set of gaze data. The computational system used for
this study has a state-of-the-art graphics processing unit (GPU), which
is not comparable to the one of a standard computer regarding the speed
and hence the time needed for training the algorithm and mapping the
gaze data. Cloud computing, which was used in this study, lowers the
burden as it continuously decreases the costs and democratises the
access to the amount of computational performance needed.</p>

<p>As described in the introduction, AOI analysis of mobile, interactive
eye tracking studies with tangible objects is still a time-consuming and
challenging task. The presented <italic>cGOM</italic> algorithm is a
first step to address this gap and complement state-of-the-art methods
of automated AOI analysis (e.g. markers). For objects that are trained
with a sufficient amount of training data like the AOI
<italic>syringe</italic>, the algorithm already shows a promising TPR
and TNR. Due to the early break-even point, both for the total procedure
and in particular considering human working time, as well as the
real-time capability of the mapping process, even hours of eye tracking
recording can be evaluated. Currently, this would require an amount of
time for the manual mapping that is not or difficult to realise, seen
from an economical and a humane point of view. Consequently, this
approach using machine learning for the mapping task promises to enable
the mapping of a great amount of gaze data in a reliable, standardized
way and within a short period. It lays the foundation for profound
research on AOI metrics and reduces the obstacles many researchers still
have when thinking about applying mobile eye tracking in their own
research.</p>
    </sec>
	
    <sec id="S5" sec-type="COI-statement">
      <title>Ethics and Conflict of Interest</title>

<p>The authors declare that the contents of the article are in agreement
with the ethics described in
<ext-link ext-link-type="uri" xlink:href="http://biblio.unibe.ch/portale/elibrary/BOP/jemr/ethics.html" xlink:show="new">http://biblio.unibe.ch/portale/elibrary/BOP/jemr/ethics.html</ext-link>.
The ethics committee Zurich confirms that this research project does not
fall within the scope of the Human Research Act (HRA) and therefore no
authorization from the ethics committee is required. (BASEC No. Req-.
2018-00533, 27<sup>th</sup> June 2018). All participants were asked to
read and sign a consent form, which describes the type of recorded data
and how this data will be used for publication. The authors declare to
not have a conflict of interest regarding the publication of this
paper.</p>
    </sec>
	
    <sec id="S6">
      <title>Acknowledgements</title>

<p>The authors Julian Wolf and Stephan Hess contributed equally to the
publication of this paper and are both considered first author. We wish
to thank all participants of this study for their time.</p>
    </sec>
</body>
<back>
<ref-list>
<ref id="b11"><mixed-citation publication-type="unknown" specific-use="linked"><person-group person-group-type="author"><name><surname>Chukoskie</surname> <given-names>L</given-names></name>, <name><surname>Guo</surname> <given-names>S</given-names></name>, <name><surname>Ho</surname> <given-names>E</given-names></name>, <name><surname>Zheng</surname> <given-names>Y</given-names></name>, <name><surname>Chen</surname> <given-names>Q</given-names></name>, <name><surname>Meng</surname> <given-names>V</given-names></name>, <etal>et al.</etal></person-group> Quantifying Gaze Behavior during Real World Interactions using Automated Object, Face, and Fixation Detection. IEEE Transactions on Cognitive and Developmental Systems. <year>2018</year>:1-. doi: <pub-id pub-id-type="doi" specific-use="author">10.1109/TCDS.2018.2821566</pub-id>.</mixed-citation></ref>
<ref id="b9"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>De Beugher</surname> <given-names>S</given-names></name>, <name><surname>Brone</surname> <given-names>G</given-names></name>, <name><surname>Goedeme</surname> <given-names>T.</given-names></name></person-group> <article-title>Automatic Analysis of In-the-Wild Mobile Eye-tracking Experiments using Object, Face and Person Detection.</article-title> <source>Proceedings of the 2014 9th International Conference on Computer Vision Theory and Applications (Visapp)</source>, <volume>Vol 1</volume>. <year>2014</year>:<fpage>625</fpage>-<lpage>33</lpage>. PubMed PMID: WOS:000412726800079.</mixed-citation></ref>
<ref id="b8"><mixed-citation publication-type="unknown" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Evans</surname> <given-names>KM</given-names></name>, <name><surname>Jacobs</surname> <given-names>RA</given-names></name>, <name><surname>Tarduno</surname> <given-names>JA</given-names></name>, <name><surname>Pelz</surname> <given-names>JB</given-names></name></person-group>. Collecting and Analyzing Eye-tracking Data in Outdoor Environments. J Eye Movement Res. <year>2012</year>;5(2). PubMed PMID: WOS:000328117400006.</mixed-citation></ref>
<ref id="b10"><mixed-citation publication-type="unknown" specific-use="linked"><person-group person-group-type="author"><name><surname>Garcia-Garcia</surname> <given-names>A</given-names></name>, <name><surname>Orts-Escolano</surname> <given-names>S</given-names></name>, <name><surname>Oprea</surname> <given-names>S</given-names></name>, <name><surname>Villena-Martinez</surname> <given-names>V</given-names></name>, <name><surname>Martinez-Gonzalez</surname> <given-names>P</given-names></name>, <name><surname>Garcia-Rodriguez</surname> <given-names>J.</given-names></name></person-group> <article-title>A survey on deep learning techniques for image and video semantic segmentation.</article-title> Appl Soft Comput. <year>2018</year>;70:41-65. doi: <pub-id pub-id-type="doi" specific-use="author">10.1016/j.asoc.2018.05.018</pub-id>. PubMed PMID: WOS:000443296000004.</mixed-citation></ref>
<ref id="b13"><mixed-citation publication-type="conference" specific-use="linked"><person-group person-group-type="author"><name><surname>He</surname> <given-names>KM</given-names></name>, <name><surname>Gkioxari</surname> <given-names>G</given-names></name>, <name><surname>Dollar</surname> <given-names>P</given-names></name>, <name><surname>Girshick</surname> <given-names>R</given-names></name></person-group>. Mask R-CNN. Ieee I Conf Comp Vis. <year>2017</year>:2980-8. doi: <pub-id pub-id-type="doi" specific-use="author">10.1109/Iccv.2017.322</pub-id>. PubMed PMID: WOS:000425498403005. <pub-id pub-id-type="doi">10.1109/ICCV.2017.322</pub-id></mixed-citation></ref>
<ref id="b1"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="author"><name><surname>Holmqvist</surname>, <given-names>K.</given-names></name>, <name><surname>Nystr&#246;m</surname>, <given-names>M.</given-names></name>, <name><surname>Andersson</surname>, <given-names>R.</given-names></name>, <name><surname>Dewhurst</surname>, <given-names>R.</given-names></name>, <name><surname>Jarodzka</surname>, <given-names>H.</given-names></name>, &#x26; <name><surname>van de Weijer</surname>, <given-names>J.</given-names></name></person-group> (<year>2011</year>). <source>Eye Tracking: A Comprehensive Guide To Methods And Measures</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford university press</publisher-name>.</mixed-citation></ref>
<ref id="b5"><mixed-citation publication-type="conference" specific-use="linked"><person-group person-group-type="author"><name><surname>Kiefer</surname> <given-names>P</given-names></name>, <name><surname>Giannopoulos</surname> <given-names>I</given-names></name>, <name><surname>Kremer</surname> <given-names>D</given-names></name>, <name><surname>Schlieder</surname> <given-names>C</given-names></name>, <name><surname>Raubal</surname> <given-names>M</given-names></name></person-group>. <article-title>Starting to get bored: an outdoor eye tracking study of tourists exploring a city panorama.</article-title>&#160;<source>Proceedings of the Symposium on Eye Tracking Research and Applications</source>; <conf-loc>Safety Harbor, Florida</conf-loc>. 2578216: ACM; <year>2014</year>. p. <fpage>315</fpage>-<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1145/2578153.2578216</pub-id></mixed-citation></ref>
<ref id="b14"><mixed-citation publication-type="unknown" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>TY</given-names></name>, <name><surname>Maire</surname> <given-names>M</given-names></name>, <name><surname>Belongie</surname> <given-names>S</given-names></name>, <name><surname>Hays</surname> <given-names>J</given-names></name>, <name><surname>Perona</surname> <given-names>P</given-names></name>, <name><surname>Ramanan</surname> <given-names>D</given-names></name>, <etal>et al.</etal></person-group> Microsoft COCO: Common Objects in Context. Lect Notes Comput Sc. <year>2014</year>;8693:740-55. PubMed PMID: WOS:000345528200048.</mixed-citation></ref>
<ref id="b2"><mixed-citation publication-type="unknown" specific-use="linked"><person-group person-group-type="author"><name><surname>Mussgnug</surname> <given-names>M</given-names></name>, <name><surname>Singer</surname> <given-names>D</given-names></name>, <name><surname>Lohmeyer</surname> <given-names>Q</given-names></name>, <name><surname>Meboldt</surname> <given-names>M</given-names></name></person-group>. <article-title>Automated interpretation of eye-hand coordination in mobile eye tracking recordings.</article-title> Identifying demanding phases in human-machine interactions. Kunstl Intell. <year>2017</year>;31(4):331-7. doi: <pub-id pub-id-type="doi" specific-use="author">10.1007/s13218-017-0503-y</pub-id>. PubMed PMID: WOS:000424411300004.</mixed-citation></ref>
<ref id="b4"><mixed-citation publication-type="unknown" specific-use="linked"><person-group person-group-type="author"><name><surname>Ooms</surname> <given-names>K</given-names></name>, <name><surname>Coltekin</surname> <given-names>A</given-names></name>, <name><surname>De Maeyer</surname> <given-names>P</given-names></name>, <name><surname>Dupont</surname> <given-names>L</given-names></name>, <name><surname>Fabrikant</surname> <given-names>S</given-names></name>, <name><surname>Incoul</surname> <given-names>A</given-names></name>, <etal>et al.</etal></person-group> <article-title>Combining user logging with eye tracking for interactive and dynamic applications.</article-title> Behav Res Methods. <year>2015</year>;47(4):977-93. doi: <pub-id pub-id-type="doi" specific-use="author">10.3758/s13428-014-0542-3</pub-id>. PubMed PMID: WOS:000364511400006.</mixed-citation></ref>
<ref id="b12"><mixed-citation publication-type="unknown" specific-use="linked"><person-group person-group-type="author"><name><surname>Orquin</surname> <given-names>JL</given-names></name>, <name><surname>Ashby</surname> <given-names>NJS</given-names></name>, <name><surname>Clarke</surname> <given-names>ADF</given-names></name></person-group>. Areas of Interest as a Signal Detection Problem in Behavioral Eye-Tracking Research. J Behav Decis Making. <year>2016</year>;29(2-3):103-15. doi: <pub-id pub-id-type="doi" specific-use="author">10.1002/bdm.1867</pub-id>. PubMed PMID: WOS:000373309700002.</mixed-citation></ref>
<ref id="b6"><mixed-citation publication-type="conference" specific-use="linked"><person-group person-group-type="author"><name><surname>Pfeiffer</surname> <given-names>T</given-names></name>, <name><surname>Renner</surname> <given-names>P.</given-names></name></person-group> <article-title>EyeSee3D: a low-cost approach for analyzing mobile 3D eye tracking data using computer vision and augmented reality technology.</article-title>&#160;<source>Proceedings of the Symposium on Eye Tracking Research and Applications</source>; <conf-loc>Safety Harbor, Florida</conf-loc>. 2578183: ACM; <year>2014</year>. p. <fpage>195</fpage>-<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1145/2578153.2578183</pub-id></mixed-citation></ref>
<ref id="b3"><mixed-citation publication-type="unknown" specific-use="linked"><person-group person-group-type="author"><name><surname>Vansteenkiste</surname> <given-names>P</given-names></name>, <name><surname>Cardon</surname> <given-names>G</given-names></name>, <name><surname>Philippaerts</surname> <given-names>R</given-names></name>, <name><surname>Lenoir</surname> <given-names>M</given-names></name></person-group>. Measuring dwell time percentage from head-mounted eye-tracking data - comparison of a frame-by-frame and a fixation-by-fixation analysis. Ergonomics. <year>2015</year>;58(5):712-21. doi: <pub-id pub-id-type="doi" specific-use="author">10.1080/00140139.2014.990524</pub-id>. PubMed PMID: WOS:000354453400005.</mixed-citation></ref>
<ref id="b7"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y</given-names></name>, <name><surname>Zheng</surname> <given-names>XJ</given-names></name>, <name><surname>Hong</surname> <given-names>W</given-names></name>, <name><surname>Mou</surname> <given-names>XQ</given-names></name></person-group>. A Comparison Study of Stationary and Mobile Eye Tracking on EXITs Design in a Wayfinding System. 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (Apsipa). <year>2015</year>:649-53. PubMed PMID: WOS:000382954100124.</mixed-citation></ref>
</ref-list>
</back>
</article>
