<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">

<article article-type="research-article" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">
 <front>
    <journal-meta>
	<journal-id journal-id-type="publisher-id">Jemr</journal-id>
      <journal-title-group>
        <journal-title>Journal of Eye Movement Research</journal-title>
      </journal-title-group>
      <issn pub-type="epub">1995-8692</issn>
	  <publisher>								
	  <publisher-name>Bern Open Publishing</publisher-name>
	  <publisher-loc>Bern, Switzerland</publisher-loc>
	</publisher>
    </journal-meta>
    <article-meta>
	<article-id pub-id-type="doi">10.16910/jemr.11.6.2</article-id> 
	  <article-categories>								
				<subj-group subj-group-type="heading">
					<subject>Research Article</subject>
				</subj-group>
		</article-categories>
      <title-group>
        <article-title>MAGiC: A Multimodal Framework for Analysing Gaze in Dyadic Communication</article-title>
      </title-group>
	   <contrib-group> 
				<contrib contrib-type="author">
					<name>
						<surname>Arslan Aydın</surname>
						<given-names>Ülkü</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>
				<contrib contrib-type="author">
					<name>
						<surname>Kalkan</surname>
						<given-names>Sinan</given-names>
					</name>
					<xref ref-type="aff" rid="aff2">2</xref>
				</contrib>	
				<contrib contrib-type="author">
					<name>
						<surname>Acartürk</surname>
						<given-names>Cengiz</given-names>
					</name>
					<xref ref-type="aff" rid="aff1">1</xref>
				</contrib>        			
        <aff id="aff1">
		<institution>Cognitive Science Program Middle East Technical University Ankara</institution>,   <country>Turkey</country>
        </aff>
        <aff id="aff2">
		<institution>Computer Science Department Middle East Technical University Ankara</institution>,   <country>Turkey</country>
        </aff>        
		</contrib-group>   

		
	  <pub-date date-type="pub" publication-format="electronic"> 
		<day>12</day>  
		<month>11</month>
        <year>2018</year>
      </pub-date>
	  <pub-date date-type="collection" publication-format="electronic"> 
	  <year>2018</year>
	</pub-date>
      <volume>11</volume>
      <issue>6</issue>
	 <elocation-id>10.16910/jemr.11.6.2</elocation-id> 
	<permissions> 
	<copyright-year>2018</copyright-year>
	<copyright-holder>Arslan Aydın, Ü., Acartürk, C. &#x26; Kalkan, S.</copyright-holder>
	<license license-type="open-access">
  <license-p>This work is licensed under a Creative Commons Attribution 4.0 International License, 
  (<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">
    https://creativecommons.org/licenses/by/4.0/</ext-link>), which permits unrestricted use and redistribution provided that the original author and source are credited.</license-p>
</license>
	</permissions>
      <abstract>
        <p>The analysis of dynamic scenes has been a challenging domain in eye tracking research.
This study presents a framework, named MAGiC, for analyzing gaze contact and gaze aversion
in face-to-face communication. MAGiC provides an environment that is able to detect
and track the conversation partner’s face automatically, overlay gaze data on top of the face
video, and incorporate speech by means of speech-act annotation. Specifically, MAGiC integrates
eye tracking data for gaze, audio data for speech segmentation, and video data for
face tracking. MAGiC is an open source framework and its usage is demonstrated via publicly
available video content and wiki pages. We explored the capabilities of MAGiC
through a pilot study and showed that it facilitates the analysis of dynamic gaze data by
reducing the annotation effort and the time spent for manual analysis of video data.</p>
      </abstract>
      <kwd-group>
        <kwd>Gaze analysis</kwd>
        <kwd>speech analysis</kwd>
        <kwd>automatic face detection</kwd>
        <kwd>automatic speech segmentation</kwd>	
      </kwd-group>
    </article-meta>
  </front>	
  <body>

    <sec id="S1">
      <title>Introduction</title>

<p>In face-to-face social communication, interlocutors exchange both verbal
and non-verbal signals. Non-verbal signals are conveyed in various
modalities, such as facial expressions, gestures, intonation and eye
contact. Previous research has shown that non-verbal messages prevail
synchronous verbal messages in case of a conflict between the two. In
particular, interlocutors usually interpret non-verbal messages rather
than verbal messages as a reflection of true feelings and intentions
(<xref ref-type="bibr" rid="b1 b2">1, 2</xref>). Therefore, an investigation of the structural
underpinnings of social interaction requires the study of both
non-verbal modalities and verbal modalities of communication. In the
present study, we focus on gaze as a non-verbal modality in face-to-face
communication. In particular, we focus on eye contact and gaze
aversion.</p>

<p>Eye contact is a crucial signal for social communication. It plays a
major role in initiating a conversation, in regulating turn taking
(<xref ref-type="bibr" rid="b3 b4">3, 4</xref>), in signaling topic (<xref ref-type="bibr" rid="b5 b6 b7 b8">5, 6, 7, 8</xref>) and in adjusting the
conversational roles of interlocutors (<xref ref-type="bibr" rid="b9 b10 b11">9, 10, 11</xref>). Moreover,
interlocutor’s putative mental states, such as
<italic>interest</italic>, are usually inferred from gaze (<xref ref-type="bibr" rid="b12">12</xref>).
In particular, eye contact is a fundamental, initial step for capturing
the attention of the communication partner and establishing joint
attention (<xref ref-type="bibr" rid="b13 b14">13, 14</xref>).</p>

<p>Gaze aversion is another coordinated interaction pattern that
regulates conversation. Gaze aversion is the act of intentionally
looking away from the interlocutor. The previous research has explored
the effects of gaze aversion on avoidance and approach. These studies
have shown that an averted gaze of an interlocutor initiates a tendency
to avoid, whereas a direct gaze initiates a tendency to approach
(<xref ref-type="bibr" rid="b15">15</xref>). Similarly, the participants give higher ratings for
likeability and attractiveness when picture stimuli involve a face with
a direct gaze contact, compared to the stimuli that involve a face with
averted gaze (<xref ref-type="bibr" rid="b16 b17">16, 17</xref>).</p>

<p>The conversational functions of gaze aversion are also closely
related to speech (<xref ref-type="bibr" rid="b18 b19 b20">18, 19, 20</xref>). In particular, gaze provides
repeating, complementing, regulating and substitution of a verbal
message. Speech requires complementary functions, such as temporal
coordination of embodied cognitive processes including planning, memory
retrieval for lexical and semantic information, and phonemic
construction (<xref ref-type="bibr" rid="b21 b22 b23 b24 b25">21, 22, 23, 24, 25</xref>).</p>

<p>A closer look at speech as a communication modality reveals that
speech carries various useful signals about the content or quality of
speech itself, such as intonation, volume, pitch variations, speed and
actions done through speech (viz. speech acts). In the present study, we
focus on <italic>speech acts</italic> due to its salient role as the
speech modality in conversation. According to the speech act theory
(<xref ref-type="bibr" rid="b26 b27">26, 27</xref>), language is a tool to perform acts, as well as to
describe things and inform interlocutors about them.</p>

<p>The speech act theory is concerned with the function of language in
communication. It states that a speech act consists of various
components that have distinct roles. For analyzing language in
communication, discourse should be segmented into units that have
communicative functions. The relevant communicative functions should be
identified and labelled accordingly. The speech acts are usually
identified by analyzing the content of speech. However, temporal
properties of speech convey information to the interlocutor, too. For
instance, the analysis of a pause may be conceived as a signal for a
shift in topic (<xref ref-type="bibr" rid="b24">24</xref>) Similarly, a pause may be an indicator of
speaker’s fluency (<xref ref-type="bibr" rid="b28">28</xref>) and even for and indicator of a speech
disorder (<xref ref-type="bibr" rid="b29">29</xref>). The framework that we present in this study
(viz. MAGiC) enables researchers to perform analyses by employing both
content of speech and its temporal properties. In the following section,
we present a major challenge that MAGiC proposes a solution, namely gaze
data analysis in dynamical scenes.</p>
    </sec>
	
    <sec id="S2">
      <title>Gaze Data Analysis in Dynamical Scenes</title>

<p>Eye tracker manufactures have been providing researchers with the
tools for identifying basic eye movement measures, such as gaze position
and duration, as well as a set of derived measures, such as Area of
Interest (AOI) statistics. The study of gaze in social interaction,
however, requires more advanced tools that would enable the researcher
to automatically analyze gaze data on dynamical scene recordings. The
analysis of gaze data in dynamical scenes has been a well-acknowledged
problem in eye tracking research (<xref ref-type="bibr" rid="b30">30</xref>) largely due to the
technical challenges in recognizing and tracking objects in a dynamic
scene. This is because eye trackers generate a raw data stream, which
contains a list of points-of-regard (POR) during the course of tracking
the participant’s eyes. In a stationary scene, it is relatively
straightforward to specify sub-regions (i.e., Areas of Interest, AOIs)
of the stimuli on the display. This specification is then used for
extracting AOI-based eye movement statistics. In case of a dynamical
scene (cf. mobile eye-trackers), the lack of predefined areas leads to
challenges in automatic analysis of gaze data. A number of solutions
have been proposed to improve dynamic gaze data and to make it more
robust against human errors in manual data annotation, such as using
infrared markers, employing inter-rater analysis and combining the
state-of-the-art object recognition techniques for image processing.
However, each method has its own limitations (<xref ref-type="bibr" rid="b31 b32 b33">31, 32, 33</xref>). For
instance, infrared markers may lead to visual distraction. In addition,
in case of multiple object detection, markers are not economically or
ergonomically feasible since they should be attached to each individual
object to be tracked as reported by the previous research
(<xref ref-type="bibr" rid="b34 b35">1</xref>). To the best of our knowledge, there is no commonly
accepted method for achieving eye movement analysis in dynamic scenes as
reported by the previous research. In this study, we propose a solution
to this problem in a specific domain, i.e., dynamic analysis of face, as
presented in the following section.</p>

<p>We focus on a relatively well-developed subdomain of object
recognition: Face recognition. The recognition of faces has been subject
to intense research in computer vision due to its potential and
importance in daily life applications, e.g. in security. Accordingly,
MAGiC employs face recognition techniques to automatically detect gaze
contact and gaze aversion in dynamic scenarios, where eye movement data
are recorded. It aims at the analysis of dynamic scenes by reducing the
effort on time-consuming and error-prone manual annotation of gaze
data.</p>

<p>MAGiC also provides an environment that facilitates the analysis of
audio recordings. Manual segmentation of audio recordings into speech
components and pause components is not efficient and reliable, since it
may exclude potentially meaningful information from the analyses
(<xref ref-type="bibr" rid="b36 b37">36, 37</xref>). In the following section we report a technical
overview of the framework by presenting its components for face tracking
and speech segmentation.</p>
    </sec>
	
    <sec id="S3">
      <title>A Technical Overview of the MAGiC Framework</title>

<p>In the two subsections below, we present how <italic>face
tracking</italic> and <italic>speech segmentation</italic> are conducted
by MAGiC through its open source components.</p>

    <sec id="S3a">
      <title>Face Tracking</title>

<p>Face tracking has been a challenging topic in computer vision. In
face tracking, a face in a video-frame is detected first, and then it is
tracked throughout the stream. In the present study, we employ an
established face tracking toolkit called <italic>OpenFace</italic>, an
open source tool for analyzing facial behavior (<xref ref-type="bibr" rid="b38">38</xref>).
<italic>OpenFace</italic> combines out-of-the-box solutions with the
state-of-the-art research to perform tasks including facial-landmark
detection, head-pose estimation and action unit (AU) recognition. The
MAGiC’s face tracking method is based on Baltrušaitis et al.
(<xref ref-type="bibr" rid="b38">38</xref>), Baltrušaitis, Mahmoud, &#x26; Robinson (<xref ref-type="bibr" rid="b39">39</xref>) , and
Baltrušaitis, Robinson, &#x26; Morency (<xref ref-type="bibr" rid="b40">40</xref>).</p>

<p><italic>OpenFace</italic> utilizes a pre-trained face detector
(trained in <italic>dlib</italic>), which is an open source
machine-learning library written in C++ (<xref ref-type="bibr" rid="b41">41</xref>). The Max-margin
object-detection algorithm (MMOD) of the face detector uses Histogram of
Oriented Gradients (HOG) feature extraction. The face detector is
trained on sub-windows in an image. Since the number of windows may be
large even in moderately sized images, relatively small amount of data
is enough for training (<xref ref-type="bibr" rid="b41 b42">41, 42</xref>). After detecting a face for
detecting the facial landmarks, <italic>OpenFace</italic> utilizes an
instance of Constrained Local Model (CLM), namely Constrained Local
Neural Field (CLNF), to perform feature detection problems even in
complex scenes. The response maps are extracted by using pre-trained
patch experts. Patch responses are optimized with a fitting method, viz.
Non-Uniform Regularized Landmark Mean-Shift (NU-RLMS, see Figure 1).</p>

<fig id="fig01" fig-type="figure" position="float">
					<label>Figure 1.</label>
					<caption>
						<p>A demonstration of
        <italic>OpenFace</italic> methodology, adapted from Baltrušaitis
        et al. (<xref ref-type="bibr" rid="b40">40</xref>). It is intentionally limited to two
        landmarks patch expert for the sake of clarity (all photos used
        upon the permission of the participant).</p>
					</caption>
					<graphic id="graph01" xlink:href="jemr-11-06-b-figure-01.png"/>
				</fig>

<p>The CLM (Constrained Local Model) is composed of three main steps.
First, a Point Distribution Model (PDM) extracts the mean geometry of a
shape from a set of training shapes. A statistical shape model is built
from a given set of samples. Each shape in the training set is
characterized by a set of landmark points. The number of landmarks and
the anatomical locations represented by specific landmark points should
be consistent from one shape to the next. For instance, for a face
shape, specific landmark points may always correspond to eyelids. In
order to minimize the sum of squared distances to the mean of a set,
each training shape is aligned into a common coordinate frame by
rotating, translating and scaling them. The Principal Component Analysis
(PCA) is used for picking out the correlations between groups of
landmarks among the trained shapes. At the end of the PDM step, patches
are created around each facial landmark. The patches are trained with a
given set of face-shapes.</p>

<p><bold>Patch Experts</bold>, also known as <italic>local
detectors</italic>, are used for calculating response maps that
represent the probability of a certain landmark that is being aligned at
image location <italic>x</italic><sub>i</sub> (Eq. (1)), from
Baltrušaitis et al. (<xref ref-type="bibr" rid="b40">40</xref>). A total of 68 patch experts are
employed to localize 68 facial landmark positions, as presented in
Figure 2.</p>

<fig id="eq01" fig-type="equation" position="anchor">
					<label>(1)</label>
					<graphic id="equation01" xlink:href="jemr-11-06-b-equation-01.png"/>
				</fig>



<p>where <inline-formula>
<mml:math id="m2"><mml:mi>I</mml:mi></mml:math></inline-formula>
is an intensity image, and <inline-formula>
<mml:math id="m3"><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>
is a logistic regressor intercept with a value between
<italic>0</italic> to <italic>1</italic> (<italic>0</italic>
representing no alignment and <italic>1</italic> representing perfect
alignment). Due to its computational advantages and implementational
simplicity, Support Vector Regressors (SVR) are usually employed as
patch experts. On the other hand, the CLNF (Constrained Local Neural
Field) model uses the LNF approach, which considers spatial features
that lead to fewer peaks, smoother responses and reduced noises.</p>

<p><bold>Regularised Landmark Mean Shift</bold> (RLMS) is the next step
of the CLM (Constrained Local Model). RLMS is a common method to solve
the fitting problem. It updates the CLM parameters to get closer to a
solution. An iterative fitting method is used to update the initial
parameters of the CLM, until achieving a convergence to an optimal
solution. The general concept of iterative fitting is defined in Eq.
(2), adapted from Baltrušaitis et al. (<xref ref-type="bibr" rid="b40">40</xref>):</p>

<fig id="eq02" fig-type="equation" position="anchor">
					<label>(2)</label>
					<graphic id="equation02" xlink:href="jemr-11-06-b-equation-02.png"/>
				</fig>

<p>where <inline-formula>
<mml:math id="m6"><mml:mi>R</mml:mi></mml:math></inline-formula>
is a regularization term and <inline-formula>
<mml:math id="m7"><mml:msub><mml:mi>D</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>
represents the misalignment measure for image
<inline-formula>
<mml:math id="m8"><mml:mi>I</mml:mi></mml:math></inline-formula>
at image location <inline-formula>
<mml:math id="m9"><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>.
Regularizing model parameters is necessary to prevent overfitting
(overfitting causes a model perform poor on data not used during
training). RLMS does not discriminate between confidence levels of
response maps. Due to noisy response maps, a novel non-uniform RLMS
weighting mean-shifts is employed for efficiency.</p>

<p>At the end RLMS, the <italic>OpenFace</italic> toolkit detects a
total of 68 facial landmarks (Figure 2). The detection of the face
boundaries based on facial landmarks enables more precise calculations
than using a rectangle that covers the face region.</p>

<fig id="fig02" fig-type="figure" position="float">
					<label>Figure 2.</label>
					<caption>
						<p>A total of 68 landmark positions on a face.</p>
					</caption>
					<graphic id="graph02" xlink:href="jemr-11-06-b-figure-02.png"/>
				</fig>

<p>We extended the <italic>OpenFace</italic> source code by making a set
of improvements, which allowed the user to perform manual AOI
annotation, generate visualizations that employ proposed input
parameters, build a custom face detector and then use the detector to
track the face, and generate separate output files depending on the
input parameters. In the following section, we present the speech
segmentation module.</p>
    </sec>
	
    <sec id="S3b">
      <title>Speech Segmentation</title>

<p>Speech is a continuous audio stream with dynamically changing and
usually indistinguishable parts. Speech analysis has been recognized as
a challenging domain of research, since it is difficult to automatically
identify clear boundaries between speech-related units. Speech analysis
involves two interrelated family of methodologies, namely <italic>speech
segmentation</italic> and <italic>diarization</italic>. Speech
segmentation is the separation of the audio recordings into units of
homogeneous parts, such as speech, silence, and laugh. Diarization is
used for extracting various characteristics of signals, such as speaker
identity, gender, channel type and background environment (e.g., noise,
music, silence). The MAGiC framework addresses both methodologies, since
both segmentation and identification are indispensable components of
face-to-face conversation.</p>

<p>In MAGiC, we employed the <italic>CMUSphinx</italic> Speech
Recognition System (<xref ref-type="bibr" rid="b43">43</xref>) by extending it for the analysis of
recorded speech. <italic>CMUSphinx</italic> is an open source,
platform-independent and speaker-independent speech recognition system.
It is integrated with <italic>LIUM</italic>, an open source toolkit for
speaker segmentation and diarization. The speech analysis process starts
with feature extraction. <italic>CMUSphinx</italic> functions extract
features, such as Mel-frequency Cepstral Coefficients (MFCC), which
collectively represent power spectrum of a sound segment. It then
performs speech segmentation based on Bayesian Information Criterion
(<xref ref-type="bibr" rid="b44 b45">44, 45</xref>).</p>

<p>The MAGiC framework performs two passes over the sound signal for
speech segmentation. In the first pass, a distance-based segmentation
process detects the <italic>change points</italic> by means of a
likelihood measure, namely Generalized Likelihood Ratio (GLR). In the
second pass, the system mixes together successive segments from the same
speaker. After the segmentation, Bayesian Information Criterion (BIC)
hierarchical clustering is performed with an initial set, which consists
of one cluster per each segment. At each iteration, the
<inline-formula>
<mml:math id="m10"><mml:mrow><mml:mstyle mathvariant="normal"><mml:mi>Δ</mml:mi></mml:mstyle><mml:msub><mml:mtext mathvariant="normal">BIC</mml:mtext><mml:mtext mathvariant="normal">ij</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula>
values for two successive clusters <inline-formula>
<mml:math id="m11"><mml:mi>i</mml:mi></mml:math></inline-formula>
and <inline-formula>
<mml:math id="m12"><mml:mi>j</mml:mi></mml:math></inline-formula>
are defined, as described by Meignier and Merlin (<xref ref-type="bibr" rid="b46">46</xref>), as
follows:</p>

<fig id="eq03" fig-type="equation" position="anchor">
					<label>(3)</label>
					<graphic id="equation03" xlink:href="jemr-11-06-b-equation-03.png"/>
				</fig>

<p>where <inline-formula>
<mml:math id="m14"><mml:mrow><mml:mo stretchy="true" form="prefix">|</mml:mo><mml:msub><mml:mi>Σ</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="true" form="postfix">|</mml:mo></mml:mrow></mml:math></inline-formula>,
<inline-formula>
<mml:math id="m15"><mml:mrow><mml:mo stretchy="true" form="prefix">|</mml:mo><mml:msub><mml:mi>Σ</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="true" form="postfix">|</mml:mo></mml:mrow></mml:math></inline-formula>
and <inline-formula>
<mml:math id="m16"><mml:mrow><mml:mo stretchy="true" form="prefix">|</mml:mo><mml:mi>Σ</mml:mi><mml:mo stretchy="true" form="postfix">|</mml:mo></mml:mrow></mml:math></inline-formula>
are the determinants of the Gaussians associated to clusters
<inline-formula>
<mml:math id="m17"><mml:mi>i</mml:mi></mml:math></inline-formula>,
<inline-formula>
<mml:math id="m18"><mml:mi>j</mml:mi></mml:math></inline-formula>
and <inline-formula>
<mml:math id="m19"><mml:mrow><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mi>i</mml:mi><mml:mspace width="0.222em"></mml:mspace><mml:mo>+</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></inline-formula>;
<inline-formula>
<mml:math id="m20"><mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mspace width="0.333em"></mml:mspace><mml:mtext mathvariant="normal"> and </mml:mtext><mml:mspace width="0.333em"></mml:mspace></mml:mrow><mml:msub><mml:mi>n</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mspace width="0.222em"></mml:mspace></mml:mrow></mml:math></inline-formula>
refer to the total lengths of cluster <inline-formula>
<mml:math id="m21"><mml:mi>i</mml:mi></mml:math></inline-formula>
and cluster <inline-formula>
<mml:math id="m22"><mml:mi>j</mml:mi></mml:math></inline-formula>;
λ is the smoothing parameter that is chosen to get a good estimator, and
<inline-formula>
<mml:math id="m23"><mml:mi>P</mml:mi></mml:math></inline-formula>
is the penalty factor. The ∆BIC values for each successive cluster are
calculated and they are merged when the value is less than 0.</p>

<p>As the next step of the speech analysis, Viterbi decoding is applied
for re-segmentation. A Gaussian Mixture Model (GMM) with eight
components is employed to represent the clusters. The parameters of the
mixture are estimated by Expectation Maximization (EM). To minimize the
number of undesired segments, such as long segments or the segments that
overlap with the word boundaries, the segments were slightly moved to
their low energy states, and the long segments were cut iteratively to
create segments that are shorter than 20 seconds. Until this stage in
the workflow, non-normalized features that preserve background
information are employed during segmentation and clustering. This method
facilitates differentiating speakers and assigning one single speaker to
each cluster. On the other hand, it may also lead to allocation of the
same speaker in multiple clusters. To resolve this issue, GMM-based
speaker clustering is performed with normalized features to assign the
same speaker to the same cluster. The GMM iterates until it reaches a
pre-defined threshold value. Figure 3 shows the workflow of speaker
diarization.</p>

<fig id="fig03" fig-type="figure" position="float">
					<label>Figure 3.</label>
					<caption>
						<p>Typical workflow for speaker
        diarization and segmentation, adapted from LIUM Speaker
        Diarization Wiki Page
        (<ext-link ext-link-type="uri" xlink:href="http://www-lium.univ-lemans.fr/diarization/doku.php/overview" xlink:show="new">http://www-lium.univ-lemans.fr/diarization/doku.php/overview</ext-link>)</p>
					</caption>
					<graphic id="graph03" xlink:href="jemr-11-06-b-figure-03.png"/>
				</fig>


<p>We extended the <italic>CMUSphinx</italic> source code and
made the following additions. <italic>CMUSphinx</italic> does
not generate segments for the whole audio. For instance, it does
not generate segments for the parts when the speaker could not
be identified. However, those non-segmented parts might contain
useful information. Thus, we carried out additional development
to automatically generate audio segments of non-segmented parts.
To do this, the time interval of each successive segment was
calculated. If there existed a time difference between the end
of the previous segment and the beginning of the next one, we
created a new audio-segment that covered that time-range. We
also added a new functionality for segmenting audio with
specified intervals.</p>


    </sec>
    </sec>

    <sec id="S4">
      <title>Demonstration of the MAGiC Framework: A Pilot Study</title>

<p>This section reports a pilot study that demonstrates the
functionalities and benefits of the MAGiC framework. The setting is a
mock job interview setting, where a pair of participants wear eye
glasses and conduct a job interview. The gaze data and the video data
are then analyzed by MAGiC.</p>

    <sec id="S4a">
      <title>Participants, Materials and Design</title>

<p>Three pairs of male participants (university students as volunteers)
took part in the pilot study (mean age 28, SD = 4.60). The task was a
mock job interview. One of the participants was assigned the role of an
interviewer and the other an interviewee. The roles were assigned
randomly. All the participants were native Turkish speakers and they had
normal or corrected-to-normal vision. No time limit was introduced to
the participants.</p>

<p>At the beginning of the session, the participants were informed about
the task. Both participants wore monocular Tobii eye tracking glasses
with a sampling rate of 30 Hz with a 56°x40° recording visual angle
capacity for the visual scene. The glasses recorded the video of the
scene camera and the sound, in addition to gaze data. The IR
(infrared)-marker calibration process was repeated until reaching 80%
accuracy. After the calibration, the participants were seated on the
opposite sides of a table, approximately 100 cm away from each other. A
beep sound was introduced to indicate the beginning of a session, for
synchronization in data analysis.</p>

<p>Eight common job interview questions, adopted from Villani, Repetto,
Cipresso, &#x26; Riva (<xref ref-type="bibr" rid="b47">47</xref>), were presented to an interviewer on
a paper. The interviewer was instructed to ask the given questions, and
also to evaluate the interviewee per each question by using paper and
pencil.</p>
    </sec>

    <sec id="S4b">
      <title>Data Analysis</title>

<p>We conducted data analysis using the speech analysis module, the AOI
analysis module and the summary module in MAGiC. As a test environment,
a PC was used with an Intel Core i5 2410M CPU at 2.30 GHz with 8 GB RAM
running Windows 7 Enterprise (64 bit).</p>

<p><bold>Speech Analysis</bold>. First, a MAGiC function (“Extract and
Format Audio”) was employed to extract the audio and then to format the
extracted audio for subsequent analysis. This function was run
separately for each participant in the pair. Therefore, in total, six
sound (.wav) files were produced. Each run took one to two seconds for
the extraction. Second, the formatted audio files were segmented one by
one. Audio-segments and a text file were created. The text file
contained the id number and the duration of each segment. The number of
segments varied depending on the length and the content of the audio
(Table 1). Each run took one to two seconds for the analysis.</p>

<table-wrap id="t01" position="float">
					<label>Table 1.</label>
					<caption>
						<p>Audio length and the number of segments for each
participant’s recording.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th></th>      
        <th colspan="2">Interviewer/ Interviewee</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td></td>
        <td>Audio Length (m:ss.ms)</td>
        <td>Number of Segments</td>
      </tr>
      <tr>
        <td>Pair-1</td>
        <td>3:46.066/ 3:57.00</td>
        <td>170/176</td>
      </tr>
      <tr>
        <td>Pair-2</td>
        <td>5:25.066/ 5:40.00</td>
        <td>120/200</td>
      </tr>
      <tr>
        <td>Pair-3</td>
        <td>5:28.000/ 5:09.00</td>
        <td>246/208</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<p>Third, time-interval estimation, synchronization and re-segmentation
were performed for each pair by using an interface that we call the
“Time Interval Estimation” panel.</p>

<p>When the experiment session is conducted with multiple recording
devices, one of the major issues is synchronization of the recordings.
Currently, eye tracker manufacturers do not provide synchronization
solutions. In most cases, the device clocks are set manually. MAGiC
provides a semi-automatic method for synchronizing multiple recordings
from a participant pair. In this method, the user is expected to specify
the initial segment of the session in both recordings. Since, user
identifies the beginning of sessions by listening to automatically
created segments instead of a whole speech, this results in more
accurate time estimation. Then, MAGiC calculates the time offset to
provide synchronization by taking the time difference of the specified
initial segments. After performing the re-segmentation process (by
utilizing synchronization information and by merging segments from both
recordings), we end up with equal-length session intervals for
participants within each pair. The closer the microphone is to a
participant, the cleaner and better the gathered audio recording is.
Thus, segmentation of multiple recordings from the same session may
result in different number of segments. A re-segmentation process merges
segments from different recordings in order to reduce data loss. Table 2
presents the experiment duration in milliseconds and the number of
segments produced after re-segmentation in our pilot study. Each run
took one to two seconds.</p>

<table-wrap id="t02" position="float">
					<label>Table 2.</label>
					<caption>
						<p>Audio length and number of segments for each participant’s
recording. The number of segments increased after re-segmentation. (see
Table 1)</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th></th>
        <th>Exp. Duration (m:ss.ms)</th>
        <th>Number of Segments</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Pair-1</td>
        <td>3:02.40</td>
        <td>261</td>
      </tr>
      <tr>
        <td>Pair-2</td>
        <td>5:05.40</td>
        <td>282</td>
      </tr>
      <tr>
        <td>Pair-3</td>
        <td>4:42.60</td>
        <td>406</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<p>Finally, speech annotation was performed. A list of pre-defined
speech acts was prepared as the first step of the analysis: Speech,
Speech Pause, Thinking (e.g., “uh”, “er”, “um”, “eee”, for instance),
Ask-Question, Greeting (e.g., “welcome”, “thanks for your attendance”),
Confirmation (e.g., “good”, “ok”, “huh-huh”), Questionnaire Filling
(Interviewer filling in questionnaire), Pre-Speech (i.e., warming up the
voice), Reading and Articulation of Questions, Laugh, Signaling end of
the speech (e.g., “that is all”).</p>

<p>The next step was the manual annotation process. For the end user,
this process involved selecting the speech act(s) and annotating the
segments. At each annotation, a new line was appended and displayed,
which contained the relevant segment's time-interval, its associated
participant (if any) and user-selected speech-act(s). AOI analysis was
performed for the three pairs of participants separately. Each run took
ten to twenty minutes, depending on the session-interval.</p>

<p><bold>AOI Analysis</bold>. All six video recordings of the pilot
study were processed with <italic>OpenFace</italic>’s
<italic>default-mode face detector</italic>. The tracking processes
produced two-dimensional landmarks on the interlocutor’s face image. The
process took 4 to 10 minutes per video. Then, the gaps with at most two
frames-duration were filled in by linear interpolation of raw gaze data.
The raw gaze data file included the frame number, gaze point
classification (either <italic>Unclassified</italic> or
<italic>Fixation</italic>), and x-y coordinates. The processed data
comprised 2% of total raw gaze data (Table 3). The gap filling process
took less than a second per pair.</p>

<table-wrap id="t03" position="float">
					<label>Table 3.</label>
					<caption>
						<p>The number and ratio of the filled gaps for each
participant’s raw gaze data.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th></th>      
        <th colspan="2">Interviewer/ Interviewee</th>
      </tr>

      <tr>
        <td></td>
        <td>Number of
        filled gaps</td>
        <td>Ratio of
        filled gaps (%)</td>
      </tr>
    </thead>
    <tbody>      
      <tr>
        <td>Pair-1</td>
        <td>146 / 236</td>
        <td>2.15 / 3.32</td>
      </tr>
      <tr>
        <td>Pair-2</td>
        <td>171 / 236</td>
        <td>1.75 / 2.31</td>
      </tr>
      <tr>
        <td>Pair-3</td>
        <td>157 / 335</td>
        <td>1.60 / 3.61</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<p>After the gap filling process, we performed AOI detection by setting
the parameters for eye tracker accuracy and image resolution. In the
present study, the size of the captured images for face tracking was 720
× 480 pixels, while the eye tracker image-frame resolution was 640 ×
480. The eye tracking glasses had a reported degree of accuracy of half
a degree of visual angle. The built-in scene camera recording angles of
the eye tracking glasses were 56 degrees horizontal and 40 degrees
vertical. The seating distance between the participants was
approximately 100 cm. Accordingly, the eye tracker accuracy was 4.84
pixels horizontal and 5.34 pixels vertical. The AOI detection took a
couple of seconds. Table 4 presents the number and the ratio of
image-frames that AOI detection failed due to undetected face. The
results indicate that higher undetected-face rates were observed at the
interviewer’s recordings. Nevertheless, face detection was performed
with more than 90% success on average.</p>

<table-wrap id="t04" position="float">
					<label>Table 4.</label>
					<caption>
						<p>The number and ratio of image-frames that face could not be
detected.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th></th>      
        <th colspan="2">Interviewer/ Interviewee</th>
      </tr>

      <tr>
        <td></td>
        <td>Number of
        undetected</td>
        <td>Ratio of
        undetected (%)</td>
      </tr>
    </thead>
    <tbody>      
      <tr>
        <td>Pair-1</td>
        <td>570 / 173</td>
        <td>10.4 / 3.16</td>
      </tr>
      <tr>
        <td>Pair-2</td>
        <td>2113 / 488</td>
        <td>23.1 / 5.33</td>
      </tr>
      <tr>
        <td>Pair-3</td>
        <td>1251 / 117</td>
        <td>14.8 / 1.38</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<p>The absence of gaze data is another issue that leads to failure in
AOI detection. Table 5 shows the ratio of undetected AOIs due to the
absence of gaze data.</p>

<table-wrap id="t05" position="float">
					<label>Table 5.</label>
					<caption>
						<p>The number and ratio of image-frames that raw gaze data were
absent.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">

    <thead>
      <tr>
        <th></th>      
        <th colspan="2">Interviewer/ Interviewee</th>
      </tr>

      <tr>
        <td></td>
        <td>Number of
        undetected</td>
        <td>Ratio of
        undetected (%)</td>
      </tr>
    </thead>
    <tbody>      
      <tr>
        <td>Pair-1</td>
        <td>3237 / 392</td>
        <td>59.1 / 7.16</td>
      </tr>
      <tr>
        <td>Pair-2</td>
        <td>4762 / 1050</td>
        <td>52.0 / 11.50</td>
      </tr>
      <tr>
        <td>Pair-3</td>
        <td>4010 / 1732</td>
        <td>47.3 / 20.40</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<p>The failure in AOI detection on the interviewer’s side was
approximately 50%. This is due to the experimental setting, where the
interviewer looked at the questions to read them. This is a situation
that experiment designers face frequently in dynamic experiment
settings. The MAGiC framework’s interface allows the user to detect the
source of the problem and to annotated it by a label through a panel
interface that we name “Visualize Tracking”. The panel interface
displays the recording by overlaying the detected facial landmarks, raw
gaze data and gaze annotation (looking at the interlocutor’s face, i.e.,
in, or looking away the interlocutor’s face i.e., out) on top of the
video recording for each frame, as shown in Figure 4.</p>

<fig id="fig04" fig-type="figure" position="float">
					<label>Figure 4.</label>
					<caption>
						<p>A snapshot from the visualize-tracking panel.</p>
					</caption>
					<graphic id="graph04" xlink:href="jemr-11-06-b-figure-04.png"/>
				</fig>

<p>The analysis of the scenes by the “Visual Tracking” panel revealed
that the missing raw gaze data were due to interviewer’s reading and
articulation of the questions, and evaluating the interviewee’s response
by using paper and pencil. In those cases, the interviewer looked
outside of the glasses frame to read the questions on the notebook. In
our pilot study, the manual annotation took 15 to 20 minutes per pair,
on average.</p>

<p>The final step in the AOI-analysis was composed of two further
functions provided by the MAGiC framework: The re-analysis step merged
automatically-detected AOIs with manually extracted AOI-labels. After
then, the detection ratio was compared with the previous outcomes.</p>

<p>Table 6 shows face-detection and gaze-detection accuracies for the
interviewer’s recordings. The results reveal an improvement of more than
30% after the final step, compared to the previous analysis steps (cf.
Table 4 and Table 5).</p>

<table-wrap id="t06" position="float">
					<label>Table 6.</label>
					<caption>
						<p>The number and ratio of the image-frames that face and gaze
could not be detected.</p>
					</caption>
					<table frame="hsides" rules="groups" cellpadding="3">
    <tbody>

      <tr>
        <td></td>      
        <td colspan="2">Face</td>
      </tr>


      <tr>
        <td>Id</td>
        <td>Number of
        undetected face</td>
        <td>Ratio of
        undetected face (%)</td>
      </tr>
      <tr>
        <td>1</td>
        <td>4</td>
        <td>0.07</td>
      </tr>
      <tr>
        <td>2</td>
        <td>38</td>
        <td>0.41</td>
      </tr>
      <tr>
        <td>3</td>
        <td>5</td>
        <td>0.06</td>
      </tr>
      <tr>
        <td></td>
        <td></td>
        <td></td>
      </tr>
   
      <tr>
        <td></td>      
        <td colspan="2">Gaze</td>
      </tr>
     
      <tr>
        <td>Id</td>
        <td>Number of
        undetected gaze</td>
        <td>Ratio of
        undetected gaze (%)</td>
      </tr>
      <tr>
        <td>1</td>
        <td>1508</td>
        <td>27.55</td>
      </tr>
      <tr>
        <td>2</td>
        <td>1292</td>
        <td>14.10</td>
      </tr>
      <tr>
        <td>3</td>
        <td>1143</td>
        <td>13.48</td>
      </tr>
    </tbody>
  </table>
</table-wrap>

<p>The analyses also revealed the distribution of interlocutor’s gaze
locations. The findings showed a tendency of more frequent gaze aversion
on the right side, especially to the right-bottom (see Figure 5).</p>

<fig id="fig05" fig-type="figure" position="float">
					<label>Figure 5.</label>
					<caption>
						<p>At the 21.7% of the dwell time,
        participants looked at interlocutor’s face. The bottom left
        corner with the 29.6% was the most sighted region.</p>
					</caption>
					<graphic id="graph05" xlink:href="jemr-11-06-b-figure-05.png"/>
				</fig>

 <p>The rightward shifts are usually associated with verbal
 thinking, whereas leftward shifts are usually associated with
 visual imagery (<xref ref-type="bibr" rid="b48">48</xref>). On the other hand, more recent
 studies report that the proposed directional patterns do not
 consistently occur when a question elicited verbal or
 visuospatial thinking. Instead, the individuals are more likely
 to avert their gaze while a listening to a question from the
 partner (see Ehrlichman &#x26; Micic (<xref ref-type="bibr" rid="b49">49</xref>) for a
 review).</p>

 <p>A further investigation of mutual gaze behavior of the
 conversation pairs and speech acts was conducted by a two-way
 ANOVA. The speech-acts had eleven levels (Speech, Speech Pause,
 Thinking, Ask-Question, Greeting, Confirmation, Questionnaire
 Filling, Pre-Speech, Reading Questions, Laugh and Signaling End
 of the Speech) and the mutual gaze behavior had four levels
 (Face Contact, Aversion, Mutual Face Contact, Mutual
 Aversion).</p>

 <p>The analysis with normalized gaze distribution frequency
 revealed a main effect of gaze behavior,
 <italic>F</italic>(3,72) =58.3, <italic>p</italic>&#x3C;.05. The
 Tukey post hoc test was performed to establish the significance
 of differences in frequency scores with different gaze behavior
 and speech-acts. It revealed that the frequency of Gaze Aversion
 (<italic>M</italic>=0.5, <italic>SD</italic>=0.12) was
 significantly larger than the frequency of Face Contact
 (<italic>M</italic>=0.1, <italic>SD</italic>=0.19,
 <italic>p</italic>&#x3C;.05), the frequency of Mutual Face Contact
 (<italic>M</italic>=0.02, <italic>SD</italic>=0.06,
 <italic>p</italic>&#x3C;.05), as well as the frequency of Mutual
 Aversion (<italic>M</italic>=0.38, <italic>SD</italic>=0.15,
 <italic>p</italic>&#x3C;.05). Moreover, the frequency of Mutual
 Aversion was significantly larger than the frequency of Face
 Contact (<italic>p</italic>&#x3C;.05) and the frequency of Mutual
 Face-Contact (<italic>p</italic>&#x3C;.05), while there was no
 significant difference between the frequency of Face Contact and
 the frequency of Mutual Face Contact
 (<italic>p</italic>=0.31).</p>

 <p>Finally, the interaction between speech-acts and gaze
 behavior was investigated. The results indicated that when the
 participants were <italic>thinking</italic>, there was a
 significant frequency difference between the frequency of Mutual
 Aversion (<italic>M</italic>=0.58, <italic>SD</italic>=0.07) and
 the frequency of Face Contact (<italic>M</italic>=0.03,
 <italic>SD</italic>=0.05, <italic>p</italic>&#x3C;.05), as well as
 significant difference between the frequency of Mutual Aversion
 and the frequency of Mutual Face Contact
 (<italic>M</italic>=0.01, <italic>SD</italic>=0.02,
 <italic>p</italic>=.02).</p>

    </sec>
    </sec>

    <sec id="S5">
      <title>An Evaluation of the Contributions of the MAGiC Framework</title>

<p>In this section, we report how the MAGiC framework facilitated gaze
analysis in the reported pilot study. MAGiC reduced the amount of the
time spent for preparing manually annotated gaze and audio data for each
image-frame of a scene video. To manually identify gaze contact and gaze
aversion, and its location, a researcher would annotate 36,000
image-frames for a 10-minute session recorded by a 60 Hz eye tracker.
Assuming that it takes 1 second to manually annotate a frame, the
annotation would last 10 hours. MAGiC took approximately 5 to 10 minutes
when it was run on a typical personal computer in today’s technology
(Intel Core i5 2.3 GHz CPU and 8 GB of RAM.) The time spent for the Area
of Interest (AOI) and audio annotation was also reduced. The automated
annotation improved the quality of annotated data. It is difficult for
human annotators to detect speech instances at this level of temporal
granularity. Since full annotation is the holy grail of gaze data in
dynamic analysis scenes, MAGiC also offers an interface to make manual
AOI annotation to the user. This component of MAGiC is one of the
pillars for improvement for future versions.</p>

<p>MAGiC provides the functionality for visualizing face tracking data
and AOI annotation frame-by-frame. It overlays the detected facial
landmarks, the raw gaze data, and the status of gaze interaction in a
single video recording. It also displays the ratio of non-annotated gaze
data (thus, the success level of face detection) as a percentage of
total data to the user. The absence of raw gaze data or undetected faces
are major reasons for the failure of an automatic AOI annotation. The
user can introduce tranining to create a custom face detector for better
face detection performance. The MAGiC software is licensed under the GNU
General Public License (GPL). Therefore, the source code of the
application is openly distributed and programmers are encouraged to
study and contribute to its development. In addition to MAGiC, we also
provide the modified component toolkits (OpenFace for face tracking,
dlib for training of a custom face detector, and CMUSphinx for speech
segmentation) on MAGiC’s github repository: MAGiC_v1.0:
<ext-link ext-link-type="uri" xlink:href="https://github.com/ulkursln/MAGiC/releases" xlink:show="new">https://github.com/ulkursln/MAGiC/releases</ext-link></p>
    </sec>

    <sec id="S6">
      <title>Usability Analysis of MAGiC</title>

<p>This section reports a usability analysis of the MAGiC framework. For
the analysis, the AOI Analysis interface and the Speech Analysis
interface were randomly assigned to a total of eight participants. The
participants performed data analysis by using publicly available sources
(see Supplementary material<xref ref-type="fn" rid="fn1">1</xref>). The
usability analysis was conducted in three steps, as described below:</p>

<p specific-use="wrapper">
  <disp-quote>
    <p>(1) Perform the analysis manually,</p>

    <p>(2) Perform the analysis by using MAGiC,</p>

    <p>(3) Asses the usability of MAGiC using 7-point scale ISO 9241/10
    questionnaire.</p>
  </disp-quote>
</p>

<p>The Usability test scores are presented in Figure 6.</p>

<fig id="fig06" fig-type="figure" position="float">
					<label>Figure 6.</label>
					<caption>
						<p>All of the usability metrics were scored higher than an average.</p>
					</caption>
					<graphic id="graph06" xlink:href="jemr-11-06-b-figure-06.png"/>
				</fig>


<p>We recorded the time spent to perform data analysis, and then we
compared it to the average duration when the participants performed the
same analysis manually. In the AOI analysis, the mean duration to
annotate a single frame decreased from 29.1 seconds
(<italic>SD</italic>=22.7) for manual annotation to an average of 0.09
seconds (<italic>SD</italic>=0.02) in MAGiC. In the speech analysis, the
mean duration for a single annotation decreased from 44.5 seconds
(<italic>SD</italic>=8.8) for manual annotation to an average of 7.1
seconds (<italic>SD</italic>=1.4) in MAGiC.</p>
    </sec>

    <sec id="S7">
      <title>Discussion and Conclusion</title>

<p>In the present study, we introduced the MAGiC framework. It provides
researchers an environment for the analysis of gaze behavior of a pair
in conversation. Human-Human conversation settings are usually dynamic
scenes, in which the conversation partners exhibit a set of specific
gaze behavior, such as gaze contact and gaze aversion. MAGiC detects and
tracks interlocutor’s face automatically in a video recording. Then it
overlays gaze location data to detect gaze contact and gaze aversion
behavior. It also incorporates speech data into the analysis by means of
providing an interface for annotation of speech-acts.</p>

<p>MAGiC facilitates the analysis of dynamic eye tracking data by
reducing the annotation effort and the time spent for frame-by-frame
manual analysis of video data. Its capability for automated multimodal
(i.e., gaze and speech-act) analysis makes MAGiC advantageous over
error-prone human annotation. The MAGiC interface allows researchers to
visualize face tracking process, gaze-behavior status and annotation
efficiency on the same display. It also allows the user to train the
face tracking components by providing labelled images manually.</p>

<p>The environment has been developed as an open source software tool,
which is available for public use and development. MAGiC has been
developed by integrating a set of open source software tools, in
particular <italic>OpenFace</italic> for analyzing facial behavior,
<italic>dlib</italic> for machine learning of face tracking and
<italic>CMUSphinx</italic> for the analysis of recorded speech and
extending their capabilities further for the purpose of detecting eye
movement behavior, and for annotating speech data simultaneously with
gaze data. MAGiC’s user interface is composed of a rich set of panels,
which provide the user an environment to conduct a guided, step-by-step
analysis.</p>

<p>MAGiC is able to process data from a single eye tracker or data in a
dual eye tracking setting. We demonstrated MAGiC’s capabilities in a
pilot study, which was conducted in a dual eye-tracking setting. We
described MAGiC’s data analysis capabilities by describing the analysis
step on the recorded data in the pilot study. We intentionally employed
a low-frequency eye-tracker, with a relatively low video quality, and a
low-illuminated environment, since these are typical real-environment
challenges that influence face tracking capabilities. Our analysis
revealed that MAGiC is able to exhibit acceptable success ratios in
automatic analyses, with an average Area of Interest (AOI) labelling
(i.e., gaze contact and gaze aversion detection) efficiency of
approximately 80%. Likely improvements in eye tracking recording
frequency, eye tracking data quality, and image resolution of video
recordings have the potential to increase the accuracy of MAGiC’s
outputs to better levels. We also note that MAGiC’s speech analysis
component, namely <italic>CMUSphinx</italic> provides several
high-quality acoustic models, although there is no pre-build acoustic
model for Turkish. Despite this challenge, MAGiC returned successful
results for the speech analysis, too. The speech-act annotation also
helped us overcoming speech segmentation issues by providing
sub-segments for speech.</p>

<p>All the data analyses were completed in approximately two hours for
the three pairs of participants. Our usability analyses revealed that
the time and effort spent for manual, frame-by-frame video analysis and
speech segmentation takes much longer to complete, in addition to being
prone to human annotator errors.</p>

<p>Recently, MAGiC is in its first version. Our future work will include
making improvements in the existing capabilities of MAGiC, as well as
developing new capabilities. For instance, face-detection ratio may be
increased by employing recently published OpenFace 2.0. Also, in its
current version, MAGiC sets an AOI-label on the interlocutor’s face
image. We plan to expand this labelling method so that it processes
other objects, such as the objects on a table. This will expand the
domain of use of MAGiC into a broader range of dynamic visual
environments not limited to face-to-face communication. However, this
development would require training a detector for the relevant objects,
which is a challenging issue for generalization of the object
recognition capabilities. Moreover, the face-tracking function of MAGiC
already makes it possible to extract facial expressions, based on the
Facial Action Coding System (FACS). As a further improvement, MAGiC may
automatically summarize facial expressions during the course of a
conversation.</p>

<p>Finally, for speech analysis, MAGiC provides functions for
semi-automatically synchronizing recordings. Further development of
MAGiC may address improving its synchronization capabilities, its
capability to transcribe speech into text and its capability for
training speech-act annotation with pre-defined speech acts and
automating subsequent annotations.</p>
    </sec>

    <sec id="S8"  sec-type="COI-statement">
      <title>Ethics and Conflict of Interest</title>

<p>The author(s) declare(s) that the contents of the article are in
agreement with the ethics described in
<ext-link ext-link-type="uri" xlink:href="http://biblio.unibe.ch/portale/elibrary/BOP/jemr/ethics.html" xlink:show="new">http://biblio.unibe.ch/portale/elibrary/BOP/jemr/ethics.html</ext-link>
and that there is no conflict of interest regarding the publication of
this paper.</p>
    </sec>
</body>
<back>
<fn-group>
  <fn id="fn1">
    <p>See the MAGiC App Channel under Youtube,
    <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/channel/UC2gvq0OluwpdjVKGSGg-vaQ" xlink:show="new">https://www.youtube.com/channel/UC2gvq0OluwpdjVKGSGg-vaQ</ext-link>,
    and MAGiC App Wiki Page under Github</p>
  </fn>
</fn-group>

<ref-list>
<ref id="b18"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Abele</surname>, <given-names>A.</given-names></name></person-group> (<year>1986</year>). <article-title>Functions of gaze in social interaction: Communication and monitoring.</article-title> <source>Journal of Nonverbal Behavior</source>, <volume>10</volume>(<issue>2</issue>), <fpage>83</fpage>&#8211;<lpage>101</lpage>. <pub-id pub-id-type="doi">10.1007/BF01000006</pub-id><issn>0191-5886</issn></mixed-citation></ref>
<ref id="b1"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Archer</surname>, <given-names>D.</given-names></name>, &#x26; <name><surname>Akert</surname>, <given-names>R. M.</given-names></name></person-group> (<year>1977</year>). <article-title>Words and everything else: Verbal and nonverbal cues in social interpretation.</article-title> <source>Journal of Personality and Social Psychology</source>, <volume>35</volume>(<issue>6</issue>), <fpage>443</fpage>&#8211;<lpage>449</lpage>. <pub-id pub-id-type="doi">10.1037/0022-3514.35.6.443</pub-id><issn>0022-3514</issn></mixed-citation></ref>
<ref id="b19"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="author"><name><surname>Argyle</surname>, <given-names>M.</given-names></name>, &#x26; <name><surname>Cook</surname>, <given-names>M.</given-names></name></person-group> (<year>1976</year>). <source>Gaze and mutual gaze. Cam-bridge: Univ</source>. <publisher-name>Press</publisher-name>.</mixed-citation></ref>
<ref id="b26"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="author"><name><surname>Austin</surname>, <given-names>J. L.</given-names></name></person-group> (<year>1962</year>). <source>How to do things with words</source>. <publisher-name>University Press</publisher-name>.</mixed-citation></ref>
<ref id="b9"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Bales</surname>, <given-names>R. F.</given-names></name>, <name><surname>Strodtbeck</surname>, <given-names>F. L.</given-names></name>, <name><surname>Mills</surname>, <given-names>T. M.</given-names></name>, &#x26; <name><surname>Roseborough</surname>, <given-names>M. E.</given-names></name></person-group> (<year>1951</year>). <article-title>Channels of Communication in Small Groups.</article-title> <source>American Sociological Review</source>, <volume>16</volume>(<issue>4</issue>), <fpage>461</fpage>&#8211;<lpage>468</lpage>. <pub-id pub-id-type="doi">10.2307/2088276</pub-id><issn>0003-1224</issn></mixed-citation></ref>
<ref id="b40"><mixed-citation publication-type="unknown" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Baltrusaitis</surname> <given-names>T</given-names></name>, <name><surname>Robinson</surname> <given-names>P</given-names></name>, <name><surname>Morency</surname> <given-names>L-P</given-names></name></person-group>. Constrained Local Neural Fields for Robust Facial Landmark De-tection in the Wild. 2013 IEEE International Confe-rence on Computer Vision Workshops; <year>2013</year>.354&#8211;361.</mixed-citation></ref>
<ref id="b39"><mixed-citation publication-type="conference" specific-use="linked"><person-group person-group-type="author"><name><surname>Baltrusaitis</surname>, <given-names>T.</given-names></name>, <name><surname>Mahmoud</surname>, <given-names>M.</given-names></name>, &#x26; <name><surname>Robinson</surname>, <given-names>P.</given-names></name></person-group> <article-title>Cross-dataset learning and person-specific normalisation for automatic Action Unit detection.</article-title> <source>11th IEEE Interna-tional Conference and Workshops on Automatic Face and Gesture Recognition (FG)</source>. <year>2015</year>;6(1):<fpage>1</fpage>&#8211;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/FG.2015.7284869</pub-id></mixed-citation></ref>
<ref id="b38"><mixed-citation publication-type="conference" specific-use="linked"><person-group person-group-type="author"><name><surname>Baltrusaitis</surname>, <given-names>T.</given-names></name>, <name><surname>Robinson</surname>, <given-names>P.</given-names></name>, &#x26; <name><surname>Morency</surname>, <given-names>L.-P.</given-names></name></person-group> <article-title>OpenFace: An open source facial behavior analysis toolkit.</article-title> <source>2016 IEEE Winter Conference on Applications of Comput-er Vision (WACV)</source>; <year>2016</year>.<fpage>1</fpage>&#8211;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1109/WACV.2016.7477553</pub-id></mixed-citation></ref>
<ref id="b12"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Baron-Cohen</surname>, <given-names>S.</given-names></name>, <name><surname>Wheelwright</surname>, <given-names>S.</given-names></name>, &#x26; <name><surname>Jolliffe</surname>, <given-names>T.</given-names></name></person-group> (<year>1997</year>). <article-title>Is There a &#8220;Language of the Eyes&#8221;? Evidence from Normal Adults, and Adults with Autism or Asperger Syn-drome.</article-title> <source>Visual Cognition</source>, <volume>4</volume>(<issue>3</issue>), <fpage>311</fpage>&#8211;<lpage>331</lpage>. <pub-id pub-id-type="doi">10.1080/713756761</pub-id><issn>1350-6285</issn></mixed-citation></ref>
<ref id="b44"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Barras</surname>, <given-names>C.</given-names></name>, <name><surname>Zhu</surname>, <given-names>X.</given-names></name>, <name><surname>Meignier</surname>, <given-names>S.</given-names></name>, &#x26; <name><surname>Gauvain</surname>, <given-names>J.-L.</given-names></name></person-group> (<year>2006</year>). <article-title>Multistage speaker diarization of broadcast news.</article-title> <source>IEEE Transactions on Audio, Speech, and Language Processing</source>, <volume>14</volume>(<issue>5</issue>), <fpage>1505</fpage>&#8211;<lpage>1512</lpage>. <pub-id pub-id-type="doi">10.1109/TASL.2006.878261</pub-id><issn>1558-7916</issn></mixed-citation></ref>
<ref id="b31"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Brone</surname>, <given-names>G.</given-names></name>, <name><surname>Oben</surname>, <given-names>B.</given-names></name>, &#x26; <name><surname>Goedeme</surname>, <given-names>T.</given-names></name></person-group> <article-title>Towards a more effec-tive method for analyzing mobile eye-tracking data.</article-title> Proceedings of the 1st international workshop on Per-vasive eye tracking &#x26; mobile eye-based interaction - PETMEI 11. ACM; <year>2011</year>.53-56.</mixed-citation></ref>
<ref id="b5"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Cassell</surname>, <given-names>J.</given-names></name>, <name><surname>Bickmore</surname>, <given-names>T.</given-names></name>, <name><surname>Billinghurst</surname>, <given-names>M.</given-names></name>, <name><surname>Campbell</surname>, <given-names>L.</given-names></name>, <name><surname>Chang</surname>, <given-names>K.</given-names></name>, <name><surname>Vilhjalmsson</surname>, <given-names>H.</given-names></name>, <etal>. . .</etal></person-group>. <article-title>Embodiment in con-versational interfaces.</article-title> Proceedings of the SIGCHI conference on Human factors in computing systems the CHI is the limit - CHI 99. ACM;<year>1999</year>.520&#8211;527</mixed-citation></ref>
<ref id="b45"><mixed-citation publication-type="conference" specific-use="parsed"><person-group person-group-type="author"><name><surname>Chen</surname>, <given-names>S.</given-names></name>, &#x26; <name><surname>Gopalakrishnan</surname>, <given-names>P.</given-names></name></person-group> <article-title>Speaker, environment and channel change detection and clustering via the Bayesian information criterion.</article-title> In <source>DARPA Broadcast News Transcription and Understanding Workshop</source>; <year>1998</year>; <conf-loc>Landsdowne, VA, USA</conf-loc>.</mixed-citation></ref>
<ref id="b32"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>De Beugher</surname>, <given-names>S.</given-names></name>, <name><surname>Brone</surname>, <given-names>G.</given-names></name>, &#x26; <name><surname>Goedeme</surname>, <given-names>T.</given-names></name></person-group> Automatic Analysis of In-the-Wild Mobile Eye-tracking Experi-ments using Object, Face and Person Detection. Pro-ceedings of the 9th International Conference on Computer Vision Theory and Applications. <year>2014</year>.IEEE;625-633.</mixed-citation></ref>
<ref id="b3"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Duncan</surname>, <given-names>S.</given-names></name></person-group> (<year>1972</year>). <article-title>Some signals and rules for taking speaking turns in conversations.</article-title> <source>Journal of Personality and Social Psychology</source>, <volume>23</volume>(<issue>2</issue>), <fpage>283</fpage>&#8211;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1037/h0033031</pub-id><issn>0022-3514</issn></mixed-citation></ref>
<ref id="b49"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Ehrlichman</surname>, <given-names>H.</given-names></name>, &#x26; <name><surname>Micic</surname>, <given-names>D.</given-names></name></person-group> (<year>2012</year>). <article-title>Why Do People Move Their Eyes When They Think?</article-title> <source>Current Directions in Psychological Science</source>, <volume>21</volume>(<issue>2</issue>), <fpage>96</fpage>&#8211;<lpage>100</lpage>. <pub-id pub-id-type="doi">10.1177/0963721412436810</pub-id><issn>0963-7214</issn></mixed-citation></ref>
<ref id="b21"><mixed-citation publication-type="book-chapter" specific-use="restruct"><person-group person-group-type="author"><name><surname>Elman</surname>, <given-names>J. L.</given-names></name></person-group> (<year>1995</year>). <chapter-title>Language as a dynamical system</chapter-title>. In <person-group person-group-type="editor"><name><given-names>R. F.</given-names> <surname>Port</surname></name> &#x26; <name><given-names>T.</given-names> <surname>van Gelder</surname></name> (<role>Eds.</role>),</person-group> <source>Mind as motion: Explo-rations in the dynamics of cognition</source> (pp. <fpage>195</fpage>&#8211;<lpage>223</lpage>). <publisher-name>MIT Press</publisher-name>.</mixed-citation></ref>
<ref id="b13"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Fasola</surname>, <given-names>J.</given-names></name>, &#x26; <name><surname>Mataric</surname>, <given-names>M. J.</given-names></name></person-group> (<year>2012</year>). <article-title>Using Socially Assistive Hu-man&#8211;Robot Interaction to Motivate Physical Exercise for Older Adults.</article-title> <source>Proceedings of the IEEE</source>, <volume>100</volume>(<issue>8</issue>), <fpage>2512</fpage>&#8211;<lpage>2526</lpage>. <pub-id pub-id-type="doi">10.1109/JPROC.2012.2200539</pub-id><issn>0018-9219</issn></mixed-citation></ref>
<ref id="b22"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Ford</surname>, <given-names>M.</given-names></name>, &#x26; <name><surname>Holmes</surname>, <given-names>V. M.</given-names></name></person-group> (<year>1978</year>). <article-title>Planning units and syntax in sentence production.</article-title> <source>Cognition</source>, <volume>6</volume>(<issue>1</issue>), <fpage>35</fpage>&#8211;<lpage>53</lpage>. <pub-id pub-id-type="doi">10.1016/0010-0277(78)90008-2</pub-id><issn>0010-0277</issn></mixed-citation></ref>
<ref id="b36"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="author"><collab>Goldman-EislerF</collab></person-group>. (<year>1968</year>). <source>Psycholinguistics: Experiments in spontaneous speech</source>. <publisher-name>Academic Press</publisher-name>.</mixed-citation></ref>
<ref id="b10"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="author"><name><surname>Goodwin</surname>, <given-names>C.</given-names></name></person-group> (<year>1981</year>). <source>Conversational organization: interaction between speakers and hearers</source>. <publisher-name>Aca-demic Press</publisher-name>.</mixed-citation></ref>
<ref id="b28"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Grosjean</surname>, <given-names>F.</given-names></name>, &#x26; <name><surname>Lane</surname>, <given-names>H.</given-names></name></person-group> (<year>1976</year>). <article-title>How the listener integrates the components of speaking rate.</article-title> <source>Journal of Experimental Psychology. Human Perception and Performance</source>, <volume>2</volume>(<issue>4</issue>), <fpage>538</fpage>&#8211;<lpage>543</lpage>. <pub-id pub-id-type="doi">10.1037/0096-1523.2.4.538</pub-id><pub-id pub-id-type="pmid">1011003</pub-id><issn>0096-1523</issn></mixed-citation></ref>
<ref id="b6"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Grosz</surname>, <given-names>B.</given-names></name>, &#x26; <name><surname>Sidner</surname>, <given-names>C.</given-names></name></person-group> (<year>1986</year>). <article-title>Attention, intentions, and the structure of discourse.</article-title> <source>Computational Linguistics</source>, <volume>12</volume>(<issue>3</issue>), <fpage>175</fpage>&#8211;<lpage>204</lpage>.<issn>0891-2017</issn></mixed-citation></ref>
<ref id="b37"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Hieke</surname>, <given-names>A. E.</given-names></name>, <name><surname>Kowal</surname>, <given-names>S.</given-names></name>, &#x26; <name><surname>Oconnell</surname>, <given-names>D. C.</given-names></name></person-group> (<year>1983</year>). <article-title>The Trouble with &#8220;Articulatory&#8221; Pauses.</article-title> <source>Language and Speech</source>, <volume>26</volume>(<issue>3</issue>), <fpage>203</fpage>&#8211;<lpage>215</lpage>. <pub-id pub-id-type="doi">10.1177/002383098302600302</pub-id><issn>0023-8309</issn></mixed-citation></ref>
<ref id="b15"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Hietanen</surname>, <given-names>J. K.</given-names></name>, <name><surname>Lepp&#228;nen</surname>, <given-names>J. M.</given-names></name>, <name><surname>Peltola</surname>, <given-names>M. J.</given-names></name>, <name><surname>Linna-Aho</surname>, <given-names>K.</given-names></name>, &#x26; <name><surname>Ruuhiala</surname>, <given-names>H. J.</given-names></name></person-group> (<year>2008</year>). <article-title>Seeing direct and averted gaze activates the approach-avoidance motivational brain systems.</article-title> <source>Neuropsychologia</source>, <volume>46</volume>(<issue>9</issue>), <fpage>2423</fpage>&#8211;<lpage>2430</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuropsychologia.2008.02.029</pub-id><pub-id pub-id-type="pmid">18402988</pub-id><issn>0028-3932</issn></mixed-citation></ref>
<ref id="b29"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Hird</surname>, <given-names>K.</given-names></name>, <name><surname>Brown</surname>, <given-names>R.</given-names></name>, &#x26; <name><surname>Kirsner</surname>, <given-names>K.</given-names></name></person-group> (<year>2006</year>). <article-title>Stability of lexical defi-cits in primary progressive aphasia: Evidence from natural language.</article-title> <source>Brain and Language</source>, <volume>99</volume>(<issue>1-2</issue>), <fpage>137</fpage>&#8211;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandl.2006.06.083</pub-id><issn>0093-934X</issn></mixed-citation></ref>
<ref id="b30"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="editor"><name><surname>Holmqvist</surname>, <given-names>K.</given-names></name>, <name><surname>Nystr&#246;m</surname>, <given-names>N.</given-names></name>, <name><surname>Andersson</surname>, <given-names>R.</given-names></name>, <name><surname>Dewhurst</surname>, <given-names>R.</given-names></name>, <name><surname>Jarodzka</surname>, <given-names>H.</given-names></name>, &#x26; <name><surname>Van de Weijer</surname>, <given-names>J.</given-names></name> (<role>Eds.</role>)</person-group>. (<year>2011</year>). <source>Eye tracking: a comprehensive guide to methods and measures. Ox-ford</source>. <publisher-name>Oxford University Press</publisher-name>.</mixed-citation></ref>
<ref id="b20"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Kendon</surname>, <given-names>A.</given-names></name></person-group> (<year>1967</year>). <article-title>Some functions of gaze-direction in social interaction.</article-title> <source>Acta Psychologica</source>, <volume>26</volume>(<issue>1</issue>), <fpage>22</fpage>&#8211;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1016/0001-6918(67)90005-4</pub-id><pub-id pub-id-type="pmid">6043092</pub-id><issn>0001-6918</issn></mixed-citation></ref>
<ref id="b41"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>King</surname>, <given-names>D. E.</given-names></name></person-group> (<year>2009</year>). <article-title>Dlib-ml: A Machine Learning Toolkit.</article-title> <source>Journal of Machine Learning Research</source>, <volume>10</volume>(<issue>2</issue>), <fpage>1755</fpage>&#8211;<lpage>1758</lpage>.<issn>1532-4435</issn></mixed-citation></ref>
<ref id="b42"><mixed-citation publication-type="preprint" specific-use="unparsed"><person-group person-group-type="author"><name><surname>King</surname>, <given-names>D. E.</given-names></name></person-group> <article-title>Max-Margin Object Detection</article-title> [<comment>Internet</comment>]. <year>2015</year> [cited 2018Oct28]. Available from: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1502.00046">https://arxiv.org/abs/1502.00046</ext-link></mixed-citation></ref>
<ref id="b23"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Kirsner</surname>, <given-names>K.</given-names></name>, <name><surname>Dunn</surname>, <given-names>J.</given-names></name>, &#x26; <name><surname>Hird</surname>, <given-names>K.</given-names></name></person-group> <article-title>Language productions: A complex dynamic system with a chronometric foot-print.</article-title> Paper presented at: The International Confe-rence on Computational Science;<year>2005</year> <month>May</month>; Atlanta, GA.</mixed-citation></ref>
<ref id="b14"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Kleinke</surname>, <given-names>C. L.</given-names></name></person-group> (<year>1986</year>). <article-title>Gaze and eye contact: A research review.</article-title> <source>Psychological Bulletin</source>, <volume>100</volume>(<issue>1</issue>), <fpage>78</fpage>&#8211;<lpage>100</lpage>. <pub-id pub-id-type="doi">10.1037/0033-2909.100.1.78</pub-id><pub-id pub-id-type="pmid">3526377</pub-id><issn>0033-2909</issn></mixed-citation></ref>
<ref id="b48"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Kocel</surname>, <given-names>K.</given-names></name>, <name><surname>Galin</surname>, <given-names>D.</given-names></name>, <name><surname>Ornstein</surname>, <given-names>R.</given-names></name>, &#x26; <name><surname>Merrin</surname>, <given-names>E. L.</given-names></name></person-group> (<year>1972</year>). <article-title>Lateral eye movement and cognitive mode.</article-title> <source>Psychonomic Science</source>, <volume>27</volume>(<issue>4</issue>), <fpage>223</fpage>&#8211;<lpage>224</lpage>. <pub-id pub-id-type="doi">10.3758/BF03328944</pub-id><issn>0033-3131</issn></mixed-citation></ref>
<ref id="b24"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Krivokapi</surname>, <given-names>J.</given-names></name></person-group> (<year>2007</year>). <article-title>Prosodic planning: Effects of phrasal length and complexity on pause duration.</article-title> <source>Journal of Phonetics</source>, <volume>35</volume>(<issue>2</issue>), <fpage>162</fpage>&#8211;<lpage>179</lpage>. <pub-id pub-id-type="doi">10.1016/j.wocn.2006.04.001</pub-id><pub-id pub-id-type="pmid">18379639</pub-id><issn>0095-4470</issn></mixed-citation></ref>
<ref id="b43"><mixed-citation publication-type="conference" specific-use="parsed"><person-group person-group-type="author"><name><surname>Lamere</surname>, <given-names>P.</given-names></name>, <name><surname>Kwok</surname>, <given-names>P.</given-names></name>, <name><surname>Gouvea</surname>, <given-names>E.</given-names></name>, <name><surname>Raj</surname>, <given-names>B.</given-names></name>, <name><surname>Singh</surname>, <given-names>R.</given-names></name>, <name><surname>Walker</surname>, <given-names>W.</given-names></name>, <name><surname>Warmuth</surname>, <given-names>M.</given-names></name>, &#x26; <name><surname>Wolf</surname>, <given-names>P.</given-names></name></person-group> <article-title>The CMU SPHINX-4 speech recognition system.</article-title> In <source>Proceedings of the IEEE Intl. Conf. on Acoustics, Speech and Signal Processing</source>; <year>2003</year>; <conf-loc>Hong Kong</conf-loc>.</mixed-citation></ref>
<ref id="b16"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Mason</surname>, <given-names>M. F.</given-names></name>, <name><surname>Tatkow</surname>, <given-names>E. P.</given-names></name>, &#x26; <name><surname>Macrae</surname>, <given-names>C. N.</given-names></name></person-group> (<year>2005</year>). <article-title>The look of love: Gaze shifts and person perception.</article-title> <source>Psychological Science</source>, <volume>16</volume>(<issue>3</issue>), <fpage>236</fpage>&#8211;<lpage>239</lpage>. <pub-id pub-id-type="doi">10.1111/j.0956-7976.2005.00809.x</pub-id><pub-id pub-id-type="pmid">15733205</pub-id><issn>0956-7976</issn></mixed-citation></ref>
<ref id="b2"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Mehrabian</surname>, <given-names>A.</given-names></name>, &#x26; <name><surname>Wiener</surname>, <given-names>M.</given-names></name></person-group> (<year>1967</year>). <article-title>Decoding of inconsistent communications.</article-title> <source>Journal of Personality and Social Psychology</source>, <volume>6</volume>(<issue>1</issue>), <fpage>109</fpage>&#8211;<lpage>114</lpage>. <pub-id pub-id-type="doi">10.1037/h0024532</pub-id><pub-id pub-id-type="pmid">6032751</pub-id><issn>0022-3514</issn></mixed-citation></ref>
<ref id="b46"><mixed-citation publication-type="conference" specific-use="parsed"><person-group person-group-type="author"><name><surname>Meignier</surname>, <given-names>S.</given-names></name>, &#x26; <name><surname>Merlin</surname>, <given-names>T.</given-names></name></person-group> <article-title>LIUM SpkDiarization: An Open Source Toolkit ForDiarization.</article-title> In <source>Proceedings of the CMU SPUD Workshop</source>; <year>2010</year><month>March</month>; <conf-loc>Dallas, Texas</conf-loc>; <fpage>1</fpage>-<lpage>6</lpage> p.</mixed-citation></ref>
<ref id="b34"><mixed-citation publication-type="conference" specific-use="linked"><person-group person-group-type="author"><name><surname>Munn</surname>, <given-names>S. M.</given-names></name>, <name><surname>Stefano</surname>, <given-names>L.</given-names></name>, &#x26; <name><surname>Pelz</surname>, <given-names>J. B.</given-names></name></person-group> <article-title>Fixation-identification in dynamic scenes.</article-title> Proceedings of the 5th symposium on Applied perception in graphics and visualization - APGV 08; <year>2008</year>; ACM; 33-42 p. <pub-id pub-id-type="doi">10.1145/1394281.1394287</pub-id></mixed-citation></ref>
<ref id="b17"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Pfeiffer</surname>, <given-names>U. J.</given-names></name>, <name><surname>Timmermans</surname>, <given-names>B.</given-names></name>, <name><surname>Bente</surname>, <given-names>G.</given-names></name>, <name><surname>Vogeley</surname>, <given-names>K.</given-names></name>, &#x26; <name><surname>Schilbach</surname>, <given-names>L.</given-names></name></person-group> (<year>2011</year>). <article-title>A non-verbal Turing test: Differentiating mind from machine in gaze-based social interaction.</article-title> <source>PLoS One</source>, <volume>6</volume>(<issue>11</issue>), <fpage>e27591</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0027591</pub-id><pub-id pub-id-type="pmid">22096599</pub-id><issn>1932-6203</issn></mixed-citation></ref>
<ref id="b25"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Power</surname>, <given-names>M. J.</given-names></name></person-group> (<year>1985</year>). <article-title>Sentence Production and Working Memo-ry.</article-title> <source>The Quarterly Journal of Experimental Psychology Section A.</source>, <volume>37</volume>(<issue>3</issue>), <fpage>367</fpage>&#8211;<lpage>385</lpage>. <pub-id pub-id-type="doi">10.1080/14640748508400940</pub-id></mixed-citation></ref>
<ref id="b7"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Quek</surname>, <given-names>F.</given-names></name>, <name><surname>Mcneill</surname>, <given-names>D.</given-names></name>, <name><surname>Bryll</surname>, <given-names>R.</given-names></name>, <name><surname>Kirbas</surname>, <given-names>C.</given-names></name>, <name><surname>Arslan</surname>, <given-names>H.</given-names></name>, <name><surname>Mccullough</surname>, <given-names>K.</given-names></name>, <name><surname>Furuyama</surname>, <given-names>N.</given-names></name>, &#x26; <name><surname>Ansari</surname>, <given-names>R.</given-names></name></person-group> (<year>2000</year>). <article-title>Gesture, speech, and gaze cues for discourse segmentation.</article-title> <source>Proceedings IEEE Confe-rence on Computer Vision and Pattern Recognition CVPR</source>, <volume>2</volume>, <fpage>247</fpage>&#8211;<lpage>254</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2000.854800</pub-id></mixed-citation></ref>
<ref id="b8"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Quek</surname>, <given-names>F.</given-names></name>, <name><surname>Mcneill</surname>, <given-names>D.</given-names></name>, <name><surname>Bryll</surname>, <given-names>R.</given-names></name>, <name><surname>Duncan</surname>, <given-names>S.</given-names></name>, <name><surname>Ma</surname>, <given-names>X.-F.</given-names></name>, <name><surname>Kirbas</surname>, <given-names>C.</given-names></name>, <name><surname>McCullough</surname>, <given-names>K. E.</given-names></name>, &#x26; <name><surname>Ansari</surname>, <given-names>R.</given-names></name></person-group> (<year>2002</year>, <month>January</month>). <article-title>Multimodal human discourse: Gesture and speech.</article-title> <source>ACM Transactions on Computer-Human Interaction</source>, <volume>9</volume>(<issue>3</issue>), <fpage>171</fpage>&#8211;<lpage>193</lpage>. <pub-id pub-id-type="doi">10.1145/568513.568514</pub-id><issn>1073-0516</issn></mixed-citation></ref>
<ref id="b4"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Sacks</surname>, <given-names>H.</given-names></name>, <name><surname>Schegloff</surname>, <given-names>E. A.</given-names></name>, &#x26; <name><surname>Jefferson</surname>, <given-names>G.</given-names></name></person-group> (<year>1974</year>). <article-title>A Simplest Sys-tematics for the Organization of Turn-Taking for Conversation.</article-title> <source>Language</source>, <volume>50</volume>(<issue>4</issue>), <fpage>696</fpage>&#8211;<lpage>735</lpage>. <pub-id pub-id-type="doi">10.1353/lan.1974.0010</pub-id><issn>0097-8507</issn></mixed-citation></ref>
<ref id="b11"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Schegloff</surname>, <given-names>E. A.</given-names></name></person-group> (<year>1968</year>). <article-title>Sequencing in Conversational Open-ings.</article-title> <source>American Anthropologist</source>, <volume>70</volume>(<issue>6</issue>), <fpage>1075</fpage>&#8211;<lpage>1095</lpage>. <pub-id pub-id-type="doi">10.1525/aa.1968.70.6.02a00030</pub-id><issn>0002-7294</issn></mixed-citation></ref>
<ref id="b27"><mixed-citation publication-type="book" specific-use="restruct"><person-group person-group-type="author"><name><surname>Searle</surname>, <given-names>J. R.</given-names></name></person-group> (<year>1969</year>). <source>Speech Acts: An Essay in the Philosophy of Language</source>. <publisher-name>Cambridge University Press</publisher-name>. <pub-id pub-id-type="doi">10.1017/CBO9781139173438</pub-id></mixed-citation></ref>
<ref id="b35"><mixed-citation publication-type="conference" specific-use="unparsed"><person-group person-group-type="author"><name><surname>Stuart</surname>, <given-names>S.</given-names></name>, <name><surname>Galna</surname>, <given-names>B.</given-names></name>, <name><surname>Lord</surname>, <given-names>S.</given-names></name>, <name><surname>Rochester</surname>, <given-names>L.</given-names></name>, &#x26; <name><surname>Godfrey</surname>, <given-names>A.</given-names></name></person-group> Quantifying saccades while walking: Validity of a novel velocity-based algorithm for mobile eye track-ing. 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. <year>2014</year>; 5739-5742</mixed-citation></ref>
<ref id="b33"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Stuart</surname>, <given-names>S.</given-names></name>, <name><surname>Hunt</surname>, <given-names>D.</given-names></name>, <name><surname>Nell</surname>, <given-names>J.</given-names></name>, <name><surname>Godfrey</surname>, <given-names>A.</given-names></name>, <name><surname>Hausdorff</surname>, <given-names>J. M.</given-names></name>, <name><surname>Rochester</surname>, <given-names>L.</given-names></name>, &#x26; <name><surname>Alcock</surname>, <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Do you see what I see? Mobile eye-tracker contextual analysis and inter-rater reliability.</article-title> <source>Medical &#x26; Biological Engineering &#x26; Computing</source>, <volume>56</volume>(<issue>2</issue>), <fpage>289</fpage>&#8211;<lpage>296</lpage>. <pub-id pub-id-type="doi">10.1007/s11517-017-1669-z</pub-id><pub-id pub-id-type="pmid">28712014</pub-id><issn>0140-0118</issn></mixed-citation></ref>
<ref id="b47"><mixed-citation publication-type="journal" specific-use="restruct"><person-group person-group-type="author"><name><surname>Villani</surname>, <given-names>D.</given-names></name>, <name><surname>Repetto</surname>, <given-names>C.</given-names></name>, <name><surname>Cipresso</surname>, <given-names>P.</given-names></name>, &#x26; <name><surname>Riva</surname>, <given-names>G.</given-names></name></person-group> (<year>2012</year>). <article-title>May I ex-perience more presence in doing the same thing in vir-tual reality than in reality? An answer from a simu-lated job interview.</article-title> <source>Interacting with Computers</source>, <volume>24</volume>(<issue>4</issue>), <fpage>265</fpage>&#8211;<lpage>272</lpage>. <pub-id pub-id-type="doi">10.1016/j.intcom.2012.04.008</pub-id><issn>0953-5438</issn></mixed-citation></ref>
</ref-list>
</back>
</article>
