Detecting expert’s eye using a multiple-kernel Relevance Vector Machine

Decoding mental states from the pattern of neural activity or overt behavior is an intensely pursued goal. Here we applied machine learning to detect expertise from the oculomotor behavior of novice and expert billiard players during free viewing of a ﬁlmed billiard match with no speciﬁc task, and in a dynamic trajectory prediction task involving ad-hoc, occluded billiard shots. We have adopted a ground framework for feature space fusion and a Bayesian sparse classiﬁer, namely, a Relevance Vector Machine. By testing different combinations of simple oculomotor features (gaze shifts amplitude and direction, and ﬁxation duration), we could classify on an individual basis which group - novice or expert - the observers belonged to with an accuracy of 82% and 87% , respectively for the match and the shots. These results provide evidence that, at least in the particular domain of billiard sport, a signature of expertise is hidden in very basic aspects of oculomotor behavior, and that expertise can be detected at the individual level both with ad-hoc testing conditions and under naturalistic conditions - and suitable data mining. Our procedure paves the way for the development of a test for the “expert’s eye”, and promotes the use of eye movements as an additional signal source in Brain-Computer-Interface (BCI) systems.

We have recently provided evidence that the eye movements of novice and expert billiard players differ when they have to predict the outcome of partially-occluded single shots (Crespi, Robino, Silva, & de'Sperati, 2012).Specifically, in order to solve the visual prediction task, novices tended to adopt a strategy based on mental extrapolation of the ball trajectory, whereas experts monitored certain diagnostic points along the trajectory.By exploiting the eye movements differences of novices and experts, we could also identify the temporal boundaries of the single billiard shots contained in a videoclip, thus in fact realizing a sort of physiologically-based video parser (Robino, Crespi, Silva, & de'Sperati, 2012).
In the present study we extend our previous work and ask whether the differences in eye movements of novices and experts are robust enough to detect expertise i) at the individual level, and ii) under not only adhoc, controlled conditions but also naturalistic, unconstrained conditions i.e., during free viewing of a billiard match without a specific task.Also, iii) we aim to detect the "expert's eye" by analyzing the data regardless of the visual stimulus, that is, relying only on the oculomotor behaviour.Meeting these three conditions would be an important step towards automatic expertise detection.
Quantifying reliably and uniquely a complex behavior such as a sequence of exploratory eye movements (the so-called scanpath) is a non-trivial challenge.The existing methods can be classified into two broad classes, both pioneered by Larry Stark (see Hacisalihzade, Stark, & Allen, 1992, for a combined use of both).The first approach aims at characterizing the spatial distribution of fixations on the scene (spatio-temporal, in case of dynamic scenes) and to provide some similarity metrics (Brandt & Stark, 1997).Methods following this approach can be further distinguished as content-driven or data-driven (Grindinger et al., 2011).
The content-driven approach largely relies upon Regions Of Interest (ROIs), identified a priori in the stimulus and analyzed in terms of fixations falling inside them.The data-driven approach, in contrast, directly exploits scanpaths, or features extracted from them, independent of whatever was presented as the stimulus.An important advantage of the latter approach is that it obviates the need of arbitrary ROI definition.
The similarity of two scanpaths can be measured in principle by using ROI-based methods followed by coding of the sequence in which ROIs are visually inspected.A common method is the string edit, in which a string is defined by assigning each ROI a discrete symbol (e.g., a character), so that each scanpath is transformed in a string of symbols.Then the editing cost of transforming one string into another one is computed (e.g., by computing the Levenshtein distance, which measures the editing cost of transforming one string into another one, Brandt & Stark, 1997;Choi, Mosley, & Stark, 1995;Hacisalihzade et al., 1992;Foulsham & Underwood, 2008).Other methods are also used, such as the Needleman-Wunsch algorithm borrowed from bioinformatics (Cristino, Math ôt, Theeuwes, & Gilchrist, 2010).However, ROI based method suffer from well-known limitations, mostly related to how to cluster and regionalize fixations (Hacisalihzade et al., 1992;Privitera & Stark, 2000).For instance many methods rely upon dividing the image into a regular grid, but this way of operating loses any reference to the content of the image, and introduces quantization errors; in this limit case string edit techniques turns into a data-driven approach, while exploit-ing an oversimplified representation of data.Semantic ROIs could be used instead (Privitera & Stark, 2000;Josephson & Holmes, 2006), but these have by definition different sizes, and therefore the approximation of fixation position can be very coarse and subtle differences in oculo-motor behavior cancelled.In the last few years, heatmaps have become a very popular, datadriven , tool: heatmaps are plots in which a given oculomotor quantity (typically, the fixation dwell-time) is coded as colored, semi-transparent "bubbles" superimposed to the bi-dimensional image.This graphical representation is very appealing, but it is mostly used to convey an immediate, qualitative impression of the attended regions within a figure (see, however, Caldara & Miellet, 2011;Crespi et al., 2012).Other methods have also been proposed, based on the construction of an average scanpath (Hembrooke, Feusner, & Gay, 2006), or that minimize an energy function (Dempere-Marco, Hu, Ellis, Hansell, & Yang, 2006), or that end up with a multidimensional vector rather than a single scalar quantity (Jarodzka, Holmqvist, & Nystr öm, 2010).A main concern of these approaches is to quantify the similarity between scanpaths, which is a crucial issue in certain applications where an average observer is needed (Boccignone et al., 2008).
The second approach, again pioneered by Stark, takes straightforwardly into account the very stochastic nature of scanpaths.Indeed, gaze-shift processes, and especially saccadic eye movements, exhibit noisy, idiosyncratic variation of visual exploration by different observers viewing the same scene, or even by the same subject along different trials; this is a well-known issue debated since the early eye tracking studies by Ellis and Stark (1986), who modeled sequences using Markov transition probability matrices identified from experimental sequences (see Hayes, Petrov, & Sederberg, 2011 for a detailed discussion on methods aiming at capturing statistical regularities in temporally extended eye movement sequences).Here we follow this second approach or, more precisely, the very rationale behind such approach: namely, we consider the gaze shift behavior as a realization of a stochastic process (Feng, 2006;Brockmann & Geisel, 2000;Boccignone & Ferraro, 2014, 2013b, 2013a).In other terms, the distribution functions and the temporal dynamics of eye movements are specified by the stochastic process.In this perspective the visual exploratory features we can measure (saccade amplitude and direction, fixation duration) can be thought of as random variables generated by such a process, however complex it may be (Tatler & Vincent, 2008, 2009).
In order to discriminate between different oculomotor behavior exhibited by novices and experts, there are two options: to provide a model for the generating process, or to exploit the generated oculomotor pattern.For what concerns the first option, investigating expertise differences in dynamic tasks, such as a billiard match, is a complex modeling issue, and involves aspects far beyond the limits of current computational models (Borji & Itti, 2013).The second option, i.e., analyzing the generated oculomotor pattern, relies upon the rationale that the key requirements of expertise are discriminability and consistency across different stimuli (Shanteau, Weiss, Thomas, & Pounds, 2002), properties that should be reflected in the generated pattern.
Specifically, in this study we have tried to deal with two problems.First, machine learning approaches as usually applied to the analysis of eye-movements tend to overlook the feature representation problem.In order to spot behavioral characteristics -expertise or cognitive impairments -in a data-driven way, a scanpath can be analyzed by using several features (e.g., Lagun et al., 2011).Each feature, in turn, might be differently related to a number of factors, from low-level biomechanics, to learnt knowledge of the structure of the world and the distribution of objects of interest (Tatler & Vincent, 2009).Thus, within a machine learning perspective, we are dealing with features from different sources and where there may be limited or no a priori knowledge of their significance and contribution to the classification task.Clearly, concatenating all the features into a single feature space does not guarantee an optimum performance, while facing the "curse of dimensionality" problem.
Second, though SVM methodology has proven to be a powerful one, it has a number of well-known limitations (Tipping, 2001;Murphy, 2012).Although relatively sparse, SVMs make unnecessarily liberal use of basis functions since the number of support vectors required typically grows linearly with the size of the training set; predictions are not probabilistic, which is particularly crucial in classification where posterior probabilities of class membership are necessary to adapt to varying class priors and asymmetric misclassification cost; the kernel function must satisfy Mercer's condition, namely, it must be the continuous symmetric kernel of a positive integral operator.
In order to cope with these problems, we have exploited a ground framework for feature space fusion followed by a Bayesian sparse classification technique (Tipping, 2001) with the ability of achieving sparse solutions that utilize only a subset of the basis functions.In particular, we have considered the basic oculomotor parameters of saccade amplitude, direction, and fixation duration as different information sources that are combined within a composite kernel space level and classified through a Relevance Vector Machine (RVM), namely a multiple-kernel RVM (mRVM, (Psorakis, Damoulas, & Girolami, 2010;Damoulas & Girolami, 2009a)).See Appendix A, for a detailed discussion of the RVM approach and its main differences with respect to SVMs.
To the best of our knowledge this approach has never been used with eye movement data.

Materials and Methods
The present analyses were performed on raw data acquired in the course of previous experiments (Crespi et al., 2012;Robino et al., 2012).The reader is referred to that work for details concerning stimuli and data acquisition.

Participants
Forty-two healthy participants volunteered for the experiment (all men but one, with normal or correctedto-normal vision, aged between 27 and 70 years, naïve as to the purpose of the experiment).Half of them were elite billiard players, recruited on the basis of their national ranking, whereas the other half had no or occasional experience in billiard playing.The study was conducted in accordance with the recommendations of the Declaration of Helsinki and the local Ethical Committee.Before starting the experiments, all participants signed the informed consent.

Stimuli and procedure
The stimuli were movies of a billiard match or of individual shots, recorded from the top of the billiard table.The stimuli were subsequently presented on a computer screen.Whereas the former stimulus represented a real match without any experimental constraint, the shots were prepared by asking a professional player to execute a number of ad-hoc shots.
Match.This stimulus typology consisted of a piece of a billiard match (M), in which two professional players (the opponents) alternated in launching with the stick the cue ball (own ball) towards the target ball (opponent's ball) in such a way that the latter -but not the former -would knock down as many skittles as possible (there were 5 skittles in the central region of the table) and/or touch a third ball (a small red ball).The movie lasted 5 minutes and contained 11 shots, alternating naturally between the two opponents.The shots were obviously different for complexity, orientation, number of cushions, duration, ball velocity, and spin.The billiard match was always presented first.
Shots.The other stimulus typology consisted of 24 different shots with no spin, ultimately directed towards the central skittle.The shots were either short (2 cushions, SS) or long (5 cushions, LS).The initial direction of the shot (immediately after the contact with the stick) was either towards the right or the left, or towards the upper or the lower side of the table, in a balanced design.There were three versions of the shots, in one version the central skittle was knocked down, in the other two versions the ball passed just beside the skittle, to the right or to the left.In each shot, the final part of the trajectory was occluded 200 ms after the ball bounced on the second (SS) or the third (LS) cushion, because the observers' task was to tell whether or not the ball would strike the skittle (see below).There were 2 repetitions for each shot, for a total of 48 stimuli, presented in a pseudo-random sequence.The duration was 15 minutes.The shot trajectories, including the occluded portion, are illustrated in Figure 2.
Procedure.Observers watched the stimuli while seating about 57 cm in front of the computer screen, with the head resting on a forehead support.For the match stimulus, the observers were simply instructed to pay attention to the movie in order to answer to some general question afterwards.For the shots stimulus, their task was instead to predict, with a verbal response for each trial, whether or not the ball would strike the skittle.Eye movements were acquired through infrared video-oculography (Eyegaze System, LC Technologies; sampling frequency: 60 Hz; nominal precision: 0.18 deg).Monocular recordings were performed unobtrusively via a remote camera mounted below the computer screen.Gaze direction was determined by means of the pupil-center-corneal reflection method

Data Analyses
Ocular fixations were identified by means of a dispersion criterion: We defined gaze samples as belonging to a fixation if they were located within an area of 25 pixels (corresponding to 0.67 deg) for a minimum duration of 6 video frames (corresponding to 100 ms).Gaze shifts were defined as the transition from one fixation to the next.
The problem of distinguishing billiard experts from novice observers, by assessing their oculomotor behavior, can be recasted as a classification procedure in a supervised learning setting.A feature set should be defined in order to capture the oculomotor behavior of the observers.To this end, for each observer, given the sequence of fixations {r t } N T t=1 , where the vector r t represents the fixation position (coordinates) at time t, we computed the amplitude and direction of each gaze shift {l t , θ t } N T t=1 , where l t is defined as the Euclidean distance between two successive fixations, and θ t = tan −1 ∆y t ∆x t the direction of the gaze shift between successive fixations, ∆x t , ∆y t being the horizontal and vertical components.These two features are good descriptors of the exploratory oculomotor activity (Tatler & Vincent, 2008, 2009;Boccignone & Ferraro, 2013b, 2013a).As a third feature we used the fixation duration { f t } N T t=1 , which is also a useful descriptor of the oculomotor behavior in terms of visual processing (Viviani, 1990).
Because we assume that the scanpath is the result of an underlying stochastic process (Boccignone & Ferraro, 2014), we summarize the random sample {l t , θ t , f n } N T t=1 through the empirical distribution functions (histograms), which we denote as the random vectors , respectively, where the vector dimension D represents the number of bins of the histogram.In the following analyses D = 6 is used.The feature vector x s is thus a summary of the behavior of a single observer with respect to a particular feature space or source of information s = 1, . . .S, here S = 3.
In conclusion, each observer n, n = 1, • • • , N is represented in the dataset {X, t}, where the matrix X is the collection of features from all N observers, whose behaviour is summarized by the three feature vectors of dimension D, (binary classification).Then, the posterior probability for observer n to be classified as expert or novice will be P(t n |x 1 n , ..., x S n ) and according to Bayesian decision theory we would assign the observer n to the class that has the maximum a posteriori probability (MAP).
From a pattern recognition perspective, one could in principle use different classifiers trained on the different feature spaces, but classifier combination methodologies (product combination rule, mean combination rule, etc.) then would require specific assumptions such as independence of the feature spaces or, on the opposite, extreme correlation.Here we adopt the strategy of combining the feature spaces, and, in particular, we exploit the composite kernel construction technique (Damoulas, Ying, Girolami, & Campbell, 2008;Damoulas & Girolami, 2009a, 2009b), which is summarized at a glance in Figure 1.
First, the individual feature vectors were mapped into kernels (the kernel trick, Murphy, 2012) and thus embedded in Hilbert spaces via base kernels, that can be represented as the matrix K s ∈ R N×N .Each element of K s can be constructed through a suitable kernel function, which can be chosen based on prior knowledge, cross-validation or even via inference from a pool of kernel functions.Different choices are possible for the  The fixation sequence is represented in different feature spaces s = 1, • • • , S; each feature x s is then separately mapped in a kernel space, each space being generated via kernel K s of parameters θ s .The separate kernel spaces are then combined in a composite kernel space, which is eventually used for classification kernel functions, among which the most used are: namely the linear and Gaussian kernel respectively.
In turn, base kernels can be combined into a composite kernel K b ∈ R N×N whose elements are: This way, the composite kernel is a weighted summation of the base kernels with β s as the corresponding weight for each one.Also, notice that in a multiple kernel setting we are free to choose different kernels for constructing the individual kernel spaces.As far as we employ at least two different feature spaces, even when the same kernel shape (e.g., Gaussian) is adopted for both spaces (cfr., Figure 1), nevertheless the multiple kernel learning (MKL) procedure permits to adapt individual kernel parameters so to capture the statistics of information source s as represented in the corresponding feature space (data-driven approach).
The detection of expertise in the eye movements of the n-th subject in terms of maximum a posteriori P(t n |x 1 n , ..., x S n ), can be obtained at the most general level as: where the term on the r.h.s. is the Multinomial probit likelihood for the calculation of class membership probabilities (see Appendix A for a discussion and Damoulas & Girolami, 2009a;Psorakis et al., 2010 for further details).In Eq. 2. In the same equation, W ∈ R N×C is the matrix of model parameters; the variable k β n is a row of the kernel matrix K β ∈ R N×N -whose elements are the K β (x i , x j ) defined in Eq. 1 -and it expresses how related, based on the selected kernel function, observation x n is to the others of the training set (Appendix A).Given the posterior P(t n |x 1 n , ..., x S n ), classification t n = c, c ∈ C is attained by using the MAP rule: (3) The Multinomial probit likelihood P(t n |W, k β n ) in Eq. 2 above can be computed provided that the parameters W, k β n are known.In a Bayesian framework, the latter can be inferred (learned) from data by introducing a prior distribution for the regression parameters W (cfr. Appendix A), and to such end one suitable methodology is the Relevance Vector Machine (RVM, Tipping, 2001) framework in the variant proposed in (Psorakis et al., 2010).RVMs can be considered the Bayesian counterpart of SVMs.They are Bayesian sparse machines, that is they employ sparse Bayesian learning via an appropriate prior formulation.Not only do they overcome some of the limitations affecting SVMs (Appendix A), but also they achieve sparser solutions (and hence they are faster at test time) than SVM (Tipping, 2001;Murphy, 2012).In particular, we have exploited the multi class RVM (precisely, m-RVM1, Psorakis et al., 2010).Clearly, in our case the multi-class capability of the m-RVM1 (Psorakis et al., 2010) is redundant, since we are dealing with a binary classification problem (C = 2).However, essential in our case is the ability of achieving sparse solutions that utilize only a subset of the basis functions, the relevance vectors (Murphy, 2012), together with a ground framework for feature space fusion (Damoulas & Girolami, 2009a).
To sum up, the train and test procedure adopted has been the following.We have exploited a leave-one-out approach, where, for all observers, at each step, N − 1 observers are enrolled for the training set and the N-th observer is used as one sample of the test set (Murphy, 2012) to be classified as in Eq. 3.
The input to the train and test procedure has been shaped in the form of all possible combinations of the feature vectors (histograms) {x s } S s=1 (single features, pairs, or the full set, see the Supplementary Table ).Further, given the input, all possible mappings using either the linear and/or the Gaussian kernel have been considered.Since the Gaussian kernel has a free parameter, the scale ρ, at each learning step a 5-fold cross validation procedure was accomplished for tuning such parameter; validation has been performed by varying the scale parameter in the range ρ ∈ 2 −15 , • • • , 2 3 .Such interval has been discretised using a sampling step δ = 0.5.The learning and classification steps accomplished in the leave-one-out schedule (see Appendix A for a general description) have been performed by using the MATLAB software implementation of the m-RVM1 available at http://www.dcs.gla.ac.uk/inference/pMKL, with standard parameter initialization.
In the following Section, results reported have been obtained after 5 classification runs for each kernel and feature configuration taken into account, each run exploiting the leave-one-out procedure described above.At the beginning of each run the input data were randomly shuffled.

Results
Expert and novice observers exhibited rather similar exploratory eye movements when watching a given stimulus -at least this is the qualitative impression when observing the cumulative gaze position over time condensed in single snapshots (Figure 2).Examples of individual scanpaths are illustrated in Figure 3.Here too, as in the pooled data of Figure 2, a certain degree of similarity between experts and novices can be appreciated at visual inspection.For example, in the single shots the ball trajectories can be often glimpsed from the raw scanpaths.We quantified the scanpaths by means of three oculomotor features, namely, fixation duration, gaze shift amplitude and gaze shift direction, which were used as input to the classifier either as single features or concatenated in pairs or in a triplet.
The distributions of these basic oculomotor features looked very similar between experts and novices (  (polar plots in Figure 4).Despite this apparent similarity, however, in all cases there were statistically significant differences between experts' and novices' distributions (2-samples Kolmogorov-Smirnov test for fixation duration and gaze shift amplitude, always p < 0.01; 2-samples Kuiper test for gaze shift direction, always p < 0.01).Indeed, across the 3 shots experts had on average slightly shorter fixations (−16 ms), and somewhat larger and more counterclockwise-rotated gaze shifts (+0.15 deg and +0.336 rad).
Such small differences, however, can be exploited to discriminate between novices and experts when raw features are processed by a suitable classifier.For this purposes a RVM has been chosen as classifier.We first used equal kernel functions (linear and Gaussian) for all feature channels (cfr., Figure 1), while taking into consideration different numbers of sources/feature spaces s.Analysis of the results showed that classifier performances for the features x θ derived from saccadic directions were worse in case of the Gaussian classifier: that lead us to use mixed functions kernels, namely a Gaussian kernel for the length of shifts and fixation times, and a linear one for directions.
The outcomes obtained from the different kernels were quite similar, as can be seen in Supplementary Table 1.Therefore, the following analysis is performed solely on the results obtained with the multiple kernel approach, because it is a more flexible and novel than single kernel methods.Moreover, except for the case of short shots, it was the only approach where the best performance was attained with more than one feature or combination of features -actually three for the long shots and two on the match -thus indicating a higher efficiency than the other approaches.
Tables 1 and 2 report results in terms of the accuracy (percent correct) and discriminability (d ), respectively.Accuracy was defined as N c /N tot , where N c is the number of trials in which correct classification was attained, regardless of the stimulus (novice or expert).Discriminability was computed as Z H − Z F , where Z H is the z-transformed hit rate (a hit being a "novice" classification given a "novice" stimulus) and Z F is the z-transformed false alarms rate (a false alarm being a "novice" classification given an "expert" stimulus).Discriminability represents the capability of the classifier to separate novices and experts, regardless of the decision criterion.
For both accuracy and discriminability the reported tables represent the mean values across the 5 classifier repetitions, separately for each feature or feature combination and for each stimulus typology.We define the best performance as the highest classification score reported within each stimulus typology (short shots, long shots, match), regardless of which feature, or combination thereof, contributed to it.In case of ties, the best performance was stipulated to be the one in which both accuracy and discriminability were highest.From Table 1 it can be seen that the classification rate was rather good (range: 63.80% − 88.09%) and always above chance (p < 0.01 even for the lowest classification rate, one-tail binomial test), with a rather high best performance within each stimulus typology (88.09%, 86.19% and 81.90%, marked in green; red denotes the worst performances within each stimulus typology).
In the best case (88.09%) this amounts to saying that the RVM correctly distinguished as being a novice or an expert 37 out of 42 observers, with a moderate bias to classify correctly novices better than experts (predictive value for novices: 0.917; predictive value for experts: 0.851).By considering the best performances, which show the achievement of the classifier, accuracy was higher with the short shots (88.09%) than the match (81.90%), with the performance with the long shots being somewhat intermediate (86.19%).A oneway ANOVA among the 3 best performances showed a marginally significant effect of classification conditions (either stimulus type or feature; F(2, 12) = 3.547, p = 0.062).Post-hoc LSD pairwise tests indicated that, whereas the two former figures (88.09% and 86.19%) did not differ significantly from each other (p > 0.4), the difference with the accuracy measure obtained with the match stimulus (81.90%) was statistically significant or marginally significant (p = 0.023 and p = 0.097, respectively).
No clear tendency could be appreciated as to which feature, or combination of features, best contributed to the classification.From Table 1 it can be seen that in no case the same feature, or combination thereof, determined the best accuracy across the three stimulus typologies.In terms of mean performance, using single features provided a somewhat better result (80.31%) than combining them in pairs (75.13%) or triplet (76.82%).The best classification performance within each stimulus category was never obtained with the triplet of features, though only in one case the triplet determined the worst performance (67.61%).An almost identical pattern of results was obtained by computing d as index of performance (Table 2).Again, the best performance within each stimulus category was higher with the shots than with the match.Interestingly, also the three worst performances (marked in red in the Tables) were coincident for accuracy and discriminability, and were higher for the long shots than the short shots.

Discussion
In this study we have applied machine learning techniques (MKL-based feature combination and RVM) to analyze the oculomotor behavior of individual observers engaged in a visual task, with the aim of classifying them as experts or novices.To this end, we have administered to 42 subjects, half novices and half expert billiard players, various visual stimuli and tasks.As stimuli we used a portion of a real match, videorecorded from the top, containing several shots of variable length and complexity, as well as a number of ad-hoc individual shots, also videorecorded from the top in a real setting.The match stimulus was associated to a free-viewing observation condition, while for the individual shots, which were occluded in the final part of the trajectory, observers were asked to predict the outcome of the shot, which placed implicitly a significant constraint on the deployment of visuospatial attention, and, consequently, on the overt scan-path.Thus, we demonstrated that, in both constrained and unconstrained naturalistic viewing conditions, eye movements contain enough information to detect an internal state such as expertise.
To our knowledge this is the first time that MKLbased feature combination and RVM techniques are applied to eye movement data.A very recent study by Henderson, Shinkareva, Wang, Luke, and Olejarczyk (2013) inferred successfully the observers' cognitive task (search, memorizing, reading) through classification.However, for the purpose of that study, a dedicated classifier was trained for each observer, and a simple baseline technique as the Naïve Bayes' classifier was sufficient.Clearly, when addressing a scenario in which individual observers are classified as belonging to one or another population, more sophisticated machine learning tools are needed.Many studies used an approach based on SVM classification (e.g., Lagun et al., 2011;Eivazi & Bednarik, 2011;Bednarik et al., 2005;Vig et al., 2009;Tseng et al., 2013;Bulling, Ward, Gellersen, & Trster, 2011;Bednarik, Vrzakova, & Hradis, 2012).Beyond some limitations inherent to SVM (Tipping, 2001;Murphy, 2012), it is worth pointing out that the final classification step is just one side of the problem when spotting expertise from scanpaths in a data-driven way, the other side being how features are best combined and exploited.As anticipated in the Introduction, to address these issues we have adopted a feature fusion strategy relying on multiple kernel combination.
A comment is due on the choice of the features.The feature we have used are typical basic parameters that characterize saccadic exploration of static scenes.However, our stimuli contained also moving elements (e.g., the ball motion) capable of eliciting smooth pursuit eye movements, which are characterized by different parameters.Thus, it may be argued that using saccade parameters is not too appropriate.Let us firstly note that in our experiment smooth pursuit eye movements were in fact not frequent.Although this may sound surprising, consider that our observers were not instructed to follow the moving target; also, the ball motion occupied only a minor part of the overall stimulus duration, and furthermore its motion was not continuous but interrupted by bounces, which implied rather frequent catchup/anticipatory saccades.To take specific figures, consider the shot trials (Crespi et al., 2012): the ball was in motion for about 2.1 seconds in each trial, on average.During this short time window, the eyes spent on average only 63% of the time in slow motion (tangential velocity between 0.5 and 40 deg/s with a minimum duration of 100 ms), which amounts to about 1.3 seconds per trial.Considering that the mean recording window within a trial was 12.4 seconds, this indicates that smooth pursuit eye movements contributed to the overall eye movements pattern for only about 10% of the time.We did not measure all these parameters in the match task, but we can assume comparable figures.Secondly, much of the difference between experts and novices was found when the ball was not moving (ROI analysis, figs. 5 and 6 in Crespi et al., 2012;VDA peaks, fig. 2 in Robino et al., 2012).Thirdly, and more importantly, from the perspective of machine learning, segmenting a gradually changing signal into discrete elements and using them as features for the classifier is perfectly legitimate.Using virtual fixations or whatever other signal preprocessing of the oculomotor traces before the classification step is just a matter of convenience, as it is well known that machine learning techniques are blind as to the nature of the underlying processes.To the extent that features bring information, they work (features do not introduce new information).Indeed, by combining only three basic parameters of visual exploration, the overall classification accuracy, expressed as percent correct and averaged across stimulus types and oculomotor features, scored a respectable 78%.More interesting is to consider the best performance for each stimulus type, which testifies the achievement of the classifier, and which depends on the features used.The best performance ranged between 81.90% and 88.09% -1.852 to 2.399 in terms of d , which is a quite remarkable result, especially considering that a naturalistic, unconstrained viewing condition was included (M).Beside confirming that eye movements contain a signature of billiard expertise (Crespi et al., 2012), this finding demonstrates that, even ignoring "where" the gaze is directed, i.e., to which objects or events overt visuospatial attention is allocated ( content-driven approach), the "expert's eye" can be identified at the individual level from "how" the gaze is shifted, i.e, from basic oculomotor features such as saccade amplitude and direction and fixation duration (data-driven approach).Clearly, this does not amount to saying that the physiology of eye movements is modified by expertise, nor that expertise in a given field could be detected by using whatever visual stimulus, but simply that there is not always the need to match the oculomotor features with the visual features, a common approach that we also used in our past work (Crespi et al., 2012;Robino et al., 2012).Notably, expertise detection was successful at the level of individual observers (see below).
The classification accuracy was higher with the shots than the match.This difference, despite being small, is in keeping with the idea that the individual scanpath provides an indication about the degree of "expertise allocation", that is, how much an observer is actually using knowledge: The more expertise is used, the larger the systematic differences in visual exploration between a novice and an expert, hence the higher the classification performance.For example, the prediction task in which participants had to make a rapid guess as to the outcome of the shots ("will the ball hit the central skittle?") would seem to leave little room for free ocular exploration, especially for the short shots, thus reducing the idiosyncratic component of ocular exploration.As a consequence, the systematic differences between novices and experts emerge more clearly.Conversely, the fact that during match observation observers had no specific task, and that the pace of the shots was relatively relaxed, allowed more free eye movements, especially after the shots.In other words, the difference between the classification accuracy when the shots rather than the match stimulus is used might depend on the different degree of "expertise allocation" in the two conditions, being higher in the shot prediction task than in the relatively unconstrained match observation task.Indeed, we had previously proposed that, during billiard match observation, it is precisely the alternation between the focusing of attention on the upcoming shot and the post-shot relaxation that allowed us to successfully parse the shot alternation exclusively on the basis of the scanpath differences between novices and experts (Robino et al., 2012).
The above considerations underscore the importance of selecting a proper test setting in order to detect expertise from the scanpath.On the one hand, it is clearly better to find the conditions (i.e., stimuli and tasks) that best elicit the use of expertise.These should be as stringent and controlled as possible, such as for example the ad-hoc shots coupled with the prediction task that we have used, where the highest classification performance was attained.On the other hand, it is intriguing that the RVM yielded a high accuracy, though not the highest, also with the match stimulus (81.90%).Considering the uncontrolled variability of a real billiard match, coupled with the lack of a specific task for the observers, we think this is a remarkable achievement in terms of capability to extract information from eye movements in naturalistic conditions.Pervasive behavioural monitoring of real-life visual exploration through wearable eye trackers may take advantage of high-performance classification methods such as RVM (Schumann et al., 2008;Hart, Onceanu, Sohn, Wightman, & Vertegaal, 2009;Noris, Nadel, Barker, Hadjikhani, & Billard, 2012;Vidal, Turner, Bulling, & Gellersen, 2012).Furthermore, especially for real-life conditions, it is crucial that the scanpath analysis can be data-driven, at least as much as possible, as a content-driven approach would inevitably require manual labeling of each video frame in terms of semantically-identified regions or visual elements.Indeed, this would preclude an automatic analysis of real-life scanpaths, and even more so for a real-time analysis.
Firstly, our findings suggest that a number of lowlevel physiological parameters of visual exploration be- havior could be suitably used to automatically decode inner cognitive processes to the benefit of BCI systems.In the field of neuro-rehabilitation, for example, many efforts are directed at decoding motor imagery and covert motor commands from brain signals with the goal of driving prosthetic devices and boosting motor improvement through neurofeedback training (Silvoni et al., 2011).Central to this endeavor is the capability to extract in the simplest possible way useful neural information from subjects engaged in some sort of mental imagery tasks.For this, brain activity is recorded via amplifiers and decoded using on-line classification algorithms.Brain signals are not the only physiological correlate of mental imagery, however.Eye movements have been shown to tag in a precise way an elusive covert process such as mental imagery (Brandt & Stark, 1997;Johansson, Holsanova, Dewhurst, & Holmqvist, 2012), and, more specifically, dynamic motion imagery (de 'Sperati, 1999'Sperati, , 2003;;de'Sperati & Santandrea, 2005;Jonikaitis, Deubel, & de'Sperati, 2009;Crespi et al., 2012).Thus, the methodological approach that we have described in the present study might be profitably applied to extract eye movements information to drive BCI external devices.For example, automatically classifying good and bad imagery performance could help to refine the mental training procedures until expertise is achieved, to avoid that incorrect signals are erroneously sent to the BCI device.Also, a classifier could detect spurious eye movements -or their absence -that might mean that visuospatial attention has been drawn from the current imagery task.In sum, an oculomotorbased channel with efficient classification capabilities could be suitably paired to EEG-based or fMRI-based channels to improve mind reading performance in hybrid, multiple input signal sources BCI systems (Amiri, Fazel-Rezai, & Asadpour, 2013).
Another potential application of our approach is the development of an expertise test based on the "expert's eye".Clearly, a general expertise test cannot exist.Expertise is specific to particular domains, and it can be of various types and qualities (e.g., declarative-conceptual, procedural, strategic;(De Jong & Ferguson-Hessler, 1996).Although expertise is ultimately established by directly measuring performance (e.g., through questionnaire scores, as in school grades, or with official rankings, as in sports), an indirect assessment of the visual exploratory behaviour may uncover subtle aspects underlying expertise in all those cases where visual information is crucial (e.g.understanding the working of a mechanical apparatus, or providing legal authentication of a painting, or playing chess, or detecting faults in sports).For example, in our previous study on billiard expertise we have documented, through eye movement recording, the passage from intuitive, procedural knowledge based on mental imagery, a strategy typical of novices, to rulebased, conceptual knowledge, which was expressed only in experts (Crespi et al., 2012).Incidentally, this may explain the small bias that we have found with the best performance towards a higher misclassification of experts than novices: because experts can adopt a novice's strategy but a novice cannot adopt an expert's strategy, a classifier can be fooled by an expert but not by a novice.
The capability to detect expertise automatically, that is, without the need of semantically analyzing which particular objects and events of a visual scene the gaze of an observer is directed to, will enhance "mind reading" methods.However, it should be borne in mind that a psychophysiological test for the expert's eye would not substitute direct measures of expertise, but rather complement them.Thus, finding a mismatch between the output of an automatic "ocular expertmeter" and the outcome of direct evaluation of expertise obtained with classical methods (e.g., testing, questionnaires) could raise issues as to what strategy or what evidence has actually been used.For example, assuming that the scanpath is indicative of expertise, the finding of an anomalous scanpath in inspecting the figures of a difficult geometry exam would perhaps question what mental procedure was used by a student who nonetheless provided all correct answers; An alternative interpretation could be that the student answered correctly by chance.
The automatic recognition of individual traits through behavioral analyses is an intensely pursued goal.Biometrics is a field of study aimed at identifying individuals through their unique biological characteristics or behavioral patterns.Biological methods in biometrics include for example fingerprint, face, or iris verification, whereas behavioral methods include voice, signature, typing or gait analysis.Recently, behavioral biometrics has been applied to eye movements, with the goal of identifying individuals through their oculomotor patterns (Holland & Komogortsev, 2011), even in a task-independent way (Bednarik et al., 2005;Kinnunen, Sedlak, & Bednarik, 2010).In these studies various methods to analyze eye movements have been used, with an ensuing performance however still short of the accepted standards for biometrics systems.Our work was aimed at distinguishing a novice from an expert, that is, two classes of individuals rather than a given individual as in biometrics.Though, the high classification rates that we obtained, even in a poorly constrained scenario such as match observation, suggests that our approach based on feature space fusion and a Bayesian sparse classifier could be profitably applied to personal identification as well.It is interesting that a similar set of eye movements features (e.g., duration and amplitude of saccades) can be used successfully for both individual and categorical classification (personal identity or expertise).This seems to confirm that these basic features are more than just oculomotor traits.(Cristianini & Shawe-Taylor, 2000) relies on building a classifier of the form sign [ f (x; w)] where t n ∈ {−1, 1} (binary classification, C = 2) and is a linear regression model t = f (x; w) that approximates the true mapping function t.In Eq. 4, φ(•) represent a generally nonlinear and fixed basis functions, mapping the input space in a higher dimensional space, and w ∈ R M are adjustable parameters (or weights) that appear linearly in (4).Note that, though the model is linear in the parameters, it may still be highly flexible as the size of the basis set, M, may be very large.The objective of training is to estimate good values for those parameters, which in the SVM framework is accomplished through an optimization technique (Cristianini & Shawe-Taylor, 2000).Also, in the SVM, the model is implicitly defined such that M = N , i.e. designining one basis function for each example in the training set.A particular kind of function known as kernel function is employed, provides an implicit calculation of the product between φ(x i ) and φ(x j ), i.e., K(x i , x j ) =< φ(x i ), φ(x j ) >; thus, predictions are based on the function: The key feature of the SVM is that, in the classification case, its target function attempts to minimise a measure of error on the training set while simultaneously maximising the margin between the two classes (in the feature space implicitly defined by the kernel).This is a highly effective mechanism for avoiding over-fitting, which leads to good generalisation.It furthermore results in a sparse model dependent only on a subset of kernel functions: those associated with training examples x n that lie either on the margin or on the "wrong" side of it, namely the support vectors (Cristianini & Shawe-Taylor, 2000).
The RVM has the same functional form as SVMs, but is conceived in a Bayesian framework (Tipping, 2001).Following the standard probabilistic formulation, the targets are assumed to be samples generated from the model (4) perturbed with a noise process ε: Here ε represents the error between the estimated targets t and the true ones t, which is assumed to be normally distributed with zero mean and unknown variance σ 2 , i.e. t n ∼ P(t n |x n , w, σ 2 ) = N (t n | f (x n ; w), σ 2 ), where the latter notation specifies a Gaussian distribution N over the target labels with mean f (x n ; w) and variance σ 2 .Under independent and identical distribution generation of the observations, the data likelihood can be written as: From now on, we will write terms such as P(t|x, w, σ 2 ) as P(t|w, σ 2 ).Omitting to include x variables is purely for notational convenience and it implies no further model assumptions.
In a Bayesian framework, model parameters w and σ 2 are considered as random variables.These are estimated by first assigning prior distributions and then estimating their posterior distribution using the likelihood of the observed data (Eq.7).The key of the RVM approach (Tipping, 2001) is to define a prior conditional distribution on each coefficient w i , such that, according to the Automatic Relevance Determination (ARD) mechanism (MacKay, 1992), all coefficient which are unnecessary are pruned: where T is the vector of RVM hyperparameters.Since many of such hyper-parameters usually assume elevated values, their associated weights will be sharply peaked around zero.This has the effect of switching off basis functions for which there is no evidence in the data, yielding sparse prediction models.Thus, unlike the SVM, the RVM explicitly encodes the criterion of model sparsity as a prior over the model weights.Whilst in SVM regression/classification a desirable level of sparsity has to be brought about indirectly by determining an error or margin parameter via a cross-validation scheme, the Bayesian formulation of the regression problem in the RVM allows for a prior structure that explicitly encodes the desirability of sparse representations (Tipping, 2001).As a practical consequence, for SVM the support vectors are typically formed by "borderline" , difficultto-classify samples in the training set, which are located near the decision boundary of the classifier; in contrast, for RVM the relevance vectors are formed by samples appearing to be more representative of the two classes, which are located away from the decision boundary of the classifier.
The RVM classifier based on Multiple Kernels (Damoulas & Girolami, 2009a;Psorakis et al., 2010) can be obtained by generalizing Eq. ( 5) as follows.A base kernel can be combined into an N × N composite kernel as K β (x i , x j ) = ∑ S s=1 β s K s (x s i , x s j ) (Eq. 1, Data analyses Section).
More precisely, denote with the matrix X ∈ R N×D the input data from which the kernel matrix K β ∈ R N×N is derived, where each row k β n expresses how related, based on the selected kernel function, observation x n is to the others of the training set.The learning process involves the learning of model parameters W ∈ R N×C , which by the quantity W T K β act as a voting system to express which relationships of the data are important in order for our model to have appropriate discriminative properties By introducing the auxiliary variables Y ∈ R N×C , we regress on Y with a standardized noise model; thus, for a sample n and a class c, Eq.7 can be written as: where the vector w c defines the c-th column of the model parameters matrix W. The regression target is linked to the classification label by setting t n = 1 if y in > y jn ∀ j = i.This way, the posterior class membership distribution is the multinomial probit likelihood (details in Damoulas & Girolami, 2009a) where u ∼ N (0, 1) and Φ is the Gaussian cumulative distribution function.Following the RVM approach, the elements w nc of matrix W follow a standard normal distribution with zero mean and variance α −1 nc , where the latter are the elements of the hyper-parameter ma- With sufficiently small hyperparameters a, b(< 10 −5 ) the scales A restrict W around its zero mean due to small variance.
The learning procedure for latent variables Y and parameters W, A, β is a generalised Expectation-Maximization algorithm, which can be summarised as follows.
Step 1: use a type-II Maximum Likelihood (ML) procedure, which maximises the log of the marginal likelihood log P(Y|K β , A) = log P(Y|K β , W)P(W|A)dW with respect to A, and boils down to either add a sample or update its associated hyper-parameter α nc ; thus, the model can start with a single sample and proceed in a constructive manner.
Step 2: perform an M-step for obtaining W.
Step 3: perform an E-step for Y.
Step 4: obtain β s weights via constrained Quadratic Programming.
Step 1 to 4 are iterated using as a convergence measure the % mean change of (Y − K β W) 2 .Once the parameters of the model have been learned, then the Multinomial probit likelihood for the calculation of class membership probabilities P(t n = i|W, k β n ) (Eq. 11) is computed by resorting to Quadrature approximation for solving the expectation integral.For details, see (Damoulas & Girolami, 2009a;Psorakis et al., 2010).

Figure 1 .
Figure 1.Data analysis in multiple-kernel representation.The fixation sequence is represented in different feature spaces s = 1, • • • , S; each feature x s is then separately mapped in a kernel space, each space being generated via kernel K s of parameters θ s .The separate kernel spaces are then combined in a composite kernel space, which is eventually used for classification Figure 4), with very close median values (fixation duration -novices vs. experts: 247 vs. 231 ms, 231 vs. 215 ms, 247 vs. 230 ms, respectively for SS, LS and Match; gaze shift amplitude -novices vs. experts: 2.219 vs. 2.458 deg, 2.383 vs. 2.525 deg, 2.076 vs. 2.150 deg, respectively for SS, LS and Match).Also the shapes of the gaze shift direction distributions looked rather similar Figure 2. Raw eye position (yellow) recorded during shot and match viewing, for both novices (2(a)) and experts (2(b)).All data from all participants are superimposed.The traces recorded during shot viewing have been re-oriented onto a single shot trajectory (red arrows, with the dashed part representing the occluded portion of the trajectory) for the clarity of the graphical presentation.Rows, from top to bottom: Short shots (SS), Long shots (LS), Match (M).

Figure 3 .
Examples of scanpaths of individual observers, for both novices (left panels) and experts (right panels), for the three typologies of shots (M, SS, LS).For both SS and LS, 24 scanpaths, lasting individually about ten seconds, are superimposed in each panel (one for each trial), whereas during match observation, there is only a single, 5 min long continuous scanpath.For simplicity, the same background table image has been used in all panels.Here the ocular traces during shot viewing are shown in their original orientation.6 DOI 10.16910/jemr.7.2.3 ISSN 1995-8692 This article is licensed under a Creative Commons Attribution 4.0 International license.The expert's eye (a) Fixation duration (b) Gaze shift amplitude (c) Gaze shift direction Figure 4. Distributions of the three oculomotor features used to classify expertise.Top panels (4(a)), fixation duration; middle panels (4(b)), gaze shift amplitude; bottom panels (4(c)), gaze shift direction.Vertical solid lines, median values.SS=Short Shots, LS=Long Shots.

Table 1
Mean classification accuracy with the Multiple Kernel analysis.Base features: gaze shift amplitude (A), gaze shift direction (D), fixation duration (F).Best and worst performances are marked in green and in red, respectively

Table 2 Classification
discriminability (d ) with the Multiple Kernel analysis.Base features: gaze shift amplitude (A), gaze shift direction (D), fixation duration (F).Best and worst performances are marked in green and in red, respectively {x n ,t n } N n=1 a training data set of N samples with x n ∈ R D and t n ∈ C , C being the classification space of dimension C. The SVM approach Denote

Table 1
Supplementary.Comparison of the three RVM analyses: linear kernel, gaussian kernel, multiple (mixed) kernel.Both mean classifier accuracy (left panels) and discriminability (right panels) are shown.