Representative Scanpath Identification for Group Viewing Pattern Analysis

Scanpaths are composed of fixations and saccades. Viewing trends reflected by scanpaths play an important role in scientific studies like saccadic model evaluation and real-life applications like artistic design. Several scanpath synthesis methods have been proposed to obtain a scanpath that is representative of the group viewing trend. But most of them either target a specific category of viewing materials like webpages or leave out some useful information like gaze duration. Our previous work defined the representative scanpath as the barycenter of a group of scanpaths, which actually shows the averaged shape of multiple scanpaths. In this paper, we extend our previous framework to take gaze duration into account, obtaining representative scanpaths that describe not only attention distribution and shift but also attention span. The extended framework consists of three steps: Eye-gaze data preprocessing, scanpath aggregation and gaze duration analysis. Experiments demonstrate that the framework can well serve the purpose of mining viewing patterns and “barycenter” based representative scanpaths can better characterize the pattern.

Scanpaths reflect the ebbs and flows of visual attention. According to Yarbus' research (1967), scanpaths from different observers for the same visual stimuli in free viewing conditions are similar but not identical. The scanning order of one subject is not perfectly congruent with that of others as shown in Figure 1 (a), so it remains a challenging task to identify from multiple scanpaths a pattern that reflects the attention synchrony of different subjects as shown in Figure 1 (b). Such a pattern not only plays an important role in understanding how humans perceive and explore their surrounding scenes but also reveals some important properties of visual stimuli, so it has a wide range of applications in many fields. For example, in psychology, it can be used to identify reading habits of experts and detect reading disorder; in marketing, it can tell us which parts of Scanpaths are composed of fixations and saccades. Viewing trends reflected by scanpaths play an important role in scientific studies like saccadic model evaluation and real-life applications like artistic design. Several scanpath synthesis methods have been proposed to obtain a scanpath that is representative of the group viewing trend. But most of them either target a specific category of viewing materials like webpages or leave out some useful information like gaze duration. Our previous work defined the representative scanpath as the barycenter of a group of scanpaths, which actually shows the averaged shape of multiple scanpaths. In this paper, we extend our previous framework to take gaze duration into account, obtaining representative scanpaths that describe not only attention distribution and shift but also attention span. The extended framework consists of three steps: Eye-gaze data preprocessing, scanpath aggregation and gaze duration analysis. Experiments demonstrate that the framework can well serve the purpose of mining viewing patterns and "barycenter" based representative scanpaths can better characterize the pattern.
Keywords: eye movement, eye tracking, representative scanpath, attention, viewing pattern, barycenter, gaze duration an advertisement first grab customer attention and help to design a more user-friendly interface; in computer vision, it can be regarded as the group viewing pattern to train a network for scanpath prediction.

Related Work
Several methods were proposed to analyze scanpaths. For example, T-pattern is a tool to discover repetitive scan patterns in each individual scanpath (Magnusson, 2000;Burmester & Mast, 2010). Others attempt to characterize complex scanning patterns in dynamic tasks such as air traffic control (McClung & Kang, 2016). However, to get the group viewing pattern, we need to take into account all individual scanpaths rather than focus on a single one, like the identified scanpath in Figure 1(b), which we call representative scanpath. The surge of interest in dynamic vis-ual attention gives rise to various methods for representative scanpaths identification, most of which either stem from sequence mining algorithms or target a specific category of visual stimuli such as web pages (Eraslan, Yesilada & Harper, 2014, 2016a, 2016b, 2016c, 2017a, 2017b. So they have limitations when applied to analyze scanpaths. Existing methods to analyze scanpaths include extracting common subsequences shared by all the subjects (Eraslan et al., 2014;Goldberg & Helfman, 2010;Hembrooke, Feusner, & Gay, 2006;West, Haake, Rozanski, & Karn, 2006). However, in the case where there is no common component shared by individual scanpaths, methods in this category will fail to produce any pattern. To be more tolerant of individual differences, sequential pattern mining algorithms can be used to obtain frequent subsequences supported by a specified number of subjects (Hejmady & Narayanan, 2012). But a fixed threshold of subject number can hardly be suitable for all the images due to the varying degree of scanpath inconsistency incurred by personal viewing habits and visual stimuli properties. Hence produced subsequences may still be too short to reflect the complete viewing pattern. Instead of simply focusing on subsequences, scanpath trend analysis (STA) (Eraslan et al., 2016b) is proposed to acquire the viewing pattern from a whole new perspective. STA first selects representatively trending instances from scanpath components and then rearranges them based on their average rank in all the individual scanpaths. To make STA more tolerant, a new parameter tolerance level, which allows trending instances to be shared by a subset of scanpaths rather than all of them, is added to the original STA algorithm (Eraslan et al., 2017b), but it is difficult to propose a specific tolerance level. The main limitation of STA and its variant is that it targets web pages and relies on the natural segmentation of visual elements (e.g., navigation bar, text box, etc.) to denote scanpaths by character strings. Apart from the above studies, researchers in computer vision community are also interested in eye tracking data. Saliency models predicting fixation distribution and saccadic models predicting scanpaths are two important topics in computer vision. While fixation density map (Engelke, Liu, Wang, Le Callet, Heynderickx, Zepernick, & Maeder, 2013) has been widely accepted as the baseline to evaluate saliency model performance, few efforts are dedicated into finding an appropriate baseline for saccadic models. Generally, researchers obtain the upper bound of scanpath prediction performance based on inter-observer consistency and choose from individual scanpaths the one that is the closest to the rest on behalf of all the scanpaths for visualization (Jiang, Boix, Roig, Xu, Van Cool, & Zhao, 2016). Similar to STA, the inter-observer consistency method (IOC) also preprocesses recorded individual scanpaths into sequences based on clustering results. Scanpath similarity is measured by Needleman-Wunsch string matching algorithm. Such simplification retains the viewing order but abandons the spatial distribution of scanpaths.
However, it is fixation order and fixation distribution that jointly determine scanpath shape. So some researchers adopted Dynamic Time Warping (DTW) (Sakoe, & Chiba, 1978) to directly compare scanpaths without preprocessing or simplication (Le Meur, & Liu, 2015). In our previous work (Li, Zhang, & Chen, 2017), we proposed the Candidate-constrained DTW Barycenter Averaging (CDBA) algorithm to take into account spatial distribution when analyzing the viewing trend. But still there is little discussion about the important role that gaze duration plays in characterizing scanpaths. Hence, in this paper we extend the framework to generalize viewing trends in not only scanpath shape but also gaze duration. Experiments are conducted to assess the ability of obtained scanpaths to reflect viewing patterns.

Methodology
The overall framework to obtain the representative scanpath is shown in Figure 2. It consists of three steps: eye-gaze data preprocessing, scanpath aggregation and gaze duration analysis. Fixation position, order and duration are fully exploited to identify the viewing pattern. The preprocessing step is divided into three substeps: outlier removal, AOI extraction and center identification. The second step focuses on scanpath shape, in which multiple scanpaths are aggregated into a single one. Finally, based on the aggregated scanpath, we analyze the pattern from the perspective of gaze duration and combine the analysis results from all three aspects to obtain the representative scanpath.

Eye-gaze Data Preprocessing
Eye-gaze data are generally expressed by sequences of fixations. Each fixation is recorded as a point with coordinates and gaze duration. The preprocessing step makes preparation for the next pattern mining procedures: outlier removal ensures the consistency of remaining scanpaths, AOI extraction facilitates a higher-level representation, center identification retains the spatial distribution of scanpath components.
Outlier Removal. With different preferences, subjects allocate fixations in irregular and idiosyncratic manners. In addition, inevitable errors in eye tracking and data processing increase the uncertainty of recorded fixations. Therefore, fixations that are isolated might come from interesting viewing behaviors of subjects or measurement errors of eye trackers, leading to discrepancy among scanpaths. Even fixation distributions are similar, how fixations are sequentially arranged to reflect the actual viewing process still varies with different individuals. Therefore, both fixation position and order are potentially causes for scanpath inconsistency.
To eliminate the influence of outlier scanpaths on both spatial distribution and temporal order, we exclude outlier scanpaths with boxplot at the very beginning. Boxplot is a statistical tool that enables us to detect outliers and observe the dispersion degree of data. Algorithm 1 explains how the boxplot works in detail. In Algorithm 1, we use Dynamic Time Warping (DTW) (Sakoe, & Chiba, 1978) to calculate the distance or dissimilarity Figure 2. The extended framework to find a representative scanpath that shows attention distribution, attention shift as well as attention span.
between any two scanpaths. Outlier removal guarantees inter-observer consistency to some degree so the result pattern can reflect the common trend from the compatible majority.
AOI Extraction. According to Gestalt theory (Kanizsa 1979), the nature of unified whole is not simply the addition of its parts. So visual attraction is not from a single pixel but a whole region of interest. It is possible that for the same visual target, fixations scatter on different locations due to the high degree of viewing freedom. As a result, fixation based scanpaths do not facilitate an abstract expression, making it hard to identify what is common in eye tracking data. Therefore, we should express the representative scanpath by higher level components such as AOIs. For example, ScanMatch (Cristino et al., 2010) algorithm uses grid mask to transform fixation based scanpaths to AOI sequences. But the number of grids is flexibly determined and AOIs are not associated with image content. Considering that fixations are stimulus-driven, the clustering structure of fixations is closely related to the distribution of visual attraction. Hence, the representative scanpaths we discuss in this paper are composed of AOIs that are associated with fixation clusters.
All the fixation points are clustered by the algorithm proposed by Rodriguez et al. (2014), which considers two properties of points: local density ρ and distance from points with higher density δ.
Fixations with large values of ρ and δ are recognized as cluster examplars. To determine the number of clusters, γ = ρ × δ is calculated for each fixation and all the values are sorted in decreasing order. Then a threshold is set so that fixations with γ larger than the threshold stick out and cluster number is accordingly determined. The threshold can be set as the arithmetic mean or geometric mean empirically. In our experiment, we used the weighted geometric mean, which is calculated as follows: where , ,…, have been sorted in decreasing order. The weighted geometric mean puts more emphasis on larger γ and leads to fewer and less overlapped clusters than the geometric mean.
Center Identification. Now all the fixations are assigned to different AOIs. To retain the spatial information of scanpaths, we need to take into account the locations of AOIs. Instead of simply averaging coordinates or choosing points with large γ as centers, we adopt a random walk based method  to identify AOI centers, which is more robust and less likely to be affected by edge points of a cluster. The random walk based method aims to obtain a coefficient for each fixation in the AOI and calculates the weighted center as the final AOI center.
The coefficient is updated by the following formula: where ( ) is the initial coefficient of fixation defined by fixation density, is the normalizing parameter, ( , ) is the transition probability from fixation to fixation j.
where ( , ) is the Euclidean distance from fixation to fixation , σ is introduced to influence the center distribution subtly.
Different from simple segmentation or grid mask that only allows scanpaths to be treated as character strings, AOI centers make it possible to denote scanpaths by sequences of coordinates and thus can also be regarded as indicators of AOI distribution. AOIs with identified centers are considered as candidate components for the representative scanpath in the aggregation stage.

Scanpath Aggregation
Generally speaking, the barycenter of points in a cluster is regarded as the representative or examplar of the cluster. Likewise, we aggregate multiple scanpaths into a single one by computing the "barycenter" of the scanpath set. In other words, we try to calculate a representative scanpath that is the closest to individual scanpaths in terms of average distance. Mathematically, the representative scanpath is defined as follows: is the representative scanpath, is any scanpath that may become the representative scanpath, is an individual scanpath in the given scanpath set , and is a function calculating the distance or dissimilarity between two scanpaths.
Here we utilize Dynamic Time Warping (DTW) to measure scanpath distance. DTW was first put forward for speech recognition and then widely used in time series analysis (Berndt & Clifford, 1994). Traditional string matching algorithms like Needleman-Wunsch algorithm (Needleman & Wunsch, 1970) and Levenshtein Distance (Levenshtein, 1965) simply treat scanpaths as strings and need to additionally construct a cost matrix to take into account spatial proximity, while DTW already involves the construction. In most cases, scanpaths are recorded as sequences of components with coordinates. Given two scanpaths A = =< , , ⋯ , > and B = =< , , ⋯ , > , the DTW distance is recursively computed by the following formula: (1 ,1 ) where , are the subsequences of A and B, and are components of scanpaths A and B respectively, δ() is the Euclidean distance function. The distance or dissimilarity between scanpath A and B is: It is difficult to directly get the optimal solution of Equation (1). Hence, we add the following constraints to make it feasible:  The representative scanpath must be composed of abstract scanpath components such as AOIs;  Any two contiguous components in the representative scanpath must be contiguous in at least one individual scanpath;  The occurrence count of each component in the representative scanpath does not exceed the maximum occurrence count of the component in all the individual scanpaths.
These constraints not only simplify the aggregation but also force the obtained scanpath to be more reasonable. The first constraint guarantees the aggregated scanpath is expressed at a higher level. The second and the third constraints ensure that the aggregated scanpath will not deviate too far from individual scanpaths. We propose two methods for scanpath aggregation.
Heuristic Method. The heuristic method first constructs a candidate set for each AOI. The candidate set contains all the potential subsequent AOIs for a certain AOI. In other words, AOIs in the candidate set for must follow in at least one individual scanpath. Then all the possible scanpaths are enumerated by extending scanpaths of 1 fixation to scanpaths of fixations. A scanpath is extended by choosing an AOI from the candidate set of the last AOI on the scanpath and adding it to the end. When the occurrence count of a certain AOI is equal to its maximum occurrence count in individual scanpths, the AOI is removed from the candidate set and thus will not appear in later enumerated scanpaths. Finally, the scanpath with the smallest DTW from individual scanpaths is chosen from all the enumerated scanpaths as the representative. n is the specified maximum fixation number. When n is large enough, we can get the theoretically optimal result for Equation (2), which provides a lower bound of the average distance.
Candidate-constrained DTW Barycenter Averaging (CDBA) algorithm. Since the heuristic method is time and space consuming, we propose another algorithm for scanpath aggregation by imposing some constraints on the DTW Barycenter Averaging (DBA) algorithm (Petitjean, Ketterlin, & Gancarski, 2011) as an approximation (Li et al., 2017). CDBA also needs to construct a candidate set for each AOI and adjust the set members like the heuristic method. Then it defines an initial average scanpath as the reference scanpath and then updates the reference scanpath iteratively. For each iteration, CDBA consists of two steps: computing DTW between every individual scanpath and the reference scanpath and updating the components of the reference scanpath.
 DTW computation. When computing DTW between two sequences, we can obtain the accumulation matrix and find the path of cost accumulation, which indicates the optimal alignment between sequences. The process of DTW computation is repeated between every actual scanpath and the reference scanpath.
 Scanpath update. In the update step, each component of the reference scanpath is updated by the "constrained barycenter" of fixations that are aligned to it during the computation process. The "constrained barycenter" means an AOI belonging to the candidate set and having the minimum average distance with all the aligned fixations.
The above two steps are repeated until the reference scanpath does not change. The process of CDBA is summarized in Algorithm 2.

Gaze Duration Analysis
After scanpath aggregation, we obtain an aggregated scanpath that can tell us not only which areas draw our attention but also the priority of attraction. In this section, we aim to embed gaze duration into the aggregated scanpath. To specify how long an AOI can hold our attention, we transform each individual scanpath (of fixations) into an AOI sequence (of clusters) and statistically analyze the gaze duration of each AOI for all the individual scanpaths. The gaze duration of each AOI in the aggregated scanpath is obtained by averaging the gaze duration of the same AOI in all the individual sequences. Note that when we analyze AOI duration, one and the same AOI appearing more than once in a sequence is regarded as different AOIs and will be distinguished by their appearing order in the sequence.

Eye Tracking Study Eye Tracking Data
To investigate the rationality of representative scanpaths, we conduct experiments on two large public eye-tracking data sets, namely OSIE data set (Xu, Jiang, Wang, Kankanhalli & Zhao, 2014) and MIT1003 data set (Judd, Ehinger, Durand, & Torralba, 2009).
 OSIE Data Set contains 700 images. Each image is freely viewed by 15 subjects for 3 seconds. All the images are of the size 800 × 600 pixels.
 MIT1003 Data Set includes 1003 scenes freely viewed by 15 subjects for 3 seconds. The longest dimension of each image is 1024 pixels.

Procedure
The key process in our framework is scanpath aggregation, which can be substituted by other methods like eMine (Eraslan et al., 2014), STA (Eraslan et al., 2016b), SPAM (Hejmady et al., 2012) and IOC (Jiang et al., 2016). The first three of them can not directly operate on scanpaths consisting of fixations with coordinates and need to convert scanpaths into character strings. IOC also relies on some preprocessing steps for scanpath quantization. To make sure the comparison is fair, we adopt the same preprocessing step in our framework. The outlier removal process averagely excludes 0.61 and 0.86 scanpaths per image for OSIE and MIT 1003 data sets, respectively. In addition, despite the outlier removal process, eMine still fails to produce any result for some images, so for eMine algorithm, we only consider cases in which eMine algorithm has final outputs. For SPAM algorithm, we set the minimum supporting number of subjects as the half of the total number, which may lead to more than one frequent subsequences, so we choose from these frequent subsequences the one that is optimal with regard to Equation (1) as the representative scanpath. For IOC algorithm, we adapt it for our framework by taking DTW as its distance function and choosing the scanpath with the smallest average DTW. For the heuristic method, we need to determine the specified maximum number when enumerating all the possible scanpaths. Figure 3 shows average DTW varying with given maximum length . For both data sets, when is equal to or larger than 8, the average DTW does not change and the heuristic method can get the theoretically best results. So the maximum number is set as 8 in later discussion for the heuristic method unless otherwise stated. Due to the high degree of viewing freedom, it is hard to define ground truth representative scanpaths. The only way to evaluate the rationality of the obtained scanpath is to compare it against each individual scanpath with the standard string-edit algorithm as suggested by Eraslan et al. (2016bEraslan et al. ( , 2016c. More sophisticated methods to compare scanpaths like ScanMatch (Cristino et al., 2010), Mul-tiMatch (Jarodzka, Holmqvist, & Nyström, 2010), and ScanGraph (Dolezalova & Popelka, 2016) are also be developed, facilitating the evaluation.
In our experiment, the evaluation of representative scanpaths is conducted at three different levels:  Scanpath length: scanpath length reflects the frequency of attention shift, so we compare the length distribution to check whether representative scanpaths can reflect this property;  Scanpath shape: scanpath shape, partly influenced by scanpath length, is related to both spatial distribution and temporal order, which is measured by DTW in our experiment;  Overall scanpath similarity: overall scanpath similarity comprehensively considers scanpath shape and gaze duration. ScanMatch and MultiMatch can provide such comparison.

Results
Analysis of Scanpath Length. Scanpath length reflects the frequency of attention shift. Figure 4 and Figure 5 analyze the length of representative scanpaths for both OSIE and MIT1003 datasets. From Figure 4 (a) and Figure 5 (a), we can find that length distributions of individual scanpaths are similar to normal distribution, which indicates that for only a small number of images, people concentrate on certain areas (hardly shift) or roam over the whole image (frequently shift) while for most images the shift frequency is relatively stable, neither too large nor too small. Thus the bell-shaped property should also be reflected by representative scanpaths. Considering that all the representative scanpaths are AOI based while individual scanpaths are fixations based, the absolute values of scanpath length may be different but the bell-shaped property of scanpath length distribution should be kept. However, eMine, STA and SPAM fail to retain this property and obtain right-tailed distributions. All of them are more likely to get shorter representative scanpaths, which reflect the pattern that for most images, subjects tend to concentrate on certain areas and hardly shift their attention. IOC, CDBA and the heuristic method can keep the bell-shaped distributions. Analysis of Scanpath Shape. In this part, we evaluate the ability of representative scanpaths to reflect attention distribution and attention shift, that is, the shape of representative scanpaths. We measure this ability by computing the average distance (DTW) between the representative scanpath and all the actually recorded scanpaths as suggested by Le Meur et al. (2015). Quantitative results are shown in Table 1. A smaller DTW means a better result. The average DTW between representative scanpaths obtained by the heuristic method and all the recorded scanpaths is the smallest. In other words, the heuristic method produces the best solutions for Equation (1), followed by CDBA and IOC. The results of statistical analysis are pre-sented in Table 2, which shows there is a significant difference between the results of our proposed "barycenter" based methods (CDBA and heuristic) and other methods.

Analysis of Overall Scanpath Similarity.
In this part, we estimate and assign gaze duration to scanpaths obtained in the aggregation step. Note that none of the existing algorithms except for STA have discussed representative scanpaths with gaze duration. Even though STA employs the duration information when identifying trending elements, it still focuses on the analysis of trending scanpaths and does not further analyze gaze duration. To make fair comparisons, we combine our gaze duration analysis method with all the methods proposed for scanpath aggregation, i.e., eMine, SPAM, STA and IOC. The overall scanpath similarity is evaluated by MultiMatch (Jarodzka et al., 2010) and ScanMatch (Cristino et al., 2010). MultiMatch compares scanpaths from five aspects: vector similarity, direction similarity, length similarity, position similarity and duration similarity. ScanMatch only outputs an integrated score reflecting order consistency, spatial proximity and duration similarity. The parameters involved in Scan-Match implementation are set as follows: Xbin = 24, Ybin = 18, Threshold = 3.5, GapValue = 0, TempBin = 100 (TempBin =0 when duration is not taken into account). We compare the representative scanpath with each actually recorded scanpath using both algorithms and compute the average scores. Table 3 shows the results on both datasets. The larger the scores, the better the results. Our proposed methods (CDBA* and Heuristic*) still outperform eMine, STA and SPAM, but the advantages of our methods over IOC are not so obvious. Then we further conduct statistical test on the ScanMatch results (with duration). The difference between the proposed methods and the first three methods, i.e., eMine, STA and SPAM, is significant on both data sets but this is not the case with IOC. It can be seen that although the heuristic method can get a smaller average distance in terms of DTW, scores of MultiMatch and ScanMatch are neck and neck with CDBA and IOC on both datasets. This may be caused by the fact that DTW directly takes Euclidean distance as elements in the cost matrix while both MultiMatch and ScanMatch conduct scanpath simplification or quantization before comparison.

Summary
In our experiment, we can regard the adaptation of IOC as constructing a candidate set that contains AOI-level scanpaths transformed from individual fixation-level scanpaths. In other words, IOC actually finds an optimal solution of Equation (1) under stricter constraints. In addition, CDBA and the heuristic method are also based on Equation (1), and the outputs of CDBA can actually be regarded as approximations of the heuristic results. Compared with the heuristic method, IOC chooses from a smaller candidate set while CDBA searches the set in a more efficient way, but these three algorithms share a similar idea, choosing a scanpath from a candidate scanpath set as the representative. In this sense, all the algorithms we discussed above can be categorized as follows: (1) "barycenter" based: IOC, CDBA, heuristic; (2) subsequence based: eMine, SPAM; (3) others: STA.
When evaluated by scanpath length, the "barycenter" based method can well keep the bell shaped distribution of scanpath length. The comparison by DTW also indicates that all the "barycenter" based methods can produce representative scanpaths similar to actually recorded individual scanpaths in scanpath shape. As for overall scanpath similarity, the "barycenter" based methods improve the performance by a large margin over others, which consolidates that "barycenter" based aggregated scanpaths are more suitable to be combined with gaze duration to get final representative scanpaths. In summary, representative scanpaths obtained by "barycenter" based methods can better describe viewing patterns.

Interpretation of Representative Scanpaths
Figures 6 shows the aggregated scanpaths obtained by different algorithms. In Figure 6, red circles represent AOIs. Yellow arrows indicate the direction and numbers indicate the order. Images 1009 and 1033 respectively contain only one conspicuous foreground object and three objects without many distractors in the background while image 1263 and image 1270 both contain multiple objects with complex background. eMine, STA and SPAM obviously produce shorter scanpaths that may not be able reflect complete viewing patterns. In particular, eMine only identifies one common AOI in all the individual scanpaths and fails to provide any information about attention shift for images 1009, 1033 and 1263. The "barycenter" based methods (IOC, CDBA and the heuristic method) produce identical results for images 1009 and 1263. For image 1009, the representative scanpaths show that attention is first attracted by the dog head, then transferred to the body and finally go back to the head. For image 1263, the pattern is that subjects are first attracted by faces, then linger between faces, next explore objects with which the female and the male are interacting (the food they are eating), and finally redirect their attention to human faces. For images 1033 and 1270, representative scanpaths obtained by IOC, CDBA and the heuristic method are a little different. It is difficult to conclude which scanpath can better describe the viewing pattern since they actually contain some common segments. Take image 1270 for example, all the three representative scanpaths start by an AOI located near image center, which is consistent with the well-known center bias. The main difference between obtained patterns lies in the priority of the AOI on the zip-top can and the AOI on the computer screen. The heuristic method and CDBA prioritizes the AOI on the zip-top can while IOC is on the contrary. Note that there are some letters on the can. Considering text is a top-down factor capable of guiding visual attention (Ramanishka, Das, Zhang, & Saenko, 2017), the pattern obtained by the heuristic method and the CDBA algorithm may be more reasonable. In addition, although we do not have any so-called ground truth viewing pattern, the identified patterns seem to be congruent with human intuition and some verified findings such as center bias, top-down effect, etc, whether there are one or several foreground objects, simple or complex backgrounds. However, in some cases where the priorities of different visual stimuli are not clear (e.g., image 1033), the identified patterns can only provide limitedly useful knowledge. Figure 7 visualizes obtained representative scanpaths obtained by our proposed methods (CDBA and huristic) with duration pattern for image i1182314083 from MIT1003 data set (Judd et al., 2009). The radius of red circles is proportional to the total gaze duration on the corresponding AOI. Figure 8 shows the duration patterns of individual scanpaths. It is can be seen that the duration pattern of the representative scanpath is visually consistent with the duration pattern of individual scanpaths and can reflect the group trend from an overall perspective.

Discussion
In this article, we extend our previous framework to identify representative scanpaths from multiple individual scanpaths for natural images. Different from most existing work, we also analyze the duration pattern. The proposed framework consists of three steps: eye-gaze data preprocessing, scanpath aggregation and gaze duration analysis. Experiments demonstrate that our proposed framework is able to identify representative scanpaths reflecting group viewing patterns on natural images.
Based on the algorithms for scanpath aggregation, we further categorize representative scanpaths as follows: (1) "barycenter" based; (2) subsequence based; (3) others. Some algorithms are specially designed to identify viewing patterns on a specific kind of visual stimuli so their performances are not so satisfactory when visual stimuli are changed. For natural images, we find that "barycenter" based representative scanpaths are the closest to individual scanpaths. Such representative scanpaths for natural images are useful in various fields. For example, computer vision researchers attempt to build plausible saccadic models to predict human scanpaths and they need a reliable ground truth scanpath against which predicted scanpaths can be compared. In addition, it is much easier for us to visualize and analyze one representative scanpath than multiple individual scanpaths that are largely overlapped, which makes it possible to validate some assumptions about visual attention and eye movements such as centerbias and top-down bias. The representative scanpath with duration pattern can also give us a hint about what first grabs visual attention and what holds attention for a long period, providing knowledge about what kinds of images are obvious visual attractors.
However, there are some limitations of our work. For example, the eye tracking data set only involves 15 participants, which means there are at most 15 scanpaths for each image. So it is necessary to construct a much larger data set with more participants. The size of the data set can arouse some challenges for the proposed algorithm, like how to efficiently determine the initial reference scanpath for CDBA and how to reduce the space and time cost of the heuristic method. In addition, we use a data-driven approach to obtain AOIs but it could be better to associate AOIs with semantically meaningful objects. The incorporation of semantic segmentation in the preprocessing step needs further investigation.

Conclusions
Eye tracking data provide insights into how humans perceive and explore their surroundings. Traditional methods to analyze scanpaths target a specific kind of viewing stimuli such as web pages and neglect the duration pattern, so the scanpaths obtained by such methods are not able to reflect the viewing pattern on natural images correctly or comprehensively. In this paper, we extend our previous framework to identify representative scanpaths, considering temporal order, spatial distribution and gaze duration. The framework consists of three steps: eye-gaze data preprocessing, scanpath aggregation and gaze duration analysis. The second step is the key to representative scanpaths identification and can be replaced by traditional methods such as eMine. Based on the algorithms chosen, we further categorize the obtained representative scanpaths as subsequence based, "barycenter" based and others. Experiments demonstrate that our framework can well serve the purpose of generalizing viewing patterns and the "barycenter" based representative scanpaths can better describe the patterns.

Ethics and Conflict of Interest
The author(s) declare(s) that the contents of the article are in agreement with the ethics described in http://biblio.unibe.ch/portale/elibrary/BOP/jemr/ethics.html and that there is no conflict of interest regarding the publication of this paper.