Visual Complexity and Its Effects on Referring Expression Generation.
In multimodal human machine conversation, successfully interpreting human attention is critical. While attention has been studied extensively in linguistic processing and visual processing, it is not clear how linguistic attention is aligned with visual attention in multimodal conversational interfaces. To address this issue, we conducted a preliminary investigation on how attention reflected by linguistic discourse aligns with attention indicated by gaze fixations during human machine conversation. Our empirical findings have shown that more attended entities based on linguistic discourse correspond to higher intensity of gaze fixations. The smoother a linguistic transition is, the less distance between corresponding fixation distributions. These findings provide insight into how language and gaze can be combined to predict attention, which have important implications in many tasks such as word acquisition and object recognition.