A final aspect of our work on language concerns the multimodality of face-to-face conversations, or even video conferences, in which gestures or other meaningful movements, as well as objects and actions visible in the context, can be semantically relevant. Understanding how the nonlinguistic context adds content to a conversation, and conversely, how the content of a conversation can help us understand what is going on in the visual scene will be crucial for developing models of conversation sophisticated enough to facilitate natural conversation between humans on the one hand and assistants or embodied agents, such as collaborative robots, on the other.