The ability to reliably transcribe spoken conversation opens up the possibility of exploiting transcribed data for tasks that require advanced language understanding capacities, ranging from automatic summarization, to more fruitful dialogues with artificial assistants, to full-fledged situated interactions with assistants able to exploit information from the visual context during conversation. In collaboration with academic and industrial research partners, our team is strongly invested in developing innovative models of language understanding that draw on our solid experience in machine learning and a hybrid approach to research that brings linguistic expertise to bear on machine learning algorithms.
Our research on automatic summarization has led to improved models of lexical importance and discourse similarity that allow us to more reliably identify the utterances most central to a conversation. To extend this work to models capable of producing detailed summaries and meeting minutes, we are currently working on algorithms to track how utterances relate to one another in a conversation: does an utterance serve to answer a question that was asked, to provide an explanation of something that was said, or to correct or disagree with an argument that was put forward, for example? Identifying such relations involves integrating our work on summarization with our work on dialogue and discourse modeling.
Drawing on our team’s expertise in modeling discourse structure, our research on dialogue extends work on discourse parsing for text and chat to model multi-party, spoken conversation. Progress in discourse parsing is greatly hindered by a dearth of annotated conversational data as well as a need for linguistic expertise for exploiting it. We are currently tackling both of these problems with an approach to weak supervision that allows expert annotators to study a small but representative sample of data and write labeling rules that can be used to automatically annotate large data sets. This approach allows us to easily incorporate heterogeneous sources of information that can be useful for dialogue modeling, from discursive cues to acoustic information.
A final aspect of our work on language concerns the multimodality of face-to-face conversations, or even video conferences, in which gestures or other meaningful movements, as well as objects and actions visible in the context, can be semantically relevant. Understanding how the nonlinguistic context adds content to a conversation, and conversely, how the content of a conversation can help us understand what is going on in the visual scene will be crucial for developing models of conversation sophisticated enough to facilitate natural conversation between humans on the one hand and assistants or embodied agents, such as collaborative robots, on the other.