Multimodal Error Correction with Natural Language and Pointing Gestures

Stefan Constantin, Fevziye Irem Eyiokur, Dogucan Yaman, Leonard Bärmann, Alex Waibel; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 1976-1986

Abstract


Error correction is crucial in human-computer interaction, as it can provide supervision for incrementally learning artificial intelligence. If a system maps entities like objects or persons with unknown class to inappropriate existing classes, or misrecognizes entities from known classes when there is too high train-test discrepancy, error correction is a natural way for a user to improve the system. Provided an agent with visual perception, if such entity is in the view of the system, pointing gestures can dramatically simplify the error correction. Therefore, we propose a modularized system for multimodal error correction using natural language and pointing gestures. First, pointing line generation and region proposal detects whether there is a pointing gesture, and if yes, which candidate objects (i. e. RoIs) are on the pointing line. Second, these RoIs (if any) and the user's utterances are fed into a VL-T5 network to extract and link both the class name and the corresponding RoI of the referred entity, or to output that there is no error correction. In the latter case, the utterances can be passed to a downstream component for Natural Language Understanding. We use additional, challenging annotations for an existing real-world pointing gesture dataset to evaluate our proposed system. Furthermore, we demonstrate our approach by integrating it on a real-world steerable laser pointer robot, enabling interactive multimodal error correction and thus incremental learning of new objects.

Related Material


[pdf]
[bibtex]
@InProceedings{Constantin_2023_ICCV, author = {Constantin, Stefan and Eyiokur, Fevziye Irem and Yaman, Dogucan and B\"armann, Leonard and Waibel, Alex}, title = {Multimodal Error Correction with Natural Language and Pointing Gestures}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {1976-1986} }