Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos


To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal in- structions in their demonstrations, showing a sequence of short- horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that con- verts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a DMP sequence acquired from a cooking video using the audio-visual Transformer. Ex- periments with Epic-Kitchen-100, YouCookII, QuerYD, and in- house instruction video datasets show that the proposed method improves the quality of DMP sequences by 2.3 times the METEOR score obtained with a baseline video-to-action Trans- former. The model achieved 32% of the task success rate with the task knowledge of the object.


  •  Hori, C., Peng, P., Harwath, D., Liu, X., Ota, K., Jain, S., Corcodel, R., Jha, D.K., Romeres, D., Le Roux, J., "Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos", arXiv, June 2023.
    BibTeX arXiv
