Sound3DVDet: 3D Sound Source Detection using Multiview Microphone Array and RGB Images


Spatial localization of 3D sound sources is an important problem in many real world scenarios, especially when the sources may not have any visually distinguishable characteristic; e.g., finding a gas leak, a malfunctioning motor, etc. In this paper, we cast this task in a novel audio-visual setting, by introducing an acoustic-camera rig consisting of a centered pinhole RGB camera and a uniform circular array of four coplanar microphones. Using this setup, we propose Sound3DVDet – a 3D sound source localization Transformer model that treats this task as a set prediction problem. It first learns a set of initial sound source locations (dubbed queries) from a single view of the microphone array signal, then feeds the query set to a sequence of Transformerlike layers for refinement. Each query arising from each layer repeatedly aggregates sound source cues from other views. We deeply supervise the initial sound source queries, intermediate layer queries, and the final output by measuring their respective discrepancy against ground truth queries via bipartite matching. To evaluate our method, we introduce a new dataset: Sound3DVDet Dataset, consisting of nearly 6k scenes produced using the SoundSpaces simulator. We conduct extensive experiments on our dataset and show the efficacy of our approach against closely related methods, demonstrating significant improvements in the localization accuracy.