TR2018-195

Automatic Evaluation of End-to-End Dialog Systems with Adequacy-Fluency Metrics


    •  d’Haro, L.F., Banchs, R., Hori, C., Li, H., "Automatic Evaluation of End-to-End Dialog Systems with Adequacy-Fluency Metrics", Special issue on DSTC6 in Computer Speech and Langauge, DOI: 10.1016/​j.csl.2018.12.004, Vol. 55, pp. 200-215, March 2019.
      BibTeX TR2018-195 PDF
      • @article{dHaro2019mar,
      • author = {d’Haro, Luis Fernando and Banchs, Rafael and Hori, Chiori and Li, Haizhou},
      • title = {Automatic Evaluation of End-to-End Dialog Systems with Adequacy-Fluency Metrics},
      • journal = {Special issue on DSTC6 in Computer Speech and Langauge},
      • year = 2019,
      • volume = 55,
      • pages = {200--215},
      • month = mar,
      • publisher = {Elsevier},
      • doi = {10.1016/j.csl.2018.12.004},
      • url = {https://www.merl.com/publications/TR2018-195}
      • }
  • MERL Contact:
  • Research Area:

    Speech & Audio

Abstract:

End-to-End dialog systems are gaining interest due to the recent advances of deep neural networks and the availability of large human-human dialog corpora. However, in spite of being of fundamental importance to systematically improve the performance of this kind of systems, automatic evaluation of the generated dialog utterances is still an unsolved problem. Indeed, most of the proposed objective metrics shown low correlation with human evaluations. In this paper, we evaluate a two-dimensional evaluation metric that is designed to operate at sentence level, which considers the syntactic and semantic information carried along the answers generated by an end-to-end dialog system with respect to a set of references. The proposed metric, when applied to outputs generated by the systems participating in track 2 of the DSTC-6 challenge, shows a higher correlation with human evaluations (up to 12.8% relative improvement at the system level) than the best of the alternative state-of-the-art automatic metrics currently available.