首页 > 其他分享> > paper review: Multimodal Transformer for Unaligned Multimodal Language Sequences

paper review: Multimodal Transformer for Unaligned Multimodal Language Sequences



Multimodal Transformer for Unaligned Multimodal Language Sequences



The author wants to infer how we combine voice with a face. In this paper, the author does many work base on VGGFace and VoxCeleb database. Its main contributions can be summarized as follow :

  1. introduce CNN for binary or multi-way’s face matching with audio.
  2. Using different audio to identify the dynamic speaker.
  3. the author discovers that CNN matches human performance on easy examples (different gender). But it exceeds human judgment in complex examples. (face has the same gender, age, and nationality)

摘要 (中文)


Research Objective

We examine whether faces and voices encode redundant identity information and measure to which extent.

Background and Problems

main work

  1. We provide an extensive human-subject study, with both the participant pool and dataset larger.
  2. We learn the co-embedding of modal representations of human faces and voices, and evaluate the learned representations extensively, revealing unsupervised correlations to demographic, prosodic, and facial features.
  3. We present a new dataset of the audiovisual recordings of speeches by 181 individuals with diverse demographic background, totaling over 3 hours of recordings, with the demographic annotations.

work limitations : self dataset is not big enough.

Related work



  1. first, with human subjects, showing the baseline for how well people perform such tasks.
  2. On machines using deep neural networks, demonstrating that machines perform on a par with humans.
  1. However, we emphasize that, similar to lie detectors, such associations should not be used for screening purposes or as hard evidence. Our work suggests the possibility of learning the associations by referring to a part of the human cognitive process, but not their definitive nature, which we believe would be far more complicated than it is modeled as in this work.
  1. not reflected .


Arouse for me

来源: https://blog.csdn.net/liupeng19970119/article/details/113784089