New Ideas and Trends in Deep Multimodal Content Understanding: A Review
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants.These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.
深度多模式内容理解的新思想和趋势:回顾
这项调查的重点是分析多模式深度学习的两种模式:图像和文本。与经典的深度学习评论不同,单峰图像分类器(例如VGG,ResNet和Inception模块)是中心主题,而本文将研究最近的多峰深度模型和结构,包括自动编码器,生成对抗网络及其变体。.. 这些模型超越了简单的图像分类器,在这些分类器中,它们可以执行单向(例如,图像字幕,图像生成)和双向(例如,跨模式检索,视觉问题解答)多模式任务。此外,我们在更好地理解深度多模式应用程序的内容方面分析了挑战的两个方面。然后,我们介绍了深度多峰特征学习中的当前思想和趋势,例如特征嵌入方法和目标函数设计,这些对于克服上述挑战至关重要。最后,我们为未来的研究提供了一些有希望的方向。 (阅读更多)