Rethinking the Form of Latent States in Image Captioning
Refereed conference paper presented and published in conference proceedings

替代計量分析
.

其它資訊
摘要RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain.
出版社接受日期12.07.2018
著者Bo Dai, Deming Ye, Dahua Lin
會議名稱15th European Conference on Computer Vision, ECCV 2018
會議開始日08.09.2018
會議完結日14.09.2018
會議地點Munich, Germany
會議國家/地區德國
會議論文集題名Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
出版年份2018
月份9
卷號11209
出版社Springer
頁次294 - 310
國際標準書號978-303001227-4
國際標準期刊號03029743
語言美式英語

上次更新時間 2021-12-01 於 01:15