Information Aggregation for Multi-Head Attention with Routing-by-Agreement
Refereed conference paper presented and published in conference proceedings


Full Text

Other information
AbstractMulti-head attention is appealing for its ability to jointly extract different types of information from multiple representation subspaces. Concerning the information aggregation, a common practice is to use a concatenation followed by a linear transformation, which may not fully exploit the expressiveness of multi-head attention. In this work, we propose to improve the information aggregation for multi-head attention with a more powerful routing-by-agreement algorithm. Specifically, the routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on linguistic probing tasks and machine translation tasks prove the superiority of the advanced information aggregation over the standard linear transformation.
All Author(s) ListJian Li, Baosong Yang, Zi-Yi Dou, Xing Wang, Michael R. Lyu, Zhaopeng Tu
Name of Conference2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019)
Start Date of Conference02/06/2019
End Date of Conference07/06/2019
Place of ConferenceMinnesota, USA
Country/Region of ConferenceUnited States of America
Proceedings TitleProceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year2019
Month6
Pages3566 - 3575
LanguagesEnglish-United States

Last updated on 2019-17-10 at 17:30