MS-Transformer: Introduce multiple structural priors into a unified transformer for encoding sentences

Transformers have been widely utilized in recent NLP studies. Unlike CNNs or RNNs, the vanilla Transformer is position-insensitive, and thus is incapable of capturing the structural priors between sequences of words. Existing studies commonly apply one single mask strategy on Transformers for incorp...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language Vol. 72; p. 101304
Main Authors: Qi, Le, Zhang, Yu, Yin, Qingyu, Liu, Ting
Format: Journal Article
Language:English
Published: Elsevier Ltd 01-03-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Transformers have been widely utilized in recent NLP studies. Unlike CNNs or RNNs, the vanilla Transformer is position-insensitive, and thus is incapable of capturing the structural priors between sequences of words. Existing studies commonly apply one single mask strategy on Transformers for incorporating structural priors while failing at modeling more abundant structural information of texts. In this paper, we aim at introducing multiple types of structural priors into Transformers, proposing the Multiple Structural Priors Guided Transformer (MS-Transformer) that transforms different structural priors into different attention heads by using a novel multi-mask based multi-head attention mechanism. In particular, we integrate two categories of structural priors, including the sequential order and the relative position of words. For the purpose of capturing the latent hierarchical structure of the texts, we extract these information not only from the word contexts but also from the dependency syntax trees. Experimental results on three tasks show that MS-Transformer achieves significant improvements against other strong baselines. •Multi-mask strategies can introduce different priors into different attention heads.•Multi-mask strategies can guide models learning more precise dependencies.•The sequential order and relative position of words are taken as structure priors.•Structure priors benefit models on modeling sentence structures from multiple aspects.
ISSN:0885-2308
1095-8363
DOI:10.1016/j.csl.2021.101304