Attention Is All You Need
Vaswani et al., 2017
Summary
- Proposes a sequence transduction model based soley on attention mechanisms, instead of complex recurrent or convolutional neural networks
- Outperforms other models on machine translation tasks while being computationally cheaper
- Links: [ website ] [ pdf ]
Background
- RNNs have been firmly established as the SotA approaches for sequence modeling and transduction problems (e.g. language modeling, machine translation)
- RNNs factor computation along symbol positions, which precludes parallelization within training examples due to its inherently sequential nature
- The Transformer model relies entirely on attention mechanisms instead of recurrence to model dependencies between input and output
- Attention mechanisms allow modeling of dependencies regardless of the distance in the sequence
Methods
- Encoder: Maps an input sequence of symbols to a sequence of continuous representations
- Composed of a stack of $N$ identical layers, each with two sub-layers: multi-head self-attention and position-wise fully connected feed-forward network
- Residual connection around each of the sub-layers, followed by layer normalization
- All sub-layers in the model and the embedding layers have the same output dimension
- Decoder: Generates output sequence of symbols one element at a time
- Also composed of a stack of $N$ identical layers, each with three sub-layers
- Additional multi-head attention over output of the encoder stack
- Modify self-attention sub-layer to prevent positions from attending to subsequent positions
- Also composed of a stack of $N$ identical layers, each with three sub-layers
- Multi-head Attention: Project queries, keys, and values $h$ times with different, learned linear projections
- Perform attention function in parallel, concatenate resulting output values, and then apply a final projection
- Enables jointly attending to different subspaces at different positions
- Position-wise Feed-forward Networks: FCN applied to each position separately and identically
- Positional Encoding: Inject information about relative (or absolute) position of the tokens in the sequence
- Sum the positional encodings with input embeddings
- Fixed positional encoding using sine and cosine functions of different frequencies
Results
- Compared to recurrent and convolutional layers, self-attention has:
- Better computational complexity when sequence length is smaller than representation dimensionality, which is generally the case
- Similar complexity of sequential operations to CNNs ($O(1)$), better than RNNs ($O(n)$)
- Shorter maximum path length than CNNs ($O(\log_k(n)$) and RNNs ($O(n)$)
- Outperforms SotA single models, achieving a BLEU score of 41 on WMT 2014 English-to-French translation task, with a quarter of the training cost
- Model variations on English-to-German translation task:
- Performance drops off with too few or too many attention heads
- Reducing attention key dimensionality hurts performance, indicates more sophisticated compatibility function may be beneficial
- Bigger models are better
- Dropout is helpful for reducing over-fitting
- Learned positional embeddings perform basically identically to fixed positional encoding
- Shows generalization to other tasks by performing well on English constituency parsing compared to previous models, especially other RNN seq-to-seq models
Conclusion
- Presents the Transformer, a sequence transduction model that replaces recurrent layers and relies entirely on attention mechanisms
- Offers many benefits over RNN models
- Future work includes investigating restricted attention mechanisms to efficiently handle large inputs and outputs