Uses bijective networks to identify large subspaces of invariance-based vulnerability and introduces the independence cross-entropy loss which partially alleviates it.
The Transformer, a sequence transduction model that replaces recurrent layers and relies entirely on attention mechanisms, achieves new SotA on machine translation tasks while reducing training time significantly.