无需归一化的 Transformer
Abstract
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
归一化层在现代神经网络中无处不在,长期以来被认为是不可或缺的。本文证明,无需归一化的 Transformer 可以通过一种极其简单的技术实现同等或更优的性能。我们提出了动态 Tanh(DyT),一种逐元素操作
给定输入张量
其中:
DyT 模块可以通过短短几行 PyTorch 代码来实现:
class DyT(nn.Module):
def __init__(self, num_features, alpha_init_value=0.5):
super().__init__()
self.alpha = nn.Parameter(torch.ones(1) * alpha_init_value)
self.weight = nn.Parameter(torch.ones(num_features))
self.bias = nn.Parameter(torch.zeros(num_features))
def forward(self, x):
x = torch.tanh(self.alpha * x)
return x * self.weight + self.bias