Skip to content

无需归一化的 Transformer

Abstract

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation DyT(x)=tanh(αx) as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, S-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

归一化层在现代神经网络中无处不在,长期以来被认为是不可或缺的。本文证明,无需归一化的 Transformer 可以通过一种极其简单的技术实现同等或更优的性能。我们提出了动态 Tanh(DyT),一种逐元素操作 DyT(x)=tanh(αx),作为 Transformer 中归一化层的即插即用替代方案。DyT 的灵感源于观察到 Transformer 中的层归一化常会产生类似 tanh 函数的 S 形输入-输出映射。通过引入 DyT,无需归一化的 Transformer 能够匹配甚至超越带归一化层的模型,且通常无需超参数调整。我们验证了带 DyT 的 Transformer 在多种场景下的有效性,涵盖识别与生成任务、监督与自监督学习、计算机视觉与语言模型。这些发现挑战了传统认知中归一化层对现代神经网络不可或缺的观点,并为理解其在深度网络中的作用提供了新视角。


给定输入张量 x,DyT 层定义如下:

DyT(x)=γtanh(αx)+β

其中:α 为可学习的标量参数,γβ 是可学习的、逐通道的向量参数。

DyT 模块可以通过短短几行 PyTorch 代码来实现:

python
class DyT(nn.Module):
    def __init__(self, num_features, alpha_init_value=0.5):
        super().__init__()
        self.alpha = nn.Parameter(torch.ones(1) * alpha_init_value)
        self.weight = nn.Parameter(torch.ones(num_features))
        self.bias = nn.Parameter(torch.zeros(num_features))

    def forward(self, x):
        x = torch.tanh(self.alpha * x)
        return x * self.weight + self.bias