使用torch.nn.modules.transformer.Transformer
模块/对象时,第一层是encoder.layers.0.self_attn
层即MultiheadAttention
层,即
from torch.nn.modules.transformer import Transformer
bumblebee = Transformer()
bumblee.parameters
[出去]:
<bound method Module.parameters of Transformer(
(encoder): TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
如果我们打印出层的大小,我们会看到:
for name in bumblebee.encoder.state_dict():
print(name, '\t', bumblebee.encoder.state_dict()[name].shape)
[出去]:
layers.0.self_attn.in_proj_weight torch.Size([1536, 512])
layers.0.self_attn.in_proj_bias torch.Size([1536])
layers.0.self_attn.out_proj.weight torch.Size([512, 512])
layers.0.self_attn.out_proj.bias torch.Size([512])
layers.0.linear1.weight torch.Size([2048, 512])
layers.0.linear1.bias torch.Size([2048])
layers.0.linear2.weight torch.Size([512, 2048])
layers.0.linear2.bias torch.Size([512])
layers.0.norm1.weight torch.Size([512])
layers.0.norm1.bias torch.Size([512])
layers.0.norm2.weight torch.Size([512])
layers.0.norm2.bias torch.Size([512])
看起来 1536 是 512 * 3 并且不知何故,该layers.0.self_attn.in_proj_weight
参数可能将变压器架构中的所有三个 QKV 张量存储在一个矩阵中。
来自https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/activation.py#L649
class MultiheadAttention(Module):
def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None):
super(MultiheadAttention, self).__init__()
self.embed_dim = embed_dim
self.kdim = kdim if kdim is not None else embed_dim
self.vdim = vdim if vdim is not None else embed_dim
self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
self.num_heads = num_heads
self.dropout = dropout
self.head_dim = embed_dim // num_heads
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
if self._qkv_same_embed_dim is False:
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
else:
self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
并且文档字符串中的注释MultiheadAttention
说:
注意:如果 kdim 和 vdim 为 None,它们将被设置为 embed_dim 以便查询、键和值具有相同数量的特征。
那是对的吗?