编程语言
首页 > 编程语言> > BERT源码详解(一)——HuggingFace Transformers源码解

BERT源码详解(一)——HuggingFace Transformers源码解

作者:互联网

 

众所周知,BERT模型自2018年问世起就各种屠榜,开启了NLP领域预训练+微调的范式。到现在,BERT的相关衍生模型层出不穷(XL-Net、RoBERTa、ALBERT、ELECTRA、ERNIE等),要理解它们可以先从BERT这个始祖入手。

HuggingFace是一家总部位于纽约的聊天机器人初创服务商,很早就捕捉到BERT大潮流的信号并着手实现基于pytorch的BERT模型。这一项目最初名为pytorch-pretrained-bert,在复现了原始效果的同时,提供了易用的方法以方便在这一强大模型的基础上进行各种玩耍和研究。

随着使用人数的增加,这一项目也发展成为一个较大的开源社区,合并了各种预训练语言模型以及增加了Tensorflow的实现,并且在2019年下半年改名为Transformers。截止写文章时(2021年3月30日)这一项目已经拥有43k+的star,可以说Transformers已经成为事实上的NLP基本工具。

https://github.com/huggingface/transformers​github.com/huggingface/transformers

本文基于Transformers版本4.4.2(2021年3月19日发布)项目中,pytorch版的BERT相关代码,从代码结构、具体实现与原理,以及使用的角度进行分析,包含以下内容:

  1. BERT Tokenization分词模型(BertTokenizer)
  2. BERT Model本体模型(BertModel)
    1. BertEmbeddings
    2. BertEncoder
      1. BertLayer
        1. BertAttention
          1. BertSelfAttention
          2. BertSelfOutput
        2. BertIntermediate
        3. BertOutput
      2. BertPooler
  3. BERT-based Models应用模型(请看下篇)
    1. BertForPreTraining
    2. BertForSequenceClassification
    3. BertForMultiChoice
    4. BertForTokenClassification
    5. BertForQuestionAnswering
  4. BERT训练与优化(请看下篇)
    1. Pre-Training
    2. Fine-Tuning
      1. AdamW
      2. Warmup

下篇传送门:

Riroaki:BERT源码详解(二)——HuggingFace Transformers最新版本源码解读134 赞同 · 25 评论文章正在上传…重新上传取消​


1 Tokenization(BertTokenizer)

和BERT有关的Tokenizer主要写在/models/bert/tokenization_bert.py/models/bert/tokenization_bert_fast.py 中。

这两份代码分别对应基本的BertTokenizer,以及不进行token到index映射的BertTokenizerFast,这里主要讲解第一个。

class BertTokenizer(PreTrainedTokenizer):
    """
    Construct a BERT tokenizer. Based on WordPiece.

    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
    Users should refer to this superclass for more information regarding those methods.
    ...
    """

BertTokenizer 是基于BasicTokenizerWordPieceTokenizer 的分词器:

BertTokenizer 有以下常用方法:

以及,类自身的方法:

>>> from transformers import BertTokenizer
>>> bt = BertTokenizer.from_pretrained('./bert-base-uncased/')
>>> bt('I like natural language progressing!')
{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

2 Model(BertModel)

和BERT模型有关的代码主要写在/models/bert/modeling_bert.py中,这一份代码有一千多行,包含BERT模型的基本结构和基于它的微调模型等。

下面从BERT模型本体入手分析:

class BertModel(BertPreTrainedModel):
    """

    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

    To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration
    set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
    input to the forward pass.
    """ 

BertModel主要为transformer encoder结构,包含三个部分:

  1. embeddings,即BertEmbeddings类的实体,对应词嵌入;
  2. encoder,即BertEncoder类的实体;
  3. pooler, 即BertPooler类的实体,这一部分是可选的。
补充:注意BertModel也可以配置为Decoder,不过下文中不包含对这一部分的讨论。

下面将介绍BertModel的前向传播过程中各个参数的含义以及返回值:

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_values=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ): ...
补充:注意,这里的head_mask对注意力计算的无效化,和下文提到的注意力头剪枝不同,而仅仅把某些注意力的计算结果给乘以这一系数。

返回部分如下:

        # BertModel的前向传播返回部分
        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPoolingAndCrossAttentions(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            past_key_values=encoder_outputs.past_key_values,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
            cross_attentions=encoder_outputs.cross_attentions,
        )

可以看出,返回值不但包含了encoder和pooler的输出,也包含了其他指定输出的部分(hidden_states和attention等,这一部分在encoder_outputs[1:])方便取用:

        # BertEncoder的前向传播返回部分,即上面的encoder_outputs
        if not return_dict:
            return tuple(
                v
                for v in [
                    hidden_states,
                    next_decoder_cache,
                    all_hidden_states,
                    all_self_attentions,
                    all_cross_attentions,
                ]
                if v is not None
            )
        return BaseModelOutputWithPastAndCrossAttentions(
            last_hidden_state=hidden_states,
            past_key_values=next_decoder_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attentions,
            cross_attentions=all_cross_attentions,
        )

此外,BertModel还有以下的方法,方便BERT玩家进行各种骚操作:

  1. get_input_embeddings:提取embedding中的word_embeddings即词向量部分;
  2. set_input_embeddings:为embedding中的word_embeddings赋值;
  3. _prune_heads:提供了将注意力头剪枝的函数,输入为{layer_num: list of heads to prune in this layer}的字典,可以将指定层的某些注意力头剪枝。
补充:剪枝是一个复杂的操作,需要将保留的注意力头部分的Wq、Kq、Vq和拼接后全连接部分的权重拷贝到一个新的较小的权重矩阵(注意先禁止grad再拷贝),并实时记录被剪掉的头以防下标出错。具体参考BertAttention部分的prune_heads方法。

2.1 BertEmbeddings

包含三个部分求和得到:

  1. word_embeddings,上文中subword对应的嵌入。
  2. token_type_embeddings,用于表示当前词所在的句子,辅助区别句子与padding、句子对间的差异。
  3. position_embeddings,句子中每个词的位置嵌入,用于区别词的顺序。和transformer论文中的设计不同,这一块是训练出来的,而不是通过Sinusoidal函数计算得到的固定嵌入。一般认为这种实现不利于拓展性(难以直接迁移到更长的句子中)。

三个embedding不带权重相加,并通过一层LayerNorm+dropout后输出,其大小为(batch_size, sequence_length, hidden_size)

补充:这里为什么要用LayerNorm+Dropout呢?为什么要用LayerNorm而不是BatchNorm?可以参考一个不错的回答:

transformer 为什么使用 layer normalization,而不是其他的归一化方法?202 赞同 · 12 评论回答正在上传…重新上传取消​

2.2 BertEncoder

包含多层BertLayer,这一块本身没有特别需要说明的地方,不过有一个细节值得参考:

利用gradient checkpointing技术以降低训练时的显存占用。

补充:gradient checkpointing即梯度检查点,通过减少保存的计算图节点压缩模型占用空间,但是在计算梯度的时候需要重新计算没有存储的值,参考论文《Training Deep Nets with Sublinear Memory Cost》,过程如下示意图:

在BertEncoder中,gradient checkpoint是通过torch.utils.checkpoint.checkpoint实现的,使用起来比较方便,可以参考文档:

torch.utils.checkpoint - PyTorch 1.8.1 documentation​pytorch.org/docs/stable/checkpoint.html

这一机制的具体实现比较复杂(没看懂),在此不作展开。

再往深一层走,就进入了Encoder的某一层:

2.2.1 BertLayer

这一层包装了BertAttention和BertIntermediate+BertOutput(即Attention后的FFN部分),以及这里直接忽略的cross-attention部分(将BERT作为Decoder时涉及的部分)。

理论上,这里顺序调用三个子模块就可以,没有什么值得说明的地方。

然而这里又出现了一个细节

        # 这是forward的一部分
        self_attention_outputs = self.attention(
            hidden_states,
            attention_mask,
            head_mask,
            output_attentions=output_attentions,
            past_key_value=self_attn_past_key_value,
        )
        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

        # 中间省略一部分……

        layer_output = apply_chunking_to_forward(
            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
        )
        outputs = (layer_output,) + outputs

        # 省略一部分……

        return outputs

    # 这是feed_forward_chunk的部分
    def feed_forward_chunk(self, attention_output):
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output

看到上面那个apply_chunking_to_forwardfeed_forward_chunk了吗(为什么要整这么复杂,直接调用它不香吗)?

那么这个apply_chunking_to_forward到底是啥?深入看看……

def apply_chunking_to_forward(
    forward_fn: Callable[..., torch.Tensor], chunk_size: int, chunk_dim: int, *input_tensors
) -> torch.Tensor:
    """
    This function chunks the :obj:`input_tensors` into smaller input tensor parts of size :obj:`chunk_size` over the
    dimension :obj:`chunk_dim`. It then applies a layer :obj:`forward_fn` to each chunk independently to save memory.

    If the :obj:`forward_fn` is independent across the :obj:`chunk_dim` this function will yield the same result as
    directly applying :obj:`forward_fn` to :obj:`input_tensors`.
    ...
    """

哦,又是一个节约显存的技术——包装了一个切分小batch或者低维数操作的功能:这里参数chunk_size其实就是切分的batch大小,而chunk_dim就是一次计算维数的大小,最后拼接起来返回。

不过,在默认操作中不会特意设置这两个值(在源代码中默认为0和1),所以会直接等效于正常的forward过程。

继续往下深入,就是Transformer的核心:BertAttention部分,以及紧随其后的FFN部分。

2.2.1.1 BertAttention

本以为attention的实现就在这里,没想到还要再下一层……其中,self成员就是多头注意力的实现,而output成员实现attention后的全连接+dropout+residual+LayerNorm一系列操作。

class BertAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)
        self.pruned_heads = set()

首先还是回到这一层。这里出现了上文提到的剪枝操作,即prune_heads方法:

    def prune_heads(self, heads):
        if len(heads) == 0:
            return
        heads, index = find_pruneable_heads_and_indices(
            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
        )

        # Prune linear layers
        self.self.query = prune_linear_layer(self.self.query, index)
        self.self.key = prune_linear_layer(self.self.key, index)
        self.self.value = prune_linear_layer(self.self.value, index)
        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)

        # Update hyper params and store pruned heads
        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
        self.pruned_heads = self.pruned_heads.union(heads) 

这里的具体实现概括如下:

接下来就到重头戏——Self-Attention的具体实现。

2.2.1.1.1 BertSelfAttention

预警:这一块可以说是模型的核心区域,也是唯一涉及到公式的地方,所以将贴出大量代码。

初始化部分:

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

然后是重点,也就是前向传播过程。

首先回顾一下multi-head self-attention的基本公式:

其中   表示注意力头的个数,  表示向量拼接,  。

而这些注意力头,众所周知是并行计算的,所以上面的query、key、value三个权重是唯一的——这并不是所有heads共享了权重,而是“拼接”起来了。

补充:原论文中多头的理由为Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.而另一个比较靠谱的分析有:

看看forward方法:

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_value=None,
        output_attentions=False,
    ):
        mixed_query_layer = self.query(hidden_states)

        # 省略一部分cross-attention的计算
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))
        query_layer = self.transpose_for_scores(mixed_query_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # ...

到这里实现了K与Q相乘,获得raw attention scores的部分,按公式接下来应该是按dk进行scaling并做softmax的操作。然而——

先出现在眼前的是一个奇怪的positional_embedding,以及一堆爱因斯坦求和:

        # ...
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            seq_length = hidden_states.size()[1]
            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r
            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
        # ...
补充:关于爱因斯坦求和约定,参考以下文档

torch.einsum - PyTorch 1.8.1 documentation​pytorch.org/docs/stable/generated/torch.einsum.html

补充:这里的positional_embedding引入了attention map中的位置嵌入——为什么要这么做呢?我目前还没搞明白……

对于不同的positional_embedding_type,有三种操作:

暂时跳过这一迷惑的部分,回到正常attention的流程:

        # ...
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask  # 这里为什么是+而不是*?

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        # 省略decoder返回值部分……
        return outputs

重大疑问:这里的attention_scores = attention_scores + attention_mask是在做什么?难道不应该是乘mask吗?

因为这里的attention_mask已经【被动过手脚】,将原本为1的部分变为0,而原本为0的部分(即padding)变为一个较大的负数,这样相加就得到了一个较大的负值:

(Pdb) attention_mask
tensor([[[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        ...,
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]]],
       device='cuda:0')

那么,这一步是在哪里执行的呢?

我在modeling_bert.py中没有找到答案,但是在modeling_utils.py中找到了一个特别的类:class ModuleUtilsMixin,在它的get_extended_attention_mask方法中发现了端倪:

    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple[int], device: device) -> Tensor:
        """
        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

        Arguments:
            attention_mask (:obj:`torch.Tensor`):
                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
            input_shape (:obj:`Tuple[int]`):
                The shape of the input to the model.
            device: (:obj:`torch.device`):
                The device of the input to the model.

        Returns:
            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
        """
        # 省略一部分……

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        return extended_attention_mask

那么,这个函数是在什么时候被调用的呢?和BertModel有什么关系呢?

OK,这里涉及到BertModel的继承细节了:BertModel继承自BertPreTrainedModel ,后者继承自PreTrainedModel,而PreTrainedModel继承自[nn.Module, ModuleUtilsMixin, GenerationMixin] 三个基类。——好复杂的封装!

这也就是说, BertModel必然在中间的某个步骤对原始的attention_mask调用了get_extended_attention_mask ,导致attention_mask从原始的[1, 0]变为[0, -1e4]的取值。

真相只有一个!

最终在BertModel的前向传播过程中找到了这一调用(第944行):

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)

问题解决了:这一方法不但实现了改变mask的值,还将其广播(broadcast)为可以直接与attention map相加的形状。

不愧是你,HuggingFace。

抱脸虫可不是乱叫的!

除此之外,值得注意的细节有:

OK, that's all for attention.

2.2.1.1.2 BertSelfOutput

这一块操作略多但不复杂,一目了然:

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states
补充:这里又出现了LayerNorm和Dropout的组合,只不过这里是先Dropout,进行残差连接后再进行LayerNorm。至于为什么要做残差连接,最直接的目的就是降低网络层数过深带来的训练难度,对原始输入更加敏感~

2.2.1.2 BertIntermediate

看完了BertAttention,在Attention后面还有一个全连接+激活的操作:

class BertIntermediate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states
补充:为什么要过一个FFN?不知道……谷歌最近的论文貌似说明只有attention的模型什么用都没有:

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth​arxiv.org/abs/2103.03404

作为参考(图源网络):

至于为什么在transformer中要用这个激活函数……

补充:看了一些研究,应该是说GeLU比ReLU这些表现都好,以至于后续的语言模型都沿用了这一激活函数。

2.2.1.3 BertOutput

在这里又是一个全连接+dropout+LayerNorm,还有一个残差连接residual connect:

class BertOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

这里的操作和BertSelfOutput不能说没有关系,只能说一模一样……非常容易混淆的两个组件。

以下内容还包含基于BERT的应用模型,以及BERT相关的优化器和用法,将在下一篇文章作详细介绍。

2.2.3 BertPooler

这一层只是简单地取出了句子的第一个token,即[CLS]对应的向量,然后过一个全连接层和一个激活函数后输出:

(这一部分是可选的,因为pooling有很多不同的操作)

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

Takeaways·小结

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      # 如此重复11层,直到大厦崩塌:)
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )

HiroLin:小白Bert系列-参数计算29 赞同 · 7 评论文章正在上传…重新上传取消​

(未完待续)


后记

标签:BERT,layer,Transformers,self,attention,states,源码,hidden,size
来源: https://blog.csdn.net/javastart/article/details/120999800