Skip to content

Latest commit

 

History

History
75 lines (43 loc) · 6.17 KB

1810.04805.md

File metadata and controls

75 lines (43 loc) · 6.17 KB

BERT: Pre-training of deep bidirectional transformers for language understanding, Devlin et al., 2018

Paper, Tags: #nlp, #embeddings

New language representation model. Unlike Peters et al., 2018, BERT pretrains deep bidirectional representations from unlabeled text by jointly consitioning on both left and right context in all layers. BERT can be fine-tuned with just one additional output layer.

2 strategies for applying pre-trained language-representations to downstream tasks:

  • Feature-based (ELMo): use task-specific architectures that include the pre-trained representations as additional features
  • Fine-tuning (OpenAI GPT, BERT): minimal task-specific parameters, trained on downstream tasks by simply fine-tuning all pre-trained parameters.

Current techniques restrict the power of the pre-trained representations, especially for fine-tuning approaches. Major limitation: LMs are unidirectional. OpenAI GPT use left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the transformer (Vaswani et al., 2018). These restrictions are sub-optimal for sentence-level tasks, and harmful for token-level tasks, where it's crucial to incorporate context from both directions.

BERT alleviates the unidirectionality constraint by using a "masked LM" (MLM) pre-training objective. The MLM randomly masks some of the tokens from the input, and the goal is to predict the original vocabulary id of the masked word based only on its context. MLM enables the representation to fuse the left and right context, which allows to train a deep bidirectional transformer. We also use a "next sentence prediction" task that jointly pretrains text-pair representations.

Peters et al., 2018 uses shallow concatenation of independently trained left-to-right and right-to-left LMs.

Pre-trained representations reduce the need for many heavily-engineered task-specific architectures, BERT is the first fine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks.

Related work

1. Unsupervised feature-based approaches

Left-to-right LMs, generalized to sentence embeddings, paragraph embeddings. ELMo extracts context-sensitive features from a left-to-right and a right-to-left LM, with shallow concatenation for the contextual representation of each token.. Not deeply bidirectional

2. Unsupervised fine-tuning approaches

Only few parameters must be learnt from scratch, pre-training is done from unlabeled text and fine-tuned for a supervised downstream task.

3. Transfer learning from supervised data

There has been work showing effective transfer from supervised tasks with large datasets, such as NLI and MT.

BERT

2 steps in framework:

  1. Pre-training: model trained on unlabeled data over different pre-training tasks
  2. Fine-tuning: BERT is first initialized with pre-trained parameters, and all parameters are fine-tuned using labeled data from the downstream tasks.

BERT has an unified architecture across different tasks, there's minimal difference between the pre-trained architecture and the final downstream architecture.

The model is a multi-layer bidirectional Transformer encoder based on the original implementation in paper #5. 2 model sizes, BERT_base and BERT_large.

BERT Transformer uses bidirectional self-attention, while OpenAI Transformer uses constrained-self attention where every token can only attend to context to its left.

1. Pretraining BERT

Unlike Peters et al., 2018, we don't use traditional left-to-right or right-to-left language models, we pretrain BERT using 2 unsupervised tasks

  1. Masked LM
    • Standard LMs can only be trained left-to-right or right-to-left, since bidirectional conditionining would allow each word to indirectly 'see itself'.
    • We simply mask some percentage of the input tokens at random, and then predict those masked tokens
    • Downside: Creating mismatch between pre-training and fine-tuning, because the [MASK] token doesn't appear during fine-tuning. To mitigate this, we don't always replace 'masked' words with the actual [MASK] token. 15% of tokens are chosen for masking. out of those, we replace this token with [MASK] token 80% times, random token 10% times, and actual unchanged token 10% times.
    • The Transformer encoder doesn't know which words will it be asked to predict or which have been replaced by random words, so it must keep a distributional contextual representation of every input token.
    • MLM converges slower than left-to-right LMs
  2. Next sentence prediction (NSP)
    • Many downstream tasks are based on understanding the relationships between two sentences, which isn't directly captured by LMs. That's why we use next sentence prediction task, which can be generated from any monolingual corpus
    • When choosing the sentences A and B for each pre-training example, 50% times B is the actual next sentence that follows A (isNext), and 50% times it's a random sentence from corpus (NotNext).

As data input for pretraining, we use BooksCorpus (800M words). It's critical to use a document-level corpus rather than a shuffled sentence-level corpus, in order to extract long contiguous sequences.

2. Fine-tuning BERT

It's straight-forward since the self-attention mechanism in the Transformer allows BERT to model many downstream tasks by swapping out the appropriate inputs and outputs.

For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all parameters end-to-end. Compared to pre-training, fine-tuning is relatively inexpensive.

Model sizes

Scaling to extreme models leads to large improvements on very large and also very small scale tasks, provided that the model has been sufficiently pre-trained

Feature-based approach

Feature-based approaches have some advantages. Not all tasks can be easily represented by a Transformer encoder architecture. Also, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.