# Transformers
_____

**Overview**
- Transformers compute `vector-space representations` of natural language that
  are suitable for use in deep learning models.  
- Based solely on attention mechanisms to compute representations of its input
  and output without using sequence aligned RNNs or convolutions.  
- The benefit of the transformer architecture is that it helps the model to retain infinitely long sequences that were not possible from the traditional RNNs, LSTMs, and GRU
- Still lacks contextual understanding.  

### Why Transformers?
- **paper**: *Attention Is All You Need - 2017*
- Attention allows the model to focus on the relevant parts of the input
  sequence as needed.  
- Self-attention is the method the `Transformer` uses to bake the
  “understanding” of other  relevant words into the one we’re currently
  processing. 
- Models like `ELMo` which uses LSTMs to alleviate the consequences of not
  having an attention mechanism to create an efficient way of `focusing` on the
  important word in each sentence.  
- This is the problem the Transformer network addressed by using the attention mechanism.

**Structure**
- `Attention is all you need` paper used attention to improve the performance
  of machine translation. They created a model with two main parts: 
  - **Encoder**: This part of the “Attention is all you need” model processes
    the input text, looks for important parts, and creates an embedding for
    each word based on relevance to other words in the sentence. (Stack of 6
    encoders) 
  - **Decoder**: This takes the output of the encoder, which is an embedding,
    and then turns that embedding back into a text output, i.e. the translated
    version of the input text (Stack of 6 Decoders). 

![image](../assets/transformer.png)

**Notes:**
- Neither the encoder nor the decoder used any recurrence or looping, like traditional RNNs. 
- Instead, they used layers of `attention` through which the information passes
  linearly. It didn’t loop over the input multiple times – instead, the
  Transformer passes the input through multiple attention layers.  
- You can think of each attention layer as `learning` more about the input,
  i.e. looking at different parts of the sentence and trying to discover more
  semantic or syntactic information.  
- This is important in the context of the vanishing gradient problem.
- [Reference**](http://jalammar.github.io/illustrated-bert/)

## Attention

![image](../assets/attention.png)

##### Attention Mechanism 

- **Query**: is the input word vector for the token. 
- **Keys**: keys are the input word vectors for all the other tokens, and for
  the query token as well. 
- **Values**: The values are the values stored in they dictionary for each
  token (e.g. key). 

##### Calculate Self-Attention 
1. For each word, create a `Query` vector, a `Key` vector, and a `Value` vector and
   are created by multiplying the embedding by three matrices that we trained
   during the training process.
2. Calculate a score to determine how much focus to place on other parts of the
   input sentence as we encode a word at a certain position.  
   - The score is calculated by taking the `dot product` of the query vector with
     the key vector; the result is a scaler; 
     - This value provides us an **attention score** and measures how relevant the
       key is to a query. The output is a **weighted sum**. 
   - So if we’re processing the self-attention for the word in position #1, the
   first score would be the dot product of `q1` and `k1`.  
   - The second score would be the dot product of `q1` and `k2` and so on...
3. Divide scores by 8; having more stable gradients (paper) 
4. Apply `softmax` to normalize the scores so they’re all positive and add up to 1.
   - Determines how much each word will be expressed at this position (e.g. first
   word, etc.) 
5. Multiply each value vector by the `softmax` score; this helps drown-out
   irrelevant words when multiplying by small values. 
6. Sum up the weighted value vectors, which produces the output of the
   self-attention layer  
7. Resulting vector is one we can send along to the feed-forward neural
   network. 

[Reference](https://jalammar.github.io/illustrated-transformer/)
### Encoders
- The encoders are all identical in structure (yet they do not share weights).
  Each one is broken down into two sublayers: self-attention and feed-forward  
- The encoder’s inputs first flow through a self-attention layer which helps
  the encoder look at other words in the input sentence as it encodes a
  specific word. 
- The outputs of the self-attention layer are fed to a feed-forward neural network. 

![image](../assets/encoder.png)

### Decoders
- The decoder has both those layers, but between them is an attention layer
  that helps the decoder focus on relevant parts of the input sentence 
- The output of the top encoder is then transformed into a set of attention vectors K and V. 
- These are to be used by each decoder in its “encoder-decoder attention” layer
  which helps the decoder focus on appropriate places in the input sequence 
- The decoder stack outputs a vector of floats which is passed to a final fully connected layer that projects the vector produced by the stack of decoders into a logits vector
- Softmax layer then turns those scores into probabilities 
- Output: the cell with the highest probability is chosen, and the word
  associated with it is produced as the output for this time step. 
- **Reference**: http://jalammar.github.io/illustrated-transformer/

![image](../assets/decoder.png)

______

##  BERT

- The BERT family of models uses the `Transformer` encoder architecture to
  process each token of  input text in the full context of all tokens before
  and after, hence the name: `Bidirectional Encoder Representations from
  Transformers`.
- BERT models are usually pre-trained on a large corpus of text, then
  fine-tuned for specific tasks. 
- [Tensorflow Example Reference](https://www.tensorflow.org/text/tutorials/classify_text_with_bert)
- [Visual Guide to Using BERT Reference](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)
  

#### BERT Tasks
- Question answering
- NER
- Semantic similarity
- Document classification 
- Predicting next word
  
#### Why Bert?
- There were still issues with the limits of training large amounts of data
  using approaches like ELMo and Word2Vec.  
- This was a serious obstacle to the potential of these models to improve their
  ability to  perform well on a range of NLP tasks. This is where the concept
  of pre-training set the scene for the arrival of models like BERT to
  accelerate the evolution

#### BERT Overview
- BERT is pre-trained on a large corpus of unlabelled text including the entire
  Wikipedia and BookCorpus with `30MM` tokens. 
- Used for transfer learning  
- BERT makes use of `Masked Language Models` to randomly mask words in the
  sentence and then it tries to predict them.  
- Masking is where the model looks in both directions and it uses the full
  context of the sentence in order to predict the masked word. 
  The bidirectional aspect creates a representation of each word that is based
  on the other words in the sentence - context-based.  
- **Example:**
  - “I accessed the bank account,” a unidirectional contextual model would
    represent “bank” based on “I accessed the” but not “account.” 
  - BERT relies on Transformers (the attention mechanism that learns contextual
    relationships between words in a text). 

#### Inner-workings of BERT 
- The input to the encoder for BERT is a sequence of tokens, which are first
  converted into vectors and then processed in the neural network. 
- **Token embeddings:** A `[CLS]` token is added to the input word tokens at the
  beginning of the first sentence and a `[SEP]` token is inserted at the end of
  each sentence. 
- **Segment embeddings:** A marker indicating Sentence A or Sentence B is added
  to each token. This allows the encoder to distinguish between sentences. 
- **Positional embeddings:** A positional embedding is added to each token to
  indicate its position in the sentence. 
  - **BERT-Base:** 12-layer, 768-hidden-nodes, 12-attention-heads, 110M parameters	
  - **BERT-Large:** 24-layer, 1024-hidden-nodes, 16-attention-heads, 340M parameters

### BERT Example with Tensorflow 

- Load a pre-trained BERT model from `TensorFlow Hub`.
- **[BERT-Base, Uncased](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3)**
  - Text inputs have been normalized the "uncased" way, meaning that the text
    has been lower-cased before tokenization into word pieces, and any accent
    markers have been stripped. 
- **[Small BERTs](https://tfhub.dev/google/collections/bert/1)**
  - Small BERTs have the same general architecture but fewer and/or smaller
    Transformer blocks, which lets you explore tradeoffs between speed, size
    and quality. 

**Preprocessing the Model**

- Text inputs need to be transformed to numeric token ids and arranged in
  several Tensors before being input to `BERT`. 
  
  ```
  bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
  ```

- BertTokenizer 
  Tokenizer classes which store the vocabulary for each model and provide
  methods for encoding/decoding strings in list of token embeddings indices 

- `input_word_ids`
- `input_mask`
- `input_type_ids`

**Simple Example using Keras**

```python 
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/2",trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 256].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 256].
```

### BERT - Semantics Similarity

- Initial problem with BERT for semantic similarities:
- Finding the most similar pair in a collection of 10,000 sentences requires
  about 50 million inference computations (~65 hours) with BERT.  
- The construction of BERT makes it unsuitable for semantic similarity search 
- New research came out in 2019 with a modification to BERT to derive
  semantically meaningful sentence embeddings that can be compared using
  `cosine-similarity`. 
- This approach allows us to use contextualized embeddings and use `SBERT` 
- We can use the `SentenceTransformer` model in the Hugging Face library to map
  sentences to embeddings. 
- BERT is limited to `512 word pieces`, which corresponds to about 300-400 words 
- If resumes and job descriptions exceed this constraint, we would need to
  think about ways to work around this. 
- One approach would be to break the resume/job description up into smaller
  sections and compute a similarity score for each section then take the
  average. We would then rank the top `n` jobs to a user based on this
  information. 
- **Link**: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial