Overview

  • In this section, we will talk about the tokenizer which is a huge part of NLP and how machines understand language.
  • Understanding language is a daunting task for machines. In order to understand language, we want machines to be able to read and grasp the meaning.
  • For machines to be able to learn language, they need to break text into standard units, called tokens, and then process it.
  • This process is called tokenization and it is fed as input to language models such as BERT.
  • To date, we still dont know the level of semantic understanding models learn, though it is thought they learn syntactic knowledge and the lower levels of the neural network and then semantic at the higher levels.Source

Technicals of Tokenization

  • Let’s dive a little deeper on how tokenization even works today.
  • To start off, there are many ways and options to tokenize text. You can tokenize by removing spaces, adding a split character between words or simply break the input sequence into seperate words.

  • This can be visualized by the image above from Source
  • As we stated earlier, we use one of these options as a way to split the larger text into a smaller unit, a token, to serve as input to the model.
  • Additionally, inorder for the model to learn relationships between words in a sequence of text, we need to represent it as a vector.
  • We do this in lieu or hard coding grammatical rules within our system as the complexity for this would be exponential since it would change per language.
  • Instead, with vector representation, the model has encoded meaning in any dimension of this vector.

Sub-word Tokenization

  • Subword Tokenization is a method that breaks words down into sub-tokens.
  • The main idea behind sub-word tokenization is that many words in a language have a common prefix or suffix, and by breaking words into smaller units, we can more effectively handle rare and out-of-vocabulary words.
  • Sub-word tokenization better handles out of vocabulary words (OOV) by combining one or more common words (such as “any” and “place” for “anyplace”).
  • This also reduces the model size and helps in efficiency.
  • Below we see a few algorithms to perform sub-word tokenization.

Byte Pair Encoding (BPE)

  • The algorithm starts with a set of individual characters as the initial subwords.
  • Then, it iteratively replaces the most frequent pair of bytes (or characters) in the text with a new, unused byte.
  • This new byte represents the merged pair and is considered a new subword.
  • This process is repeated for a fixed number of iterations or until a certain number of subwords is reached.
  • For example, consider the word “internationalization”. Using BPE, the algorithm would first identify the most frequent pair of bytes “in” and replace it with a new byte “A”. Next, it would identify the next most frequent pair “ter” and replace it with “B”, and so on.
  • The resulting subwords would be: “in”, “ter”, “na”, “ti”, “on”, “al”, “i”, “za”, “t”, “io”, “n”, “A”, “B”, “C”, “D”, “E”, “F”.Generated by ChatGPT
  • BPE: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.Source

Unigram Subword Tokenization

  • The algorithm starts by defining a vocabulary of the most frequent words and represent the remaining words as a combination of the vocabulary words.
  • Then it iteratively splits the most probable word into smaller parts until a certain number of subwords is reached.
  • Unigram: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.Source

WordPiece

  • The algorithm starts with a set of words, and then iteratively splits the most probable word into smaller parts. For each split, the algorithm assigns a probability to the newly created subwords based on their frequency in the text. This process is repeated until a certain number of subwords is reached.
  • For example, consider the word “internationalization”.
  • Using WordPiece, the algorithm would first identify the most probable word “international” and split it into “international” and “ization”.
  • Next, it would identify the next most probable word “ization” and split it into “i” and “zation”. The resulting subwords would be: “international”, “ization”, “i”, “zation”. Note: This example was generated by OpenAI’s ChatGPT.
  • WordPiece: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged tokenSource

SentencePiece

  • One of the main differences between SentencePiece and BPE is that SentencePiece is unsupervised, which means that it does not require any external training data.
  • It is able to learn the subword units directly from the input text.
  • SentencePiece also has a built-in mechanism for handling out-of-vocabulary words.
  • It can generate new subword units on the fly for words that are not present in the initial vocabulary.
  • Additionally, SentencePiece can handle multiple languages with a single model and can also be used for text normalization.
  • SentencePiece is widely used in modern transformer-based models such as BERT, RoBERTa, and GPT-3.
  • These models use SentencePiece to segment the input text into subword units, which allows them to handle out-of-vocabulary words and to reduce the model size.

  • Below is a comparitive analysis of these 4 algorithms as stated by ChatGPT:

References