Vinija's Notes • NLP • Preprocessing

Overview
Stemming
Lemmatization
Stopwords
Tokenization
Lowercasing
Punctuation Removal
Spell Check and Correction
Noise Removal
Text Normalization
Part-of-Speech Tagging

Overview

Preprocessing is a critical step that involves the transformation and cleaning of raw text in such a way that it becomes easy to understand and efficient to work with.
These preprocessing techniques help to reduce the complexity of the language data, improving computational efficiency and performance of the models. However, these techniques need to be applied judiciously, keeping in mind the requirements of the specific NLP task.
These techniques help to reduce the complexity of the data, increase the efficiency of the computational processes, and often enhance the performance of NLP models.
Below, we will look at a few ways NLP preprocesses its raw data.

Stemming

Stemming is a basic and heuristic process in NLP used to reduce words to their root or base form. This is accomplished by removing the end of the word, which often includes suffixes. For example, the stem of the words “jumps”, “jumping”, and “jumped” is “jump”. Stemming helps in reducing the corpus of words that a model needs to know, which can improve computational efficiency.
However, stemming can sometimes be too crude a method, as it merely chops off the ends of words using simple rules without understanding the context. This could lead to erroneous outputs where the stemmed word is not a valid word itself or has a different meaning.
Let’s see this in action with PorterStammer to perform suffix trimming as referenced from (nlpplanet.org):
“The Porter stemmer algorithm does not keep a lookup table for actual stems of each word but applies algorithmic rules to generate stems.”
Note: for languages other than English, SnowballStemmer can be leveraged to perform Stemming.

Let’s see how to use it. First, we import the necessary modules.

import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
The PorterStemmer class can be imported from the nltk.stem module. Then, we instantiate a PorterStemmer object.

stemmer = PorterStemmer()
The function stem then can be used to actually do stemming on words.

print(stemmer.stem("cat")) # -> cat
print(stemmer.stem("cats")) # -> cat

print(stemmer.stem("walking")) # -> walk
print(stemmer.stem("walked")) # -> walk

print(stemmer.stem("achieve")) # -> achiev

print(stemmer.stem("am")) # -> am
print(stemmer.stem("is")) # -> is
print(stemmer.stem("are")) # -> are
cat
cat
walk
walk
achiev
am
is
are
To stem all the words in a text, we can use the PorterStemmer on each token producted by the word_tokenize function.

text = "The cats are sleeping. What are the dogs doing?"

tokens = word_tokenize(text)
tokens_stemmed = [stemmer.stem(token) for token in tokens]
print(tokens_stemmed)
# ['the', 'cat', 'are', 'sleep', '.', 'what', 'are', 'the', 'dog', 'do', '?']
['the', 'cat', 'are', 'sleep', '.', 'what', 'are', 'the', 'dog', 'do', '?']

Lemmatization

Lemmatization, similar to stemming, is used to reduce a word to its base form, but it considers the morphological analysis of the words. It returns the lemma of the word, which is the dictionary form or the base form. The process involves understanding the context and part of speech of the word, making it more complex and accurate than stemming.
For example, the word “better” has “good” as its lemma. Lemmatization would correctly identify this, while stemming would not. However, the complexity of lemmatization can also make it computationally more expensive than stemming.
Let’s see lemmatization in action from (nlplanet.org)

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
We’ll use the WordNetLemmatizer which leverages WordNet to find existing lemmas. Then, we create an instance of the WordNetLemmatizer class and use the lemmatize method.

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("achieve")) # -> achieve
achieve
The lemmatizer is able to reduce the word “achieve” to its lemma “achieve”, differently from stemmers which reduce it to the non-existing word “achiev”.

Stopwords

Stopwords in NLP are the words that are filtered out before processing a text. These are usually the most common words in a language like “is”, “in”, “an”, “the”, “and”, etc. These words do not carry significant meaning and are usually removed from texts to help reduce the dataset size and improve computational efficiency.
However, in some NLP tasks, like sentiment analysis, stopwords can carry significant meaning, and removing them could potentially affect the performance of the model. Therefore, the removal of stopwords should be carefully considered based on the task at hand.
Let’s see stopwords in action from (nlplanet.org)

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
Then, we retrieve the stopwords for the English language with stopwords.words("english"). There are 179 stopwords in total, which are words (note that they are all in lowercase) that are very common in different types of English texts.

english_stopwords = stopwords.words('english')
print(f"There are {len(english_stopwords)} stopwords in English")
print(english_stopwords[:10])
There are 179 stopwords in English
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Tokenization

Tokenization is the process of breaking up text into smaller pieces called tokens. These tokens could be sentences or words. This is often the first step in NLP preprocessing.
For example, sentence tokenization breaks a paragraph into individual sentences. Word tokenization breaks a sentence into individual words.

Lowercasing

Lowercasing is a common preprocessing step where all the text is converted to lower case. This helps to avoid having multiple copies of the same words. For example, “House”, “house”, and “HOUSE” will be considered as different words unless they are all converted into the same case, preferably lower case.

Punctuation Removal

Punctuation can provide grammatical context to a sentence which supports our understanding. However, for our vectorizer which counts the number of words and not the context, it does not add value, so we remove all special characters.

Spell Check and Correction

Typos and spelling mistakes are common in text data. Spell check and correction can be used to correct these errors. This step can help in reducing multiple copies of words. For example, “speling” and “spelling” will be considered as two different words unless corrected.

Noise Removal

Noise removal is about removing characters digits and pieces of text that can interfere with your text analysis. Noise removal could be performed in various ways, including removal of text file headers, footers, HTML, XML, etc.

Text Normalization

Text normalization includes converting all text to the same case (usually lowercase), removing punctuation, converting numbers to their word equivalents, and so on.

Part-of-Speech Tagging

Part-of-speech tagging involves identifying the part of speech (noun, verb, adjective, etc.) of each word in a sentence. This can be important for understanding the sentence structure and can be especially useful in tasks like named entity recognition, question answering, etc.
We will look at this in action from (nlplanet.org)

POS Tagging with Python
The NLTK library provides an easy-to-use pos_tag function that takes a text as input and returns the part-of-speech of each token in the text.

text = word_tokenize("They refuse to go")
print(nltk.pos_tag(text))

text = word_tokenize("We need the refuse permit")
print(nltk.pos_tag(text))
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('go', 'VB')]
[('We', 'PRP'), ('need', 'VBP'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
PRP are propositions, NN are nouns, VBP are present tense verbs, VB are verbs, DT are definite articles, and so on. Read this article to see the complete list of parts-of-speech that can be returned by the pos_tag function. In the previous example, the word “refuse” is correctly tagged as verb and noun depending on the context.

The pos_tag function assigns parts-of-speech to words leveraging their context (i.e. the sentences they are in), applying rules learned over tagged corpora.