NLP Text Preprocessing Steps for Machine Learning Algorithms

Dipayon Paul
11 min readApr 13, 2023

--

src: https://python.plainenglish.io/text-preprocessing-in-natural-language-processing-in-python-8aeb7bfdaee7

Text preprocessing is a crucial step in natural language processing and machine learning, where raw text data is transformed into a format that can be easily understood and analyzed by machines. It involves cleaning, transforming, and enriching the raw text data to improve the accuracy and efficiency of machine learning algorithms.

Text data often contains noise in the form of unwanted characters, special symbols, HTML tags, and inconsistent formats. Preprocessing the text data helps in removing these noise elements, standardizing the text format, and extracting meaningful features that can be used to build machine learning models.

Text preprocessing includes several steps such as lowercasing, tokenization, stop word removal, stemming, lemmatization, spell checking, and correction, part-of-speech tagging, named entity recognition, vectorization, and feature extraction.

Properly preprocessed text data can lead to more accurate and efficient machine learning models, especially in tasks such as sentiment analysis, text classification, text summarization, and language translation.

The following are the common steps involved in text preprocessing:

1. Tokenization

The process of converting a raw text into a sequence of tokens (words, phrases, symbols, etc.) is called tokenization.

Tokenization is needed in text processing to break down a larger text into smaller pieces called tokens, which can be analyzed more easily. Tokenization is important because most NLP algorithms require their input to be in the form of tokens, rather than a full block of text.

For example, consider the sentence: “The quick brown fox jumped over the lazy dog.” Tokenization of this sentence would break it down into individual tokens such as “The”, “quick”, “brown”, “fox”, “jumped”, “over”, “the”, “lazy”, and “dog”. Once the text is tokenized, various NLP tasks can be performed on these tokens, such as part-of-speech tagging, named entity recognition, sentiment analysis, etc.

In addition to being a necessary step for most NLP algorithms, tokenization can also improve the accuracy of the analysis by reducing ambiguity in the text. For example, consider the word “running”. Depending on the context, it could be interpreted as a verb or a noun. By breaking the text into tokens, the algorithm can more easily determine the intended meaning of each word based on the surrounding context.

Python code example of tokenization is given below

from nltk.tokenize import word_tokenize
text = "This is a sample text for tokenization."
tokens = word_tokenize(text)
print(tokens)

Output:

['This', 'is', 'a', 'sample', 'text', 'for', 'tokenization', '.']

2. Stopword Removal

Stopwords are commonly used words in a language, such as “the,” “and,” “a,” etc., that do not add much meaning to the text. Removing these words helps to reduce the noise in the text data.

Stop word removal is necessary for text preprocessing because stop words are words that are very common in a language but do not convey any specific meaning in the context of the sentence. These words take up valuable space in the memory and can slow down the analysis. Therefore, removing them from the text can improve the efficiency and accuracy of the analysis.

For example, in the sentence “The cat sat on the mat”, the stop words are “the” and “on”. Removing these stop words would result in the remaining words having more importance in the analysis: “cat”, “sat”, and “mat”.

Python code example of Stopword Removal is given below

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
tokens = [word for word in tokens if not word in stop_words]
print(tokens)

Output:

['This', 'sample', 'text', 'tokenization', '.']

3. Stemming

Stemming is the process of reducing a word to its base or root form. For example, the words “jumping”, “jumps”, and “jumped” would all be reduced to “jump” by a stemming algorithm. The main goal of stemming is to reduce different forms of a word to a common base form, which can help in tasks like text classification, sentiment analysis, and information retrieval.

Stemming is important in text processing because it helps to reduce the number of unique words in a text corpus, which can make it easier to process and analyze the data. By reducing words to their base form, we can group together different variations of the same word and treat them as a single term. This can improve the accuracy and efficiency of many natural language processing tasks.

There are several popular stemming algorithms used in text processing, including the Porter stemming algorithm and the Snowball stemming algorithm. These algorithms work by applying a set of rules to strip affixes from words and reduce them to their base form.

Python code example of Stemming is given below

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]
print(stemmed_words)

Output:

['thi', 'sampl', 'text', 'token', '.']

4. Lemmatization

Lemmatization is the process of reducing words to their base or dictionary form (known as a lemma) so that they can be analyzed as a single item, rather than multiple different forms. For example, the word “running” can be reduced to its base form “run” through lemmatization.

Lemmatization is required in text preprocessing to reduce the variations of words and to group similar words together, which can aid in the analysis and understanding of the text. By reducing words to their base form, it becomes easier to count and analyze occurrences of words and to identify relationships between words in the text. This is particularly useful in natural language processing (NLP) tasks such as sentiment analysis, topic modeling, and text classification.

Stemming and Lemmatization are two common techniques used in natural language processing for reducing words to their base or root forms. The main difference between stemming and lemmatization is that stemming is a crude process of removing suffixes from words to obtain their root forms, while lemmatization is a more sophisticated process of mapping words to their base forms using a vocabulary and morphological analysis of words.

Stemming operates by chopping off the end of words using simple rules and heuristics, without taking into account the context or meaning of the word. For example, the word “running” would be stemmed to “run”, “runningly” would be stemmed to “running”, and so on. Stemming is faster and less resource-intensive than lemmatization, but it can produce inaccuracies due to the over-stemming or under-stemming of words.

Lemmatization, on the other hand, uses a vocabulary and morphological analysis of words to map words to their base forms, or lemmas. This process takes into account the context and meaning of the word and can produce more accurate results than stemming. For example, the word “running” would be lemmatized to “run”, and “ran” would also be lemmatized to “run”. However, lemmatization is more computationally expensive than stemming and requires more resources, such as a dictionary or thesaurus.

Python code example of Lemmatization is given below

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
print(lemmatized_words)

Output:

['This', 'sample', 'text', 'tokenization', '.']

5. Part-of-speech (POS) tagging

Part-of-speech (POS) tagging is the process of identifying and labeling the part of speech of each word in a sentence, such as a noun, verb, adjective, adverb, etc. POS tagging is useful in various natural languages processing tasks like sentiment analysis, text classification, information extraction, and machine translation.

Here’s an example of how to perform POS tagging using the Natural Language Toolkit (nltk) library in Python:

import nltk
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence into words
words = nltk.word_tokenize(sentence)
# Perform POS tagging
pos_tags = nltk.pos_tag(words)
# Print the POS tags
print(pos_tags)

Output:

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

In the output, each word in the sentence is tagged with its respective part-of-speech label. For example, ‘The’ is tagged as a determiner (DT), ‘quick’ is tagged as an adjective (JJ), ‘brown’ and ‘fox’ are both tagged as nouns (NN), ‘jumps’ is tagged as a verb in the third person singular present (VBZ), ‘over’ is tagged as a preposition (IN), and so on.

6. Named Entity Recognition (NER)

Named Entity Recognition (NER) is a natural language processing technique that is used to identify and extract the named entities from a given text. Named entities can be anything like a person, organization, location, product, etc.

NER is an important step in text preprocessing as it helps in extracting important information from the text, which can be used for various tasks like sentiment analysis, text classification, information retrieval, etc.

Here is a Python code example for performing NER using the spaCy library:

import spacy
# Load the English language model
nlp = spacy.load("en_core_web_sm")
# Sample text for NER
text = "Apple is looking at buying U.K. startup for $1 billion"
# Process the text with the language model
doc = nlp(text)
# Extract named entities from the text
for ent in doc.ents:
print(ent.text, ent.label_)

In this code, we first load the English language model using the spacy.load() method. We then define a sample text for which we want to perform NER. We process the text using the loaded language model and get a Doc object. Finally, we loop through the ents property of the Doc object to extract the named entities and their labels.

The output of the above code will be:

Apple ORG
U.K. GPE
$1 billion MONEY

In this output, we can see that the named entities in the sample text have been correctly identified and their labels have been assigned. “Apple” is identified as an organization (ORG), “U.K.” is identified as a geopolitical entity (GPE), and “$1 billion” is identified as a monetary value (MONEY).

7. Spell Checking and Correction

Spell checking and correction is the process of identifying and correcting spelling errors in the text. It is an important step in text preprocessing as it can improve the accuracy of natural language processing algorithms that are applied to text data.

Python provides several libraries for spell-checking and correction such as PySpellChecker, TextBlob, autocorrect, etc. Here is an example using the PySpellChecker library:

!pip install pyspellchecker

from spellchecker import SpellChecker
# initialize spell checker
spell = SpellChecker()
# example sentence with spelling errors
sentence = "Ths sentnce hs spellng erors that nd to b corcted."
# tokenize sentence
tokens = sentence.split()
# iterate over tokens and correct spelling errors
for i in range(len(tokens)):
# check if token is misspelled
if not spell.correction(tokens[i]) == tokens[i]:
# replace misspelled token with corrected spelling
tokens[i] = spell.correction(tokens[i])
# join corrected tokens back into sentence
corrected_sentence = ' '.join(tokens)
print(corrected_sentence)

The output of the above code will be:

This sentence has spelling errors that need to be corrected.

In this example, we first install and import the PySpellChecker library. We then define an example sentence with spelling errors that need to be corrected. We tokenize the sentence using the split() method and iterate over each token in the sentence. For each token, we use the correction() method of the SpellChecker class to check if it is misspelled. If it is, we replace it with the corrected spelling. Finally, we join the corrected tokens back into a sentence using the join() method and print the corrected sentence.

8. Removing HTML tags, punctuation, and special characters

Removing HTML tags, punctuation, and special characters is necessary for text preprocessing to clean the text data and make it ready for further processing. HTML tags, punctuation, and special characters do not contribute to the meaning of the text and can cause issues during text analysis.

Here is an example of Python code for removing HTML tags, punctuation, and special characters from a text:

import re
import string

def remove_html_tags(text):
clean_text = re.sub('<.*?>', '', text)
return clean_text

def remove_punctuation(text):
clean_text = text.translate(str.maketrans('', '', string.punctuation))
return clean_text

def remove_special_characters(text):
clean_text = re.sub('[^a-zA-Z0–9\s]', '', text)
return clean_text

text = "<p>Hello, world!</p>"
clean_text = remove_html_tags(text)
clean_text = remove_punctuation(clean_text)
clean_text = remove_special_characters(clean_text)
print(clean_text)

The output of the above code is:

Hello world

In the above example, we define three functions for removing HTML tags, punctuation, and special characters from the text. The re and string modules are used for regular expression and string operations.

The remove_html_tags function uses a regular expression to remove all HTML tags from the text.

The remove_punctuation function uses the translate method of a string to remove all punctuation marks from the text.

The remove_special_characters function uses a regular expression to remove all special characters from the text except for alphabets, digits, and whitespace.

Finally, we apply these functions in a pipeline to the original text and print the cleaned text as output.

9. Converting to Lowercase

Lowercasing the text is a common preprocessing step in natural language processing (NLP) to make text data consistent and easier to analyze. This step involves converting all the letters in the text to lowercase so that words that differ only by the case are treated as the same word.

Lowercasing the text helps in reducing the number of unique words in the text, which in turn helps in reducing the sparsity of the data. This is important because most NLP algorithms work better when there is less sparsity in the data. Additionally, it can help with tasks like word matching and search, since it eliminates the case sensitivity of the text.

For example, consider the following two sentences:

The quick brown fox jumps over the lazy dog.”
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.”

If we lowercase both of these sentences, they become identical:

“the quick brown fox jumps over the lazy dog.”

This makes it easier to analyze and process the text since we only need to consider one unique sentence instead of two.

Python code example is given below.

text = "This is a sample TEXT for preprocessing"
text = text.lower()
print(text)

Output:

this is a sample text for preprocessing

10. Text Vectorization

Text vectorization is the process of transforming raw text into a numerical representation that can be used by machine learning algorithms. This is a crucial step in text preprocessing as most machine learning algorithms work with numerical data. There are several ways to vectorize text, including Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word Embeddings.

BoW representation represents text as a collection of unique words, ignoring their order and context. It creates a vocabulary of unique words present in the text corpus and then generates a matrix where each row represents a document and each column represents a word in the vocabulary. The value in each cell of the matrix represents the frequency of the word in the corresponding document.

TF-IDF representation is similar to BoW, but it takes into account the importance of a word in a document and in the entire corpus. It assigns a weight to each word based on its frequency in the document and its inverse frequency in the entire corpus.

Word embeddings represent words as dense vectors in a high-dimensional space, where the distance between vectors represents the semantic similarity between the words. Word embeddings are created by training a neural network on a large corpus of text and then extracting the learned vector representations of the words.

Here is an example of how to vectorize text using BoW and TF-IDF representations in Python:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example text corpus
corpus = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Vectorize text using BoW representation
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)

print("BoW representation:")
print(X_bow.toarray())
print("Vocabulary:")
print(vectorizer.get_feature_names())

# Vectorize text using TF-IDF representation
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(corpus)

print("TF-IDF representation:")
print(X_tfidf.toarray())

Output:

BoW representation:
[[0 1 0 1 1 0 0 1]
[0 2 0 1 0 0 0 1]
[1 0 1 0 1 1 0 1]
[0 1 0 1 1 0 1 1]]
Vocabulary:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third']
TF-IDF representation:
[[0. 0.43003652 0. 0.32274454 0. 0. 0.53944541 0.32274454]
[0. 0.62276601 0. 0.47330339 0. 0. 0. 0.47330339]
[0.49711994 0. 0.61089855 0.37040998 0.49711994 0.49711994 0. 0.37040998]
[0. 0.43003652 0. 0.32274454 0. 0. 0.53944541 0.32274454]]

In the code above, we first create a sample text corpus. Then we use the CountVectorizer and TfidfVectorizer classes from scikit-learn to vectorize the text using BoW and TF-IDF representations, respectively. Finally, we print the resulting matrices and vocabulary.

--

--