Natural Language Processing Fundamentals
Lesson 2: Text Preprocessing Techniques in NLP
Objectives:
- Understand the importance of text preprocessing.
- Learn various text preprocessing techniques.
- Implement text cleaning, tokenization, and feature extraction.
- Practice with code examples.
2.1 Importance of Text Preprocessing
Text preprocessing is a crucial step in NLP as it helps in transforming raw text into a format that is suitable for analysis. Preprocessing improves the performance of NLP models by removing noise and irrelevant information.
2.2 Text Cleaning Techniques
2.2.1 Lowercasing: Convert all text to lowercase to ensure uniformity.
text = "Natural Language Processing is Fascinating!"
clean_text = text.lower()
print(clean_text)
2.2.2 Removing Punctuation: Strip out punctuation marks to focus on the words themselves.
import string
text = "Hello, world! Welcome to NLP."
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)
2.2.3 Removing Numbers: Remove numbers if they are not relevant for the analysis.
import re
text = "There are 3 apples."
clean_text = re.sub(r'\d+', '', text)
print(clean_text)
2.2.4 Removing White Spaces: Remove extra white spaces from the text.
text = " This is a text with extra spaces. "
clean_text = text.strip()
print(clean_text)
2.2.5 Handling Contractions: Expand contractions to their full forms (e.g., "isn't" to "is not").
from contractions import fix
text = "I can't believe it's happening!"
expanded_text = fix(text)
print(expanded_text)
2.3 Tokenization
2.3.1 Word Tokenization: Split text into individual words.
from nltk.tokenize import word_tokenize
text = "Natural Language Processing is fascinating!"
tokens = word_tokenize(text)
print(tokens)
2.3.2 Sentence Tokenization: Split text into sentences.
from nltk.tokenize import sent_tokenize
text = "Natural Language Processing is fascinating! It has many applications."
sentences = sent_tokenize(text)
print(sentences)
2.3.3 Tokenization with SpaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing is fascinating!")
tokens = [token.text for token in doc]
print(tokens)
2.4 Removing Stop Words
Stop words are common words that may be irrelevant for analysis.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
2.5 Stemming and Lemmatization
2.5.1 Stemming: Reduces words to their root form.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runner", "ran"]
stems = [stemmer.stem(word) for word in words]
print(stems)
2.5.2 Lemmatization: Reduces words to their base or dictionary form.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("running runners ran")
lemmas = [token.lemma_ for token in doc]
print(lemmas)
2.6 Feature Extraction
2.6.1 Bag of Words (BoW): Convert text into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
documents = ["I love NLP.", "NLP is fun."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(X.toarray())
2.6.2 Term Frequency-Inverse Document Frequency (TF-IDF): Evaluate the importance of words.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(X.toarray())
2.7 Summary and Next Steps
In this lesson, we explored various text preprocessing techniques, including text cleaning, tokenization, stop word removal, stemming, lemmatization, and feature extraction.
Next Steps:
- In Lesson 3, we will delve into more advanced text preprocessing techniques and explore the concept of word embeddings.