CodeSphere - Expand Your Coding Knowledge

Lesson 3: Advanced Text Preprocessing and Word Embeddings

Objectives:

Learn advanced text preprocessing techniques.
Understand the concept and applications of word embeddings.
Implement word embeddings using popular methods.

3.1 Advanced Text Preprocessing Techniques

3.1.1 Handling Special Characters and HTML Tags: Remove or replace special characters and HTML tags from text.

from bs4 import BeautifulSoup
import re

html_text = "<p>Hello, world!</p>"
soup = BeautifulSoup(html_text, "html.parser")
text = soup.get_text()

# Remove special characters
clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)

3.1.2 Normalization: Convert text to a consistent format, such as Unicode normalization.

import unicodedata

text = "Café"
normalized_text = unicodedata.normalize('NFKD', text)
print(normalized_text)

3.1.3 Spelling Correction: Correct spelling errors in text.

from textblob import TextBlob

text = "I have a speling error."
corrected_text = TextBlob(text).correct()
print(corrected_text)

3.2 Introduction to Word Embeddings

Word embeddings are dense vector representations of words that capture semantic meaning. Unlike traditional one-hot encoding, embeddings represent words in a continuous vector space where similar words have similar representations.

3.2.1 Why Use Word Embeddings?

Semantic Meaning: Captures context and meaning.
Dimensionality Reduction: Reduces the size of feature vectors compared to one-hot encoding.
Improved Performance: Enhances the performance of NLP models.

3.2.2 Popular Word Embedding Methods:

Word2Vec
GloVe (Global Vectors for Word Representation)
FastText

3.3 Implementing Word Embeddings

3.3.1 Word2Vec:

Word2Vec is a method for learning vector representations of words from large text corpora. It has two main models:

Continuous Bag of Words (CBOW): Predicts the current word based on surrounding words.
Skip-gram: Predicts surrounding words based on the current word.

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Sample data
sentences = [["natural", "language", "processing", "is", "fun"],
             ["word", "embeddings", "capture", "semantic", "meaning"]]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=0)

# Get vector for a word
vector = model.wv['word']
print(vector)

# Find similar words
similar_words = model.wv.most_similar('word')
print(similar_words)

3.3.2 GloVe:

GloVe is another popular word embedding method that captures global statistical information from a text corpus.

from glove import Corpus, Glove

# Sample data
corpus = Corpus()
corpus.fit(sentences, window=3)

# Train GloVe model
glove = Glove(no_components=50, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

# Get vector for a word
vector = glove.word_vectors[glove.dictionary['word']]
print(vector)

# Find similar words
word = 'word'
similar_words = {k: v for k, v in glove.most_similar(word, number=5)}
print(similar_words)

3.3.3 FastText:

FastText is an extension of Word2Vec that considers subword information, which helps in handling out-of-vocabulary words better.

from gensim.models import FastText

# Train FastText model
model = FastText(sentences, vector_size=50, window=3, min_count=1, sg=0)

# Get vector for a word
vector = model.wv['word']
print(vector)

# Find similar words
similar_words = model.wv.most_similar('word')
print(similar_words)

3.4 Summary and Next Steps

In this lesson, we covered advanced text preprocessing techniques and explored word embeddings. We implemented word embeddings using Word2Vec, GloVe, and FastText.

Next Steps: