strangerRidingCaml
2. Text Representation 본문
Text Representation
Bag-of-Words model and its limitations:
The Bag-of-Words (BoW) model represents text data by counting the frequency of words in a document. It disregards the order and structure of words, treating each document as a set of words. Limitations of the BoW model include:
- Lack of semantic meaning: The BoW model does not capture the semantic relationships between words.
- Loss of word order: Since word order is not considered, the BoW model may lose important context and meaning.
- High dimensionality: BoW representation results in high-dimensional sparse vectors, which can be computationally expensive to process.
Term Frequency-Inverse Document Frequency (TF-IDF) representation:
TF-IDF representation is a numerical statistic that reflects the importance of a word in a document relative to a corpus. It combines term frequency (TF), which measures the frequency of a word in a document, with inverse document frequency (IDF), which penalizes words that appear frequently across documents. TF-IDF helps to identify words that are important to a document while filtering out common words that occur across many documents.
Word embeddings: Word2Vec, GloVe:
Word embeddings are dense vector representations of words in a continuous vector space, where similar words are closer together in the vector space. Two popular methods for generating word embeddings are Word2Vec and GloVe:
- Word2Vec: A neural network-based model that learns word embeddings by predicting the context words given a target word (skip-gram model) or predicting the target word given context words (continuous bag-of-words model).
- GloVe (Global Vectors for Word Representation): A method for generating word embeddings by factorizing the matrix of word co-occurrence statistics. GloVe captures global word-word co-occurrence statistics across the entire corpus.
Lab Activity: Training Word2Vec embeddings on a corpus and performing similarity tasks
In this lab activity, we will train Word2Vec embeddings on a corpus of text data and perform similarity tasks using the trained embeddings. The steps involved include:
- Preprocess the text data: Tokenization, cleaning, and normalization.
- Train Word2Vec model: Using the preprocessed text data to learn word embeddings.
- Perform similarity tasks: Calculate cosine similarity between word vectors to identify similar words or phrases.
Code for Lab Activity:
Step 1: Preprocess the text data
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
# Sample text data
text_data = "Natural Language Processing is a fascinating field. It involves the development of algorithms and models to enable computers to understand, interpret, and generate human language data."
# Tokenization
tokens = word_tokenize(text_data)
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
Step 2: Train Word2Vec model
from gensim.models import Word2Vec
# Train Word2Vec model
model = Word2Vec([filtered_tokens], min_count=1)
# Save trained model
model.save("word2vec.model")
Step 3: Perform similarity tasks
# Load trained Word2Vec model
model = Word2Vec.load("word2vec.model")
# Find similarity between words
similarity = model.wv.similarity('natural', 'processing')
print("Similarity between 'natural' and 'processing':", similarity)
'NLP' 카테고리의 다른 글
6. NLP Applications and Advanced Topics (0) | 2024.05.05 |
---|---|
5. Advanced NLP Models (0) | 2024.05.05 |
4. Sequence-to-Sequence Models (0) | 2024.05.05 |
3. Language Modeling (0) | 2024.05.05 |
1. Introduction to NLP (0) | 2024.05.05 |