strangerRidingCaml

2. Text Representation 본문

NLP

2. Text Representation

woddlwoddl 2024. 5. 5. 15:55
728x90
Text Representation

Text Representation

Bag-of-Words model and its limitations:
The Bag-of-Words (BoW) model represents text data by counting the frequency of words in a document. It disregards the order and structure of words, treating each document as a set of words. Limitations of the BoW model include:

  • Lack of semantic meaning: The BoW model does not capture the semantic relationships between words.
  • Loss of word order: Since word order is not considered, the BoW model may lose important context and meaning.
  • High dimensionality: BoW representation results in high-dimensional sparse vectors, which can be computationally expensive to process.

Term Frequency-Inverse Document Frequency (TF-IDF) representation:
TF-IDF representation is a numerical statistic that reflects the importance of a word in a document relative to a corpus. It combines term frequency (TF), which measures the frequency of a word in a document, with inverse document frequency (IDF), which penalizes words that appear frequently across documents. TF-IDF helps to identify words that are important to a document while filtering out common words that occur across many documents.

Word embeddings: Word2Vec, GloVe:
Word embeddings are dense vector representations of words in a continuous vector space, where similar words are closer together in the vector space. Two popular methods for generating word embeddings are Word2Vec and GloVe:

  • Word2Vec: A neural network-based model that learns word embeddings by predicting the context words given a target word (skip-gram model) or predicting the target word given context words (continuous bag-of-words model).
  • GloVe (Global Vectors for Word Representation): A method for generating word embeddings by factorizing the matrix of word co-occurrence statistics. GloVe captures global word-word co-occurrence statistics across the entire corpus.

Lab Activity: Training Word2Vec embeddings on a corpus and performing similarity tasks
In this lab activity, we will train Word2Vec embeddings on a corpus of text data and perform similarity tasks using the trained embeddings. The steps involved include:

  1. Preprocess the text data: Tokenization, cleaning, and normalization.
  2. Train Word2Vec model: Using the preprocessed text data to learn word embeddings.
  3. Perform similarity tasks: Calculate cosine similarity between word vectors to identify similar words or phrases.

Code for Lab Activity:
Step 1: Preprocess the text data


        import nltk
        from nltk.tokenize import word_tokenize
        from nltk.corpus import stopwords
        nltk.download('punkt')
        nltk.download('stopwords')

        # Sample text data
        text_data = "Natural Language Processing is a fascinating field. It involves the development of algorithms and models to enable computers to understand, interpret, and generate human language data."

        # Tokenization
        tokens = word_tokenize(text_data)

        # Remove stop words
        stop_words = set(stopwords.words('english'))
        filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

        print(filtered_tokens)
    
Step 2: Train Word2Vec model

        from gensim.models import Word2Vec

        # Train Word2Vec model
        model = Word2Vec([filtered_tokens], min_count=1)

        # Save trained model
        model.save("word2vec.model")
    
Step 3: Perform similarity tasks

        # Load trained Word2Vec model
        model = Word2Vec.load("word2vec.model")

        # Find similarity between words
        similarity = model.wv.similarity('natural', 'processing')
        print("Similarity between 'natural' and 'processing':", similarity)
    

'NLP' 카테고리의 다른 글

6. NLP Applications and Advanced Topics  (0) 2024.05.05
5. Advanced NLP Models  (0) 2024.05.05
4. Sequence-to-Sequence Models  (0) 2024.05.05
3. Language Modeling  (0) 2024.05.05
1. Introduction to NLP  (0) 2024.05.05