strangerRidingCaml

3. Language Modeling 본문

NLP

3. Language Modeling

woddlwoddl 2024. 5. 5. 15:57
728x90
Language Modeling

Language Modeling

Introduction to language models:
Language models are statistical models that aim to predict the probability of a sequence of words in a given context. Two common types of language models are:

  • n-gram models: These models predict the next word in a sequence based on the occurrence of preceding n-1 words. They are simple and efficient but suffer from the curse of dimensionality.
  • Neural language models: These models use neural networks to learn the probability distribution of words in a sequence. They can capture complex dependencies between words and perform well on various tasks.

Recurrent Neural Networks (RNNs) for sequence modeling:
Recurrent Neural Networks (RNNs) are a type of neural network architecture designed to handle sequential data. They have connections that form directed cycles, allowing them to maintain a state or memory of previous inputs. RNNs are commonly used for tasks such as language modeling, time series prediction, and machine translation.

Long Short-Term Memory (LSTM) networks:
Long Short-Term Memory (LSTM) networks are a variant of RNNs that address the vanishing gradient problem, which occurs when training deep networks with backpropagation through time. LSTMs introduce gating mechanisms to control the flow of information, enabling them to learn long-range dependencies in sequential data more effectively.

Lab Activity: Building an LSTM-based language model for text generation
In this lab activity, we will build an LSTM-based language model using TensorFlow/Keras for text generation. The steps involved include:

  1. Preprocess the text data: Tokenization, sequence generation.
  2. Build and train the LSTM model: Define the architecture and train the model on the preprocessed data.
  3. Generate text using the trained model: Generate new text sequences based on the learned patterns.

Code for Lab Activity:
Step 1: Preprocess the text data


        # Import necessary libraries
        import numpy as np
        from tensorflow.keras.preprocessing.text import Tokenizer
        from tensorflow.keras.preprocessing.sequence import pad_sequences

        # Sample text data
        text_data = "Natural Language Processing is a fascinating field. It involves the development of algorithms and models to enable computers to understand, interpret, and generate human language data."

        # Tokenization
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts([text_data])
        sequences = tokenizer.texts_to_sequences([text_data])

        # Generate input-output sequences
        sequences = np.array(sequences).flatten()
        X = sequences[:-1]
        y = sequences[1:]

        # Padding sequences
        max_sequence_length = max([len(seq) for seq in sequences])
        X = pad_sequences([X], maxlen=max_sequence_length-1, padding='pre')
    
Step 2: Build and train the LSTM model

        from tensorflow.keras.models import Sequential
        from tensorflow.keras.layers import Embedding, LSTM, Dense

        # Build LSTM model
        model = Sequential()
        model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=10, input_length=max_sequence_length-1))
        model.add(LSTM(50))
        model.add(Dense(len(tokenizer.word_index)+1, activation='softmax'))

        # Compile model
        model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        # Train model
        model.fit(X, y, epochs=100, verbose=0)
    
Step 3: Generate text using the trained model

        # Generate text using the trained model
        seed_text = "Natural Language Processing"
        for _ in range(10):
            # Tokenize seed text
            encoded = tokenizer.texts_to_sequences([seed_text])[0]
            encoded = pad_sequences([encoded], maxlen=max_sequence_length-1, padding='pre')

            # Predict next word
            predicted_index = np.argmax(model.predict(encoded), axis=-1)[0]

            # Map predicted index to word
            predicted_word = ""
            for word, index in tokenizer.word_index.items():
                if index == predicted_index:
                    predicted_word = word
                    break

            seed_text += " " + predicted_word

        print("Generated text:", seed_text)
    

'NLP' 카테고리의 다른 글

6. NLP Applications and Advanced Topics  (0) 2024.05.05
5. Advanced NLP Models  (0) 2024.05.05
4. Sequence-to-Sequence Models  (0) 2024.05.05
2. Text Representation  (0) 2024.05.05
1. Introduction to NLP  (0) 2024.05.05