Bird
Raised Fist0
NLPml~20 mins

Entity types (PERSON, ORG, LOC, DATE) in NLP - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Entity types (PERSON, ORG, LOC, DATE)
Problem:You want to build a model that can recognize named entities in text, specifically people (PERSON), organizations (ORG), locations (LOC), and dates (DATE). The current model identifies entities but often confuses entity types or misses some entities.
Current Metrics:Training accuracy: 92%, Validation accuracy: 75%, Validation F1-score: 0.70
Issue:The model is overfitting: training accuracy is high but validation accuracy and F1-score are much lower, indicating poor generalization.
Your Task
Reduce overfitting to improve validation accuracy to at least 85% and validation F1-score to at least 0.80, while keeping training accuracy below 90%.
You can only adjust model architecture and training hyperparameters.
You cannot change the dataset or add more data.
You must keep the entity types limited to PERSON, ORG, LOC, and DATE.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
NLP
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Assume X_train, y_train, X_val, y_val are preprocessed and ready

vocab_size = 10000  # example vocabulary size
embedding_dim = 64
max_len = 100  # max length of input sequences
num_classes = 5  # 4 entity types + 1 for 'O' (no entity)

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_len),
    Bidirectional(LSTM(64, return_sequences=True)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(num_classes, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(X_val, y_val),
                    callbacks=[early_stop])
Added dropout layers after LSTM and Dense layers to reduce overfitting.
Lowered learning rate to 0.0005 for smoother convergence.
Added early stopping to stop training when validation loss stops improving.
Results Interpretation

Before: Training accuracy: 92%, Validation accuracy: 75%, Validation F1-score: 0.70

After: Training accuracy: 88%, Validation accuracy: 86%, Validation F1-score: 0.82

Adding dropout and early stopping helped reduce overfitting, improving validation accuracy and F1-score while slightly lowering training accuracy. This shows how controlling model complexity and training duration helps models generalize better.
Bonus Experiment
Try using a pretrained language model like BERT for named entity recognition on the same dataset.
💡 Hint
Use a pretrained transformer model with a token classification head and fine-tune it on your entity dataset for potentially better accuracy.

Practice

(1/5)
1. Which entity type label would you use to mark the name "Albert Einstein" in a text?
easy
A. PERSON
B. ORG
C. LOC
D. DATE

Solution

  1. Step 1: Understand entity types

    PERSON labels identify names of people in text.
  2. Step 2: Match the example to entity type

    "Albert Einstein" is a person's name, so it fits PERSON.
  3. Final Answer:

    PERSON -> Option A
  4. Quick Check:

    PERSON = Albert Einstein [OK]
Hint: Names of people are always PERSON entities [OK]
Common Mistakes:
  • Confusing ORG with PERSON
  • Labeling locations as PERSON
  • Using DATE for names
2. Which of the following is the correct way to label the entity type for "Google" in a named entity recognition task?
easy
A. LOC
B. ORG
C. PERSON
D. DATE

Solution

  1. Step 1: Identify what Google represents

    Google is a company, which is an organization.
  2. Step 2: Match to entity type

    ORG is the label for organizations like companies.
  3. Final Answer:

    ORG -> Option B
  4. Quick Check:

    ORG = Google [OK]
Hint: Companies and institutions are labeled ORG [OK]
Common Mistakes:
  • Labeling companies as LOC
  • Using PERSON for organizations
  • Confusing DATE with ORG
3. Given the sentence: "Barack Obama visited Paris on July 14, 2015." Which of the following is the correct sequence of entity types for [Barack Obama, Paris, July 14, 2015]?
medium
A. [PERSON, LOC, ORG]
B. [ORG, LOC, DATE]
C. [PERSON, LOC, DATE]
D. [PERSON, ORG, DATE]

Solution

  1. Step 1: Identify each entity type

    "Barack Obama" is a person, "Paris" is a location, and "July 14, 2015" is a date.
  2. Step 2: Match entities to types in order

    The sequence is PERSON, LOC, DATE.
  3. Final Answer:

    [PERSON, LOC, DATE] -> Option C
  4. Quick Check:

    PERSON, LOC, DATE = Barack Obama, Paris, July 14, 2015 [OK]
Hint: Match each entity to person, place, or date in order [OK]
Common Mistakes:
  • Confusing ORG with LOC
  • Mixing DATE with ORG
  • Wrong order of entity types
4. You have a named entity recognition model that labels "Amazon" as a LOC (location). What is the most likely error in this labeling?
medium
A. Amazon is an organization, so it should be ORG
B. Amazon is a person, so LOC is wrong
C. Amazon is a date, so LOC is incorrect
D. Amazon is a location, so LOC is correct

Solution

  1. Step 1: Understand the entity "Amazon"

    Amazon is commonly known as a company (organization), not a location.
  2. Step 2: Correct entity type for Amazon

    ORG is the correct label for companies like Amazon.
  3. Final Answer:

    Amazon is an organization, so it should be ORG -> Option A
  4. Quick Check:

    ORG = Amazon company [OK]
Hint: Companies are ORG, not LOC [OK]
Common Mistakes:
  • Assuming Amazon is only a location
  • Labeling company names as PERSON
  • Ignoring context of entity
5. You want to extract all dates and locations from the sentence: "The conference was held in New York on March 3rd, 2023, and attended by experts from Google." Which entity types should your model identify to get the correct information?
hard
A. PERSON and LOC
B. PERSON and ORG
C. ORG and DATE
D. LOC and DATE

Solution

  1. Step 1: Identify entities to extract

    The task is to extract dates and locations only.
  2. Step 2: Match entity types for locations and dates

    Locations are labeled LOC and dates are labeled DATE.
  3. Final Answer:

    LOC and DATE -> Option D
  4. Quick Check:

    LOC and DATE = New York, March 3rd, 2023 [OK]
Hint: Dates = DATE, places = LOC [OK]
Common Mistakes:
  • Extracting PERSON or ORG instead
  • Mixing LOC with ORG
  • Ignoring DATE entities