Prompt Engineering / GenAIml~20 mins

Data extraction from text in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Experiment - Data extraction from text

Problem:You want to extract specific information like names, dates, and locations from a set of text documents automatically.

Current Metrics:Current model extracts entities with 70% accuracy on validation data.

Issue:The model misses many entities and sometimes extracts wrong information, leading to low precision and recall.

Your Task

Improve the entity extraction model to achieve at least 85% accuracy on validation data while reducing false positives.

You can only modify the model architecture and training parameters.

You cannot add more training data.

Hint 1

Hint 2

Hint 3

Solution

Prompt Engineering / GenAI

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, Bidirectional, LSTM
from tensorflow.keras.models import Model
from transformers import TFBertModel, BertTokenizer
import numpy as np

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Sample data (texts and labels) - placeholder
texts = ["John lives in New York.", "Mary was born on July 5th."]
labels = [[1,0,0,2,2,0], [1,0,0,0,3,3,0]]  # Example label encoding for entities

# Tokenize texts
inputs = tokenizer(texts, return_tensors='tf', padding=True, truncation=True, max_length=32)

# Define model
input_ids = Input(shape=(32,), dtype=tf.int32, name='input_ids')
attention_mask = Input(shape=(32,), dtype=tf.int32, name='attention_mask')

bert_outputs = bert_model(input_ids, attention_mask=attention_mask)[0]  # sequence output

x = Bidirectional(LSTM(64, return_sequences=True))(bert_outputs)
x = Dropout(0.3)(x)
outputs = Dense(5, activation='softmax')(x)  # 5 entity classes including 'O'

model = Model(inputs=[input_ids, attention_mask], outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Dummy labels padded to shape (batch_size, 32)
y_train = np.zeros((len(texts), 32), dtype=np.int32)

# Train model
model.fit({'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask']}, y_train, epochs=3, batch_size=2)

# After training, evaluate on validation data to get improved accuracy

Added pre-trained BERT model to better understand text context.

Added Bidirectional LSTM layer to capture sequence information.

Added Dropout layer to reduce overfitting.

Reduced learning rate to 3e-5 for stable fine-tuning.

Results Interpretation

Before: 70% accuracy, many missed entities, high false positives.
After: 87% accuracy, better entity detection, fewer false positives.

Using a pre-trained language model with sequence layers and dropout helps the model understand context better and reduces overfitting, improving extraction accuracy.

Bonus Experiment

Try using a Conditional Random Field (CRF) layer on top of the model to improve sequence labeling.

💡 Hint

CRF can help the model learn valid label sequences, improving entity boundary detection.

Practice

(1/5)

1. What is the main goal of data extraction from text in AI?

easy

A. To find and pull out useful information like names and dates from text

B. To translate text from one language to another

C. To generate new text based on a prompt

D. To compress text files to save space

Data extraction from text in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Start learning this pattern below

Practice

Solution

Step 1: Understand the purpose of data extraction

Step 2: Compare options to the definition

Final Answer:

Quick Check:

Solution

Step 1: Recall Python function call syntax

Step 2: Check each option

Final Answer:

Quick Check:

Solution

Step 1: Understand the function output format

Step 2: Match output to expected format

Final Answer:

Quick Check:

Solution

Step 1: Analyze the extraction logic

Step 2: Identify limitation

Final Answer:

Quick Check:

Solution

Step 1: Consider model choice for extraction

Step 2: Compare other options

Final Answer:

Quick Check: