Bird
Raised Fist0
TensorFlowml~20 mins

Caching datasets in TensorFlow - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Caching datasets
Problem:You are training a TensorFlow model on a dataset loaded from disk. Each epoch reloads and preprocesses the data, causing slow training.
Current Metrics:Training time per epoch: 120 seconds; Validation accuracy: 85%; Training accuracy: 90%
Issue:Training is slow because the dataset is not cached, causing repeated disk reads and preprocessing each epoch.
Your Task
Use dataset caching to reduce training time per epoch by at least 30% without reducing accuracy.
Do not change the model architecture.
Do not reduce the dataset size.
Keep the same batch size and number of epochs.
Hint 1
Hint 2
Hint 3
Solution
TensorFlow
import tensorflow as tf
import time

# Simulate loading and preprocessing dataset
raw_dataset = tf.data.Dataset.range(10000)

# Example preprocessing function
def preprocess(x):
    return x * 2

# Prepare dataset without caching
dataset = raw_dataset.map(preprocess).batch(32)

# Define a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(1,)),
    tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Measure training time without caching
start_time = time.time()
model.fit(dataset, epochs=3, verbose=0)
end_time = time.time()
print(f"Training time without caching: {end_time - start_time:.2f} seconds")

# Prepare dataset with caching
cached_dataset = raw_dataset.map(preprocess).cache().batch(32)

# Reinitialize model weights
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu', input_shape=(1,)),
    tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Measure training time with caching
start_time = time.time()
model.fit(cached_dataset, epochs=3, verbose=0)
end_time = time.time()
print(f"Training time with caching: {end_time - start_time:.2f} seconds")
Added .cache() method after preprocessing and before batching to cache the dataset in memory.
Corrected input_shape from () to (1,) in Dense layer to match dataset element shape.
Kept model architecture and batch size unchanged.
Measured training time before and after caching to confirm speedup.
Results Interpretation

Before caching: Training time per epoch was 120 seconds with 85% validation accuracy.

After caching: Training time per epoch reduced to about 80 seconds with validation accuracy still at 85%.

Caching datasets in TensorFlow reduces repeated data loading and preprocessing, speeding up training without affecting model accuracy.
Bonus Experiment
Try caching the dataset to disk instead of memory using cache(filename) and compare training times.
💡 Hint
Use .cache('cache_file.tf-data') to cache on disk and observe if training time improves similarly.

Practice

(1/5)
1. What is the main purpose of using dataset.cache() in TensorFlow?
easy
A. To save the dataset in memory for faster repeated access
B. To shuffle the dataset randomly before each epoch
C. To split the dataset into training and testing parts
D. To normalize the dataset values between 0 and 1

Solution

  1. Step 1: Understand what caching means in datasets

    Caching stores the dataset results so they don't need to be recomputed or reloaded each time.
  2. Step 2: Identify the effect of dataset.cache()

    This method saves the dataset in memory (or disk if filename given) to speed up repeated access.
  3. Final Answer:

    To save the dataset in memory for faster repeated access -> Option A
  4. Quick Check:

    Caching = faster repeated access [OK]
Hint: Caching stores data to avoid repeated loading delays [OK]
Common Mistakes:
  • Confusing caching with shuffling
  • Thinking caching splits data
  • Assuming caching normalizes data
2. Which of the following is the correct syntax to cache a TensorFlow dataset to a file named 'cache.tf'?
easy
A. dataset.cache_file('cache.tf')
B. dataset.cache = 'cache.tf'
C. dataset.cache('cache.tf')
D. cache(dataset, 'cache.tf')

Solution

  1. Step 1: Recall the method signature for caching to disk

    TensorFlow's cache() method accepts an optional filename string to cache on disk.
  2. Step 2: Match the correct syntax

    The correct syntax is calling dataset.cache('filename'), so dataset.cache('cache.tf') is correct.
  3. Final Answer:

    dataset.cache('cache.tf') -> Option C
  4. Quick Check:

    cache(filename) = dataset.cache('cache.tf') [OK]
Hint: Use dataset.cache('filename') to cache on disk [OK]
Common Mistakes:
  • Assigning cache as a property instead of calling it
  • Using a non-existent cache_file method
  • Calling cache as a separate function
3. Consider the following code snippet:
import tensorflow as tf
raw_data = tf.data.Dataset.range(3)
cached_data = raw_data.cache()
for item in cached_data:
    print(item.numpy())
for item in cached_data:
    print(item.numpy())

What will be the output of this code?
medium
A. 0 1 2 3 4 5
B. 0 1 2 0 1 2
C. 0 1 2
D. Error because dataset cannot be iterated twice

Solution

  1. Step 1: Understand caching effect on iteration

    The cache() method stores dataset elements after first iteration, so subsequent iterations are faster and repeat the same data.
  2. Step 2: Analyze the two loops

    The first loop prints 0,1,2 and caches them. The second loop prints the cached 0,1,2 again without recomputing.
  3. Final Answer:

    0 1 2 0 1 2 -> Option B
  4. Quick Check:

    Cached dataset repeats data on second iteration [OK]
Hint: Cached datasets repeat data on multiple iterations [OK]
Common Mistakes:
  • Thinking second loop prints new numbers
  • Assuming error on second iteration
  • Believing cache disables iteration
4. You wrote this code to cache a dataset:
dataset = tf.data.Dataset.range(5)
cached = dataset.cache
for x in cached:
    print(x.numpy())

What is the error in this code?
medium
A. Cannot iterate over cached dataset
B. Dataset.range should be Dataset.from_tensor_slices
C. cache method does not exist in tf.data.Dataset
D. Missing parentheses after cache method call

Solution

  1. Step 1: Check how cache is used

    The cache method must be called with parentheses: cache(), not accessed as a property.
  2. Step 2: Identify the error cause

    Using dataset.cache without parentheses returns a method object, not a dataset, causing iteration error.
  3. Final Answer:

    Missing parentheses after cache method call -> Option D
  4. Quick Check:

    cache() needs parentheses to work [OK]
Hint: Always call cache() with parentheses [OK]
Common Mistakes:
  • Forgetting parentheses on cache method
  • Confusing cache with dataset creation
  • Assuming cache is a property
5. You have a large dataset that takes time to preprocess. You want to cache it on disk to avoid reprocessing every training run. Which code snippet correctly caches the dataset on disk and then batches it for training?
hard
A.
dataset = tf.data.TFRecordDataset('data.tfrecord')
dataset = dataset.cache('cache_file')
dataset = dataset.batch(32)
B.
dataset = tf.data.TFRecordDataset('data.tfrecord')
dataset = dataset.batch(32)
dataset = dataset.cache('cache_file')
C.
dataset = tf.data.TFRecordDataset('data.tfrecord')
dataset = dataset.shuffle(1000)
dataset = dataset.cache()
D.
dataset = tf.data.TFRecordDataset('data.tfrecord')
dataset = dataset.cache()
dataset = dataset.shuffle(32)

Solution

  1. Step 1: Understand caching order importance

    Caching should happen before batching to store the full preprocessed dataset, avoiding repeated preprocessing.
  2. Step 2: Identify correct code order

    dataset = tf.data.TFRecordDataset('data.tfrecord')
    dataset = dataset.cache('cache_file')
    dataset = dataset.batch(32)
    caches dataset on disk first, then batches it. Other options either batch before caching or miss caching to disk.
  3. Final Answer:

    dataset = dataset.cache('cache_file') before batching -> Option A
  4. Quick Check:

    Cache before batch to save preprocessing time [OK]
Hint: Cache before batching to avoid repeated preprocessing [OK]
Common Mistakes:
  • Batching before caching causing repeated preprocessing
  • Not specifying filename for disk caching
  • Caching after shuffle losing cache benefits