0
0
Agentic AIml~20 mins

Checkpointing agent progress in Agentic AI - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Checkpointing agent progress
Problem:You have an AI agent that learns tasks over time. The agent's progress is not saved during training, so if interrupted, all learning is lost.
Current Metrics:Agent completes 80% of tasks in a session but loses all progress if stopped early.
Issue:No checkpointing means no saved progress. Training must restart from zero after interruption.
Your Task
Implement checkpointing to save the agent's state periodically so training can resume without loss.
Use only built-in or standard libraries for saving/loading state.
Checkpoint every 5 training episodes.
Ensure the agent resumes correctly from the last checkpoint.
Hint 1
Hint 2
Hint 3
Solution
Agentic AI
import os
import pickle

class Agent:
    def __init__(self):
        self.knowledge = 0  # simple progress measure
        self.episode = 0

    def train_episode(self):
        # Simulate learning by increasing knowledge
        self.knowledge += 1
        self.episode += 1

    def save_checkpoint(self, filename='agent_checkpoint.pkl'):
        with open(filename, 'wb') as f:
            pickle.dump({'knowledge': self.knowledge, 'episode': self.episode}, f)

    def load_checkpoint(self, filename='agent_checkpoint.pkl'):
        if os.path.exists(filename):
            with open(filename, 'rb') as f:
                data = pickle.load(f)
                self.knowledge = data['knowledge']
                self.episode = data['episode']
                print(f"Loaded checkpoint: episode {self.episode}, knowledge {self.knowledge}")
        else:
            print("No checkpoint found, starting fresh.")


def train_agent(total_episodes=20, checkpoint_interval=5):
    agent = Agent()
    agent.load_checkpoint()

    for _ in range(agent.episode, total_episodes):
        agent.train_episode()
        print(f"Episode {agent.episode}: knowledge {agent.knowledge}")

        if agent.episode % checkpoint_interval == 0:
            agent.save_checkpoint()
            print(f"Checkpoint saved at episode {agent.episode}")

    # Save final checkpoint
    agent.save_checkpoint()
    print("Training complete and checkpoint saved.")


if __name__ == '__main__':
    train_agent()
Added methods to save and load agent state using pickle.
Implemented checkpoint saving every 5 episodes during training.
Modified training loop to resume from last saved episode.
Results Interpretation

Before checkpointing: Agent loses all progress if training stops early.

After checkpointing: Agent resumes from last saved state, preserving progress and saving time.

Checkpointing is essential to save learning progress in long training processes. It prevents loss of work and allows resuming training efficiently.
Bonus Experiment
Try implementing checkpointing with a human-readable format like JSON instead of pickle.
💡 Hint
Convert agent state to a dictionary and save/load using json module. Remember to handle data types correctly.