Bird
Raised Fist0
Agentic AIml~8 mins

Function calling in LLMs in Agentic AI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Function calling in LLMs
Which metric matters for Function calling in LLMs and WHY

When evaluating function calling in large language models (LLMs), the key metric is accuracy of function call prediction. This means how often the model correctly decides which function to call based on the input. It matters because the model must pick the right function to get the correct result, just like choosing the right tool for a job.

Other important metrics include precision and recall for function calls. Precision tells us how many of the called functions were actually correct, while recall tells us how many correct functions the model found out of all possible correct ones. These help balance between calling too many wrong functions and missing needed ones.

Confusion matrix for function calling
      | Predicted Function Call |
      |------------------------|
      | True Positive (TP): Model called the correct function when needed
      | False Positive (FP): Model called a wrong function
      | True Negative (TN): Model correctly did not call a function
      | False Negative (FN): Model missed calling a needed function

      Example:
      TP = 80, FP = 10, TN = 900, FN = 20
      Total samples = 1010
    
Precision vs Recall tradeoff with examples

If the model calls functions too often, it may have high recall (few missed calls) but low precision (many wrong calls). For example, calling a weather function when not asked wastes resources.

If the model is too cautious, it may have high precision (calls are mostly correct) but low recall (misses some needed calls). For example, not calling a calculator function when math is requested leads to wrong answers.

Good function calling balances precision and recall so the model calls the right functions when needed without many mistakes.

What good vs bad metric values look like
  • Good: Precision and recall both above 0.85 means the model calls functions correctly most of the time and rarely misses needed calls.
  • Bad: Precision below 0.5 means many wrong function calls, causing confusion or errors.
  • Bad: Recall below 0.5 means the model misses many needed function calls, leading to incomplete or wrong answers.
  • Accuracy alone can be misleading if most inputs do not require function calls (high TN), so precision and recall are more informative.
Common pitfalls in metrics for function calling
  • Accuracy paradox: High accuracy can happen if the model rarely calls functions, but this misses many needed calls (low recall).
  • Data leakage: If test data includes exact prompts seen in training, metrics may be unrealistically high.
  • Overfitting: Model may memorize function calls for training prompts but fail on new inputs, causing poor generalization.
  • Ignoring context: Metrics must consider if the function call was appropriate for the input context, not just if a function was called.
Self-check question

Your model has 98% accuracy but only 12% recall on needed function calls. Is it good for production?

Answer: No. The high accuracy likely comes from many true negatives (no function call needed). But 12% recall means the model misses 88% of needed function calls, so it will often fail to perform required actions. This is not good for production.

Key Result
Precision and recall are key to evaluate function calling accuracy and completeness in LLMs.

Practice

(1/5)
1. What is the main purpose of function calling in large language models (LLMs)?
easy
A. To prevent the LLM from understanding user questions
B. To let the LLM run specific external functions and get precise results
C. To slow down the LLM's response time intentionally
D. To make the LLM generate random text without any control

Solution

  1. Step 1: Understand function calling role

    Function calling lets LLMs connect to external code or tools to perform tasks.
  2. Step 2: Identify the main benefit

    This connection helps LLMs provide accurate, task-specific answers by running real functions.
  3. Final Answer:

    To let the LLM run specific external functions and get precise results -> Option B
  4. Quick Check:

    Function calling purpose = precise external function use [OK]
Hint: Function calling means running real code from the LLM [OK]
Common Mistakes:
  • Thinking function calling makes LLMs slower
  • Believing it causes random text generation
  • Assuming it blocks understanding questions
2. Which of the following is the correct way to specify a function call in an LLM prompt?
easy
A. {"name": "get_weather", "parameters": {"city": "Paris"}}
B. function_call: get_weather(city='Paris')
C. call_function('get_weather', city='Paris')
D. run get_weather with city=Paris

Solution

  1. Step 1: Recognize JSON format for function calls

    LLMs use structured JSON to specify function names and parameters clearly.
  2. Step 2: Match the correct JSON syntax

    {"name": "get_weather", "parameters": {"city": "Paris"}} shows a JSON object with "name" and "parameters", which is the standard format.
  3. Final Answer:

    {"name": "get_weather", "parameters": {"city": "Paris"}} -> Option A
  4. Quick Check:

    Function call format = JSON object [OK]
Hint: Function calls in LLMs use JSON with name and parameters [OK]
Common Mistakes:
  • Using plain text instead of JSON
  • Trying to call functions like regular code
  • Missing quotes around keys or values
3. Given this function call JSON sent to an LLM:
{"name": "calculate_sum", "parameters": {"a": 5, "b": 3}}

What should the LLM do next?
medium
A. Return the sum 8 as the function output
B. Ignore the function call and generate unrelated text
C. Ask the user to provide values for a and b
D. Throw an error because parameters are missing

Solution

  1. Step 1: Understand the function call content

    The JSON specifies a function named "calculate_sum" with parameters a=5 and b=3.
  2. Step 2: Determine expected LLM behavior

    The LLM should run the function with these inputs and return the result, which is 8.
  3. Final Answer:

    Return the sum 8 as the function output -> Option A
  4. Quick Check:

    Function call with inputs = output sum 8 [OK]
Hint: LLM runs function with given inputs and returns result [OK]
Common Mistakes:
  • Ignoring the function call and chatting instead
  • Requesting inputs already provided
  • Assuming missing parameters cause errors
4. You wrote this function call JSON for an LLM:
{"name": "get_user_info", "params": {"user_id": 42}}

Why might the LLM fail to execute this call?
medium
A. Because the function name must be uppercase
B. Because user_id must be a string, not a number
C. Because the key should be "parameters", not "params"
D. Because the JSON is missing a closing brace

Solution

  1. Step 1: Check JSON keys for function calling

    The standard key for parameters is "parameters", not "params".
  2. Step 2: Identify why LLM fails

    Using "params" means the LLM won't recognize the inputs and can't run the function.
  3. Final Answer:

    Because the key should be "parameters", not "params" -> Option C
  4. Quick Check:

    Correct key = "parameters" [OK]
Hint: Use "parameters" key exactly in function call JSON [OK]
Common Mistakes:
  • Using "params" instead of "parameters"
  • Assuming number types cause failure
  • Thinking function names must be uppercase
5. You want to build a chatbot that can book appointments by calling an external scheduling function. Which approach best uses function calling in LLMs to achieve this?
hard
A. Ask the user to book appointments manually without automation
B. Make the LLM guess appointment times without calling any function
C. Hardcode all appointment slots inside the LLM prompt text
D. Define a function schema with name and parameters, let LLM call it with user inputs, then run the real scheduler

Solution

  1. Step 1: Understand chatbot function calling design

    Function calling lets the LLM decide when to call the scheduler with user data.
  2. Step 2: Choose best integration method

    Defining a function schema and letting the LLM call it dynamically is the correct approach.
  3. Final Answer:

    Define a function schema with name and parameters, let LLM call it with user inputs, then run the real scheduler -> Option D
  4. Quick Check:

    Function calling enables dynamic external task calls [OK]
Hint: Use function schema and let LLM call real scheduler [OK]
Common Mistakes:
  • Ignoring function calling and guessing answers
  • Hardcoding data inside prompt text
  • Not automating booking at all