When evaluating function calling in large language models (LLMs), the key metric is accuracy of function call prediction. This means how often the model correctly decides which function to call based on the input. It matters because the model must pick the right function to get the correct result, just like choosing the right tool for a job.
Other important metrics include precision and recall for function calls. Precision tells us how many of the called functions were actually correct, while recall tells us how many correct functions the model found out of all possible correct ones. These help balance between calling too many wrong functions and missing needed ones.