Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics.
openclaw install @rustyorb/agent-evaluation