Benchmarks

Standard test scenarios to measure agent performance