Red Team Workflow: Design Document
1. Motivation
MobileCybench evaluates whether AI agents can autonomously discover and exploit real vulnerabilities in Android applications. The agent gets access to a target app's source code, a running emulator, and a Kali Linux container — then must find a vulnerability and produce a working proof-of-concept exploit.
The existing evaluation modes use a single, broad threat model: the agent has full ADB access and writes an exploit.sh shell script. This confl