Measuring AI Ability to Complete Long Cybersecurity Tasks • NorthSec 2026

Back to the list of Speakers and Sessions
Watch the stream Level: Beginner

In March of 2025, the Model Evaluation & Threat Research (METR) group introduced AI task time horizons as a method for measuring the length of tasks that models can autonomously complete coherently. They demonstrated rapid growth in capabilities across frontier systems: effectively showing a doubling every \~7 months. While this framework has primarily been applied to general software and knowledge work, its implications for adversarial domains remain largely unexplored.

In this talk, I present work I've done with Sean Peters and Jack Payne, extending METR’s methodology to offensive cybersecurity workflows, alongside a complementary human baseline study to ground and interpret model performance.

Motivated by the desire to better understand offensive model capabilities, we assembled realistic multi-step offensive task sequences by leveraging a suite of industry standard benchmarks. Both human participants and frontier models were evaluated across increasing task lengths to quantify sustained autonomy, coherence, and failure modes.

Initial results indicate that AI task horizons in offensive cyber are already meaningful and extending rapidly. In several domains, models can chain complex tool-driven actions resembling early-stage intrusion playbooks rather than isolated exploitation steps. The human study provides critical context, highlighting where models approach or diverge from human performance as task length increases.

The talk will cover the experimental design, empirical findings, and key limitations, emphasizing how horizon-based evaluation combined with human grounding surfaces trends that may not be observable by standalone, static benchmarks.

Finally, this work is positioned as exploratory research. It raises questions about whether similar horizon trends appear in defensive workflows: how could we measure defensive task horizons, and what methods would allow meaningful comparisons to offensive performance? If the trend does not replicate in defense, what interventions, tooling, or policy changes could help close the gap? This framing invites further investigation and provides a roadmap for research and practitioner engagement in understanding and mitigating offense–defense asymmetries under AI automation.

Jeremy Miller Sr. Manager, Cybersecurity Strategy.& Research, OffSec

Jeremy Miller is an offensive security leader and educator, currently focused on how AI automation is reshaping adversarial capability. He spent over a decade at Offensive Security in technical and leadership roles across content development, training, and workforce development programs, bridging hands-on offensive methodology with pedagogy and strategy.

His current research, in collaboration with Sean Peters and Jack Payne, applies the METR AI task time horizon framework to realistic offensive cyber workflows, grounded by complementary human studies to measure autonomy scaling in adversarial domains.

Jeremy’s interests center on offense–defense asymmetry, empirical evaluation of autonomous systems, and translating AI security and safety research into practical implications for decision makers.