Beyond the AI Noise: What Actually Improves Security?

Anvil Signal

What we're seeing, building, and learning in the field.

Our latest installment shares a snapshot of the research, tooling, and field observations coming out of Anvil Secure, and this one starts with news we're very proud of.

Recently, Anvil was trusted by Anthropic as one of a select group of external security partners to triage and validate Mythos findings as part of Project Glasswing.

The project gets at the central question underlying all the AI noise: not what can AI find, but what actually improves security. Where does automation help, where does it create risk, and where do human context and judgement still matter most?

AI is not the only area we are focused on, but given the pace of change, it is an increasingly important one. You can see that focus in much of our recent work, from AI security testing and LLM-assisted code review to passkey architecture, Burp Suite tooling, and vulnerability validation at scale.

Here is what we've been thinking about lately.

LATEST INSIGHTS

How We Use LLMs in Secure Code Review

Research Director Tao Sauvage explores how Anvil uses LLMs during secure code review, not as autonomous reviewers but as tools guided by engineers.

The post walks through where LLMs assist with planning, task breakdown, state management, phased review, and intermediate validation, and why human judgement still plays a critical role.

How We Test AI: LLM & GenAI Security Methodology at Anvil Secure

Security Engineer George Damiris walks through how Anvil approaches LLM and GenAI security assessments, from threat modelling and architecture review to adversarial testing of agentic workflows, MCP integrations, RAG pipelines, tool use, access boundaries, and data exposure risks.

For teams building or deploying AI-enabled apps, this is a practical look at what meaningful AI security testing needs to include.

Demystifying Passkeys: Under the Hood

Security Engineer Matteo Giordano's passkeys series takes a closer look at what is happening beneath the surface of passworldless authentication.

Part 1 focuses on the FIDO2 protocol layer, while Part 2 explores passkey architecture, synced vs device-bound credentials, and the implementation details that still shape real-world risk.

READ PART 1

READ PART 2

ByteBanter Demo on PortSwigger Discord

Senior Security Engineer Andrea Braschi recently presented ByteBanter live on the PortSwigger Discord.

ByteBanter is an open-source Burp Suite extension for generating context-aware intruder payloads using Burp AI. The demo walks through how the extension works, newly released functionality, and how it can support testing workflows.

WATCH THE DEMO

VIEW THE TOOL

Want to see more? Access all research, tools, and perspectives we’re building.

See all research

What this means for security teams

Across these updates, the theme is not just speed, it's trust.

Security teams are being asked to adopt AI, modernize authentication, review more code, and respond to more findings, often without more time or more people. Faster workflows can look like progress while also introducing more noise and ambiguity. More findings do not automatically mean better security outcomes. Teams still need to validate exploitability, assess impact, and prioritize what to fix.

That is why depth still matters. AI systems need to be tested across the full execution chain. LLMs can support secure code review, but they are most useful when guided by human judgement. Passkeys can materially improve authentication security, but implementation choices and deployment details still shape real-world risk.

The same is true for AI-assisted vulnerability discovery. More findings do not automatically mean better security outcomes. Teams still need to validate exploitability, assess impact, and prioritize what to fix.

That work reflects a challenge many teams are starting to face; finding more is only useful if you can determine what actually matters.

A few questions worth asking inside your own organization:

Where are we using AI in development, testing, or security workflows today?
What sensitive systems, source code, data, or tools can those workflows access?
When AI surfaces a finding, who validates exploitability, severity, and business impact?
Are we measuring progress by volume, or by better security decisions?
Where do we need more context, structure, or expert review?

For security and engineering leaders, the question is no longer just what can we find, but what can we trust, what should we act on, and what changes the risk picture?

Start a Conversation