March 2026

RESEARCH

Testing AI for Vulnerability Research: 4 Approaches & Where I Failed

#AI

#Vulnerability Research

#SAML

#HTTP Smuggling

#Security

TL;DR

I tested 4 AI-assisted approaches for finding vulnerabilities over one week. Found real bugs — 14 confirmed vulns in one target in 20 minutes. Also burned time on an approach that found nothing useful. AI can find bugs — that’s true. But which ones actually have impact? That’s another story. Everything I’m sharing here are real bugs, spec violations, or genuinely wrong practices. But when you actually try to exploit them, most fall apart. AI is fast at coverage, hypothesis generation, and code analysis. It’s bad at impact assessment, validation, and knowing what’s actually exploitable. Every model inflated findings. The researcher’s judgment was the difference between noise and CVEs every single time. AI doesn’t replace security researchers — it’s a force multiplier, not a replacement.

Note: I don’t have a formal AI/ML background. This was my first time seriously using AI for vulnerability research. Take this with a grain of salt — your results will likely differ based on your target selection, how you set things up, your prompts, and your overall approach. I deliberately picked some hardened targets to see how well AI performs under pressure, which skews the numbers. I could be wrong about some of my conclusions. This is just what I tried and what I observed. The whole thing took about a week and half. The bugs shown here are examples, not the complete list — some findings are omitted due to pending disclosure or because they’d identify specific targets.

Approach 1: Blackbox RFC Spray — HTTP/1.1 request smuggling across 23 servers
Approach 2: Two Fine-Tuned Models + Opus Brain — XML parser fuzzing with specialist models
Approach 3: Hypothesis-Driven from Code and CVE History — SAML library analysis with structured hypotheses
Approach 4: Domain Research + Frontier Model (SECRA) — Pre-gathered research fed to frontier model, multiple SAML targets
What I Learned — Lessons across all 4 approaches
SECRA — How the pipeline works
Numbers — Summary table

Over around one week I tested 4 different AI-assisted approaches for finding vulnerabilities and found real bugs — some critical. I also wasted days on approaches that produced nothing useful. This is an honest breakdown of what worked, what didn’t, and where AI fits in this workflow.

Approach 1: Blackbox RFC Spray

I used AI to help build the testing setup, pointed it at the research direction, and had it systematically map RFC gaps, contradictions, and MAY/SHOULD ambiguities across HTTP/1.0, HTTP/1.1, proxy specifications, and HTTP/2.

The primary focus was HTTP/1.1. The AI mapped where specs disagree — RFC 7230 vs RFC 9110 on Content-Length handling (first-wins vs last-wins), where “MUST reject” becomes “MAY ignore” across spec versions, where proxies are allowed to be lenient. From those gaps, it generated payload variations and I sprayed them across a matrix of 23 backend servers and proxies in Docker labs.

Results: 55 findings across 23 backend servers, tested through H1 and H2→H1 proxy chains. After triage: 8 confirmed exploitable with end-to-end combo lab PoCs, 11 path traversal bugs, 4 logic bugs, plus a funky chunk RFC violation across 16 servers and 6 proxies. 3 turned out to be false positives, 2 were already reported by other researchers.

Confirmed Bugs

The exploitable bugs all follow the same pattern: the AI finds a parser disagreement, I build the proxy→backend chain in Docker labs to prove it’s exploitable end-to-end.

CL octal parsing via strtol(base=0). A backend’s Content-Length parser uses strtol with auto-base detection. CL:0400 is read as octal 256, not decimal 400. Proxy reads it as decimal 400 and forwards 400 body bytes. Backend only consumes 256 — the remaining 144 bytes become a smuggled request. Confirmed exploitable through two major proxies in default configuration — no lenient mode needed.

POST / HTTP/1.1
Content-Length: 0400       ← proxy: decimal 400. backend: octal 256.

[256 bytes padding]GET /smuggled HTTP/1.1
Host: target               ← backend treats these 144 leftover bytes as next request

Bare LF chunk framing. A backend accepts bare \n (without \r) as chunk line terminator. One proxy forwards bare LF chunks. Backend and proxy disagree on body boundaries — 1 request in, 2 responses out. Confirmed end-to-end.

Duplicate TE first-wins. Two TE headers: TE:identity + TE:chunked. Backend takes first (identity → falls to CL). A proxy that forwards both uses chunked. Body framing disagrees — smuggling confirmed through one proxy.

Multi-valued TE ignored. TE:gzip,chunked — backend doesn’t recognize the comma-separated value as chunked, falls to CL. Proxy parses chunked from the list. Same desync pattern, confirmed through same proxy.

Space-in-name header truncation. A backend silently drops ALL headers after encountering a space in a header name. Content-Length placed after the malformed header vanishes. Body becomes a smuggled request. Dead in pure H1 — every proxy rejects in default config. Revived through a major proxy’s documented lenient mode:

POST / HTTP/1.1
Host: target
X -Forwarded: x           ← space terminates backend's header parsing
Content-Length: 43         ← backend NEVER SEES THIS (truncated)

GET /admin HTTP/1.1        ← these 43 bytes aren't consumed as body
Host: target               ← backend treats them as the next request

Proxy checked ACLs on the POST. Backend processes the smuggled GET to /admin. ACL bypass with full PoC.

NUL byte header splitting. A backend splits headers on NUL bytes. X-Foo\x00Content-Length: 50 becomes two separate headers. Dead in pure H1 — all proxies reject NUL in headers. Revived via H2 downgrade: one H2→H1 proxy forwards NUL bytes in header values.

Path traversal bugs (11 across 3 backends). Backslash traversal (/public\..dmin → /admin), null byte truncation (/admin%00.jpg → /admin), encoded dot traversal (/%2e%2e/admin → /admin), semicolon stripping (/public;/../admin → /admin), fragment as path separator. These work through any proxy — proxies forward URI paths without normalizing these patterns. ACL bypass trivial to demo.

Underscore-to-hyphen header normalization. Two backends map _ to - in header names internally. X_Forwarded_For → X-Forwarded-For. Proxies that strip X-Forwarded-For from clients won’t strip X_Forwarded_For. IP spoofing past proxy XFF filters through any proxy.

CRLF in quoted chunk-extension. RFC doesn’t allow CRLF inside quoted strings in chunk-ext. 16 servers + 6 proxies don’t track quote state — CRLF inside a quoted chunk-ext value is treated as real line terminator, promoting attacker text as trailer headers. Reported across the board, but chunk-extensions are not implemented properly in ecosystem, no proper exploitation since most treat it in the same way.

H2→H1 Downgrade Testing

We also tested the H2→H1 conversion layer across 15 proxies with ~3,600 test cases across 26 categories. The H2 testing found mostly RFC violations — proxies accepting things they shouldn’t per spec, but not actually forwarding them in exploitable ways. The main value was reviving dead H1 bugs: the NUL byte header splitting and bare CR/LF delivery paths only exist through H2 proxies that have different validation than H1 proxies.

Some hypotheses were my ideas that the AI formalized, others the AI generated on its own from the research material we’d gathered:

Empty DATA frame → zero-size chunk injection. From the H2 spec material, the AI reasoned that an empty DATA frame (length=0, no END_STREAM) mid-body could convert to 0\r\n\r\n — the chunked terminator — smuggling everything after it.

RST_STREAM after partial body → dirty backend connection. H2 POST with CL:1000, deliver 200 bytes, RST_STREAM. Proxy pools the backend connection, next victim’s request fills the remaining body. In practice, proxies handle RST_STREAM cleanly and don’t return dirty connections — this one didn’t pan out.

The Catch

The AI didn’t find any of the exploitable chains. I did. The AI found the backend parser bugs and mapped which proxies forward what. I connected the dots — the lenient proxy modes, the specific bytes that trigger each behavior, the end-to-end chains. That’s the pattern: AI maps the surface, researcher builds the exploit.

It also inflated results badly. Out of 55 initial findings, many were false chains that looked plausible on paper but fell apart in testing. 3 were outright false positives (0 drain hits on re-test). The model would flag a payload as “working” when the proxy-backend pair would never actually desync on it. I still had to sit in labs validating every single one.

Good for mapping RFC gaps, generating research directions, and finding parser bugs. Bad at chaining logic, bad at knowing what’s actually exploitable. The exploitable chains came from manual work.

Approach 2: Two Fine-Tuned Models + Opus Brain

The most involved setup. Three-tier model architecture: a 32B security expert fine-tuned on ~24,000 training pairs (CVE analysis, attack patterns, spec gaps, encoding tricks), a 14B implementation expert fine-tuned per-target on code tracing and edge case behavior, and Opus as orchestrator deciding what to ask each one.

Target: XML parsers. I defined 61 parsers across 11 language ecosystems. Combined all attack surface knowledge into the security expert model. Tested 12 parsers across ~2,800 payloads.

Results: 53 confirmed bugs, 147 suspicious behaviors across 12 parsers. Zero reportable.

Bugs Found (Not Exploitable)

The numbers look impressive: 200 total findings across 12 parsers. These are real spec violations, real wrong behaviors, real bugs. But I tested them against major downstream projects — SAML libraries, signature verifiers, document processors — and none of the bugs were exploitable in practice. Every single one needs a specific precondition that doesn’t hold in real deployments.

Some examples of why they don’t matter:

Prototype pollution via constructor.prototype. The parser sanitizes __proto__ but not the constructor chain. <constructor><prototype><isAdmin>true</isAdmin></prototype></constructor> creates a pollution gadget. But exploitation requires the downstream application to pass parsed output through a deep-merge utility like _.merge — and the major projects consuming these parsers don’t do that.

Numeric character reference passthrough. XML §4.1 says &#N; MUST be resolved. Multiple parsers leave them as literal strings. <script> passes untouched but a browser would decode it. Exploitable only if the parsed output goes directly into an HTML context without re-encoding — and the downstream SAML/XML-signing libraries I checked don’t do that.

Duplicate attribute acceptance (last-write-wins). XML §3.1 forbids duplicate attributes. Multiple parsers silently accept them. But for this to be a security issue, a SAML library would need to rely on attribute uniqueness for assertion integrity — and the ones I tested validate signatures before extracting attributes, so duplicate attributes in unsigned content don’t help.

Missing attribute newline normalization. XML §3.3.3 says LF in attribute values must be replaced with space. Multiple parsers preserve raw newlines. Breaks C14N and digital signatures — but only if the signing library and the verifying library use different parsers with different normalization behavior. The libraries I tested use the same parser for both.

Stopnode premature close. </script> inside a string literal breaks out of the stop node. Direct XSS vector — but only in an HTML-in-XML configuration that nobody uses in the downstream projects I checked.

The pattern is always the same: the bug is real, the spec violation is real, but the precondition for exploitation doesn’t exist in the actual downstream consumers.

Why It Failed

This is the core problem with targeting parsers in isolation. A parser doing something weird only matters if something downstream trusts that output dangerously. The AI is good at finding the weird behavior. It can’t tell you whether any real application actually hits that path.

The fine-tuned model architecture did have one real advantage though: the main LLM brain had better visibility. It could directly question the specialist models about implementation details, like “does this parser normalize whitespace before or after entity expansion?” and get answers grounded in the actual codebase instead of hallucinated ones. The reasoning was sharper because the orchestrator could query domain-specific knowledge on demand at a fraction of the cost of sending the full codebase to a frontier model every time. When it asked about prototype pollution, it wasn’t guessing from general knowledge, it was informed by the implementation expert’s understanding of how the parser actually builds its output objects. The target was wrong, not the architecture. If I’d pointed this at a SAML library instead of isolated parsers, I think the results would have been different. I might revisit this when I have the budget.

The fine-tuning itself wasn’t the waste. The target selection was.

This approach was cheaper on per target run approx 4-6$ but the fine tuning time / resources / content gathering is much more and only worth doing when you are planning to do systematic research across ecosystem.

Approach 3: Hypothesis-Driven from Code and CVE History

Instead of spraying blind, this approach reasons about the target:

Indexes the target codebase using tree-sitter — function signatures, call graphs, entry points
Summarizes code structure and identifies trust boundaries
Pulls CVE history from OSV.dev for the target and its dependencies
Runs behavioral probes — 30-50 edge-case tests to observe actual library behavior before theorizing
Generates structured hypotheses with confidence levels based on code patterns + past vulns + observed behavior
Maps dependency impact chains — if library A has a flaw, which downstream libraries inherit it?

I tested this on a SAML authentication library and its dependency chain — 7 libraries covering XML parsing, signature verification, XPath evaluation, and session management.

Results: 38 hypotheses tested across 3 runs. 12 confirmed (31%), 3 partial, 23 rejected.

The hypothesis quality was the real output here, more than the bugs themselves. Each hypothesis wasn’t just “test X for vulnerability Y.” It had structure: the specific function, what assumption it violates, the attack vector, and why it thinks this based on CVE patterns in similar libraries. One hypothesis looked at a 2021 signature bypass CVE, traced how the fix changed XPath evaluation, cross-referenced that with the current logout flow, and asked “does this same trust boundary violation exist here?” It cited the specific code lines, the CVE pattern, the assumption being violated. That kind of cross-library reasoning across a dependency chain takes days to do manually.

Bugs Found (Low Exploitation Impact)

The biggest differentiator was the deep analysis step — Opus with a 100K token budget for extended thinking.

Baseline findings (9 medium/low):

Unvalidated relay parameter — open redirect/XSS, no sanitization. Confirmed across GET and POST bindings, no auth required. But depends on application-level handling that most frameworks already mitigate.
No size validation on auth payloads — 5MB base64 payload = ~2.5s CPU, ~30MB RAM. Zero size checks. Unauthenticated, but a slow resource drain, not a crash.
Cross-tenant certificate exposure — Host header manipulation in multi-tenant metadata generation leaks one tenant’s certs. Requires multi-tenant config most deployments don’t use.
Cache design forces disabling replay protection — default cache provider doesn’t work in clustered deployments, pushing users to disable replay checks entirely.
Async logout race — logout URL generation runs async but session destruction fires unconditionally before it completes.
Deprecated URL parser differential — url.parse creates req.url/req.query divergence for redirect binding signature verification.

Deep findings (3 the baseline missed):

Session survives SP-initiated logout — when a logout acknowledgment arrives, the handler calls this.pass() — continuing to the next middleware — without ever calling req.logout(). The session survives indefinitely after the user “logs out.” Confirmed: 5/5 tests showed pass_called: true, req_logout_called: false, session_still_active: true. But requires a specific SLO flow most deployments don’t exercise.
IdP-initiated SLO race condition — logout sends the redirect before session destruction completes, leaving a ~50ms race window. If the session store write fails, the session survives. Tiny window, depends on store failure.
Query parameter priority override — query-string parameters unconditionally take priority over POST body, allowing GET parameters to override POST-binding messages.

The bugs are real. The reasoning that found them was impressive. But the exploitation impact was limited across the board. Individually, none of these were things I’d confidently report as vulnerabilities.

Why It Failed

Where it failed hardest: validation. Test cases looked correct but didn’t trigger the bugs reliably. The model would write a PoC, claim it worked, and be wrong — or partially right with the wrong assertion. The hypothesis generation was good. The lab work was not.

Good research assistant. Bad lab partner.

Approach 4: Domain Research + Frontier Model (SECRA)

Completely different from Approach 3. Instead of generating hypotheses for an AI to test, I dropped the hypothesis step entirely. SECRA gathers domain-level research (CVEs, attack techniques, spec violations, code patterns) and structures it into goals for a frontier model to investigate one at a time.

The pipeline builds a research package for a target domain:

CVE analysis — Pull every CVE for the domain, extract root causes, fix gaps, and variant angles. For SAML: 100 CVEs across all known implementations.
Technique extraction — From blogs, writeups, commits, and advisories, extract concrete attack techniques. 149 techniques for SAML.
Pattern mapping — 25 code-level vulnerability patterns with what to search for, why it’s a bug, how to test it, and which CVEs it maps to.
Spec analysis — Map MUST/SHOULD requirements, discretion zones (where the spec says “MAY” and implementations diverge), and undefined behaviors. For SAML: things like “spec doesn’t define behavior with multiple Assertion elements,” “spec doesn’t address XML comment injection in NameID values,” “URL comparison rules for Destination/Recipient not specified.”
Novel angles — Attack hypotheses that don’t come from existing CVEs. Things like xml:base attribute injection to redirect Reference URI resolution, or internal DTD ATTLIST injection to fabricate ID attributes on forged elements.

This gets packaged with a guide: pick one pattern or angle, read the target code, trace the data flow, write a concrete test, run it, report the result. One thing at a time. Don’t scan everything at once.

The frontier model gets all of this as context and works through it against the target codebase. Reads the actual source, writes test harnesses, executes them, checks whether each pattern or angle exists in the target. The research is pre-done. The model investigates and validates, it doesn’t theorize.

One caveat: I had to review everything to make sure the model wasn’t taking shortcuts — disabling security features in the test setup, weakening configurations, or subtly making the code more vulnerable and then “finding” the bug it introduced. This happens. The model wants to show results, so it’ll disable signature verification “for testing,” use an insecure default it configured itself, or write a harness that strips validation the real library would perform. You need to verify every finding exists in the real library with its real configuration, not just in the model’s lab. Can’t skip this step.

I ran this against multiple targets. The results varied dramatically based on one thing: how much prior security attention the target had.

Hardened Target: A Mature SAML Library

18 findings. Breakdown:

7 were design choices the maintainers documented
4 were spec violations maintainers call “working as intended”
3 were unreachable dead code
2 were theoretical with no practical exploitation
1 was cosmetic

1 was reportable. A scope filtering function used string suffix matching instead of domain boundary matching:

host  = "login.notevil.com"
scope = "evil.com"

# Vulnerable check: is "evil.com" a suffix of "login.notevil.com"?
# position("evil.com") == len("login.notevil.com") - len("evil.com")
# 10 == 10 → TRUE (should be FALSE — not a domain boundary)

An attacker’s IdP at notevil.com can assert scoped attributes for evil.com. That’s a 5% hit rate. The AI inflated everything else. It couldn’t tell the difference between “this code does something weird” and “this is exploitable.” Every finding came with high confidence and detailed reasoning that sounded convincing. 17 out of 18 were noise.

Fresh Target: A SAML Library

Same tool, different target — a SAML library with significant downloads. Still actively used. It just hadn’t received focused security review.

Bugs Found

14 confirmed vulnerabilities in 20 minutes.

2 authentication bypasses — full SAML assertion forgery, requires understanding the entire authentication flow and where signature verification actually binds to assertion consumption
CBC padding panic — CBC padding validation crashes on malformed input. The error path is distinguishable (theoretically a padding oracle), but full Vaudenay decryption wasn’t exploitable in practice. The DoS via the panic is real — unauthenticated crash with a single crafted message.
XSLT execution — attacker-controlled XSLT transforms executed during signature verification. Code execution surface.
Multiple DoS vectors — resource exhaustion through crafted SAML messages

Plus 3 patterns from the research that were checked and correctly ruled out — the model didn’t hallucinate findings where they didn’t exist.

Deeper Ecosystem Run

On a separate run against a different SAML ecosystem (7 libraries), the frontier model worked through the research package — patterns, techniques, spec violations — against each library in the chain. 9 findings were CVE-worthy after triage, 6 validated with standalone PoCs.

The structured domain research made a real difference here. The model wasn’t guessing. It was checking whether known vulnerability patterns from 100 CVEs across the SAML ecosystem exist in this specific target.

Bugs Found

CBC padding crash via error type leak. A CBC decryption function disables the crypto library’s built-in padding check, then does manual validation. On invalid padding, it calls a callback variable that doesn’t exist in scope — throwing a ReferenceError instead of a generic Error. The caller’s try/catch distinguishes these error types. Theoretically that’s a padding oracle — the model flagged it because the research package included Vaudenay patterns from CVE history, and the code matched: manual padding check after setAutoPadding(false). In practice, the full decryption attack wasn’t exploitable — but the crash on the error path is real. Unauthenticated DoS with a single crafted ciphertext. The AI correctly identified the vulnerable code pattern; it just overestimated the exploitation impact.

XPath injection in key retrieval. An encryption library takes a URI attribute from an XML element and concatenates it directly into an XPath expression:

//*[@Id='ATTACKER_CONTROLLED_VALUE']/*[...]

No escaping. Inject a single quote and a union operator, and you select arbitrary elements as the encryption key material:

<RetrievalMethod URI="#'] | //*[local-name(.)='CipherValue'][1] | //*[@Id='x" />

Silent algorithm downgrade. A switch statement matching OAEP digest algorithm URIs uses the wrong XML namespace in its case labels. The standard URIs sent by every major identity provider don’t match any case, so they all fall through to the default: SHA-1. Every deployment unknowingly uses SHA-1 for OAEP instead of the SHA-256/512 they configured.

Unsigned message acceptance. A signature verification function for redirect-binding messages returns true when the Signature query parameter is absent entirely — not “invalid signature,” but “no signature = valid.” Unauthenticated attacker sends a logout request with no signature, and the library accepts it:

GET /slo?SAMLRequest=<deflated LogoutRequest>&RelayState=...
# No Signature, no SigAlg query parameters
# hasValidSignatureForRedirect() returns true
# User's session terminated

Replay bypass via unsigned attribute stripping. Replay protection checks an attribute on the outer (unsigned) message envelope. The signed inner assertion doesn’t include this attribute. Strip it from the envelope — the signature still validates, but the replay check sees “no attribute present” and skips validation entirely, treating the message as unsolicited. Same signed assertion accepted multiple times.

Text node truncation via XML comment injection. An XML library extracts identity values using .data on the first child text node. The attacker controls the IdP and signs an assertion containing a comment-injected NameID. The signature is valid because the attacker signed it themselves. But the SP’s identity extraction only reads the first text node before the comment boundary:

<!-- Attacker (evil.com IdP) signs assertion with: -->
<NameID>admin@victim.com<!--.evil.com--></NameID>

<!-- Signature verifies: attacker signed it, digest matches ✓ -->
<!-- DOM .data extraction: "admin@victim.com" (first text node only) -->
<!-- SP authenticates attacker as admin@victim.com -->

The dependency impact chain analysis was useful here. The pipeline traced how a bug in an encryption library flows through the signature verification layer into the SSO login flow, confirming the crash is reachable from the unauthenticated attack surface.

Yes these all are not some big SAML issues like Signature bypass, but still on harden targets but the caveat is these bugs on harden targets are situational.

Other Findings

Across other SAML libraries with the same pipeline:

CBC padding panic — unauthenticated DoS, full PoC
Deflate bomb — 464KB compressed → 350MB expanded, 2 requests kill a 1GB container
InResponseTo replay — nonce deletion commented out in the source, responses replayable indefinitely
Metadata signature bypass — unsigned metadata accepted as trusted, enabling attacker IdP registration
KeyInfo certificate injection — untrusted certificates from the SAML response used for signature verification. Full authentication bypass.
Expired certificate acceptance — certificate validation always ignores expiry. Revoked/expired certs still trusted.
Double-encrypted assertion signature bypass — signature verification skipped on the second decryption loop. Attacker with the SP’s public key can inject a forged assertion.

Clean, reportable bugs with clear impact.

The Contrast

Same AI. Same pipeline. Same researcher.

One target: 5% hit rate in about 20-30 minutes (finding, validation, impact analysis). The other: 14 confirmed bugs in 20 minutes. The ecosystem run: 9 CVE-worthy findings across 7 libraries.

The AI didn’t get smarter between runs. The difference was prior security attention. On the hardened library, every obvious bug was already found by humans. What remained needed deep spec knowledge and real deployment context: understanding SAML’s canonicalization requirements, how XML Signature’s Reference URI resolution actually works, what “working as intended” means when the spec is ambiguous. The AI can’t do that.

On the fresh targets, the bugs weren’t easy. Auth bypasses aren’t trivial. But nobody had looked closely yet. The AI could apply known vulnerability patterns to code that hadn’t been tested against them. That’s where it works best.

What I Learned

Target selection > model sophistication. Fine-tuned models found nothing reportable. The standard SECRA pipeline found 14 bugs in 20 minutes. The difference was where I pointed it, not how smart the model was.

AI inflates findings. Always. Every approach had this problem. The model can’t assess impact — it finds weird behavior and calls it a vulnerability. 17 out of 18 findings on a hardened target were noise, all presented with high confidence. The researcher’s judgment is the filter.

Deep reasoning > more payloads. Extended thinking found bugs that baseline analysis missed — tracing code paths, reasoning about edge-case state. Spraying more tests wouldn’t have found those.

Structured research > hypothesis generation. Feeding the model proven vulnerability patterns from real CVEs produced more reportable bugs than letting it theorize from code. Investigate, don’t speculate.

Validation is still the gap. The model will take shortcuts — disabling security features in test setups, weakening configs, then “finding” the bug it introduced. You have to verify every finding against the real library with real configuration.

Fresh targets win. Same AI, same pipeline — 5% hit rate on a hardened library, 14 bugs in 20 minutes on a fresh one. If you’re spending compute, point it at code that hasn’t had focused security attention.

AI is an enabler, not a replacement. It makes you faster. It doesn’t replace the researcher. Maybe that changes. Right now, it’s not there.

SECRA

SECRA is the tool I built and used across all 4 approaches. It’s a context engine — it gathers domain-level security research and packages it so a frontier model can work through it systematically.

For each target domain, it builds a research package: CVE history with root causes and variant angles, attack techniques extracted from blogs/writeups/commits, code-level vulnerability patterns, spec analysis (MUST/SHOULD requirements, discretion zones, undefined behaviors), and novel attack angles that don’t come from existing CVEs. The model gets this alongside the target codebase and works through it one pattern at a time — read the code, write a test, run it, report.

Each approach in this post used SECRA differently. Approach 1 used it to map RFC gaps and generate payload variations. Approach 2 used it to build training data and feed the fine-tuned models. Approach 3 used it for CVE history and code indexing to generate hypotheses. Approach 4 used it most directly — the full research package fed to a frontier model.

Sonnet handles bulk work like CVE summarization and technique extraction. Findings are cached by ecosystem/name/version so dependencies analyzed once are reusable across targets. Right now I have 4 versions of it, I’ll look more into the hybrid version when I have some budget.

Numbers

Approach	Findings	Reportable	Lesson
RFC Spray	55 across 23 servers. 8 exploitable (combo lab PoC), 11 path traversal, 4 logic bugs, funky chunks across 16 servers + 6 proxies	Smuggling, ACL bypass, cache poisoning (reported)	AI maps surface, researcher builds chains
Fine-Tuned Models	53 bugs + 147 suspicious across 12 parsers (~2,800 payloads)	0 reportable (all need preconditions that don’t hold)	Wrong target, right architecture
Hypothesis Engine	38 hypotheses, 12 confirmed (31%)	3 confirmed, low practical impact	Good hypotheses, bad validation
SECRA	100 CVEs, 25 patterns, 149 techniques fed to frontier model	14+ across multiple targets, 9 CVE-worthy	Domain research + right target = results

One week. Some wasted, some extremely productive. The difference was never the AI — it was where I pointed it. AI works well with goals, If you direct it like Go look for bugs the results might not be that good.

The bugs listed in this post are representative examples — the full findings across all approaches are more extensive. Some findings are pending disclosure. Details will be updated once CVEs are assigned. Most of the findings are situational thats one of the reason im documenting all this

BACK HOME →

Testing AI for Vulnerability Research: 4 Approaches & Where I Failed

TL;DR

Table of Contents

Approach 1: Blackbox RFC Spray

Confirmed Bugs

H2→H1 Downgrade Testing

The Catch

Approach 2: Two Fine-Tuned Models + Opus Brain

Bugs Found (Not Exploitable)

Why It Failed

Approach 3: Hypothesis-Driven from Code and CVE History

Bugs Found (Low Exploitation Impact)

Why It Failed

Approach 4: Domain Research + Frontier Model (SECRA)

Hardened Target: A Mature SAML Library

Fresh Target: A SAML Library

Bugs Found

Deeper Ecosystem Run

Bugs Found

Other Findings

The Contrast

What I Learned

SECRA

Numbers