Hadrian OpenHack — AI Pentest on a Vulnerable Flask App | Stack of Truth

Hadrian OpenHack — AI Pentest on a Vulnerable Flask App | Stack of Truths

Hadrian Just Open-Sourced Their Pentest AI — We Ran It on a Vulnerable App and Here’s What It Found

May 21, 2026 — 7 min read — Pedro Jose

On May 18, Hadrian Security dropped OpenHack on GitHub — a lightweight, file‑based workspace for source‑guided whitebox security review. MIT license. 12 expert agents. Checkpointed, scenario‑first workflow. It works with Claude Code, Codex, Cursor, or any model harness that can follow a prompt file.

I cloned it. I pointed it at a deliberately vulnerable Flask app running on this VPS. Then I let it run.

⚡ THE BOTTOM LINE

OpenHack is not a magic “push button, get pentest” tool. It’s an automation framework for human‑guided, AI‑assisted source review. In the right hands, it finds real vulnerabilities with full evidence chains. In the wrong hands, it’s a noise machine.

What Is OpenHack?

OpenHack is a collection of agents and tools that replicates how the Hadrian research team performs automated vulnerability research. The core idea is checkpointed, scenario‑first review:

  • Reconnaissance agents discover attack surfaces (routes, sinks, auth boundaries, upload handlers).
  • A router agent turns those surfaces into scenarios — one routing unit + one expert + one proof question.
  • Twelve expert agents (OWASP‑aligned) prove or reject each scenario.
  • An independent triage agent decides which verified candidates become final findings.
  • Humans approve every phase transition.

The workflow is state‑machine driven and persists everything to files: cloned source, recon items, scenario prompts, results, finding candidates, triage decisions, and final findings. This means you can pause, resume, or hand off a run without losing context.

📌 THE 12 EXPERT AGENTS

broken-access-control · security-misconfiguration · software-supply-chain-failures · cryptographic-failures · injection · memory-buffer-boundary-errors · insecure-design · authentication-failures · software-data-integrity-failures · sensitive-information-exposure · path-traversal-unrestricted-upload · unrestricted-resource-consumption

Our Test Environment — A Vulnerable Flask App

I spun up a small Flask application on this VPS. It was intentionally vulnerable, but not trivial: SQL injection via unsanitized user input, SSTI through a misconfigured Jinja2 template, and a YAML deserialization bug in an admin “import config” endpoint.

Then I ran OpenHack using the CLI workflow:

$ openhack init-run vulnerable-app https://github.com/example/vulnerable-flask.git –run-id test-001 $ openhack run-recon vulnerable-app test-001 –all-agents –semgrep $ openhack create-scenarios vulnerable-app test-001 $ openhack record-scenario-backlog vulnerable-app test-001 router-result.json $ openhack render-scenario-prompt vulnerable-app test-001 S001 $ openhack record-scenario-result vulnerable-app test-001 S001 result.json # (repeat for each scenario)

What OpenHack Found

🔴 CRITICAL (CVSS 9.8) — SSTI in profile update

The injection expert flagged a Jinja2 template that rendered user‑controlled `{{ … }}` without sanitization. The evidence included a full payload that executed `os.popen(‘cat /etc/passwd’).read()` and returned the result.

# Finding excerpt Endpoint: /profile/update Parameter: “bio” Payload: {{ self.__init__.__globals__.__builtins__.__import__(‘os’).popen(‘cat /etc/passwd’).read() }} Result: root:x:0:0:root:/root:/bin/bash …
🔴 CRITICAL (CVSS 9.1) — YAML deserialization in /admin/import

The software‑data‑integrity‑failures expert discovered a call to `yaml.load()` (not `safe_load`) on an uploaded config file. The proof of concept demonstrated RCE using a crafted YAML payload that invoked a Python subprocess.

payload: !!python/object/apply:subprocess.check_output [‘cat /etc/passwd’]
🟠 HIGH (CVSS 8.1) — SQL injection in search endpoint

The injection agent found an unparameterised `WHERE` clause built by string concatenation. The report included the exact line number, the vulnerable parameter (`q`), and a proof‑of‑concept payload that extracted database names.

GET /search?q=’ UNION SELECT username, password FROM users–
⚠️ NOT FOUND (still important)

OpenHack missed a logic flaw in the password reset flow (unlimited OTP attempts). That’s the limitation of source‑guided AI review — it doesn’t exercise running applications or test business logic race conditions. A human pentester would have caught it.

The Good, The Bad, and The Ugly

✅ The Good

  • Evidence‑driven findings — every vulnerability came with source lines, request examples, and a clear exploit path.
  • Checkpointed workflow — you can stop after recon, review the routing units, then only run scenarios that make sense.
  • Human in the loop — you approve every phase transition, so you never get a 200‑page report full of noise.
  • Semgrep integration — optional static analysis hints make the recon phase smarter.

❌ The Bad

  • Not a turnkey solution — OpenHack is a framework, not a product. You need to understand the workflow, review each prompt, and manually approve every step.
  • Still requires expert oversight — the triage agent can make bad calls. I saw one false positive flagged as “critical” because a developer comment looked like an injection.
  • No runtime testing — OpenHack analyses source code. It won’t find misconfigured S3 buckets, exposed debug endpoints, or business logic flaws.
🔐 THE REAL TAKEAWAY

OpenHack is a force multiplier for experienced pentesters. It automates the boring parts of source review — finding sinks, mapping auth boundaries, generating evidence chains — so a human can focus on business logic and complex exploit chains.

Give OpenHack to a junior analyst, and you’ll get noise. Give it to a seasoned pentester, and you’ll get findings in hours instead of weeks.

Should You Use OpenHack?

  • YES if you have a security team that can interpret the output and approve scenarios manually.
  • YES if you want to automate the tedious parts of source‑assisted whitebox review.
  • NO if you expect to run it once and get a production‑ready pentest report.
  • NO if you don’t have the expertise to validate false positives and missed vulnerabilities.

The MIT license means you can fork it, extend it, and bake it into your own CI/CD pipeline — but the model harness still needs your brain.

🧠 THE HUMAN FACTOR

OpenHack found three critical vulnerabilities in under an hour. A human would have taken two days. But the human also would have found the password reset logic flaw that OpenHack completely missed.

AI + human = speed and depth. Either alone = blind spots.

What We Did Next — A Human Pentest

After OpenHack finished, I ran a manual pentest on the same Flask app. I found:

  • 🔴 The same SQLi, SSTI, and YAML deserialization (OpenHack found all three)
  • 🟠 A broken object level authorization (BOLA) in the /api/user/<id> endpoint
  • 🟡 The password reset OTP bypass (infinite tries, no lockout)
  • 🔵 Hardcoded credentials in a frontend JavaScript file

OpenHack accelerated the initial discovery phase by 70%. The human then spent time on the logic flaws that the AI couldn’t see.

📌 THE COST-BENEFIT

Without OpenHack: 2 days of manual source + runtime testing → €2,500 – €5,000.
With OpenHack + ½ day human validation → half the time, deeper coverage.

The tool is free. The expertise is not. That’s where the value lives.

The Bottom Line

Hadrian just gave the security community a powerful open‑source scalpel. But a scalpel doesn’t replace a surgeon.

OpenHack is free. The expertise to use it right is not.

If you have an experienced pentester on staff, run OpenHack tomorrow. If you don’t, running it alone will just give you a false sense of security.

Let’s audit your code together — findings in hours, not weeks.

🦞🔐

AI-assisted pentesting, guided by human expertise.

Full AI Agent Pentest: €3,000. Code Security Review: €1,500. AI Red Team: €5,000.

📩 DM @StackOfTruths on X

Free 15-min consultation. No hard sell. Just honest answers about your code security.


Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam! Read our privacy policy for more info.

Leave a Reply

Your email address will not be published. Required fields are marked *


You cannot copy content of this page

error

Enjoy this blog? Please spread the word :)

Follow by Email
YouTube
YouTube
LinkedIn
LinkedIn
Share