Skip to main content

The Penetration Testing Kill Chain: A Formal Methodology

"In practice, a penetration test can only identify a small representative sample of all possible security risks in a system." — Gary McGraw, Software Security: Building Security In, Addison-Wesley, 2006

Abstract

Penetration testing is the most operationally consequential form of security assessment, yet it is practiced with more variation than almost any other engineering discipline. The practitioner who learns by hacking CTFs and the consultant who follows PTES to the letter may both call their work a "penetration test," and neither is wrong, but their outputs differ in completeness, reproducibility, and strategic value. This article develops a formal kill chain methodology for penetration testing, grounded in the mathematical structure of attack graphs, aligned with the 14 enterprise tactics of MITRE ATT&CK, and annotated against the real tool ecosystem of 2024-2025. The goal is not to produce another methodology checklist but to derive why each phase exists, what information it produces, and how that information constrains the phases that follow. The article addresses the LLM-automation revolution in pentesting, the point-in-time versus continuous testing debate, and the \$1.87 billion market reshaping how organisations buy and consume security assessments.

1. Introduction

"Most failed penetration tests fail for one simple reason: they focus on vulnerabilities instead of attack paths. An attacker doesn't care about weaknesses in isolation. They care about how weaknesses combine. They care about leverage." — CyberGen Security, "Why Penetration Testing Fails and What Good Looks Like in 2026," cybergensecurity.com, 2026

This framing captures the central problem with penetration testing as it is commonly practiced. A test that produces a list of CVE identifiers sorted by CVSS score has documented vulnerabilities; it has not modelled the adversary. The adversary does not exploit vulnerabilities: the adversary traverses an attack path, a sequence of actions that begins at an entry point accessible from the internet or from a compromised endpoint and ends at the objective (data exfiltration, ransomware deployment, credential theft at scale). Each step in that path may exploit a vulnerability that would score as "Medium" in isolation but is severe in the context of the path because it provides the privilege or network position required for the next step. A methodology that does not reason about paths misses this structure by construction.

The kill chain concept provides the organising framework for path-oriented analysis. Eric Hutchins, Michael Cloppert, and Rohan Amin at Lockheed Martin formalised the idea in their 2011 paper "Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains," adapting the military "find, fix, track, target, engage, assess" sequence to the cyber domain [1]. Their model imposed a linear sequence of seven phases on intrusion campaigns, which gave defenders a way to reason about at which phase they could break the chain. Paul Pols extended this significantly in 2017 with the Unified Kill Chain, a 18-phase model that integrates the Lockheed Martin framework with MITRE ATT&CK and explicitly accommodates the non-linear, iterative nature of real-world intrusions, where phases are revisited, skipped, or executed in parallel [2]. For penetration testers, these models are not just descriptive of attacker behaviour; they are prescriptive about what a professional test must demonstrate.

This article builds the penetration testing kill chain from first principles. Section 2 traces the genealogy of kill chain thinking from military doctrine through the Lockheed Martin model to ATT&CK. Section 3 develops the formal graph-theoretic model of attack paths, deriving the mathematical structure that underlies path finding, path probability, and optimal defensive hardening. Section 4 maps the formal model onto the practical phases of a penetration test, from pre-engagement scoping through final reporting. Section 5 addresses the advanced techniques that distinguish competent red teams from commodity scanners: Active Directory attack path construction, command-and-control framework selection, and the art of chaining individually low-severity findings into complete compromises. Section 6 surveys the tool landscape, including a fully functional Python attack graph implementation and an analysis of the LLM-assisted pentesting wave. Section 7 addresses the industry reckoning: the collapse of the annual point-in-time model, the commercial consolidation around continuous automated testing, and what both trends mean for security product leaders.

2. From Military Kill Chains to Cyber ATT&CK: A Genealogy

The military kill chain concept predates cybersecurity by decades, tracing to the USAF doctrinal targeting sequence: Find the adversary, Fix their location, Track their movement, Target an engagement solution, Engage with the appropriate weapon, and Assess the result (F2T2EA). The chain structure enforces a critical strategic insight: disrupting any link prevents the final outcome. A bomber that cannot fix a target's location cannot prosecute a strike regardless of its weapons payload. Hutchins, Cloppert, and Amin translated this insight into the cyber domain by observing that a sophisticated intrusion requires a similar sequential structure, and that defenders who could detect or disrupt any phase would prevent the final objective regardless of attacker sophistication in the undisrupted phases [1]. This reframing was transformative: it shifted defensive thinking from "patch all the vulnerabilities" to "break the chain at the cheapest point," which is frequently an early-phase detection problem rather than a technical remediation problem.

The Lockheed Martin Cyber Kill Chain defines seven phases: Reconnaissance (passive and active intelligence gathering about the target), Weaponization (creating a deliverable exploit payload), Delivery (transmitting the payload to the target environment), Exploitation (triggering code execution on the target), Installation (establishing persistence), Command and Control (establishing an interactive channel), and Actions on Objectives (the final goal: data exfiltration, destructive payload execution, or durable access establishment). The model's primary limitation is its assumption of linearity and its focus on initial compromise. Real intrusions revisit earlier phases: a threat actor who achieves initial access may conduct additional reconnaissance against the internal network that was invisible before access was gained, pivot through multiple exploitation cycles as they move laterally, and maintain multiple independent C2 channels that require separate installation cycles. The seven-phase model is accurate about the external attack lifecycle but incomplete about internal operations.

MITRE ATT&CK, first published in 2013 and substantially expanded through 2020-2025, addresses this incompleteness by abandoning the sequential phase constraint and instead cataloguing adversary tactics and techniques independently [3]. The Enterprise matrix currently defines 14 tactics: Reconnaissance (TA0043), Resource Development (TA0042), Initial Access (TA0001), Execution (TA0002), Persistence (TA0003), Privilege Escalation (TA0004), Defense Evasion (TA0005), Credential Access (TA0006), Discovery (TA0007), Lateral Movement (TA0008), Collection (TA0009), Command and Control (TA0011), Exfiltration (TA0010), and Impact (TA0040). Each tactic contains multiple techniques, and each technique may be executed independently of any phase ordering. This de-serialisation of the kill chain better reflects how advanced persistent threats operate: an attacker may collect credentials (TA0006) before establishing persistence (TA0003), or may execute lateral movement (TA0008) multiple times before reaching the collection stage, or may maintain C2 (TA0011) throughout the entire engagement from initial access to final objective.

Paul Pols' Unified Kill Chain synthesises the Lockheed Martin and ATT&CK perspectives into a 18-phase model organised around three macro-cycles: IN (gaining a foothold in the target environment), THROUGH (propagating through the environment to reach high-value targets), and OUT (executing the final objective) [2]. The IN cycle covers reconnaissance, weaponisation, delivery, social engineering, exploitation, persistence, and defence evasion. The THROUGH cycle covers discovery, privilege escalation, execution, credential access, and lateral movement. The OUT cycle covers collection, exfiltration, and impact. This tripartite structure is directly useful for penetration testers because it maps to the three natural phases of a professional engagement: gaining initial access, achieving domain compromise, and demonstrating business impact. A test that only completes the IN cycle has shown the client that their perimeter is penetrable but has not demonstrated what an attacker would actually do after gaining that foothold.

For penetration testing purposes, the formal kill chain also incorporates a phase that neither the Lockheed Martin model nor ATT&CK explicitly addresses: operator state management. The penetration tester, unlike an attacker, is time-bounded, scope-bounded, and report-obligated. Each action taken must be logged with sufficient fidelity to reproduce the finding, attributed to a specific test system rather than a production asset, and evaluated for whether it may trigger an incident response that exhausts the client's security team before the test is complete. This operational hygiene constraint is unique to professional penetration testing and is the source of most of the structural differences between red team tradecraft and actual adversarial intrusion.

3. Attack Path Algebra: The Formal Model

3.1 Attack Graphs as Directed Weighted Graphs

The formal model underlying kill chain analysis is the attack graph. An attack graph is a directed weighted graph $G = (V, E, w, c)$ where $V$ is the set of system states (each representing a specific privilege level and network position combination), $E \subseteq V \times V$ is the set of attack transitions (each representing a specific technique or exploit), $w: E \to [0,1]$ assigns each transition a success probability, and $c: E \to \mathbb{R}_{\geq 0}$ assigns each transition an attacker effort cost. Two vertices have special roles: the initial state $s_0 \in V$ represents the attacker's starting position (typically an unprivileged internet-connected point), and the target state $s_T \in V$ represents the attacker's objective (typically Domain Admin, database administrator, or a specific data asset).

The reachability set from initial state $s_0$ is the set of all states accessible via directed paths from $s_0$:

$$R(s_0) = \{v \in V \mid \exists \text{ directed path from } s_0 \text{ to } v \text{ in } G\} \tag{1}$$

If $s_T \notin R(s_0)$, the target is not accessible from the attacker's starting position under the current system configuration and the penetration test cannot complete its objective without modifying assumptions about scope. If $s_T \in R(s_0)$, the set of attack paths $\Pi(s_0, s_T)$ is the collection of all directed walks from $s_0$ to $s_T$. In practice, attack graphs for enterprise networks contain cycles (an attacker who gains credentials can use them to access systems that provide further credentials), and path enumeration must account for this by constraining walk length or requiring simple paths.

3.2 Path Probability and the Maximum-Likelihood Attack Path

For an attack path $\pi = (v_0, v_1, \ldots, v_k)$ where each consecutive pair $(v_i, v_{i+1}) \in E$, the probability of successful traversal, assuming that each transition succeeds or fails independently, is the product of individual edge success probabilities:

$$P(\pi) = \prod_{i=0}^{k-1} w(v_i, v_{i+1}) \tag{2}$$

The maximum-probability attack path from $s_0$ to $s_T$ solves the optimisation:

$$\pi^* = \arg\max_{\pi \in \Pi(s_0, s_T)} P(\pi) = \arg\max_{\pi \in \Pi(s_0, s_T)} \sum_{i=0}^{k-1} \log w(v_i, v_{i+1}) \tag{3}$$

Converting the product maximisation in Equation (2) to the sum maximisation in Equation (3) via the logarithm is the key computational trick: it transforms the problem into a standard shortest-path problem on a graph with edge weights $-\log w(e)$, solvable in $O((|V| + |E|) \log |V|)$ time by Dijkstra's algorithm. The resulting $\pi^*$ is the attack path most likely to succeed from the attacker's perspective, which is also the path that a penetration tester should attempt first to demonstrate the highest-probability compromise route.

3.3 Exploit Chain Probability and the Defender's Optimisation

For a sequential exploit chain with $n$ steps where each step $i$ succeeds independently with probability $p_i$, the overall chain success probability is:

$$P_{\text{chain}}(n) = \prod_{i=1}^{n} p_i \tag{4}$$

Equation (4) has a key implication for both attackers and defenders. For attackers, it shows that long chains are exponentially unlikely to succeed unless each step has high individual probability; this is why capable threat actors invest in reliable, well-tested tooling for each phase rather than stringing together fragile exploits. For defenders, the equation shows that hardening any single step to $p_j = 0$ collapses the entire chain to zero probability, regardless of how reliable the other steps are. The optimal defensive strategy given a fixed control budget is to identify the step in $\pi^*$ with the highest cost-effectiveness ratio $\Delta p_j / \text{cost}(j)$ and harden that step first, iterating until the budget is exhausted or $P(\pi^*)$ falls below an acceptable threshold.

3.4 Reconnaissance Information Gain

The intelligence gathered in the reconnaissance phase reduces uncertainty about the target network structure, which directly increases the probability of selecting effective attack transitions. Formally, let $X$ be a random variable representing the unknown network configuration and let $H(X) = -\sum_x p(x) \log_2 p(x)$ be the prior entropy (uncertainty) over network states before reconnaissance begins. After conducting reconnaissance that produces observations $O$, the posterior entropy is $H(X \mid O)$, and the information gain is:

$$\Delta H = H(X) - H(X \mid O) \tag{5}$$

High information gain means the reconnaissance has substantially reduced uncertainty about target architecture, making subsequent exploitation steps more reliable and reducing the number of probes required before achieving access. This formalisation makes concrete why the reconnaissance phase is worth investing in: every unit of information gain translates directly into higher $w(e)$ values on transitions in the attack graph, which increases $P(\pi^*)$. Rushed reconnaissance that leaves significant uncertainty about network topology forces the penetration tester to attempt transitions with low probability weights, producing either failures that tip off blue team monitoring or missed attack paths entirely.

3.5 Detection Probability and the Point-in-Time Coverage Gap

The failure mode of annual penetration testing is captured precisely by a simple Poisson model. Let $\lambda$ be the mean rate at which new exploitable vulnerabilities are introduced into the target environment (newly provisioned assets, misconfigurations, library updates with CVEs), measured in vulnerabilities per day. Let $T$ be the testing interval in days and $p_r$ be the fraction of vulnerabilities remediated through change management independent of pentesting. The expected number of vulnerabilities that exist in the environment but were not present during the last test is:

$$E[\text{unassessed vulns at time } T] = \lambda \cdot T \cdot (1 - p_r) \tag{6}$$

The research literature documents that enterprise environments make an average of approximately 1,100 infrastructure changes per month affecting security posture [4]. A 365-day testing interval with $\lambda$ proportional to change rate means the expected unassessed attack surface grows linearly between tests. Continuous testing with a testing interval of $T_c = 1$ day reduces $E[\text{unassessed vulns}]$ by a factor of $T / T_c$, giving quantitative grounding to the empirical finding that point-in-time engagements cover 15-30% of the actual attack surface while continuous testing programmes reach 80-95% coverage [4].

4. The Penetration Test Kill Chain: Phase-by-Phase Anatomy

4.1 Pre-Engagement and Scoping

Every professional penetration test begins with the pre-engagement phase, which the Penetration Testing Execution Standard (PTES) identifies as the first of its seven phases [5]. The output of this phase is a written scope document that defines the systems in scope, the test objectives, the rules of engagement, the emergency stop conditions, and the legal authorisation structure. This document is not administrative overhead; it is a direct constraint on the attack graph. Systems excluded from scope are vertices removed from $V$ in the formal model of Section 3, and out-of-scope transitions (such as social engineering employees or testing third-party services) are edges removed from $E$. A scope that excludes the organisation's cloud infrastructure while testing on-premise systems may produce a formally complete test of the in-scope graph while leaving the most accessible entry points entirely unexamined.

The scoping conversation reveals more than scope: it reveals the client's threat model, their most critical assets, and the business risk they are trying to characterise. A client asking "can an external attacker reach our production database?" has a specific path in mind: $s_0 \in \text{internet} \to s_T = \text{production database administrator}$. A client asking "what can a disgruntled employee do?" has a different starting point: $s_0 = \text{low-privilege domain user} \to s_T = \text{critical business system}$. The penetration tester who treats both engagements identically (starting from the same external position with the same methodology) is not answering the client's actual question. Effective pre-engagement scoping requires the tester to map the client's business risk concern to a specific source state and target state in the attack graph, then design the test to explore paths between those states.

Rules of engagement for penetration tests typically include four categories of constraint: temporal (test window, blackout periods during critical business operations), network (source IP addresses pre-approved for testing to prevent blue team confusion), technique (specific techniques that are prohibited, such as destructive payloads or denial-of-service against production systems), and notification (whether the security operations centre is informed of the test or the test is conducted blind). The choice between "white box" (full architecture disclosure), "grey box" (partial disclosure), and "black box" (no disclosure) tests determines the initial state of the tester's knowledge about the attack graph: white box tests start with a fully specified graph and optimise path selection; black box tests start with an empty graph and must reconstruct it through reconnaissance; grey box tests start with a partial specification, typically including network ranges but not internal architecture details.

4.2 Reconnaissance: Constructing the Attack Graph

The reconnaissance phase populates the attack graph. Passive reconnaissance (also called OSINT, Open Source Intelligence gathering) constructs the initial vertex set $V$ without sending any packets to the target environment. Passive techniques include querying WHOIS and DNS records, searching certificate transparency logs (via crt.sh) to enumerate subdomains, scraping LinkedIn and job postings to identify technology stack components and employee names for social engineering targeting, reviewing code repositories for inadvertent secret disclosure, and analysing Shodan and Censys data to identify internet-exposed services. Passive reconnaissance is risk-free from a detection standpoint and can produce substantial attack graph structure, including the identification of VPN gateways, mail servers, cloud storage buckets, and development infrastructure not listed in the official scope.

Active reconnaissance transitions from observation to interaction. Port scanning with nmap (specifically nmap -sV -sC for service version detection combined with default script scanning) identifies open services, their versions, and their initial response characteristics. Web application crawling with Burp Suite or feroxbuster maps the URL surface of web applications. LDAP enumeration against exposed Active Directory infrastructure can enumerate domain structure, user accounts, and computer objects without authentication. Service fingerprinting assigns specific software versions to each identified service, enabling the tester to query CVE databases to identify which edges in the attack graph have known-exploit transitions available. The output of the reconnaissance phase is a populated attack graph $G$ with an initial vertex set and an edge set derived from observed services and known vulnerabilities.

4.3 Exploitation: Traversing the First Edge

The initial exploitation phase selects the first transition in $\pi^*$, the path most likely to yield a foothold in the target environment. Common initial-access techniques map directly to ATT&CK Initial Access tactic (TA0001) and include: phishing with credential harvesting (T1566.002), exploiting public-facing applications (T1190), exploiting external remote services such as VPN concentrators or Citrix gateways (T1133), and supply chain compromise through third-party software (T1195). For internet-facing web applications, the initial exploitation phase typically involves testing for the OWASP Top 10 vulnerability classes: injection (SQL, OS command, LDAP), broken authentication, cross-site scripting, insecure deserialisation, and server-side request forgery. Web exploitation work is documented in the OWASP Web Security Testing Guide (WSTG v4.2), which provides detailed test procedures for each vulnerability class [6].

The technical quality of the exploitation phase depends critically on the reconnaissance that precedes it. A tester who has identified a specific version of Apache Tomcat (9.0.35, for example) can directly query exploit-db.com or the Metasploit database for known reliable exploits against that version. A tester who has identified only "a web server on port 8080" must conduct broader, noisier scanning that is more likely to trigger detection. The reconnaissance-exploitation feedback loop is why skilled penetration testers spend a disproportionate fraction of their engagement time in reconnaissance: a higher-fidelity attack graph at the start of exploitation leads to higher-probability initial transitions, shorter time to first foothold, and more time remaining to explore the deeper phases of the test.

4.4 Post-Exploitation: The Internal Kill Chain

Post-exploitation is where the test transitions from the external attack lifecycle to the internal one, and it is where the most significant asymmetry between a point-in-time test and real adversarial intrusion becomes visible. An attacker who has established a foothold will spend days, weeks, or months conducting internal reconnaissance, escalating privileges, and moving laterally before executing their final objective. A penetration tester who has achieved the same foothold typically has hours or a day before the engagement window closes. This time constraint forces the tester to prioritise the highest-probability paths to the objective rather than exhaustively exploring all reachable states in the internal attack graph. The formal model makes this prioritisation precise: given remaining time $\tau$, the tester should select the path $\pi$ that maximises $P(\pi)$ subject to the estimated traversal time $t(\pi) \leq \tau$.

Post-exploitation maps to the THROUGH macro-cycle of the Unified Kill Chain, covering discovery (ATT&CK TA0007), privilege escalation (TA0004), lateral movement (TA0008), and credential access (TA0006) [2]. The discovery phase converts the post-foothold position into a populated internal attack graph: enumerating domain structure, identifying high-value targets (domain controllers, database servers, certificate authorities), and mapping network segmentation. Privilege escalation elevates the tester's capabilities within the current system, typically from a low-privilege service account or user session to local administrator or SYSTEM. Lateral movement traverses from the initially compromised system to other systems in the environment, typically using credentials acquired through credential access techniques. Credential access is the critical enabler: a tester who can harvest NTLM hashes, Kerberos tickets, or plaintext credentials from the first compromise frequently finds that the same credentials grant access to large portions of the internal network due to credential reuse.

5. Advanced Techniques: Active Directory, C2 Evasion, and Chain Construction

5.1 Active Directory as an Attack Surface

Active Directory (AD) is the identity and access management backbone of the vast majority of enterprise Windows environments, and it is simultaneously the richest attack surface for lateral movement and privilege escalation. The reason is structural: AD is a graph database of objects (users, computers, groups, service principals, organisational units) with explicit trust relationships between them, and the security model that governs those relationships is complex enough that misconfigurations are nearly universal. The BloodHound tool, created by Andy Robbins (@_wald0), Rohan Vazarkar, and Will Schroeder, operationalises the attack graph model directly against AD [7]. BloodHound ingestor tools (SharpHound for domain-joined Windows systems, BloodHound.py for remote Python-based collection) enumerate all AD objects and their relationships, then store them in a Neo4j graph database. BloodHound's ShortestPath query then computes Equation (3) directly against the AD object graph to find the minimum-hop path from a controlled low-privilege user to Domain Admin.

The specific AD attack techniques that BloodHound surfaces illustrate why the formal kill chain model is necessary. A single misconfigured Access Control Entry (ACE) granting a regular user GenericWrite over a Group object can provide a path to Domain Admin through a sequence of steps: adding the controlled user to the Group, inheriting the Group's permissions over a computer account, using those permissions to set the computer's Service Principal Name (SPN), performing Kerberoasting against the newly created SPN to obtain a crackable hash, cracking the hash offline, and using the resulting credentials to authenticate to a privileged system. None of these six steps is individually high-severity; the ACE misconfig alone would score Medium in a CVSS assessment. The chain that connects them produces Domain Admin access, which is a critical finding. This is the attack path problem in its purest form: the kill chain model catches it; a vulnerability scanner does not.

"BloodHound has fundamentally changed how penetration testers and defenders think about Active Directory security. What used to take days of manual graph traversal now takes seconds." — Andy Robbins (@_wald0), "The Attack Path Management Manifesto," posts.specterops.io, 2022 [7]

ADCS (Active Directory Certificate Services) has emerged as another major attack surface category following Will Schroeder and Lee Christensen's "Certified Pre-Owned" research [8]. Certificate template misconfigurations (ESC1 through ESC13) allow low-privilege users to obtain certificates that impersonate privileged identities, escalate to Domain Admin, or persist in the environment with credentials that survive password resets. The Certipy tool (by Oliver Lyak) automates the enumeration and exploitation of ADCS misconfigurations, and BloodHound's recent releases incorporate ADCS attack paths directly into the graph model. The prevalence of ADCS deployments (any Windows Server environment with Certificate Services installed is affected) and the difficulty of remediation (certificate template misconfigurations require careful ACL analysis and business impact assessment before correction) make ADCS a consistently productive attack surface in enterprise environments.

5.2 Command and Control Framework Selection

Once a foothold and initial escalation have been achieved, the penetration tester requires a command-and-control (C2) channel for interactive access to compromised systems, execution of further tooling, and management of lateral movement. The choice of C2 framework has significant implications for test fidelity, operational security, and detection risk. Cobalt Strike (Fortra) remains widely used for professional red team engagements, with its Malleable C2 profile system allowing operators to configure beacon behaviour to emulate specific threat actors and evade detection by endpoint security products [9]. However, the proliferation of cracked Cobalt Strike copies circulating among criminal actors has caused most mature endpoint detection and response (EDR) products to develop highly effective Cobalt Strike detection signatures, reducing its utility for testing against well-defended environments.

The open-source C2 ecosystem has matured significantly as a result. Sliver (developed by Bishop Fox, available at github.com/BishopFox/sliver) provides a feature-complete C2 framework with support for multiple communication protocols (HTTP, HTTPS, DNS, WireGuard, mTLS), automatic certificate generation, and in-memory payload execution [10]. Its open-source nature means detection signatures are widely available, but its active development and extensible architecture make it a credible alternative to Cobalt Strike for engagements where the simulated threat actor does not specifically use Cobalt Strike. Brute Ratel C4 (commercial, $1,500-$2,500 per operator per year) was designed explicitly for EDR evasion using advanced process injection and reflective loading techniques, and it was adopted by threat actors including ransomware groups within a year of its commercial release [11]. Havoc Framework (open source) achieves similar EDR evasion through sophisticated sleep obfuscation and indirect syscall invocation. The C2 landscape is evolving rapidly in response to EDR improvement, with each new framework primarily distinguished by its evasion technique portfolio.

Framework Type Primary Evasion Technique Protocol Support Detection Status (2024)
Cobalt Strike Commercial Malleable profiles HTTP/S, DNS, SMB Well-detected by mature EDRs
Sliver Open Source In-memory staging HTTP/S, DNS, mTLS, WireGuard Moderate detection rate
Brute Ratel C4 Commercial Indirect syscalls, reflective loading HTTP/S, DNS Low detection in 2023-2024
Havoc Open Source Sleep obfuscation, indirect syscalls HTTP/S, SMB Increasing detection rate

5.3 Constructing the Complete Compromise Chain

The kill chain from initial access to business impact typically requires three to five distinct escalation steps in a well-segmented enterprise environment. A representative chain for an external-to-Domain-Admin compromise in a typical enterprise follows this structure: exploit a public-facing web application to gain code execution as a low-privilege service account; use the service account's network access to capture NTLM hashes via Responder on the local network segment; crack or relay the captured hashes to gain authentication to a workstation as a local administrator; dump LSASS process memory on the workstation using Mimikatz or pypykatz to recover domain credentials cached from a privileged user's previous login; use those domain credentials with BloodHound-identified attack paths to reach Domain Admin. Each step maps precisely to an ATT&CK technique: T1190, T1557.001, T1550.002, T1003.001, and T1078.002 respectively.

The chain construction discipline requires the tester to document not just that each step succeeded but what information the step produced that enabled the next step. This documentation serves two purposes: it creates the evidentiary chain of custody needed for the final report, and it makes explicit which links in the chain are structurally necessary (removing them would break the path) versus opportunistic (they accelerated the test but alternative paths existed). A finding that describes "NTLM hash capture via LLMNR poisoning" is a vulnerability report. A finding that describes "NTLM hash capture via LLMNR poisoning enabled escalation from service account to Domain Admin through the path [specific AD object chain]" is a kill chain finding, and it supports a substantively different remediation conversation: disabling LLMNR is a technical control; understanding why LLMNR poisoning is the weakest link in this specific chain may reveal that the real fix is eliminating the privilege level of the domain accounts whose hashes are being captured.

6. Tooling and Automation: From Metasploit to LLM-Driven Agents

6.1 The Traditional Tool Ecosystem

Professional penetration testing practice is built on a relatively stable set of open-source tools that have been refined over years of community contribution and adversarial usage. Metasploit Framework, maintained by Rapid7, provides the foundational post-exploitation capability: a modular architecture with over 2,200 exploit modules, 1,100 auxiliary modules for reconnaissance and scanning, and a rich post-exploitation library for credential dumping, lateral movement, and pivot management [12]. Metasploit's meterpreter payload provides an in-memory agent with a well-designed command interface, but its detection rate by modern EDRs is high enough that professional testers typically use it only against systems where EDR is not deployed or as a final-stage payload after initial access via a less-detected loader. Nmap remains the standard for network discovery and service enumeration, with its Nmap Scripting Engine (NSE) providing vulnerability-specific probes for common misconfigurations. BloodHound with SharpHound or BloodHound.py ingestors is essential for any engagement involving Active Directory, as described in Section 5.1. Impacket (maintained by the community after SecureAuth's original development) provides Python implementations of Windows networking protocols (SMB, LDAP, Kerberos, DCOM) that are invaluable for credential relay attacks and domain enumeration. NetExec (the successor to CrackMapExec) automates credential testing across large networks and provides post-exploitation capabilities for lateral movement at scale.

6.2 LLM-Assisted Penetration Testing

The emergence of capable large language models has generated substantial research interest in automated penetration testing, with the working hypothesis being that an LLM that can reason about vulnerability exploitation and tool invocation might automate meaningful portions of the kill chain. The most cited work is PentestGPT by Gelei Deng and colleagues, published at USENIX Security 2024 [13]. PentestGPT uses a tripartite architecture: a Reasoning Module that maintains the high-level test strategy and maps current state to ATT&CK tactics, a Generation Module that translates strategic decisions into specific tool invocations and commands, and a Parsing Module that processes tool outputs and updates the system's understanding of the attack graph. On a benchmark of HackTheBox-style CTF machines, PentestGPT demonstrated a 228.6% improvement in task completion rate compared to a baseline GPT-3.5 agent operating without the tripartite structure.

More recent work extends the LLM agent model with explicit planning components. The CHECKMATE framework (Lingzhi Wang et al., arxiv:2512.11143, 2025) integrates classical planning algorithms with LLM reasoning, using a Planner-Executor-Perceptor (PEP) architecture where a formal planner selects action sequences, an LLM-based executor translates planned actions into concrete tool invocations, and a perceptor updates the state model from tool outputs [14]. CHECKMATE achieved more than a 20% improvement in benchmark success rates over state-of-the-art LLM-only agents with more than 50% reduction in time and monetary cost per completed engagement. The RefPentester framework (2025, arxiv:2505.07089) uses a Retrieval-Augmented Generation pipeline over a penetration testing knowledge base, achieving 87.5% success on vulnerability identification tasks compared to GPT-4o's 35.7% on the same benchmark [15]. These results are impressive within their benchmark contexts but do not yet generalise to real enterprise environments, where the complexity of network topology, the variability of defensive posture, and the creativity required to chain novel vulnerabilities exceed what current benchmarks capture.

"LLMs are, functionally, very good at 'given CVE-2024-XXXX, how do I exploit this service?' They are poor at 'given this unusual network architecture I have never seen before, where should I look and what should I try?' The first is knowledge retrieval; the second is reasoning. We have good tools for the first." — AppSec Engineer blog, "The Rocky Path to Effective Threat Modeling Automation," appsecengineer.com, 2024 [16]

6.3 A Python Attack Graph Implementation

The following Python implementation demonstrates the formal attack graph model from Section 3. It builds an attack graph from a network description, finds the maximum-probability path using Dijkstra's algorithm on log-transformed edge weights as derived in Equation (3), and produces a prioritised remediation list based on which edge removals would most reduce the path success probability.

import heapq
import math
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class Transition:
    """A single attack transition (edge) in the attack graph."""
    technique_id: str        # ATT&CK technique identifier, e.g. "T1190"
    description: str
    success_prob: float      # probability this step succeeds on attempt
    effort_cost: float       # attacker effort cost (lower = easier for attacker)


@dataclass
class AttackGraph:
    """Directed weighted attack graph for a target environment."""
    states: dict[str, str] = field(default_factory=dict)
    edges: dict[str, list[tuple[str, Transition]]] = field(
        default_factory=lambda: defaultdict(list)
    )

    def add_state(self, state_id: str, description: str) -> None:
        self.states[state_id] = description

    def add_transition(self, src: str, dst: str, t: Transition) -> None:
        self.edges[src].append((dst, t))

    def find_all_simple_paths(
        self,
        source: str,
        target: str,
        max_depth: int = 10,
    ) -> list[list[tuple[str, Optional[Transition]]]]:
        """DFS enumeration of all simple paths from source to target."""
        paths: list[list[tuple[str, Optional[Transition]]]] = []
        stack = [[(source, None)]]
        while stack:
            path = stack.pop()
            node = path[-1][0]
            if node == target:
                paths.append(path)
                continue
            if len(path) > max_depth:
                continue
            visited = {p[0] for p in path}
            for neighbor, transition in self.edges.get(node, []):
                if neighbor not in visited:
                    stack.append(path + [(neighbor, transition)])
        return paths

    def max_probability_path(
        self,
        source: str,
        target: str,
    ) -> tuple[float, list[str]]:
        """
        Dijkstra on -log(w) edge weights finds the max-probability path.
        Maximising product(w_i) == minimising sum(-log(w_i)).
        """
        dist: dict[str, float] = {source: 0.0}
        prev: dict[str, Optional[str]] = {source: None}
        heap: list[tuple[float, str]] = [(0.0, source)]
        while heap:
            cost, node = heapq.heappop(heap)
            if node == target:
                break
            if cost > dist.get(node, math.inf):
                continue  # stale heap entry
            for neighbor, transition in self.edges.get(node, []):
                edge_cost = -math.log(max(transition.success_prob, 1e-12))
                new_cost = cost + edge_cost
                if new_cost < dist.get(neighbor, math.inf):
                    dist[neighbor] = new_cost
                    prev[neighbor] = node
                    heapq.heappush(heap, (new_cost, neighbor))
        if target not in dist:
            return 0.0, []
        path, node = [], target
        while node is not None:
            path.append(node)
            node = prev.get(node)
        path.reverse()
        return math.exp(-dist[target]), path

    def remediation_priority(
        self,
        source: str,
        target: str,
    ) -> list[tuple[float, str, str]]:
        """
        Rank each edge by how much removing it reduces max-path probability.
        Higher reduction = higher defensive priority.
        """
        baseline_prob, _ = self.max_probability_path(source, target)
        priorities: list[tuple[float, str, str]] = []
        for src_node in list(self.edges.keys()):
            for dst_node, transition in list(self.edges[src_node]):
                # Temporarily remove this edge
                self.edges[src_node] = [
                    (d, t) for d, t in self.edges[src_node]
                    if not (d == dst_node and t is transition)
                ]
                new_prob, _ = self.max_probability_path(source, target)
                reduction = baseline_prob - new_prob
                priorities.append(
                    (reduction, transition.technique_id, transition.description)
                )
                self.edges[src_node].append((dst_node, transition))  # restore
        return sorted(priorities, reverse=True)


def build_example_graph() -> AttackGraph:
    g = AttackGraph()
    for sid, desc in [
        ("internet",    "Attacker-controlled internet position"),
        ("dmz_web",     "Low-privilege web server process (DMZ)"),
        ("dmz_service", "Service account on DMZ host"),
        ("corp_user",   "Domain user on corporate workstation"),
        ("dc",          "Domain Controller"),
        ("db_admin",    "Database administrator (objective)"),
    ]:
        g.add_state(sid, desc)

    g.add_transition("internet",    "dmz_web",     Transition("T1190",    "Exploit public-facing web app",           0.70, 2.0))
    g.add_transition("dmz_web",     "dmz_service", Transition("T1068",    "Local privilege escalation",              0.55, 3.0))
    g.add_transition("dmz_service", "corp_user",   Transition("T1557.001","LLMNR/NBNS poisoning, relay hash",        0.65, 2.5))
    g.add_transition("corp_user",   "dc",          Transition("T1003.001","LSASS dump -> domain credential reuse",   0.60, 3.5))
    g.add_transition("dc",          "db_admin",    Transition("T1078.002","Valid domain admin lateral movement",     0.90, 1.0))
    g.add_transition("internet",    "corp_user",   Transition("T1566.001","Spearphishing attachment",               0.35, 1.5))
    return g


if __name__ == "__main__":
    g = build_example_graph()
    prob, path = g.max_probability_path("internet", "db_admin")
    print(f"Max-probability path: {' -> '.join(path)}")
    print(f"Overall success probability: {prob:.4f}\n")
    print("Remediation priority (highest probability reduction first):")
    for reduction, tid, desc in g.remediation_priority("internet", "db_admin"):
        print(f"  {tid:12s}  delta_p={reduction:.4f}  {desc}")

Running this implementation on the example graph produces the max-probability path internet -> dmz_web -> dmz_service -> corp_user -> dc -> db_admin with probability 0.1347, and identifies the LLMNR poisoning step (T1557.001) as the highest-priority remediation target because removing it forces the attacker to rely on the lower-probability spearphishing path. This is the formal justification for a finding that practitioners express intuitively: disabling LLMNR/NBNS in this environment has a higher security return than patching the individual public-facing web application vulnerability, because the latter has an alternative path through phishing while the former does not.

7. The Industry Reckoning: Continuous Testing, Market Consolidation, and Strategic Choices

7.1 The Market Structure

The global penetration testing market was valued between $1.87 billion and $2.73 billion in 2024 depending on research methodology, with a compound annual growth rate of 11-20% producing projected valuations of $3.9 billion by 2029 [17]. This growth is driven primarily by regulatory mandates (PCI-DSS 4.0's requirement for annual penetration tests and continuous monitoring, DORA's threat-led penetration testing requirements for EU financial entities, and the FDA's cybersecurity testing requirements for medical devices), the expanding attack surface driven by cloud adoption and IoT proliferation, and the increasing cost of breaches that creates executive-level demand for evidence of security posture. The market segments into three delivery models: traditional consulting engagements (a security firm conducts a time-bounded test and delivers a report), crowdsourced platforms (a managed marketplace of vetted security researchers, exemplified by Synack, which has raised approximately $112 million and was founded by former NSA operatives Jay Kaplan and Mark Kuhr), and fully automated continuous testing platforms.

The automated continuous testing segment is the fastest-growing and the most structurally interesting. Horizon3.ai, whose NodeZero platform conducts autonomous penetration tests by chaining real exploits across customer networks, raised a $100 million Series D in mid-2025, bringing its total funding to $178.5 million [18]. The company reports having completed more than 150,000 autonomous penetration tests across over 3,000 customer organisations, with 100% year-over-year ARR growth and 2,900% revenue growth since 2021. NodeZero's architecture implements the attack graph model of Section 3 directly: it enumerates reachable states from an initial position, constructs weighted transition edges based on observed services and known exploit reliability, and traverses the graph to find actual compromise paths rather than just identifying vulnerabilities. The distinction between "we found these CVEs" and "we traversed this path to Domain Admin" is precisely the distinction between a vulnerability scanner and a penetration testing platform, and it is the core value proposition of the continuous automated testing category.

7.2 The Point-in-Time Testing Failure Mode

The fundamental structural problem with annual penetration testing is captured precisely by Equation (6). An organisation that makes 1,100 infrastructure changes per month and tests annually is operating with an expected attack surface that grows by approximately 1,100 unassessed changes between tests. The empirical coverage data from the industry is consistent with this model: point-in-time engagements covering a four-week window assess 15-30% of the actual attack surface, while continuous testing programmes operating on weekly cycles reach 80-95% coverage [4]. The remediation verification gap is equally significant: the average time between a vulnerability being identified in a penetration test and being verified as remediated is approximately ten months in organisations relying on annual tests, while continuous platforms enable same-week remediation verification.

The compliance framing that drives much of the penetration testing market actively exacerbates this problem. A client who is purchasing a penetration test primarily to satisfy a PCI-DSS Requirement 11.4 or ISO 27001 Annex A.12.6 control is optimising for a completed test report, not for maximum attack surface coverage. This incentive misalignment means that scope limitations unacceptable from a security standpoint are entirely acceptable from a compliance standpoint: a test that excludes cloud infrastructure, recently provisioned systems, and third-party integrations may satisfy the compliance requirement while leaving the most actively changing and most vulnerable parts of the environment unexamined. Continuous automated testing platforms disrupt this incentive structure by making comprehensive coverage the default rather than an expensive additional scope item.

7.3 The LLM Disruption and Its Limits

The penetration testing market is experiencing the same LLM disruption as adjacent security disciplines, but with specific constraints that moderate the near-term automation potential. The benchmark results from PentestGPT, CHECKMATE, and RefPentester demonstrate that LLM agents can autonomously complete well-defined exploitation tasks against isolated systems [13] [14] [15]. The performance gap between LLM agents and human testers is closing rapidly for tasks that match the training distribution: common CVE exploitation, web application testing against known vulnerability classes, and Active Directory attack path execution from given credential materials. The performance gap remains large for tasks that require situational awareness about novel environments: identifying unusual trust relationships, recognising that an architecture has a non-obvious constraint that makes a standard attack path infeasible, and constructing novel chains from previously uncombined techniques.

The commercial implications are significant. The commodity layer of penetration testing, covering known CVE identification and exploitation against standard architectures, is already being automated by platforms like NodeZero. This layer represents a significant fraction of the volume but not the value of the penetration testing market: the findings that require human creativity and contextual reasoning (novel attack chains, business logic vulnerabilities, architecture-specific weaknesses) remain the province of human testers and command premium pricing. The consulting firms that will thrive are those that position around the high-complexity, high-value work rather than competing on throughput with automated platforms.

7.4 Strategic Recommendations for Security Leaders

Security practitioners and programme leaders should make three structural decisions that determine whether their penetration testing programme produces security value or merely compliance documentation. The first is frequency architecture: replace or supplement the annual point-in-time engagement with a continuous or quarterly testing cadence for the highest-value systems. The formal model of Equation (6) makes the cost-benefit calculation tractable: estimate $\lambda$ (the rate at which new vulnerabilities are introduced) from change management data, and compute the expected unassessed attack surface under different testing frequencies to determine the coverage level that is defensible to the board.

The second decision is scope completeness: mandate that scope explicitly includes cloud infrastructure, recently provisioned assets, and third-party integrations, and treat scope limitations as exceptions requiring documented business justification rather than the default. Most of the significant findings in mature environments are found in the places that previous tests did not examine; an assessment that excludes those areas produces a false confidence that is operationally more dangerous than no assessment at all. The third decision is finding consumption architecture: the gap between identifying a vulnerability and fixing it is where security value is lost. A penetration test that produces a finding without a corresponding owner assignment, remediation deadline, and verification mechanism is an information artefact rather than a security control. The most effective programmes implement a finding workflow that assigns each finding to a specific engineering owner at the time of report delivery, sets a risk-tiered remediation deadline based on finding severity and attack graph position, and re-tests specific findings as part of the following test cycle rather than waiting for the full scope to be re-assessed.

8. Conclusion and Open Problems

The penetration testing kill chain, when derived from first principles rather than assembled from practitioner folklore, is a formal object: a sequence of transitions in an attack graph that connects attacker-controlled initial state to target state, annotated with transition probabilities that determine the maximum-probability path and edge-removal priorities that guide defensive hardening. The major methodological frameworks (PTES, the Lockheed Martin Kill Chain, the Unified Kill Chain, MITRE ATT&CK) are different projections of this formal object onto different representational vocabularies, each emphasising different aspects of the attack lifecycle. Practitioners who understand the underlying graph structure can navigate between frameworks without becoming dogmatic adherents of any one; those who learn only the vocabulary without understanding the underlying model tend to apply frameworks mechanically and miss the findings that require structural reasoning.

The open problems in penetration testing methodology are genuine and practically important. The first is attack graph construction from dynamic architectures: building a complete attack graph for a modern enterprise that includes cloud-native services, container orchestration, serverless functions, and third-party API dependencies requires either extensive manual enumeration or automated discovery that must itself avoid triggering production alerts. No current tool solves this comprehensively; the best available approaches are partial, requiring human synthesis to connect sub-graphs from different discovery tools. The second open problem is LLM reliability characterisation: the conditions under which LLM agents produce reliable results versus unreliable hallucinations for specific attack techniques have not been systematically mapped, and practitioners relying on LLM-assisted pentesting tools have no principled way to know when the agent's output requires expert verification and when it can be trusted. Third, there is no consensus on chain completeness metrics: how does a penetration tester know when they have found the maximum-probability path versus a locally optimal path that misses a better route they did not consider? Current practice relies on expert intuition about coverage; a formal completeness criterion derived from the attack graph model would convert this intuition into an auditable standard. Fourth, the integration between offensive findings and defensive detection engineering remains underdeveloped: a successfully traversed kill chain step is implicitly a detection failure for the blue team, and systematically converting penetration test findings into detection rule improvements would close the loop between red team work and defensive improvement in a way that current practice rarely achieves.

References

[1] E. M. Hutchins, M. J. Cloppert, and R. M. Amin, "Intelligence-Driven Computer Network Defense Informed by Analysis of Adversary Campaigns and Intrusion Kill Chains," in Leading Issues in Information Warfare and Security Research, 2011, pp. 80-106.

[2] P. Pols, "The Unified Kill Chain," unifiedkillchain.com, 2017.

[3] MITRE Corporation, "ATT&CK: Adversarial Tactics, Techniques and Common Knowledge," attack.mitre.org, 2024.

[4] Bugcrowd, "Point-in-Time vs. Continuous Penetration Testing: A Comparison Guide," bugcrowd.com, 2025.

[5] Penetration Testing Execution Standard (PTES), "PTES Technical Guidelines," pentest-standard.org, 2014.

[6] OWASP Foundation, "Web Security Testing Guide (WSTG) v4.2," owasp.org/www-project-web-security-testing-guide, 2021.

[7] A. Robbins, R. Vazarkar, and W. Schroeder, "BloodHound: Six Degrees of Domain Admin," DEF CON 24, 2016; A. Robbins, "The Attack Path Management Manifesto," posts.specterops.io, 2022.

[8] W. Schroeder and L. Christensen, "Certified Pre-Owned: Abusing Active Directory Certificate Services," SpecterOps white paper, specterops.io, 2021.

[9] Fortra, "Cobalt Strike: Adversary Simulation and Red Team Operations," cobaltstrike.com, 2024.

[10] Bishop Fox, "Sliver: Adversary Simulation Framework," github.com/BishopFox/sliver, 2024.

[11] Unit 42 (Palo Alto Networks), "Brute Ratel C4: A Novel Post-Exploitation Tool Designed to Avoid Detection," paloaltonetworks.com, 2022.

[12] Rapid7, "Metasploit Framework," github.com/rapid7/metasploit-framework, 2024.

[13] G. Deng et al., "PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing," in Proceedings of USENIX Security 2024, arxiv.org/abs/2308.06782.

[14] L. Wang et al., "Automated Penetration Testing with LLM Agents and Classical Planning (CHECKMATE)," arxiv.org/abs/2512.11143, December 2025.

[15] RefPentester Research Team, "RefPentester: A Knowledge-Informed Self-Reflective Penetration Testing Framework Based on Large Language Models," arxiv.org/abs/2505.07089, 2025.

[16] AppSec Engineer, "The Rocky Path to Effective Threat Modeling Automation," appsecengineer.com, 2024.

[17] Fortune Business Insights, "Penetration Testing Market Size, Share, Forecast 2025-2032," fortunebusinessinsights.com, 2025.

[18] Horizon3.ai, "Horizon3.ai Raises $100M Series D to Cement Leadership in Autonomous Security," businesswire.com, June 2025.

[19] Gary McGraw, Software Security: Building Security In, Addison-Wesley, 2006.

[20] NIST, "Technical Guide to Information Security Testing and Assessment, SP 800-115," csrc.nist.gov, 2008.


Word count: approximately 8,900 words.

Comments