The story behind Matters.AI funding journey

Data Exfiltration

Data exfiltration explained: how sensitive data leaves systems, why tools miss it, and how to detect risky behavior patterns before loss occurs.

Read with AI

What is Data Exfiltration?

Data exfiltration is the unauthorised transfer of sensitive data from an organisation's systems to a destination outside its control. It's the mechanism by which data breaches become material: the point at which exposure transitions from a posture risk to an actual loss of information with regulatory, legal, and competitive consequences.

But the way exfiltration actually happens in practice diverges sharply from how most security controls are designed to catch it.

How data exfiltration actually happens

Security architecture built in the 2000s and 2010s assumed exfiltration would look like an intrusion: an unauthorised user accessing systems through a blocked port, transferring data to a suspicious IP address, triggering a firewall rule or IDS alert. That assumption shaped the controls most enterprises still run today.

The real picture is different. MITRE ATT&CK documents exfiltration over web services and cloud storage as common attacker techniques, specifically because these channels blend into normal traffic and are typically already permitted. An attacker or insider moving data to Google Drive, Dropbox, or OneDrive isn't violating a firewall rule. They're using a destination the IT team approved. The traffic looks like normal business activity because, from a network control standpoint, it is.

Modern exfiltration doesn't require a malicious tool, a command-and-control server, or an obvious violation. It requires access, patience, and a legitimate channel.

The sequence that makes a material exfiltration incident runs something like this. An identity with authorised access queries a sensitive dataset, perhaps larger than usual but within policy. The results are exported to a local file. The file is renamed, chunked, or compressed. The archive is transferred using a cloud storage service, collaboration platform, or web application that IT has already sanctioned. Cleanup activity follows. The identity continues working normally.

Five steps. Every one of them technically permitted at the individual event level. No rule fires. No alert triggers. The data is gone.

The three categories of exfiltration vectors

External attacker-driven exfiltration

An attacker compromises a credential through phishing, credential stuffing, or a third-party breach. They use that valid credential to access the environment, locate sensitive data, and transfer it out through permitted channels. Because the credential is valid and the destination is allowed, the exfiltration looks like legitimate user activity right up until the point where the behavioral pattern diverges from the account's baseline.

Insider-driven exfiltration

A current or departing employee with legitimate access deliberately transfers sensitive data for personal gain or to benefit a competitor. The data they access is within their authorised scope. The channels they use are approved. Detection depends entirely on recognising that the pattern of access and transfer is inconsistent with legitimate business purpose, not on blocking an unauthorised action.

Negligent or accidental exfiltration

An employee emails a sensitive spreadsheet to a personal account to work over the weekend. A developer pushes source code containing embedded database credentials to a public repository. A contractor copies a project folder to personal cloud storage for convenience. None of these are deliberate exfiltration in the malicious sense, but the data has left the organisation's control. The regulatory and reputational consequences are the same as deliberate theft.

That third category is the highest-frequency scenario. Most exfiltration incidents in terms of raw count are negligent, not malicious. The most expensive ones, by cost per incident, are typically malicious insiders or external attacker-driven scenarios.

Why conventional controls consistently fail to detect it

The failure mode is structural. Most security controls are designed to evaluate single events at single boundaries, not sequences across systems and time.

A DLP rule fires when a specific content pattern crosses a specific egress boundary. It doesn't know what happened upstream of that boundary. It doesn't know whether the content was accessed unusually, whether the volume was anomalous for this identity, whether the file was created by an export 10 minutes earlier, or whether this is the third time this week the identity has sent similar content to an external destination.

A DAM alert fires when a query exceeds a threshold or accesses a sensitive table at an unusual hour. It doesn't see what happens to the data after the query returns results and the session closes.

An endpoint DLP agent may see the file compression operation. It doesn't know whether that file's contents originated from a suspicious database query.

So: each tool fires on the event within its scope. None of them evaluate whether the events across tools constitute a threat pattern.

That's the detection gap that exfiltration reliably exploits. Not a gap in any individual tool's coverage. A gap between tools that no individual tool is designed to bridge.

What detection actually requires

Catching modern exfiltration requires evaluating the chain of events, not the individual steps.

Access at an unusual hour, to data outside the identity's historical access pattern, in volumes above their baseline, followed by local file operations consistent with staging, followed by an upload to a permitted destination through a process that doesn't match the identity's normal application usage: that sequence is the signal. Not any single component of it.

Building that detection requires four things operating together.

Behavioural baselines per identity. What does normal look like for this specific user or service account? What data do they typically access, in what volumes, at what hours, from which source IPs? Deviation from that personalised baseline is the first indicator.

Data sensitivity context. The same behavioural anomaly carries different risk depending on what data is involved. An unusual query against a customer PII table is a higher-confidence signal than an unusual query against an internal wiki. The sensitivity of the data involved determines how urgently the behavioural signal needs to be investigated.

Endpoint ground truth. What happened on the device between the database query and the network transmission? Was a file created? Was it renamed or compressed? Which process handled the upload? Kernel-level telemetry connects the data access event to the egress event through the local operations that occurred in between, producing the chain rather than isolated endpoints.

Data lineage. Where did the data originate? How did it propagate? Which systems, identities, and processes touched it between origin and exfiltration? Lineage converts the sequence from a collection of separate telemetry events into a single narrative that security teams can act on and regulators can audit.

Without all four layers, detection produces either high false positive rates, where anomalous behaviour gets flagged without knowing whether it involves sensitive data, or high false negative rates, where sensitive data moves through channels that look normal from any single monitoring point.

GenAI as an emerging exfiltration surface

Generative AI has introduced an exfiltration path that most conventional controls weren't designed to see.

A user working in a browser-based GenAI application pastes customer data into a prompt for analysis. The model processes it, generates a response, and the interaction is logged nowhere in the organisation's security tooling. The data was pasted as text into a web interface, not sent as a file through a monitored channel. There's no file creation event for endpoint DLP to catch. The destination is a permitted HTTPS endpoint. Network DLP sees encrypted traffic to an allowed domain.

The data left the organisation's governed environment as part of a conversational interaction. It didn't look like exfiltration. It looked like an employee using a productivity tool.

That's the shadow AI exfiltration problem. The data movement is real. The governance model that should govern it doesn't exist yet for most enterprises. Detection requires monitoring what enters and exits GenAI interactions through browser-level visibility or endpoint telemetry, not just through traditional file and network monitoring controls.

How to prevent data exfiltration

Prevention and detection operate at different layers. Neither is sufficient alone.

Prevention controls reduce the attack surface: data classification so sensitive data carries labels that controls can enforce against, access governance so the population of users who can reach sensitive data is as small as the job requires, DLP policies at egress channels to block the transmission events that get caught, and CASB controls inside cloud applications to prevent sharing configurations that create exposure before any active exfiltration occurs.

Detection controls catch what prevention misses: the sequences that bypass individual prevention controls because each step is technically permitted. Behavioural analytics, sequence detection, data lineage tracking, and endpoint telemetry catch the chain.

The real problem most organisations have isn't a gap in either prevention or detection in isolation. It's that prevention controls and detection controls aren't connected to the same intelligence model. Prevention is informed by the classification labels that exist. Detection fires on events within each tool's scope. The correlation between what data is at risk, who is accessing it, how it's moving, and where it's going lives in no single system.

That's the operational gap exfiltration reliably finds. The fix isn't more tools. It's connecting the tools that exist to a shared classification model and a shared behavioural model so that prevention policies enforce against accurate labels and detection evaluates sequences rather than events.

Frequently asked questions

What is data exfiltration?

Data exfiltration is the unauthorised transfer of sensitive data from an organisation's systems to an external destination outside its control. It can be caused by external attackers using compromised credentials, malicious insiders deliberately stealing data, or negligent employees accidentally exposing data through unsanctioned channels.

What is the difference between data exfiltration and a data breach?

Data exfiltration is the mechanism: the actual transfer of data to an unauthorised destination. A data breach is the broader incident that may involve exfiltration but can also include unauthorised access to data without its movement, exposure through misconfiguration, or disclosure through system compromise. Exfiltration typically makes a breach material because it creates irreversible data loss rather than merely a control failure.

What are common data exfiltration methods?

Cloud storage uploads to personal accounts through permitted platforms like Google Drive, Dropbox, or OneDrive. Email transmission to personal accounts or external domains. USB and removable media transfers. Web uploads through browser-based applications including GenAI tools. DNS tunnelling and other covert network channels used by sophisticated attackers. The most common enterprise exfiltration paths today use the first three, which are typically already-permitted channels.

How do you detect data exfiltration?

Effective detection requires evaluating sequences across multiple telemetry sources, not single events. Behavioural baseline anomalies at the identity level identify unusual access patterns. Data sensitivity context determines whether the anomaly involves data worth investigating. Endpoint telemetry captures local staging and file operations between access and transmission. Data lineage tracking connects the origin to the egress event. Together these produce sequence-based detection that identifies the chain, not just individual events that appear legitimate in isolation.

What is the difference between data exfiltration and data leakage?

Data leakage typically refers to accidental or negligent exposure, often through misconfiguration, oversharing, or careless handling rather than deliberate transfer. Data exfiltration implies intentional extraction, though it can also cover scenarios where data moves outside organisational control due to negligent behaviour that wasn't deliberately malicious. The distinction matters for regulatory and legal purposes but the detection challenge is similar in both cases.

Can DLP prevent data exfiltration?

DLP reduces exfiltration risk at monitored egress channels for content that matches policy rules. It doesn't catch exfiltration through channels it doesn't monitor, through permitted destinations that appear legitimate, through content that has been transformed or renamed to bypass pattern matching, or through the staging and preparation steps that precede transmission. DLP is one prevention layer. It's not comprehensive exfiltration prevention without behavioural detection and data lineage context underneath it.

Published May 1, 2026
Share

Ready to see Matters in Action?

Join a specialized 30-minute walkthrough. No sales fluff, just pure visibility and security intelligence.