Data Breach
Data breach explained with causes, costs, and response strategies. Learn how to detect incidents faster, limit impact, and meet compliance requirements.
What is a Data Breach?
A data breach is any security incident in which sensitive, confidential, or protected information is accessed, exposed, stolen, or transmitted by an unauthorised party, or by an authorised party acting outside the scope of legitimate business purpose. It's not limited to external attacks. Breaches caused by insider misuse, negligent exposure, or system misconfiguration are equally material from a legal and regulatory standpoint.
The definition matters because it determines what triggers notification obligations. Under GDPR, a personal data breach requires notification to supervisory authorities within 72 hours of becoming aware of it. Under India's DPDP Act and similar frameworks, comparable timelines apply. "Becoming aware" doesn't mean finishing your investigation. It means the moment the organisation reasonably believes a breach has occurred.
That clock starts well before most organisations know what data was actually involved.
What causes data breaches
Data breaches don't have a single cause. They cluster into three categories, each with different detection and response implications.
External attacks
An attacker compromises credentials through phishing, brute force, or a third-party breach, then uses those credentials to access and exfiltrate data. This is the category most people picture when they hear "data breach," but it's not the most common source by frequency, and credential theft typically produces incidents that look like insider activity until the attribution is clear.
Insider incidents
A current or departing employee deliberately extracts sensitive data for personal gain, competitive advantage, or malicious intent. Or, more commonly by frequency, an employee accidentally exposes data through a misconfigured sharing setting, an email sent to the wrong recipient, or sensitive data pushed to a public code repository. Both are breaches. Both trigger regulatory obligations. The accidental ones are far more frequent; the deliberate ones are typically far more expensive.
Misconfiguration and exposure
An S3 bucket is left publicly accessible. A database has no authentication requirement in the development environment that shares infrastructure with production. A collaboration platform's default sharing settings are "anyone with the link" and nobody changes them. No data was stolen in the conventional sense. But the data was accessible to anyone who found it. That's a breach.
That third category is systematic and persistent. DSPM tools exist specifically to catch it continuously, because quarterly audits don't find misconfigurations that appeared between assessment cycles.
The cost structure of a breach
The IBM Cost of a Data Breach Report 2025 puts the global average breach cost at $4.44 million. That number is the aggregate of several distinct cost components, and understanding the breakdown matters for prioritising where security investment reduces actual exposure.
Detection and escalation costs cover the work of identifying and confirming that a breach occurred. These costs are primarily labour: analyst time, investigation tooling, forensic investigation, and legal counsel beginning to evaluate notification obligations.
Notification costs cover the actual execution of regulatory and individual notification requirements: legal review of what must be disclosed, technical preparation of notifications, and communications teams managing the public-facing response.
Post-breach response covers credit monitoring for affected individuals, regulatory cooperation and potential fines, litigation defence, and system remediation.
Lost business is the hardest to quantify but often the largest component: customer churn, increased customer acquisition costs, reputational damage affecting contract renewal rates, and partner relationship impacts.
The IBM data also shows that organisations using AI and automation in their security operations see materially lower average breach costs, driven primarily by faster identification and containment times rather than prevention. The gap between detecting a breach in the first week versus detecting it after 200 days is significant: longer dwell time means more data propagated, more systems involved, and a larger, more expensive scope to contain and disclose.
That's the operational argument for detection speed. Not just that it's better to find breaches quickly, but that every day of undetected breach activity expands the scope that regulators will eventually ask about.
The question regulators actually ask
When a regulator, auditor, or board asks about a breach, the question isn't "did this happen?" By the time they're asking, they already know it happened.
The question is: what data was involved, how sensitive was it, how many individuals are affected, and where did it go?
Those four questions define the breach scope. They determine whether notification is legally required under a given framework, how many individuals must be notified, what the regulatory exposure is, and what containment actions are sufficient.
Most organisations discover during their first significant breach investigation that they can answer the first question roughly, can't answer the third with confidence, and can't answer the fourth at all without days of manual log correlation across multiple systems.
The data was accessed from a database. That database contained customer records. How many? Approximately 400,000, based on the table size, but some of those records may be duplicates or test entries. Where did the data go after the initial access? It was exported. Where did the export go? The analyst is checking email gateway logs and DLP records. Nothing yet. The investigation is on day three of a 72-hour regulatory clock.
That's the breach response problem that data lineage tracking exists to solve. When every step of data movement is continuously mapped, blast radius is a query, not a reconstruction. The investigator opens the incident, pulls the lineage graph for the affected dataset, and sees every downstream system, derivative file, and external destination the data touched. The scope question is answered in minutes, not days.
The anatomy of a modern breach
Not every breach looks like an intrusion. The Verizon 2025 DBIR reports that ransomware was present in 44% of breaches reviewed, but many of the most expensive data incidents don't involve ransomware or obvious malware at all. They involve authorised access used in ways that diverge from business intent.
A legitimate employee queries a dataset they have permission to access. The query is larger than their historical norm but within policy. The results are exported locally, compressed, and uploaded to a cloud storage service IT has already approved for business use. Two weeks later, a regulator asks whether personal data was exposed during that period.
At that moment, the organisation needs to answer: was that specific upload a breach? What data was in it, at the semantic level? Where did the file go? Was it downloaded? Did it propagate further?
Those questions are not answerable from database logs alone. They require the chain: access event from DAM, export operation from endpoint telemetry, upload from network logs or cloud storage access records, all correlated against the classification of the data involved. Building that chain manually under investigation time pressure is the dominant cost in data breach response for organisations without continuous lineage tracking.
What effective breach response requires
Three things determine whether a breach response is defensible or improvised.
Scope accuracy
Knowing what data was actually involved, not what table it came from. A table-level answer tells a regulator "customer data was accessed." A semantic-level answer tells them "the accessed data contained 847,000 records including names, email addresses, and payment identifiers for customers in three jurisdictions." The second answer is what notification decisions require. It comes from classification, not from table metadata.
Propagation tracking
Knowing where the data went after access. Data rarely stays in one place after an incident. It was exported, emailed, shared, or uploaded somewhere. Blast radius analysis, built from continuous data lineage, maps those downstream locations before the investigation starts. Without it, "containment" means containing the source without knowing how far the data has already spread.
Evidence integrity
The records used to scope, contain, and notify must be tamper-resistant and contemporaneous. Evidence assembled under investigation pressure from logs that were never designed to be correlated carries significantly less regulatory weight than immutable audit trails produced as a continuous byproduct of normal monitoring operations.
Frequently asked questions
What is a data breach?
A data breach is any incident in which sensitive, confidential, or protected data is accessed, exposed, stolen, or transmitted without authorisation, or by an authorised party acting outside legitimate business purpose. This includes external attacks, insider incidents, accidental exposure, and misconfiguration that makes data accessible to unintended parties.
What is the average cost of a data breach?
According to the IBM Cost of a Data Breach Report 2025, the global average cost of a data breach is $4.44 million. Costs vary significantly by industry, geography, and response speed. Organisations using AI and automation in security operations see materially lower average costs, driven primarily by faster identification and containment of the incident.
What is the difference between a data breach and a data leak?
A data leak typically refers to accidental or negligent exposure of data, often through misconfiguration, oversharing, or careless handling, without necessarily involving any unauthorised access. A data breach implies some form of unauthorised access or action, though the boundary is often blurred in regulatory frameworks. Both can trigger notification obligations depending on what data was exposed and to whom.
What triggers a data breach notification requirement?
Under GDPR, notification to supervisory authorities is required within 72 hours of the data controller becoming aware that a personal data breach has occurred, unless the breach is unlikely to result in risk to individuals' rights and freedoms. Under India's DPDP Act and similar frameworks, comparable timelines apply. The notification clock starts when the organisation reasonably believes a breach has occurred, not when the investigation is complete.
What causes most data breaches?
External credential compromise, including phishing, credential stuffing, and third-party breach exposure, is the most publicised cause. By frequency, negligent insider incidents, including accidental sharing, misconfiguration, and data sent to wrong recipients, cause more incidents. By cost per incident, malicious insider action and targeted external attacks produce the highest average costs. Misconfiguration creating unintended public data exposure is systematic and persistent across cloud-heavy environments.
How long does it take to detect a data breach?
The IBM 2025 data reports that average breach identification and containment time remains in the hundreds of days for organisations without AI and automation in their security operations. Faster detection requires behavioural analytics to identify anomalous access patterns, data lineage to track propagation without manual correlation, and continuous evidence generation so investigation starts with existing records rather than building them from scratch.
