The story behind Matters.AI funding journey

Unstructured Data Security

Unstructured data security protects sensitive content in documents, emails, and files where most organisations have the largest visibility gaps and weakest coverage.

Read with AI

What is Unstructured Data Security?

Unstructured data security is the discipline of discovering, classifying, monitoring, and protecting sensitive information stored in formats that don't conform to predefined data models documents, spreadsheets, PDFs, emails, presentation files, collaboration platform messages, images, logs, and any content where the structure isn't enforced at the storage layer.

It's the harder half of data security. Structured database security protecting rows, columns, and tables in defined schemas is a mature discipline with well-established tooling. Unstructured data security is where most organisations have the largest visibility gaps, the weakest classification coverage, and the highest concentration of actual data exposure.

That gap exists for a simple reason: most security tooling was built for structured environments. It assumes data has known fields, known formats, and known locations. Unstructured data has none of those properties. It sits in file systems, collaboration platforms, email systems, and endpoints in formats that require reading and interpretation to understand, not just querying.

Why unstructured data is where most sensitive information lives

By file count, the vast majority of an enterprise's data estate is unstructured. Consider what most employees actually produce and store: Word documents, Excel spreadsheets, PowerPoint presentations, PDFs, email threads, Slack conversations, Google Docs, Notion pages, shared drive folders. Every one of those is unstructured or semi-structured content.

And that's precisely where sensitive information appears most naturally. A contract negotiation lives in a Word document. An employee's salary review lives in a spreadsheet. Customer contact details appear in a sales team's Excel export. Legal advice lives in email threads. M&A strategy lives in presentation decks. Source code with embedded credentials lives in text files.

None of that data starts its life in a database with a defined schema. It starts as human-generated content in formats that make sense to the people creating them. The sensitive information is embedded in that content, not stored in a structured field that a classification rule can find by column name.

Why traditional security controls struggle with unstructured data

Three characteristics of unstructured data make conventional security controls less effective.

No schema to query against. Database security tools read column names, data types, and values from known structures. They can be told "classify all data in the ssn column" or "monitor all queries against the customers table." Unstructured data has no equivalent schema. A Word document containing a customer's social security number doesn't announce that fact through its file structure. The only way to know the document contains a SSN is to read and understand its content.

Sensitive information appears in context, not in isolation. In a structured database, a PII field is a PII field. In an unstructured document, PII appears embedded in natural language: "Please process the refund for John Smith at 42 Maple Street, whose account number is 4891-2234-5566." Pattern matching against that sentence requires understanding the sentence, not just matching it against a format rule. A regex for \\d{4}-\\d{4}-\\d{4} would match the account number. It wouldn't distinguish it from a product reference code or a tracking number in the same format without understanding the surrounding text.

Data transforms continuously. A customer record in a structured database stays in the database. Unstructured data proliferates: it gets attached to emails, downloaded to endpoints, shared as links, copied into new documents, exported to different file formats, and uploaded to collaboration platforms. Each transformation potentially moves the sensitive information to a new location without any classification label following it. The document starts on SharePoint with no sensitivity label. It gets downloaded to a laptop, modified, and reuploaded. The modified version inherits no classification from the original. The same sensitive content now exists in two places, one of which is unclassified.

What unstructured data security requires

Effective unstructured data security needs four capabilities working together, each addressing a specific characteristic of the problem.

Content-aware discovery. Finding unstructured data assets isn't just enumerating files and folders. It's understanding what those files contain. A discovery programme for unstructured data needs to read document content, extract meaning, and identify sensitive information embedded in natural language and mixed formats. That means coverage across the full range of formats in use: .docx, .xlsx, .pdf, .pptx, .txt, .csv, .json, image files containing text, email messages, collaboration platform content. Missing any format category creates a blind spot.

Semantic classification rather than pattern matching. The classification technique matters more for unstructured data than for structured data. Pattern matching works against known formats in known fields. Unstructured data constantly defeats it: a regex for credit card numbers fires on any 16-digit string, producing false positives from order numbers, tracking codes, and phone numbers; a keyword rule for "salary" fires on any document mentioning salaries, including published industry reports; a rule for email addresses fires on any text containing the @ symbol with surrounding characters. Semantic classification — using NLP models to understand what content means in context — produces the accuracy that unstructured environments require.

Persistent classification labels that survive format changes. When a classified document is exported from SharePoint, the classification should travel with the file. When a document is modified and reuploaded, the classification should be reassessed. When content from a classified document is copied into a new document, the new document should be evaluated for inherited sensitivity. Persistent labelling that survives document transformations is what makes classification meaningful in the collaborative, constantly-moving environment where unstructured data lives.

DLP coverage at all channels where unstructured data moves. Unstructured content moves through email, web upload, cloud sync, messaging platforms, USB devices, and print. DLP coverage for unstructured data needs to reach all of those channels with content inspection that works against the file types employees actually use, not just the structured exports that rule-based DLP was originally designed for.

Where unstructured data security breaks down in practice

The failure mode security teams encounter consistently is false confidence. An organisation deploys a DLP tool, configures policies for PCI and PII, and believes its data protection programme covers the organisation's sensitive content. Then an incident reveals customer data in a spreadsheet attached to an email that the DLP policy never examined, because the tool's content inspection was configured for structured database exports and didn't apply to .xlsx attachments in the email workflow.

Not because the tool is broken. Because nobody configured it to look at that format in that channel.

Unstructured data security requires explicit decisions about every format type and every channel. The default posture of most DLP tools is to apply structured data policies and let everything else through. Building a comprehensive programme means inventorying every place unstructured sensitive content is created, stored, and transmitted, and ensuring classification and enforcement coverage reaches each of those locations.

That inventory is usually larger and more complex than organisations expect when they start. Collaboration platforms alone Microsoft 365, Google Workspace, Slack, Notion, Confluence — each contain unstructured content, each require different API integrations to scan, and each have different sharing models that affect how content moves and who can access it.

Unstructured data security in regulated environments

For regulated industries, unstructured data security carries direct compliance implications that structured database protection alone doesn't address.

Healthcare organisations have PHI in clinical notes, discharge summaries, and patient letters — all unstructured documents. HIPAA's minimum necessary standard and breach notification obligations apply to that content regardless of the format it's stored in.

Financial services firms have MNPI in analyst reports, deal memos, and board presentation decks. The insider trading and market abuse implications of that information leaking don't depend on whether it's stored in a structured system or a Word document.

Legal firms have privileged communications and client confidential information in email threads and document drafts. Legal professional privilege doesn't help if the content is accessible to people it shouldn't be.

In each case, the regulatory obligation applies to the content, not the format. But enforcement is only possible when the security programme actually reaches the unstructured environments where that content lives.

Frequently asked questions

What is unstructured data security?

Unstructured data security is the discipline of discovering, classifying, monitoring, and protecting sensitive information in formats without predefined schemas documents, spreadsheets, emails, PDFs, presentation files, collaboration platform content, and other file-based data. It requires content-aware classification techniques that understand what unstructured content means, rather than querying known fields against known formats.

What is the difference between structured and unstructured data security?

Structured data security protects information in defined schemas database tables, rows, and columns where the location and format of sensitive data is predictable and queryable. Unstructured data security protects content where sensitive information is embedded in human-generated text, images, and documents without predictable location or format. Structured security relies on schema-based querying; unstructured security requires content reading and semantic interpretation.

Why is unstructured data harder to secure than structured data?

Unstructured data has no schema to query, meaning sensitive content can only be identified by reading and understanding the content itself. It transforms and proliferates constantly across email, collaboration platforms, endpoints, and cloud storage. Classification labels don't automatically persist through format changes. And most security tooling was designed for structured database environments, requiring explicit extension to cover the file types and channels where unstructured content lives.

What types of files contain unstructured sensitive data?

Common file types containing sensitive unstructured data include: Word documents (.docx) with contract terms and customer correspondence; Excel spreadsheets (.xlsx) with customer lists and financial data; PDFs of invoices, reports, and contracts; presentation files (.pptx) with strategy and financial projections; text files and JSON files with embedded credentials; email messages with customer PII; and collaboration platform content including Slack messages and shared documents.

Published May 1, 2026
Share

Ready to see Matters in Action?

Join a specialized 30-minute walkthrough. No sales fluff, just pure visibility and security intelligence.