Databricks 10月28日 02:49
Databricks推出AI驱动数据分类,简化敏感信息管理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着数据平台规模的扩大,敏感信息常常被忽视。Databricks现已推出其在AWS、Azure Databricks和GCP上的数据分类公测版,利用AI驱动的智能系统自动发现和标记所有目录中的敏感数据。该工具提供对个人身份信息(PII)所在位置的持续可见性,帮助企业保持合规、自动化保护措施,并自信地跨团队共享数据。通过高效智能扫描和精确分类,Databricks数据分类能够显著提高发现敏感数据的准确性,同时降低扫描成本,将手动审计转变为持续可见性,为数据团队建立信任。

🤖 **Agentic AI驱动的精确分类**:Databricks数据分类采用智能AI系统,结合模式识别、元数据和大型语言模型,实现比传统方法(如正则表达式)高出60%的准确率,且数据全程在用户环境中,符合Databricks AI安全控制标准。

⚡ **高效智能扫描与企业级规模**:该系统对整个目录进行一次性扫描,之后仅重扫描新表或已更改的表和列。Unity Catalog的血缘追踪功能确保关键数据集的增量扫描,及时捕捉新出现的PII。自Beta版以来,检测速度显著提升,扫描成本降低高达75%。

🔒 **自动化敏感数据保护与访问控制**:通过自动分类,企业可将手动审计转变为持续可见性,实现审计就绪、完整的数据血缘追踪以及高效的数据删除请求处理。结合ABAC(基于属性的访问控制)策略,可自动屏蔽或加密敏感列,仅授权给特定团队,实现可扩展的安全访问。

📊 **提升数据治理与信任**:该工具通过自动化PII检测和补救工作流,确保数据的准确性和合规性。它帮助企业建立可信赖的数据基础,使团队能够自信地使用数据,从而将人力资源从繁琐的数据管理中解放出来,专注于更高价值的业务举措。

Why Sensitive Data Gets Missed

As organizations scale their data platforms, sensitive information often hides in plain sight. New tables land every day, regulatory landscapes are becoming increasingly complex, and the stakes are higher than ever. According to the GDPR Enforcement Tracker Report, GDPR fines alone exceeded €5.6 billion in 2025, a growth of €1.17 billion since 2024.

Manual discovery methods simply don’t scale. What worked for hundreds of tables fails at thousands. The result? Compliance blind spots, costly audits, and stalled democratization of data. The fundamental problem is that you simply can’t protect what you can’t find.

Introducing Agentic Data Classification

Today, we’re excited to announce the Public Preview of Databricks Data Classification on AWS, Azure Databricks, and GCP.

Data Classification uses an agentic AI system to automatically discover and tag sensitive data across all your catalogs. It provides continuous visibility into where personally identifiable information (PII) resides, enabling you to stay compliant, automate protection, and confidently share data across teams, even as your data grows. 

Data Classification delivers comprehensive, automated PII detection across our expanding data environment, ensuring sensitive information is clearly identified and enabling consistent protection. This approach not only helps secure sensitive assets but also reduces manual workloads. As we're rolling this out more broadly, we're looking forward to freeing up our teams for higher-value initiatives. —  Gregg Rinsler, Sr. Director of Data Governance, FanDuel

Turn manual audits into continuous visibility

With automated classification in place, your teams can shift from manual classification to strategic governance:

  • Audit-readiness: Pull complete logs to show where PII resides and exactly which users and groups have access to it.
  • Full lineage: Trace exactly when PII exists and where it flows downstream. Don’t risk missing spots where PII accidentally got copied into downstream datasets.
  • Data deletion requests: Locate and clean up all instances of user data across all your tables.
Every data team's currency is trust, which is "consistency over time". Data Classification helps deliver that trust by scanning our data estate for PII and automating remediation workflows. The result is verified, compliant data that teams can confidently rely upon. — Sam Shah, VP of Engineering, Databricks Data Team

How Data Classification works

Data classification is designed to bring automated, agentic classification that covers all your data. Here’s how we do it: 

Agentic AI for precise classification: Combines proven pattern recognition, metadata, and large language models with up to 60% higher accuracy than regex-only tools. Your data never leaves your environment following standards of Databricks AI security controls (AWSAzure GCP). 

Efficient and intelligent scanning for enterprise scale: Scans your entire catalog once, then only rescans new or changed tables and columns. Unity Catalog lineage ensures critical datasets are incrementally scanned, ensuring PII is caught as it appears. Since our initial Beta launch, we’ve significantly improved detection speed and reduced scanning costs by up to 75%. This system is battle-tested to ensure high performance as your data platform grows.

Review and validation: Get complete visibility of the columns containing PII, and who currently has access to this data. Our focused review UI surfaces high-confidence detections with sample data, letting you easily bulk-apply tags. Full results are stored in system tables for custom reporting or tagging. 

Data Classification is transforming our compliance approach by automating PII detection. We use classification results along with an authorization workflow via Databricks Apps to enable Just-In-Time access controls. This allows us to keep sensitive data accessible only when needed. We eliminated the manual efforts towards this, and instead have created automated detection and protection across our entire data residing in the Databricks Platform. — Abhijit Joshi, Staff Data Engineer, Oportun

Build Scalable Access Control 

Once you know where sensitive data lives, it’s easier to protect and access can scale safely.

  • Automate sensitivity tiers: Automate existing access request workflows where users are approved based on dataset sensitivity. For example, use Data Classification tags to automatically categorize tables by your organization’s sensitivity levels (e.g., confidential, restricted, internal, or public). 
  • Scale governance with ABAC policies: Attribute-Based Access Control (ABAC) policies automatically mask or encrypt sensitive columns. For example, set up a policy that masks all columns tagged as [class.name], [class.email_address], and [class.phone_number] for everyone except your security team. Once configured, this policy automatically applies to data tagged as sensitive, ensuring consistent data protection that scales with your business.

  • Use ABAC to securely open up access: Consider the customer transactions table in the example above, which might contain both sensitive columns (e.g., customer_name, email, phone) and non-sensitive columns (e.g., transaction_id or customer_id columns). ABAC policies mask only the sensitive columns while leaving non-sensitive fields open. There is no need to block entire tables or maintain complex view logic.

What’s next?

Here's what's on our roadmap in the coming months:

  • API and Terraform support *Coming to Public Preview soon*
  • Built-in Regional and Domain-Specific Classifiers like PHI and PCI *Coming to Public Preview soon*
  • Custom Classification Rules for business-specific data patterns. We’re using agentic AI systems to develop patterns specific to your company's data *In Private Preview* 

Get Started with Public Preview Today

Ready to transform manual processes into automated Data Classification? Get started with our resources below: 

  • Read our product documentation (AWS | Azure | GCP)
  • The product is HIPAA compliant and follows trust and safety standards of Databricks AI features. Read more in our security FAQs here (AWS | Azure | GCP).
  • Reach out to your account representative to sign up for our custom classifiers Private Preview
  • Get started today and enable Data Classification from any Catalog Details tab

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Databricks 数据分类 敏感数据 AI 数据治理 合规性 PII Data Classification Sensitive Data AI Data Governance Compliance PII
相关文章