MarkTechPost@AI 2024年09月15日
Piiranha-v1 Released: A 280M Small Encoder Open Model for PII Detection with 98.27% Token Detection Accuracy, Supporting 6 Languages and 17 PII Types, Released Under MIT License
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Piiranha-v1是一款用于检测和保护个人信息的模型,具有多种优势,在数据隐私保护方面意义重大。

🎯Piiranha-v1是一款轻量级的280M编码器模型,在MIT许可下发布,可检测多种语言文本中的个人可识别信息(PII),检测率高达98.27%,整体分类准确率为99.44%,能识别17种PII类型。

💪Piiranha-v1基于强大的DeBERTa-v3架构,支持六种语言,在识别电子邮件、电话号码、密码和用户名等方面表现出色,在数字通信和交易环境中能有效保护敏感信息。

🌟该模型具有灵活性,能在数据类别可能误分类的情况下仍正确识别PII信息,适用于现实世界中数据不一致的情况,是一个强大的工具。

🤝Piiranha-v1由Internet Integrity Initiative Team与HuggingFace和Akash Network等合作训练,使用了包含超过40万个掩码PII记录的综合数据集,训练过程精心规划,确保了模型的高精度和效率。

The Internet Integrity Initiative Team has made a significant stride in data privacy by releasing Piiranha-v1, a model specifically designed to detect and protect personal information. This tool is built to identify personally identifiable information (PII) across a wide variety of textual data, providing an essential service at a time when digital privacy concerns are paramount.

Piiranha-v1, a lightweight 280M encoder model for PII detection, has been released under the MIT license, offering advanced capabilities in detecting personal identifiable information. Supporting six languages, English, Spanish, French, German, Italian, and Dutch, Piiranha-v1 achieves near-perfect detection, with an impressive 98.27% PII token detection rate and a 99.44% overall classification accuracy. It excels in identifying 17 types of PII, with 100% accuracy for emails and near-perfect precision for passwords. Piiranha-v1 is based on the powerful DeBERTa-v3 architecture. This makes it a versatile tool suitable for global data protection efforts.

The model’s performance in detecting various PII types is particularly noteworthy. For example, it has near-perfect accuracy in identifying email addresses and telephone numbers, with an F1 score of 1.0 and 0.99, respectively. Piiranha-v1 is extremely effective at recognizing passwords and usernames, with an accuracy of nearly 100% in these areas. These metrics indicate its utility in safeguarding sensitive information in digital communication and transaction environments.

One of Piiranha-v1’s key advantages is its ability to flag PII even when the specific data category may be misclassified. For instance, the model may occasionally confuse first names with last names, but it still correctly identifies the information as PII. This flexibility makes Piiranha-v1 a robust tool for real-world applications where data inconsistencies often occur. Such misclassifications, while technically errors, do not compromise the model’s primary goal of identifying and protecting sensitive data.

In collaboration with partners like Hugging Face and Akash Network, the Internet Integrity Initiative Team trained Piiranha-v1 using a comprehensive dataset comprising over 400,000 records of masked PII. This extensive training has resulted in a model that boasts high accuracy and demonstrates resilience in varied linguistic and contextual scenarios. The use of H100 GPUs during training allowed the model to reach high levels of efficiency, ensuring rapid identification of PII in real-time applications.

Despite its high accuracy, the developers of Piiranha-v1 emphasize that it should be used with caution. While the model is highly reliable, the team does not assume responsibility for any incorrect predictions it may produce. This advisory serves as a reminder of the limitations inherent in any machine learning model, particularly one tasked with something as complex as PII detection across multiple languages and data formats.

The training process for Piiranha-v1 was meticulously planned to optimize its performance. The model was trained for five epochs using a batch size of 128. It leveraged mixed-precision training with Native AMP to ensure speed and accuracy during the learning process. The result is a highly refined model capable of recognizing subtle variations in PII tokens, which is particularly important for identifying information that might be obscured or presented in non-standard formats.

The model’s evaluation results further highlight its impressive capabilities. Piiranha-v1 achieves an F1-score of 93.12% when tested on a dataset containing approximately 73,000 sentences. Its precision and recall metrics are also strong, at 93.16% and 93.08%, respectively. These figures, while slightly lower than the overall accuracy due to the model’s multi-class classification task, still represent a high level of competence in PII detection.

In practical terms, Piiranha-v1 can be used in various applications. It is particularly well-suited for organizations that handle large volumes of personal data, such as financial institutions, healthcare providers, and tech companies. By integrating Piiranha-v1 into their data processing pipelines, these businesses and organizations can ensure that sensitive information is automatically flagged and redacted, reducing the risk of data breaches & ensuring compliance with privacy regulations like the GDPR and CCPA.

The Piiranha-v1 model is also available for deployment through Hugging Face’s platform, where it can be easily integrated into existing workflows. The model is under the Creative Commons BY-NC-ND 4.0, which allows for broad usage within the confines of non-commercial applications. This open-access approach further reinforces the Internet Integrity Initiative Team’s commitment to improving data privacy on a global scale.

In conclusion, Piiranha-v1 represents a significant advancement in PII detection. Its high accuracy, multi-language support, and flexible application possibilities make it a valuable tool for any organization looking to enhance its data privacy efforts. The Internet Integrity Initiative Team has delivered a model that meets the technical challenges of PII detection and reflects the growing importance of safeguarding personal information in today’s digital world. As concerns over data privacy continue to escalate, tools like Piiranha-v1 will undoubtedly play a crucial role in protecting individuals’ sensitive information from exposure and misuse.


Check out the Model Card and Colab Notebook. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our Newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Piiranha-v1 Released: A 280M Small Encoder Open Model for PII Detection with 98.27% Token Detection Accuracy, Supporting 6 Languages and 17 PII Types, Released Under MIT License appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Piiranha-v1 数据隐私 PII检测 模型训练
相关文章