Second Brain: Crafted, Curated, Connected, Compounded on 10月02日 20:53
Nimble文件格式介绍
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Nimble是由Meta创建的一种新的列式文件格式,旨在替代Apache Parquet和ORC等格式。它适用于包含数千列的宽型工作负载,如特征工程和机器学习训练表。Nimble的设计原则包括宽型、可扩展性、并行性和统一性,支持多种编码方式,并允许用户扩展和递归应用编码。其特点包括更轻量级的元数据组织、使用Flatbuffers提高效率、块编码提供可预测的内存使用,以及支持可插拔的编码选择策略和扩展API。

💡 Nimble是一种新的列式文件格式,由Meta创建,旨在替代Apache Parquet和ORC等传统格式,特别适用于处理包含数千列的宽型工作负载,如特征工程和机器学习训练表。

🌐 Nimble的设计遵循四个核心原则:宽型、可扩展性、并行性和统一性。宽型使其非常适合处理大量列;可扩展性允许用户扩展和递归应用编码;并行性旨在充分利用高度并行的硬件;统一性则通过提供单一统一的Nimble库来避免环境碎片化问题。

🔧 Nimble具有多种特点,包括更轻量级的元数据组织,使用Flatbuffers提高元数据访问效率;采用块编码提供解码/读取时的可预测内存使用;支持多种内置编码,并允许按需添加额外编码;支持流的级联(递归/复合)编码;提供可插拔的编码选择策略;以及提供扩展API,允许用户扩展编码和文件的其他方面。

📊 Nimble还提供了清晰的逻辑和物理编码类型分离,以及更多高级功能,如可扩展的编码和文件布局,进一步增强了其在大数据处理中的灵活性和效率。

🚀 Nimble的未来发展方向包括更深入地利用并行硬件,通过提供元数据以帮助开发者更好地规划解码树和调度内核,以及持续扩展其编码库和功能,以满足不断变化的数据处理需求。

New File Format for storage of large columnar datasets.

Nimble (formerly known as “Alpha”) is a new columnar file format for large datasets created by Meta. Nimble is meant to be a replacement for file formats such as Apache Parquet and ORC. 

Watch this talk to learn more about Nimble’s internals.

# Design Principles

Nimble has the following design principles:

    Wide: Nimble is better suited for workloads that are wide in nature, such as tables with thousands of columns (or streams) which are commonly found in feature engineering workloads and training tables for machine learning. Extensible: Since the state-of-the-art in data encoding evolves faster than the file layout itself, Nimble decouples stream encoding from the underlying physical layout. Nimble allows encodings to be extended by library users and recursively applied (cascading). Parallel: Nimble is meant to fully leverage highly parallel hardware by providing encodings which are SIMD and GPU friendly. Although this is not implemented yet, we intend to expose metadata to allow developers to better plan decoding trees and schedule kernels without requiring the data streams themselves. Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.

# Features

Nimble has the following features:

    Lighter metadata organization to efficiently support thousands to tens of thousands of columns and streams.Use Flatbuffers instead of thrift/protobuf to more efficiently access large metadata sections. Use block encoding instead of stream encoding to provide predictable memory usage while decoding/reading.Supports many encodings out-of-the-box, and additional encodings can be added as needed. Supports cascading (recursive/composite) encoding of streams. Supports pluggable encoding selection policies.Provide extensibility APIs where encodings and other aspects of the file can be extended. Clear separation between logical and physical encoded types.And more.

More on GitHub


Origin: Nimble and Lance: The Parquet Killers - by Chris Riccomini
References:
Created 2024-05-21

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Nimble 列式文件格式 大数据 Meta Parquet ORC 特征工程 机器学习
相关文章