ByteByteGo 10月27日 23:39
Nubank 自建日志平台,优化成本与效率
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Nubank 作为一家大型数字银行,在快速扩张过程中面临现有日志平台成本高昂、管理困难和可见性不足的问题。为解决这些挑战,Nubank 工程团队决定自建一套日志平台。该平台分为两个阶段:Observability Stream 负责日志的摄入和处理,采用 Fluent Bit、自研数据缓冲服务和过滤处理服务,确保了数据处理的可靠性和灵活性;Query and Storage Platform 则负责日志的查询和存储,选用 Trino 作为查询引擎,AWS S3 作为存储层,并采用 Parquet 数据格式,实现了高效、低成本的日志管理。新平台不仅显著降低了运营成本,还提升了系统的可伸缩性和工程师的问题排查效率,为 Nubank 的未来发展奠定了坚实基础。

💡 **成本与可见性驱动的自建需求**:Nubank 最初依赖第三方日志服务,但随着业务增长,成本急剧上升且管理复杂,缺乏对日志收集和存储的深入可见性,严重影响了问题排查和系统稳定性。这促使 Nubank 必须构建一个自主可控、成本效益更高的日志平台。

🔄 **两阶段平台策略**:Nubank 将日志平台建设拆分为两个关键阶段:第一阶段“Observability Stream”专注于日志的摄入和处理,确保高效、可靠的数据收集;第二阶段“Query and Storage Platform”则侧重于日志的查询和存储,满足大规模数据快速检索的需求。这种分步实施的方法降低了项目风险,并允许团队逐步优化和验证每个组件。

🚀 **技术选型与架构优势**:在摄入端,Nubank 结合了轻量级的 Fluent Bit 和自研服务,实现了日志的缓冲、过滤和丰富化,提高了数据处理的稳定性和灵活性。在查询和存储端,Trino 提供了强大的分布式 SQL 查询能力,AWS S3 提供了海量、低成本的存储,而 Parquet 格式则保证了数据的高效压缩和快速查询。这种架构设计实现了摄入与查询的解耦,提升了整体系统的可伸缩性和效率。

📈 **显著的规模与效率提升**:自建平台上线后,Nubank 日均摄入 1 万亿条日志,处理 1 PB 数据,并存储 45 PB 数据。工程师每天进行约 1.5 万次查询,扫描 150 PB 数据,整体成本比原先降低了 50%。这表明 Nubank 的新日志平台在处理海量数据、保证查询性能和控制成本方面取得了巨大成功,为公司的高速发展提供了有力支撑。

Crane Worldwide Logistics Joins Streamfest 2025 — Hear Their Story (Sponsored)

Join us November 5-6 for Redpanda Streamfest, the two-day online event exploring the future of streaming data and real-time AI. Jared Noynaert, VP of Engineering at Crane Worldwide Logistics, will share insights on the fundamentals of modern data infrastructure — from isolation and auto-scaling to branching and serverless models that redefine how organizations scale. It’s a forward-looking look at the principles shaping next-generation architectures and where Redpanda and the broader Kafka ecosystem fit in. Streamfest features keynotes, demos, and case studies from engineers building the systems of tomorrow.

Sign Up Now


Disclaimer: The details in this post have been derived from the details shared online by the Nubank Engineering Team. All credit for the technical details goes to the Nubank Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When a company grows rapidly, its technical systems are often pushed to their limits.

This is exactly what happened with Nubank, one of the largest digital banks in the world. As the company scaled its operations, the logging platform it relied on began to show serious cracks. Logging may sound like a small part of a tech stack, but it plays a critical role. Every time a service runs, it generates logs that help engineers understand what is happening behind the scenes. These logs are essential for monitoring systems, fixing problems quickly, and keeping the platform stable.

Nubank was using an external vendor to handle its log ingestion and storage. Over time, this setup became both costly and difficult to manage. The engineering team had very limited visibility into how logs were collected and stored, which made troubleshooting production issues harder. If something went wrong, they could not always trust that the logs they needed would be available.

Costs were also rising fast, with no clear way to predict future spending. The only way to handle more load was to pay more money. On top of that, important tools like alerts and dashboards were tied directly to the vendor’s platform, making it difficult to change providers. Finally, whenever log ingestion spiked, it would slow down the querying process, affecting incident response.

These challenges pushed Nubank to build its own in-house logging platform to take back control, reduce costs, and ensure a stable foundation for its fast-growing systems.

In this article, we will look at how Nubank built its in-house logging platform and the challenges the team faced.

The Initial Logging Architecture

Before building its own logging platform, Nubank relied entirely on a third-party vendor.

The setup was straightforward. Every application inside the company sends its logs directly to the vendor’s forwarder or API. This worked well in the early days because the number of services was smaller, the log volume was manageable, and the system was simple to maintain.

However, as the company grew, this basic setup began to fall apart due to the following reasons:

Two-Phase Platform Strategy

To solve its growing challenges, the Nubank engineering team decided not to build everything at once. Instead, they split the project into two clear phases.

The first phase was called the Observability Stream. This part focused on ingesting and processing logs. In other words, this involved collecting data from thousands of applications and preparing it for storage. The team wanted a platform that could handle large amounts of incoming data reliably and efficiently. It also needed to be flexible enough to apply filters, process different data types, and generate useful metrics.

The second phase was the Query and Storage Platform. Once logs were collected and processed, they needed to be stored in a way that made them easy to search. Engineers use logs daily to investigate incidents, debug issues, and understand system behavior. This platform had to support fast queries, store data at low cost, and remain scalable as Nubank’s infrastructure kept expanding.

This two-phased approach made the migration smoother and allowed them to test and improve each part before moving on. The goal was to create a complete, in-house log management system that could grow with Nubank’s massive scale.

Both phases were designed with three key goals in mind:

With these guiding principles, Nubank began building the new logging platform from the ground up.

Ingestion Pipeline

The Nubank engineering team began by focusing on the first phase of their new platform: the ingestion pipeline, also known as the Observability Stream.

This part of the system is responsible for collecting logs from many different applications, processing them in a structured way, and preparing them for storage. In their old setup, ingestion and querying were tightly connected, which meant problems in one area could affect the other. By separating the two, Nubank gained the ability to scale and manage each part independently.

To build this pipeline, the team combined reliable open-source tools with in-house services:

By introducing these components, Nubank improved the reliability of its ingestion pipeline.

Logs are now buffered and processed in a controlled manner, which makes the system more stable during sudden traffic increases. Ingestion is fully decoupled from querying, so problems in one layer do not bring down the other. The team also gained operational telemetry in the form of insightful metrics that help make better decisions about storage design and system scaling.

This solid foundation paved the way for the next phase of the project.

Query and Storage Platform

Once the ingestion pipeline was stable, the Nubank engineering team turned its attention to the second major part of the platform: the query and storage layer.

This layer is what allows engineers to search through massive amounts of log data quickly and reliably. Since Nubank ingests trillions of logs every day, this part of the system needed to be both powerful and cost-efficient. It had to store data at a petabyte scale while still delivering fast query performance for daily debugging and monitoring.

Query Engine: Trino

For the query engine, the team chose Trino.

Trino is a distributed SQL engine that can query data stored in different locations. One of the main reasons for this choice was its partitioning capabilities.

Partitioning breaks data into smaller, structured pieces so that a query only scans the parts it needs. This improves performance and reduces the amount of computing resources required.

Trino also integrates well with multiple backends, which gives the platform flexibility to evolve in the future if needed.

Storage Layer: AWS S3

For storage, the team selected AWS S3.

This decision was driven by its high durability, strong availability guarantees, and practically infinite scalability. S3 can easily handle petabytes of data without the team needing to manage complex infrastructure. It is also cost-effective for long-term storage, which is essential when dealing with logs at this scale.

As Nubank continues to grow, the storage layer can expand smoothly without major architectural changes.

Data Format: Parquet

The logs are stored in Parquet, a columnar data format designed for efficient querying.

Parquet provides an average 95 percent compaction rate, which drastically reduces how much space the data takes up. It also offers excellent scan performance, making analytical queries much faster. In addition, Parquet uses efficient compression and serialization methods that help lower storage and compute costs.

Parquet Generator (In-house)

To convert the ingested log batches into Parquet format, the team built a high-throughput Parquet generator.

This is an in-house service designed to efficiently handle large volumes of data. Since it is internally controlled, the team can optimize it for cost, add custom features, and adapt it to new workloads over time. It is both scalable and extensible, ensuring the system can support future growth.

Scalability and Performance Metrics

By mid-2024, the new logging platform at Nubank had reached an impressive level of scale and efficiency. What started as a response to rising costs and limited visibility had grown into a powerful, cloud-native observability system. The team designed it to handle massive amounts of data without losing speed, reliability, or control.

Here are some key numbers that show how far the platform has come:

Conclusion

The new logging platform built by the Nubank engineering team shows how a company can rethink its infrastructure to match the pace of its growth. By moving away from a costly and inflexible third-party solution, Nubank created a modern, cloud-native observability stack tailored to its own needs.

The success of this platform comes down to a few key design choices.

The result is a platform that provides predictable costs, high scalability, and deep operational visibility. It enables engineers to troubleshoot issues faster, build new features with confidence, and prepare the company for future growth.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Nubank 日志平台 技术架构 成本优化 可观测性 Logging Platform Tech Architecture Cost Optimization Observability
相关文章