Temporal Blog 09月30日
微服务状态管理解决方案
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微服务架构简化了开发流程,但也带来了复杂的状态管理挑战。文章探讨了微服务在分布式系统中的状态管理问题,指出缺乏事务性导致开发者需自行处理状态,引发故障和性能问题。解决方案是采用专门的状态管理平台,如Cadence或Temporal,为微服务提供事务性抽象,确保业务流程的完整性和可靠性,提升开发者生产力。

🔍 微服务架构虽然缩短了开发周期,但在生产环境中因分布式特性导致状态管理复杂化,开发者需自行处理状态,增加系统脆弱性和故障风险。

🔄 文章提出借鉴数据库事务管理的Yoda原则,理想状态下平台应抽象化状态管理,让开发者专注于业务逻辑,类似数据库事务的原子性保证操作完整性。

📈 微服务状态管理不具扩展性,随着微服务数量增加(如Uber的4000个服务),状态管理代码编写时间远超业务逻辑开发,成为开发者的沉重负担。

🛠️ 现有解决方案如Kafka、数据库+定时器或监控工具(Prometheus+Grafana)仅提供部分可观测性,无法解决根本的状态管理问题,开发者仍需编写大量管理代码。

🚀 专用状态管理平台(如Cadence/Temporal)通过提供事务性抽象,确保工作流(Workflow)的原子性执行和故障自动恢复,让开发者享受微服务开发灵活性与数据库事务可靠性双重优势。

A dedicated state management platform for microservices gives us the same kind of abstraction that we see in atomic database transactions.

Microservices were meant to be a blessing, but for many, they’re a burden. Some developers have even moved away from them after negative experiences. Operational complexity becomes a headache for this distributed, granular software model in production. Is it possible to solve microservices’ problems while retaining their advantages?

Microservices shorten development cycles. Changing a monolithic code base is a complex affair that risks unexpected ramifications. It’s like unraveling a sweater so that you can change its design. Breaking that monolith down into lots of smaller services managed by two-pizza teams can make software easier to develop, update, and fix. It’s what helped Amazon grow from a small e-commerce outfit to the beast it is today.

Microservices also introduce new challenges. Their distributed nature exposes developers to complex state management issues.

The Yoda Principle#

Ideally, developers shouldn’t deal with state management at all. Instead, the platform should handle it as a core abstraction. Database transaction management is a good example; many database platforms support atomic transactions, which divide a single transaction into a set of smaller operations and ensure that either all of them happen or none of them do. To achieve this behavior, the database uses transaction isolation, which restricts the visibility of each operation in a transaction until the entire transaction completes. If an operation fails, the application using the database sees only the pre-transaction state, as though none of the operations happened.

This transactionality enables the developer to concentrate on their business logic while the database platform handles the underlying state. A database transaction doesn’t fail half complete and then leave the developer to sort out what happened. An account won’t be debited without the corresponding party’s account being credited, for example. As Yoda said: “Do or do not. There is no try.” Appreciate ACIDSQL databases, he would have.

“Phew,” you think. “Thank goodness I don’t have to write code to unravel half-completed operations just to work out the transaction state.” Unfortunately, microservices developers are still living in that era. This is why Yoda never used Kubernetes.

I’ve Got a Bad Feeling about This#

In microservice architectures, a single business process interacts with multiple services, each of which operates and fails autonomously. There is no single monolithic engine to manage and maintain state in the event of a failure.

This lack of transactionality between independent services leaves developers holding the bag. Instead of just focusing on their own applications’ functionality, they must also handle application resilience by managing what happens when things go wrong. What was once abstracted is now their problem.

In practice, things can go wrong quickly in microservice architectures, with cascading failures that cause performance and reliability problems. For example, a service that one development team updates with new error types can cause other services to fail if they haven’t also been updated to handle those new errors.

The brittle complexity of microservices is a challenge, in part because of the weakest link effect. An application’s overall reliability is only as good as its least reliable microservice. The whole thing becomes a lot harder with asynchronous primitives. State management is more difficult if a microservice’s response time is uncertain.

Look at the Size of That Thing#

Another aspect of this problem is that managing state on your own doesn’t scale well. The more microservices a user has, the more time-consuming managing their state becomes. Companies often have thousands of microservices in production, outnumbering their developers. This is what we noticed as early developers at Uber. Uber had 4,000 microservices, even back in 2018. In this environment, we spent most of our time writing code to manage the microservice state.

Developers have taken several approaches to solve homegrown state management. Some use Kafka event streams hidden behind an API to queue microservice-based messages, but the lack of diagnostics makes root cause analysis a nightmare. Others use databases and timers to keep track of the system state.

Monitoring and tracing can help, but only up to a point. Monitoring tools oversee platform services and infrastructure health while tracing makes it easier to troubleshoot bottlenecks and unexpected anomalies. There are many on offer. For example, Prometheus offers open-source monitoring that developers can query, while its sibling Grafana adds visualization capabilities to trace system behavior.

These solutions can be useful, providing at least some observability into microservices-based systems. However, monitoring tools don’t help with the task of state management, leaving that burden with the developer. That’s why developers spend way too much time writing state management code instead of highly differentiated business logic. In an ideal world, something else would abstract state management for them.

Use the Microservices State Management Platform, Luke#

The answer to simplifying state management in microservices is to offer it as a core abstraction for distributed systems.

We worked on a statement management platform after spending far too much time manually managing microservices state at Uber. We wanted a product that would enable us to define workflows that make calls to different microservices (in the language of the developer’s choosing), and then execute them without worrying about it afterwards.

In our solution, which we originally called Cadence, a workflow (a function that defines your high-level business logic in code) automatically maintains state while waiting for potentially long-running microservices to respond. Its concurrent nature also enables the workflow to continue with other non-dependent operations in the meantime.

The system manages disruption in state without requiring developer intervention. For example, in the event of a hardware failure, the state management platform will continue running a Workflow on another machine in the same state without the developer needing to do anything.

Do. Don’t Not Do.#

A dedicated state management platform for microservices gives us the same kind of abstraction that we see in atomic database transactions. Developers can be certain that a Workflow will run once, to completion. Temporal takes care of any failures and restarts under the hood. Now, microservices-based applications can guarantee that a debit from one account will always credit the other in just a couple of lines of code. Developers get the best of both worlds: the shorter development cycles of microservices with the exactly once execution of database transactions.

This fixes a long-standing problem with microservices and supercharges developer productivity, especially now that they are typically responsible for the operation of their application in addition to the development. Finally, developers that want the benefits of microservices can enjoy them without having to go to the dark side.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

微服务 状态管理 分布式系统 事务性 工作流
相关文章