Temporal Blog 09月30日
从SWF到Temporal,构建可靠分布式系统的演变
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

作者回顾了从参与微软Project Oslo到AWS Simple Workflow (SWF)团队,再到Azure Service Bus和Uber的经历,最终创立Temporal Technologies的过程。文章探讨了在构建分布式系统时,开发者如何通过工具提升生产力并解决可靠性问题,最终将Temporal开源,帮助更多企业应对关键工作负载的挑战。

🔧 从Project Oslo开始,作者确立了与开发者迭代和关注云技术的职业目标,为后续加入AWS SWF团队奠定了基础,该团队由经验丰富的分布式系统专家Maxim Fateev领导。

🚀 AWS SWF通过提供高层次的作业调度服务,显著降低了开发者构建具有容错能力的分布式系统的复杂性,推动了该服务在AWS内部的广泛采用,并最终于2012年公开发布。

🔄 在Azure期间,作者观察到微服务架构的兴起导致开发者仍需依赖底层消息队列等原始构建模块,面临与AWS时期类似的挑战,这促使他们通过Azure Durable Task Framework(基于C#的async/await模型)再次尝试解决这些问题。

⚡ Uber时期,作者与Maxim Fateev共同创立了Cherami项目以优化Kafka的性能,但发现即使在高可扩展性需求下,开发者仍需花费大量时间处理基础可靠性问题,从而萌生了开发Cadence的念头。

🌐 Cadence作为Uber内部的多租户工作流服务,通过Go语言简化了状态管理,但作者意识到行业普遍存在类似痛点,最终决定离职创办Temporal Technologies,以开源形式推广该技术。

🔑 Temporal Technologies将Cadence的核心思想开源(MIT许可),通过提供统一的工作流和任务运行时,帮助开发者以标准化方式处理分布式系统的容错和状态管理,目前已被HashiCorp、Doordash等多家公司用于关键业务。

Back in 2010 I was contemplating my next move after working on Microsoft's Project Oslo. Oslo was an effort to deliver 10X productivity to developers which inspired me to set the following criteria for my next job:

    Iterate with Developers: I want to work in a place where we can start small, get something in the hands of developers and then iterate over the product to solve real problems. Cloud: So far I had been focussed on building platforms using bare-metal products. But I was starting to see the kind of requirements modern applications have around scale and resiliency. I knew Cloud had to be huge part of the experience to deliver on those requirements.

Start of Journey#

The Oslo framework led me to join the AWS Simple Workflow (SWF) team. The team was led by Maxim Fateev, who came from a strong messaging background. Maxim was one of the most brilliant software engineers I had ever worked with, especially when it came to designing large scale distributed systems. Little did I know that I was about to embark on a long journey which is still being written to this day.

When I joined, the team was wrapping up a second version of the service which was already seeing decent usage within AWS. Even at the time, we could clearly see a pattern emerging. Developers were spending significant amounts of time building resiliency into applications, using low level primitives like queues, databases, retry mechanisms, durable timers, etc. Those same developers were able to produce higher quality systems with far less effort when using SWF instead of implementing resiliency themselves. Considering how useful the service was within AWS, the next natural step was to offer SWF publicly. I was part of the core team which worked on the public version of SWF which was launched in early 2012.

Durable Task Framework (DTFx)#

After shipping the public SWF service, I took an opportunity at Microsoft Azure and ended up joining the Azure Service Bus team that owns the messaging stack for Azure. Cloud was steadily gaining momentum and as more and more workloads started to get migrated, teams like Azure Service Bus became a focal point. As application developers increasingly started adopting microservices architecture to keep up with scale and availability requirements for modern applications, services like Azure Service Bus became the backbone to orchestrate calls across microservices. To keep up with the explosive growth, I worked as part of the team focused on large scale ingestion through Azure EventHubs. This solved the scalability and reliability issues at a messaging layer but developers still had to work with very low level primitives whenever they need to reliably orchestrate calls across microservices. The result was complex architectures which were expensive to build, hard to operate, and still came with reliability challenges. Reliability challenges stemming from all sorts of failure cases which needed to be handled due to the distributed nature of the application.

I could clearly see that the developers building applications on Microsoft Azure were facing eerily similar challenges to what I had seen back at AWS. The same challenges we had tried to address with SWF. So I used one of the internal team hackathons as an opportunity to pair up with Affan Dar and take another stab at solving the problem. Affan had a very deep understanding of Azure ServiceBus so he was the perfect person to build the backend for the stateful C# experience I had in mind. Microsoft had recently added async/await capabilities into C# and it turned out to be an amazing fit for writing stateful applications which need to orchestrate calls among microservices. Since Java lacked an async/await like primitive, we had to rely on Promise-based async approach when building SWF. But with C#, we were able to deliver a much cleaner and synchronous programming model using async/await. This hackathon project resulted in Azure Durable Task Framework as an OSS client SDK which uses Azure ServiceBus as the backend to provide a stateful workflow-as-code experience for applications. I'm so glad to see Microsoft has continued investing in the experience with Azure Durable Functions as the latest reincarnation of the original effort. An effort which started with that hackathon project.

Transportation as Reliable as Running Water

In 2015, Uber opened a development center in Seattle and I decided to take the leap and join the engineering team. Coincidently enough, Maxim Fateev ended up joining the Uber team in Seattle only a month after I did. At the time, Uber was running on Kafka 7 as the messaging backbone. Based on the scale they were running, they were encountering some serious operational issues. Considering Maxim and I had more than a decade worth of experience building messaging systems similar to Kafka, we decided to create the OSS project Cherami to address this Uber sized problem. After a year of working on the project, we were observing a very similar pattern to the one that we tried to solve with SWF and Durable Task Framework. When engineers needed to build with raw infrastructure primitives like queues and databases they were spending 80% of their time building resiliency into the application. This was clearly not sustainable for Uber, which was growing at an amazing pace and building a brand of "Transportation as Reliable as Running Water". This need to increase developer productivity without compromising on reliability of the system was the motivation for Maxim and I to create the OSS project Cadence. Within a very short period of time, we built a multi-tenant service hosted by our team. Cadence provided a great developer experience by enabling developers to use Golang to build and run stateful applications with very little operational overhead. Cadence grew organically within Uber and quickly became popular among developers. It slowly but surely began to emerge as the standard way to build stateful applications when reliability cannot be compromised.

Magic of Open Source

Today, more businesses are turning to software for running mission critical parts of the system and software is becoming key part of the end-user experience. The problems faced by engineers at places like AWS, Microsoft Azure and Uber have become more and more common across the industry. This belief was validated in early 2019. Developers from companies like Hashicorp, Box, Doordash, Checkr and dozens of other places organically discovered the Temporal technology and immediately started using it for their mission critical workloads.

We have a strong belief that an infrastructure technology of this magnitude needs to built as an Open-source project. This led both Maxim and I to quit our jobs at Uber and launch Temporal Technologies in October 2019. Over the last year we made huge advances with our developer experience and released Temporal as an Open Source Software under MIT license. We recently launched our first production release of Temporal v1.0.0 which is already being used by numerous companies for critical workloads.

https://temporal.io

https://github.com/temporalio/temporal

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

分布式系统 工作流引擎 Temporal AWS SWF Cadence 微服务架构 开源技术
相关文章