The Pragmatic Engineer 5小时前
软件工程经验:架构的本质与实践
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文通过软件工程资深人士Matthew Hawthorne的视角,深入探讨了软件架构的本质和实践。文章强调,架构的根本在于有目的地用未来的“好问题”来交换当下的“坏问题”,而非仅仅是重排现有系统。它指出,独立的架构师角色并非解决架构问题的唯一途径,Netflix的实践表明,工程师在日常工作中做出架构决策同样重要。文章还探讨了迁移至AWS等重大技术决策所带来的权衡,优秀架构应具备的特征,以及如何培养架构能力,强调了架构在统一团队目标、平衡实际与愿景方面的重要性,并认为架构能力与代码编写能力是独立的。

🏗️ **架构的本质是权衡与演进**:文章核心观点是,好的架构工作在于有意识地用当前遇到的问题,去换取未来更易于管理或更有价值的问题。这是一种动态的、前瞻性的思考,而非仅仅是优化现有代码或系统。如果一个系统只是在原地打转,不断地解决眼前的小麻烦,而没有为未来带来根本性的改进,那么它可能只是在“重新布置家具”,而非真正的架构升级。

🎯 **架构决策应源于一线实践**:作者通过在Netflix的经历,强调了“无架构师”角色下,工程师在日常工作中做出架构决策的有效性。他认为,那些直接参与编码和生产环境的工程师,最了解系统的痛点和改进空间。将架构设计与实际的开发和运维工作相结合,能够确保提出的架构方案更接地气,更具可执行性,避免了“纸上谈兵”的风险。

⚖️ **拥抱云迁移的权衡与挑战**:以Netflix迁移至AWS为例,文章阐述了重大技术决策带来的复杂权衡。虽然云环境提供了弹性伸缩的优势,但也引入了新的挑战,如网络和硬件的不稳定性。为了应对这些新问题,工程团队需要投入大量精力去构建更强的韧性、更精细的监控和更智能的自动化系统。这些“新的问题”虽然棘手,但相较于原有的固定容量限制,是更值得追求的进步。

🤝 **架构是连接人与系统的纽带**:文章指出,好的架构不仅关乎技术系统的整合,更在于它能否有效地统一团队的认知和目标。作者分享了在社交媒体公司构建分析系统的经验,说明了在面对复杂的利益相关者和模糊的边界时,清晰的愿景和沟通至关重要。一个成功的架构方案,需要凝聚共识,引导团队朝着共同的方向前进,即使在技术上可能存在更“完美”的替代方案。

🧠 **架构能力与编码能力是独立的**:作者强调,能够写出高质量代码的工程师,不一定具备出色的架构设计能力,反之亦然。优秀的架构师往往关注宏观的系统设计、数据流、延迟、容错等关键问题,而对具体的代码细节(如类名、目录结构)相对不那么执着。这种能力是独立于编程细节的,需要的是对系统整体的深刻理解和前瞻性判断。

To try and answer this seemingly basic question, I turned to software engineering veteran Matthew Hawthorne. He’s worked as a software engineer for more than 25 years, and is the author of the upcoming book Push to Prod or Die Trying, which shares lessons from the trenches on software, architecture, and pushing to production. It is in early release now and due to be published next year.

Matt has worked at Comcast, Twitter, and Netflix, where he stayed for 6 years during the 2010s. During his time at Netflix, every engineer made architectural decisions day-to-day and shipped features frequently, which rolled out to tens of millions of users – all without a single mention of the job title “Architect”.

In this issue, Matt covers:

    Architects are not the solution to architectural problems. For a long time, there were no Architects at Netflix, yet the foundations of the architecture were still sound.

    Trading off today’s problems for tomorrow’s: migrating to AWS. Moving to the AWS cloud created plenty of problems for Netflix which took serious effort to resolve, down the line. But it also fixed existing pain points.

    Good characteristics of architecture. These balance practical and aspirational concerns, unify people as well as systems – and are unrelated to good code.

    Bad architecture is a lot of work that doesn’t change much. It’s like rearranging the furniture in a house that should be demolished and replaced by an improved layout.

    Good architectural trades in Netflix projects. Making unusual tradeoffs, building tooling to reduce operational work, trading off one set of limitations for another, and upgrading systems to work better in the future.

    How to improve your architecture skills. Design systems based on what could break, know your audience, focus on the right details, and make yourself valuable across several roles.

If you’d like to keep up with Matt’s writing, subscribe to his newsletter, Push to Prod. You can also purchase his work-in-progress book Push to Prod or Die Trying, which is currently 40% complete.

The bottom of this article could be cut off in some email clients. Read the full article uninterrupted, online.

Read the full article online

With that, it’s over to Matt:


At Netflix, we preferred building prototypes to writing formal architecture proposals, while at other companies, I’ve seen impractical architecture proposals fail, overly practical architectural efforts succeed – but then deliver limited impact – and other proposals gather dust due to the absence of the conversations and alignment work necessary for creating a shared plan.

I’ve always found the work defined by the phrase software architecture to be vague, and the value delivered by capital “A” Architects to be debatable. I speak from personal experience here after my own brief, unimpressive tenure as a “domain architect” – an experience from which I learned much, but delivered very little. Over time, if there’s one thing I’ve learned about architecture, it is that:

Good architecture work is about purposefully trading problems you have today, for better problems tomorrow. Basically, if you’re not upgrading your problems, you’re just rearranging furniture.

1. Architects are not the solution to architectural problems

I worked at Netflix for 6 years and during that time, there were hundreds of things that made the company unique. One was the lack of software architects – in title – at least. In fact, we didn’t have engineering levels at all; every engineer was a Senior Software Engineer.

I spent my first 4 years there on the Edge team, which was around 20 engineers building and operating a software layer that sat between client devices and backend services. “Edge” in this case was the edge of Netflix’s server infrastructure, which carried a heavy operational burden.

It felt like the first step in every production incident was for us to prove we weren’t to blame for it. This was annoying but effective; colleagues assumed that if every client request goes through our system, then we must have logs or metrics captured by our system that identify and clarify problems. We often did, and when we didn’t, we worked to close the gaps. Of course, sometimes we did cause incidents, and on occasion, “we” was “me”.

Over a few years, we solved the most pressing problems by increasing resiliency, building flexible edge routing, and enabling predictive autoscaling. If I reflect on these projects and others from my time at the video streaming giant, there wasn’t architecture in the sense that I’d seen before. Those who proposed ideas also did the work, which avoided the awkward white-collar vs. blue-collar split that often arises between architects and engineers in tech workplaces. Most technical ideas originated in the “trenches”, resulting in deliberate architectural choices with obvious value. Over time, we entered a new era that involved extensive conversations about creating massive new systems to solve imaginary future problems, instead of real, present ones.

A new behavioral archetype formed at the company – the “Architect”. I am talking about engineers who hovered around the work, pushed new ideas, and didn’t make material contributions to day-to-day struggles. They weren’t on our on-call rotations, ostensibly because their managers wanted to give them more time to brainstorm new ideas, so they had ample time for conversations and intellectual enquiry, while the rest of us did grimy, operational work. Over lunch one day, a colleague and I discussed the situation. “I’m not sure what the solution is”, I said. “The solution”, he replied, “is to not have architects.” I laughed, then realized he wasn’t joking.

I thought Netflix didn’t need architects, and considered this more broadly: does any company really need them? Let’s examine why the role of Architect exists. I believe it’s an artifact of the old-school waterfall fantasy:

    Design: Architects think big thoughts and draw diagrams

    Build: Engineers turn diagrams into code

    Test: QA finds bugs

    Operate: Ops keeps the systems running smoothly

These four phases are inescapable, and every company approaches them in its own unique way. Imagine an axis: on the left, there’s a distinct role for each phase, and on the right, there’s a single role performing them. Every company lives somewhere on this axis.

For example, the shift towards agile at big companies made “big design, up front” less popular, and often combined Design and Build into a single, continuous phase. And the shift towards cloud, DevOps, and the “you write it, you run it” model often blends Build, Test, and Operate into a single phase.

So, why did Netflix introduce architects? My perception is that this was related to scope: our management chain believed we needed bigger, bolder ideas to redefine how our systems worked, and that those tasked with achieving this needed distance from day-to-day work.

It sounds perfectly logical, but I’ve never seen it work. I didn’t try to fix it – I switched teams and moved to the Personalization Infrastructure group, a domain I’d always been interested in. This team was newer and they were solving concrete problems. As I understand, the Edge team continued on the path of re-visualizing their platform with great success.

On a long enough timeline, success breeds specialization. You solve the hardest problems and management asks, “What’s our 3-year vision?”. Distinct groups emerge of “thinkers” and “builders”, and many companies formalize this with Principal or Staff Engineers who function as architects in all but name. From what I’ve seen, it is hard for people in these roles to avoid becoming disconnected from reality: their minds are powerful, but structure and incentives make it difficult to find appropriate targets.

2. Trading off today’s problems for tomorrow’s: migrating to AWS

The most significant tradeoff I saw at Netflix involved a decision that predated my time there: moving the company’s entire infrastructure to AWS. “So what?” you may think; lots of companies run everything in AWS.

Sure, lots of companies do so today, but none of significant scale did in 2010. I recall reading a few blog posts about Netflix’s AWS migration before I joined, and thinking, “they’re insane.” But after I joined and saw the situation up close, it made a lot of sense.

What are the repercussions of running in a datacenter, on-prem, non-cloud environment?

What are the consequences of running in AWS or other cloud environments?

If that was everything, then the business case for elastic capacity was strong enough to declare the AWS migration a solid trade. But from the perspective of the Edge team, this trade sent us down a path of many other problems.

For example:

These were problems we wouldn’t have had if we’d stayed in the data center. But they were also exponentially better than having fixed-capacity software and constrained customer growth. This was not only a good tradeoff for the business – it also catalyzed a planet-scale improvement of our engineering capabilities.

3. Good architecture characteristics

Let’s discuss a few characteristics of good and bad architecture work.

Good architecture is unrelated to good code

Very early in my career, I sat in a meeting and watched a newly hired, experienced engineer draw manically on a whiteboard for 30 minutes. I had no idea what he was talking about, but he was highly respected by senior teammates, so I paid close attention.

A few weeks later, I had some free time and we paired up to prototype some of his ideas. His code was atrocious and he had a hard time getting anything to work. But he said something which struck my young mind as profound:

“We’ve got SQL queries hardcoded into the UI”, he said. “We should change that so that the UI calls a service which abstracts the query behind a concept, like ‘get all items for the current user’”.

That’s a great idea, I thought. Over the next few years, I learned a ton from him about testing, separation of concerns, hot and cold storage, compression, data formats and schemas, and more. But I never reconciled how a person so bad at writing code could have such solid ideas about system design. A larger point emerged:

Someone’s ability to write high-quality code is entirely independent of their ability to create or recognize high-quality architecture.

I’ve worked for engineering leaders who hadn’t written code in years, yet had a brilliant sense of technical fundamentals. When necessary, they grabbed the wheel and steered their teams away from expensive mistakes and towards bountiful opportunities. They didn’t have great coding skills, but possessed impeccable design sense, instincts, and taste.

I’ve also worked with engineers who were experts in various programming minutiae, yet had no clear ability to build systems that could work together coherently and functionally. Their ability to optimally implement classes or functions did not extend beyond the source files they edited.

Architecture connects systems built with code, but is a distinct discipline from coding with limited overlap. You cannot LeetCode your way into a cohesive, opportunistic architecture that serves the business.

Why not? Let’s explore the “good at architecture, bad at code” archetype. I’ve found these individuals are drawn to the most essential details. When designing a new system, they ask questions like:

They are generally less concerned with details like class hierarchies, method names, and directory structures, which are trivial in the broader business and distributed systems contexts.

But code can’t be completely irrelevant, right? Let’s examine the “good at code, bad at architecture” archetype, which I think occurs when individuals haven’t had a role in which things like network latency, fallbacks, or failure modes are first-class citizens.

Note, I use the word “architecture” to mean “distributed systems architecture”, which is common and also biased towards my personal experience.

At Netflix, we’d often find that backend services had slow memory leaks, which took a long time to discover and fix because instances rarely lived longer than 48 hours, due to autoscaling policies. If we had chosen to focus on memory leaks instead of autoscaling, Netflix would have been unable to scale to meet demand, and would’ve been a much smaller business.

Good architecture unifies people, not only systems

A few years later, as a Staff Engineer at a large social media company, I led a project to build analytics for our real-time content personalization system.

The existing design had evolved from a prototype built before I joined. We generated events from our runtime systems that briefly sat in a queue before being exported into an external datastore.

These events captured information such as:

This data enabled us to:

From a “social” point of view, there was much to consider:

We needed a cohesive vision to get everyone on the same wavelength, so I wrote a design document to achieve this which described the problem we were solving – both business-wise and scale-wise – and added a bunch of diagrams. I presented a few options for refining the project and described what I felt was the best path, and why. Then I shared the document.

After a few weeks of conversations with the team and stakeholders, nothing improved and the situation had become notably worse. The team was more fractured than ever, with multiple sub-teams moving in different directions. The notion of a comprehensive analytics solution felt like a distant dream.

I left the company shortly after this, as I no longer felt it was the right fit. But I asked myself for months afterwards what I could have done differently to deliver a better result. I always come back to one answer:

I was too focused on solving a technical problem when the primary issue was a social one.

My job had not been to just make optimal technical decisions. It was to bring everyone together with a unifying vision, whether it was technically optimal or not.

In the video ranking case mentioned above, I lobbied for that project intermittently for over a year before work started: I presented slides and sent them to anyone I could find, gathered a bunch of data, and made more slides. Things went quiet for a while. And then, finally, the stars aligned and we started moving.

In the corporate world, objectively good ideas are rare. Ideas don’t have to be good per se; they just have to move people forward. Of course, good ideas are more likely to move people than bad ones, but in workplaces, something is “good” when the right people agree it is.

Good architecture balances the practical with the aspirational

Around the year 2020, I was working at a large US TV operator as a Principal Engineer. My role was focused on making broad improvements to personalized video ranking, and I partnered with an Architect to address the problem of making video ranking more flexible. Historically, most content was served by the product either via search, or a variety of human-curated collections. Fully personalizing our content required connecting multiple functions, systems, and teams in a new way.

Let’s talk about those teams briefly. Our primary partner was the Search team, a group of 20 or so engineers who operated a high-volume search index, along with a system to access all editorially curated content collections. Their priority was serving relevant search results in a stable, highly available way. A change to their system usually took at least a month to land in production.

We were the Personalization team, a mixture of around 10 Machine Learning Researchers and Software Engineers. We owned several services that provided personalized content to users at runtime, along with numerous data pipelines and random batch jobs. Our priority was increasing user engagement by maximizing the amount of personalized content served. If you asked us to make a change to our system, we could likely get it in production in two business days.

Our goal was to create a proposal with a high chance of winning buy-in from partner teams. One of my self-imposed constraints was that the proposal to increase personalization had to be achievable within our current responsibilities and organizational structure. We needed to increase the speed at which we could put new features and experiments into production, so we had to be mindful and realistic about the demands placed upon the slowest-moving teams.

My architect partner disagreed, saying:

“We should propose the best technical solution. And if that requires organizational changes, let the managers figure it out.”

It was an interesting idea, but at the time, a proposal requiring a reorg went nowhere fast at that company. It would also limit my opportunities to write future proposals.

In this case, the Search team moved slowly for legitimate reasons. The Personalization team moved quickly, also for legitimate reasons. One of the goals of our proposed architectural changes was to increase the speed of testing new ML models in production. If adding a new model or configuring an existing model involved manual changes and testing by the Search team, then we would continue to move slowly at their pace. To ignore this constraint would be to provide a proposal of limited value.

A common stereotype of Architects is that they propose things that don’t work in the real world. Their solutions require a new, perfect world to be constructed beforehand. In our case, a perfect world could be created by a reorg that breaks apart the Search team, or maybe by rewriting all services from scratch.

But conceiving new worlds isn’t the problem; the problem is the inability to intersect a perfect world with the one that actually exists. Sure, maybe a reorg or rewrite would be ideal, but in the meantime, can we deliver something valuable within current constraints to reveal opportunities being missed in the current state?

We were strategically exposed, and mitigated this issue via prototyping. We couldn’t win buy-in for a broad proposal, so proposed three different architectures and ran them in parallel with a small amount of real customer traffic. The option with the highest demand on the Search team performed significantly worse than the others, partially due to some architectural flaws, but mostly because they moved more slowly than the Personalization team. Production traffic is the ultimate practical differentiator.

4. Bad architecture is a lot of work that doesn’t change much

Let’s discuss the Principal Engineer role from the “Good architecture balances the practical with the aspirational” section above. I was asked to review a proposal for an “improving personalization architecture” initiative led by an engineer on a team I worked with. It was filled with clearly-stated decisions, such as:

Read more

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

软件工程 系统架构 Netflix AWS 技术决策 权衡 DevOps 工程实践 Software Engineering System Architecture Trade-offs Engineering Practices
相关文章