The Pragmatic Engineer 10月02日 20:54
谷歌工程栈揭秘
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了谷歌独特的内部工程栈,包括其行星级基础设施、单一代码库(Monorepo)以及自研的技术栈和开发工具。文章分析了谷歌为何选择不在Google Cloud Platform(GCP)上构建核心服务,而是依赖自研的PROD栈。此外,还介绍了谷歌的单一代码库如何运作,以及其技术栈中支持的编程语言和工具。文章还讨论了谷歌在系统设计方面的特点,如服务导向架构和Stubby通信机制。最后,文章探讨了谷歌在人工智能领域的投入,以及其内部GenAI项目的现状。

谷歌的内部基础设施是完全定制的,包括基础设施和开发工具,形成了所谓的“技术孤岛”。与Google Cloud Platform(GCP)不同,谷歌的核心产品如Search、YouTube、Gmail等都不使用GCP基础设施,而是依赖自研的PROD栈。这是因为PROD栈提供了更好的开发者体验和更高的性能,并且支持“行星级”扩展,而GCP缺乏这些功能。

谷歌使用单一代码库(Monorepo)来存储大部分源代码,称为“Google3”。这个代码库规模庞大,包含数十亿行代码和数百万个文件。所有工程师都在同一个主分支上工作,采用基于trunk的开发模式,即所有工程师都在同一个主分支上工作,并通过短命的分支进行修改,然后合并回主分支。这种开发模式有助于提高工程效率。

谷歌的技术栈官方支持C++、Kotlin、Java、Python、Go和TypeScript等编程语言,并大量使用Protobuf和Stubby。谷歌为大多数语言提供了风格指南,这些指南几乎总是被强制执行。此外,谷歌还开发了大量的自研工具,如Piper、Fig、Critique、Blaze、Cider、Tricorder、Rosie等,用于辅助开发工作。

在系统设计方面,谷歌采用服务导向架构,即通过Stubby进行服务之间的通信。Stubby是谷歌默认的服务间通信方式,而gRPC则很少使用,通常只用于外部服务。此外,谷歌还使用Borg、Omega、Kubernetes等自研的计算和存储系统,以及BigQuery、Bigtable、Spanner等数据库系统。

“What’s it really like, working at Google?” is the question this mini series looks into. To get the details, we’ve talked with 25 current and former software engineers and engineering leaders between levels 4 and 8. We also spent the past year researching: crawling through papers and books discussing these systems. The process amassed a wealth of information and anecdotes that are combined in this article (and mini-series). We hope it adds up to an unmatched trove of detail compared to what’s currently available online.

In Part 1, we covered Google’s engineering and manager levels, compensation philosophy, hiring processes, and touched on what makes the company special. Today, we dig into the tech stack because one element that undoubtedly makes the company stand out in the industry is that Google is a tech island with its own custom engineering stack.

We cover:

    Planet-scale infra. Google’s internal infrastructure was built for ‘planet-scale’ by default, but Google Cloud does not support this out of the box; hence, most engineering teams build on Google’s PROD stack, not GCP.

    Monorepo. Also known as “Google3,” 95% of all Google’s code is stored in one giant repository that has billions of lines. Trunk-based development is the norm. Also, the monorepo doesn’t mean Google has a monolithic codebase.

    Tech stack. C++, Kotlin, Java, Python, Go, and TypeScript are officially supported, with heavy use of Protobuf and Stubby. Google has language style guides for most languages that are almost always enforced.

    Dev tooling. A different dev tool stack from any other workplace. Goodbye GitHub, Jenkins, VS Code, and other well-known tools: hello Piper, Fig, Critique, Blaze, Cider, Tricorder, Rosie, and more.

    Compute and storage. Borg, Omega, Kubernetes, BNS, Borgmon, Monarch, Viceroy, Analog, Sigma, BigQuery, Bigtable, Spanner, Vitess, Dremel, F1, Mesa, GTape, and many other custom systems Google runs on. This infra stack is unlike anywhere else’s.

    AI. Gemini is integrated inside developer tools and most internal tools – and Google is heavily incentivizing teams to build AI whenever possible. Teams can request GPU resources for fine tuning models, and there’s a pile of internal GenAI projects.

1. Planet-scale infra

Google’s infrastructure is distinct from every other tech company because it’s all completely custom: not just the infra, but also the dev tools. Google is a tech island, and engineers joining the tech giant can forget about tools they’re used to – GitHub, VS Code, Kubernetes, etc. Instead, it’s necessary to use Google’s own version of the tool when there’s an equivalent one.

Planet-scale vs GCP

Internally, Google engineers say “planet scale” as the company’s capacity to serve every human on Earth. All its tooling operates at global scale. That’s in stark contrast to Google Cloud Platform (GCP), with no such “planet-scale” deployment options built in – it’s possible to build applications that can scale that big, but it would be a lot of extra work. Large GCP customers which managed to scale GCP infrastructure to planetary proportions include Snap, which uses GCP and AWS as their cloud backend, and Uber that uses GCP and Oracle, as detailed in Inside Uber’s move to the cloud.

Google doesn’t only run the “big stuff” like Search and YouTube on planet-scale infrastructure; lots of greenfield projects are built and deployed on this stack, called PROD.

As an aside, the roots of database Planetscale (the database Cursor currently runs on) run to Google, and its “planet-scale” systems. Before co-founding Planetscale, Sugu Sougoumarane worked at Google on YouTube, where he created Vitess, an open source database to scale MySQL. Sugu now works on Multigres, an adaptation of Vitess for Postgres. I asked where the name Planetscale comes from. He said:

“The first time I heard the term ‘planet-scale’ was at Google. I chuckled a bit when I heard it because it’s not possible to build a globally-distributed ACID database without trade-offs. But then, Vitess was already running at “planet-scale” at YouTube, with data centers in every part of the world.

So, when we decided to name PlanetScale; it was a bold claim, but we knew Vitess could uphold it.”

Planetscale originally launched with a cloud-hosted instance of Vitess and gained popularity thanks to its ability to support large-scale databases. It’s interesting to see Google’s ‘planet-scale’ ambition injected into a database startup, co-founded by a Google alumnus!

PROD stack

“PROD” is the name for Google’s internal tech stack, and by default, everything is built on PROD: both greenfield and existing projects. There are a few exceptions for things built on GCP; but being on PROD is the norm.

Some Googlers say PROD should not be the default, according to a current Staff software engineer. They told us:

“A common rant that I hear from many Googlers – and a belief I also share – is that very few services need to actually be ‘planet-scale’ on day 1! But the complexity of building a planet-scale service even on top of PROD, actually hurts productivity and go-to-market time for new projects.

Launching a new service takes days, if not weeks. If we used a simpler stack, the setup would take seconds, and that’s how long it ought to take for new projects that might not ever need to scale! Once a project gets traction, there should be enough time to add planet-scale support or move over to infra that supports this.”

Building on GCP can be painful for internal-facing products. A software engineer gave us an example:

“There are a few examples of internal versions of products built on GCP that did have very different features or experiences.

For example, the internal version of GCP’s Pub/Sub is called GOOPS (Google Pub/Sub). To configure GOOPS, you could not use the nice GCP UI: you needed to use a config file. Basically, external customers of GCP Pub/Sub have a much better developer experience than internal users.”

It makes no sense to use a public GCP service when there’s one already on PROD. Another Google engineer told us the internal version of Spanner (a distributed database) is much easier to set up and monitor. The internal tool to manage Spanner is called Spanbob, and there’s also an internal, enhanced version of SpanQSL.

Google released Spanner on GCP as a public-facing service. But if any internal Google team used the GCP Spanner, they could not use Spanbob – and have to do a lot more work just to set up the service! – and could not use the internal, enhanced SpannerSQL. So, it’s understandable that virtually all Google teams choose tools from the PROD stack, not the GCP one.

The only Big Tech not using its own cloud for new products

Google is in a position where none of its “core” products run GCP infrastructure: not Search, not YouTube, not Gmail, not Google Docs, nor Google Calendar. New projects are built on PROD by default, not GCP.

Contrast this with Amazon and Microsoft, which do the opposite:

Why do Google’s engineering teams resist GCP?

A current Google software engineer summed it up:

The internal infra is world class and probably the best in the industry. I do think more Google engineers would love to use GCP but the internal infra is purpose-built whereas GCP is more generic to target a wider audience”.

Another software engineer at the company said:

“In the end, PROD is just so good that GCP is a step down in comparison. This is the case for:

    Security – comes out of the box. GCP needs additional considerations and work

    Performance – it’s easy to get good performance out of the internal stack

    Simplicity – trivial to integrate internally, whereas GCP is much more work

A big reason to use GCP is for dogfooding, but doing so comes with a lot of downsides, so teams looking for the best tool just use PROD”.

The absence of a top-down mandate is likely another reason. Moving over from your own infra to use the company’s cloud is hard! When I worked at Skype as part of Microsoft in 2012, we were given a top-down directive to move Skype fully over to Azure. The Skype Data team did that work, who were next to me, and it was a grueling, difficult process because Azure just didn’t have good-enough support or reliability at the time. But as it was a top-down order, it eventually happened anyway! The Azure team prioritized the needs of Skype and made necessary improvements, and the Skype team made compromises. Without pressure from above, the move would have never happened, since Skype had a laundry list of reasons why Azure was suboptimal as infrastructure, compared to the status quo.

Google truly is a unique company with internal infrastructure that engineers consider much better than its public cloud, GCP. Perhaps this approach also explains why GCP is the #3 cloud provider, and doesn’t show many signs of catching AWS and Azure. After all, Google is not giving its own cloud the vote of confidence – never mind a top-down adoption mandate! – as Amazon and Microsoft did with theirs.

2. Monorepo

Google stores all code in one repository called the monorepo – also referred to as “Google3”. The size of the repo is staggering – here are stats from 2016:

Today, the scale of Google’s monorepo has surely increased several times over.

The monorepo stores most source code. Notable exceptions are open-sourced projects:

As a fun fact, these open source projects were hosted for a long time on an internal Git host called “git-on-borg“ for easy internal access (we’ll cover more on Borg in the Compute and Storage section.) This internal repo was then mirrored externally.

Trunk-based development is the norm. All engineers work in the same main branch, and this branch is the source of truth (the “trunk”). Devs create short-lived branches to make a change, then merge back to trunk. Google’s engineering team has found that the practice of having long-lived development branches harms engineering productivity. The book Software Engineering at Google explains:

“When we include the idea of pending work as akin to a dev branch, this further reinforces that work should be done in small increments against trunk, committed regularly.”

In 2016, Google already had more than 1,000 engineering teams working in the monorepo, and only a handful used long-lived development branches. In all cases, using a long-lived branch boiled down to having unusual requirements, with supporting multiple API versions a common reason.

Google’s trunk-based development approach is interesting because it is probably the single largest engineering organization in the world, and it’s important that it has large platform teams to support monorepo tooling and build systems to allow trunk-based development. In the outside world, trunk-based development has become the norm across most startups and scaleups: tools that support stacked diffs are a big help.

Documentation often lives in the monorepo and this can create problems. All public documentation for APIs on Android and Google Cloud are checked into the monorepo, which means documentation files are subject to the same readability rules as Google code. Google has strict readability constraints on all source code files (covered below). However, with external code, the samples don’t usually follow the internal readability guidelines by design!

For this reason, it has become best practice to have code samples outside of the monorepo, in a separate GitHub repository, in order to avoid the readability review (like naming an example file quickstart.java.txt).

For example, here’s an older documentation example where the source code is in a separate GitHub repository file to avoid Google’s readability review. For newer examples like this one, code is written directly into the documentation file which is set up to not trigger a readability review.

Not all of the monorepo is accessible to every engineer. The majority of the codebase is, but some parts are restricted:

Architecture and systems design

In 2016, Google engineering manager Rachel Potvin explained that despite the monorepo, Google’s codebase was not monolithic. We asked current engineers there if this is still true, and were told there’s been no change:

“I honestly notice very little difference from orgs that use separate repos, like at AWS. In either case, we had tools to search across all code you have permissions for.” – L6 Engineering Manager at Google.

“The software engineering process is distributed, not centralized. The build system is like Bazel (internally, it’s called Blaze). Individual teams can have their own build target(s) that goes through a separate CI/CD.” – L6 Staff SWE at Google.

Another Big Tech that has been using a monorepo since day one is Meta. We covered more on its monorepo in Inside Meta’s engineering culture.

Each team at Google chooses its own approach to system design, which means products are often differently designed! Similarities lie in the infra and dev tooling all systems use, and low-level components like using Protobuf and Stubby, the internal gRPC. Below are a few common themes from talking with 20+ Googlers:

A current Google engineer summarized the place from an architectural perspective:

“Google feels like many small cathedrals contained in one large bazaar.”

Chaotic bazaar of neatly-built cathedrals. Image by Gemini

This metaphor derives from the book The Cathedral and the Bazaar, where the cathedral refers to closed-source development (organized, top-down), and the Bazaar is open-source development (less organized, bottom-up.)

A few interesting details about large Google services:

3. Tech stack

Officially-supported programming languages

Internally, Google officially supports the programming languages below – meaning there are dedicated tooling and platform teams for them:

Engineers can use other languages, but they just won’t have dedicated support from developer platform teams.

TypeScript is replacing JavaScript in Google, several engineers told us. The company no longer allows new JavaScript files to be added, but existing ones can be modified.

Kotlin is becoming very popular, not just on mobile, but on the backend. New services are written almost exclusively using Kotlin or Go, and Java feels “deprecated”. The push towards Kotlin from Java is driven by software engineers, most of whom find Kotlin more pleasant to work with.

For mobile, these languages are used:

Language style guides are a thing at Google and each language has its own style guides. Some examples:

Interoperability and remote procedure calls

Protobuf is Google’s approach to interoperability, which is about working across programming languages. Protobuf is short for “Protocol buffers”: a language-neutral way to serialize structured data. Here’s an example protobuf definition:

edition = “2024”;

message Person {

string name = 1;

int32 id = 2;

string email = 3;

}

This can be used to pass the Person object across different programming languages; for example, between a Kotlin app and a C++ one.

An interesting detail about Google’s own APIs, whether it’s GRPC, Stubby, REST, etc, is that they are all defined using protobuf. This definition then generates API clients for all languages. And so internally, it’s easy to use these clients and call the API without worrying about underlying protocol.

gRPC is a modern, open source, high-performance remote procedural call (RPC) framework to communicate between services. Google open sourced and popularized this communication protocol, which is now a popular alternative to REST. The biggest difference between REST and gRPC is that REST uses HTTP for human-readable formatting, while gRPC is a binary format and outperforms REST with smaller payloads and less serialization and deserialization overhead. Internally, Google services tend to communicate using the “Google internal gRPC implementation” called Stubby, and to not use REST.

Stubby is the internal version of gRPC and the precursor to it. Almost all service-to-service communication is via Stubby. In fact, each Google service has a Stubby API to access it. gRPC is only used for external-facing comms, such as making external gRPC calls.

The name “stubby” comes from how protobuffers can have service definition, and that stubs can be generated from those functions from each language. And from “stub” comes “stubby.”

4. Dev tooling

In some ways, Google’s day-to-day tooling for developers most clearly illustrates how different the place is from other businesses:

Dev tools at most companies, vs at Google

Let’s go through these tools and how they work at the tech giant:

Read more

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

谷歌 工程栈 行星级基础设施 单一代码库 技术栈 开发工具 人工智能
相关文章