https://eugeneyan.com/rss 09月30日
数据科学家的角色演变与区分
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

2012年,数据科学家被评为21世纪最性感的工作。如今,这个笼统的角色被细分为数据科学家、应用科学家、研究科学家和机器学习工程师等多个职位。本文旨在解释每个角色的目标、技能和交付成果,帮助求职者和招聘者更好地理解这些角色的差异。文章还探讨了数据科学在过去十年中的专业化和品牌重塑,以及这些变化对团队和实践者的意义。从数据分析师到数据科学家角色的演变,再到不同角色使用和交付的内容,如数据科学家通过分析驱动决策,应用科学家使用机器学习改善业务成果,研究科学家开发新方法和技术,以及机器学习工程师构建基础设施和平台来扩展机器学习能力。最后,文章讨论了这些角色对领域的影响,以及团队是否应该采用这些不同的角色和标题。

🔍 数据科学家通过分析驱动决策:数据科学家帮助衡量问题和机会、了解客户和业务、解释混合结果的A/B测试等。他们与业务和产品领导者有最多的接触,通常被视为值得信赖的顾问。他们的工作涉及数据分析、可视化和编织连贯的叙事。他们可能还会编写SQL/Hive/Spark查询来提取和清理数据,以及Python/R来运行分析和创建可视化。他们还拥有强大的非技术技能,如提炼和框架化问题陈述、解释复杂的发现和建议,以及教育组织了解统计和数据的细微差别。他们的交付成果包括文档、可视化和仪表板,以提供洞察并指导决策。

📊 应用科学家使用机器学习改善业务成果:应用科学家使用机器学习来改善业务成果(例如,收入、成本、客户体验)。他们构建的系统可能是内部的(例如,产品分类、欺诈检测)或面向客户的(例如,搜索、推荐)。除了用例驱动的应用,他们可能还会开发内部数据集、工具和方法(例如,特征存储、包/容器模板、模型测试和发布检查)。除了用于提取和清理数据的工具,应用科学家还使用机器学习和深度学习库。他们还可能使用容器化(例如,Docker)、编排(例如,Airflow)、CI/CD(例如,Jenkins)和原型设计(例如,FastAPI)等工具。与应用科学家类似,应用科学家将业务问题(例如,增加收入)转换为解决方案(例如,增加客户获取?改进搜索或推荐?定价模型?)从问题到生产,他们需要了解:构建数据管道、实验和原型设计、训练和部署机器学习模型以及基本的软件工程和DevOps。他们的交付成果包括用于机器学习系统的代码和关于其设计、方法和实验的文档。

🔬 研究科学家开发新方法和技术:我们要感谢研究科学家(和学者)在深度学习、计算机视觉和自然语言处理等领域的迅速进展。他们倾向于调查更基本的问题(例如,模型压缩、图像分割、语音到文本),具有更长的时间范围。除了科技巨头和研究实验室之外,很少有组织有预算和意愿资助这样的角色。研究科学家通常使用公开可用的数据集和基准,例如用于计算机视觉的ImageNet、用于自然语言理解的General Language Understanding Evaluation (GLUE)和其他开源数据集(例如,推荐数据集、Kaggle)。大多数研究科学家将使用深度学习库——其他方法已经过时了——并在他们的利基领域拥有深入的专业知识。他们还在文献研究、可重复的实验以及在不同会议和期刊上发表论文方面表现出色。他们的交付成果包括论文和代码来复制他们的模型和结果。

🏗️ 机器学习工程师构建基础设施和平台来扩展能力:机器学习工程师(MLE)致力于使构建、部署和监控机器学习模型更容易的基础设施和平台。例子包括Uber的Michelangelo、Netflix的Metaflow和Amazon的SageMaker。他们也可能以类似于应用科学家的方式直接参与特定于应用程序的机器学习系统,或者帮助部署和扩展用于生产机器学习模型。大多数MLE拥有先前的软件工程师经验,并在软件开发、DevOps和工程最佳实践方面拥有更强的背景。与编写Python代码的科学家相比,MLE倾向于使用更企业化和性能更高的语言(例如,Java/Scala、C++、Go)来构建系统。对于MLE来说,更强调技术设计和架构、基础设施、可扩展性、安全性等。相对于主要使用工具(例如,Docker、Airflow、Jenkins)的科学家,MLE是那些围绕它设置基础设施和流程的人——他们本质上是一个带有机器学习倾向的软件工程师。

🔄 角色演变与专业化:数据工程师是一个较新的角色,是后端工程师、数据库管理员和SQL开发人员的结合。数据收集和存储的增加以及为这个角色所需的专用数据基础设施(例如,Hadoop、Spark、Kafka等)的增长。我们还看到传统的笼统软件工程师被细分为前端、后端、移动、云、DevOps、质量保证等。在数据科学家之前,还有统计学家。Google的首席经济学家哈尔·范里安解释了为什么统计学家将在2000年代成为最性感的工作。这发生在2009年9月,几乎就在《哈佛商业评论》将数据科学家评为21世纪最性感工作的三年前。

In 2012, the data scientist was named the sexiest job of the 21st century. Now in 2020, this catch-all role is more often split into multiple roles such as data scientist, applied scientist, research scientist, and machine learning engineer.

I used to get questions like “What does a data scientist do?” Now, I get questions such as “What does a data/applied/research scientist do? What is a machine learning engineer? How are they different from each other?”

Here’s my attempt to explain the goals, skills, and deliverables of each role. If you’re trying to enter or transition within the field of data science, I hope this will help you get the right role—one that matches your interests and skills. We’ll also take a look at how data science has specialized and rebranded itself in the past decade, and what those changes mean for teams and practitioners.

Disclaimer: This is my personal take based on chats I’ve had with folks in various organizations and roles. It does not represent the views of my employer. Also, not all companies adopt these titles, so don’t take it too seriously if you feel like you have a title mismatch. In addition, role distinctions might blur depending on project phases.

How did data scientists “become” data analysts?

Recently, I was amused to learn that the data scientist title has gained a bad rap, with some claiming that they’re "just glorified analysts". I think this view is misguided and unfair. The data scientists that I know work on increasingly sophisticated A/B testing and inference techniques that are high-impact. Nonetheless, it’s interesting to learn how it came about.

In August 2017, the Reddit community noticed that data scientists at Facebook were mostly doing data analyst work. A commenter shared that the generic data science role (in Facebook) was a “product analytics data scientist”, focused on data extraction, analysis, and statistics. (Another commenter also shared how Nielsen’s Measurement Science team became the Data Science team overnight.) To distinguish between product/analysis-focused and research-focused data scientists, Facebook created a Core Data Science group focused on research.

Then in April 2018, Lyft rebranded their data analysts as data scientists, and data scientists as research scientists. Here’s how they drew the line: data scientists extract insights from data, track business metrics, and drive better decision making; research scientists build math models and algorithms that power the product. Why rebrand? Lyft shared an example of losing data analytics candidates to competitors offering the data scientist title.

Data analytics and data science talent is hard to come by. If offering candidates a seemingly more prestigious title gives a competitive advantage in hiring (and at zero cost), why not? It’s likely many other teams followed suit. As a result, the data science title is now synonymous with data analytics.

What the different roles use and deliver

Data scientists drive better decisions via analysis

Data scientists help to size problems and opportunities, understand customers and the business, interpret A/B tests with mixed results, and so on. They have the most contact with business and product leaders and are often viewed as trusted advisors. Their work involves data analysis, visualization, and weaving a coherent narrative. They might also be called decision scientists, with Cassie Kozyrkov being a famous example.

In God we trust, all others bring data. — W Edwards Deming

Data scientists are often writing SQL/Hive/Spark queries to extract and clean data, and Python/R to run analyses and create visualizations. They might also build data pipelines and dashboards for recurring analyses.

They also have strong non-technical skills to: distill and frame problem statements, explain complex findings and recommendations, and educate the organization on the nuances of statistics and data. Their deliverables include documents, visualizations, and dashboards to provide insight and guide decisions.

Applied scientists build ML systems to improve business outcomes

Applied scientists use ML to improve business outcomes (e.g., revenue, cost, customer experience). The systems they build may be internal (e.g., product classification, fraud detection) or customer-facing (e.g., search, recommendations). Outside of use-case driven applications, they might also develop internal datasets, tooling, and methodology (e.g., feature stores, package/docker templates, model testing & release checks).

In addition to tools for extracting and cleaning data, applied scientists also use machine learning and deep learning libraries. They may also use tools for containerization (e.g., Docker), orchestration (e.g., Airflow), CI/CD (e.g., Jenkins), and prototyping (e.g., FastAPI).

Similar to data scientists, applied scientists convert business problems (e.g., increasing revenue) into solutions (e.g., increased customer acquisition? improved search or recommendations? pricing models?) To go from problem to production, they need know-how on: building data pipelines, experimentation and prototyping, training and deploying ML models, and basic software engineering and devops. Their deliverables include code for ML systems and documents on their design, methodology, and experiments.

Research scientists develop new methodology and techniques

We have research scientists (and academics) to thank for the brisk advances in fields such as deep learning, computer vision, and natural language processing. They tend to investigate problems that are more fundamental (e.g., model compression, image segmentation, speech-to-text) with a longer time horizon. Outside of the tech giants and research labs, few organizations have the budget and appetite to fund such roles.

Research is creating new knowledge. — Neil Armstrong

Research scientists often work on publicly available datasets and benchmarks, such as ImageNet for computer vision, General Language Understanding Evaluation (GLUE) for natural language understanding, and other open-source datasets (e.g., recommendations datasets, Kaggle).

Most will use deep learning libraries—other approaches have fallen out of favor—and have deep and specialized knowledge in their niche. They also excel in literature research, reproducible experiments, and publishing papers at conferences and journals. Their deliverables include papers and code to replicate their models and results.

Machine learning engineers build infra and platforms to grow capabilities

Machine learning engineers (MLEs) work on infrastructure and platforms that make it easier to build, deploy, and monitor machine learning models. Examples include Michelangelo (Uber), Metaflow (Netflix), and SageMaker (Amazon). They might also work directly on application-specific ML systems in a capacity similar to an applied scientist, or help with deploying and scaling ML models for production.

Most MLEs have prior experience as a software engineer and have stronger backgrounds in software development, devops, and engineering best practices. Compared to Python-slinging scientists, MLEs tend to build systems on more enterprise and performant languages (e.g., Java/Scala, C++, Go).

For MLEs, there’s greater emphasis on technical design and architecture, infra, scalability, security, etc. Relative to scientists who are mainly users of tools (e.g., Docker, Airflow, Jenkins), MLEs tend to be the ones to set up the infra and processes around it—they’re essentially a software engineer with an ML tilt.

Whew, that was a lot of text; here’s a table summarizing the above.

  Data Scientist Applied Scientist Research Scientist ML Engineer
Goal Perform analysis to guide better decisions Build ML systems to improve business outcomes Develop new methodology and techniques Build infra and platforms for ML capabilities
Tools SQL, Hive, Python/R, dashboards SQL, Hive, Python, ML libraries, Docker, FastAPI, etc. Python, deep learning libraries, LaTeX Python, Java/Scala, C, Go, Docker, Jenkins, etc.
Skills Statistical analysis, visualization, dashboarding, storytelling, A/B testing Data pipelines, machine/deep learning, experimentation and prototyping, software engineering, devops Research, experiments on industry/academic benchmarks, publishing papers Software development, devops, scalability, security, etc.
Deliverables Documents for insight and decisions, dashboards Code for ML systems, documents on design, methodology, and experiments Papers and code to demonstrate findings Code for infra and platforms, documentation

Summary table of the various data & ML roles

What the various roles mean for the field

While having more titles can be confusing, the specialization should help clarify the goals and skills required for each role. Back in South-East Asia, all roles had the title of data scientist. The job scope was akin to rolling a dice; it might entirely involve analysis and dashboarding, or building ML systems, or client sales. Having more fine-grained titles can help job seekers and hiring managers achieve better job-fit.

Breaking down the DS role also makes it less overwhelming for newcomers. There’s an unhealthy expectation to be proficient at 23 skills/techniques and 8 tools to be a (unicorn) data scientist. Separate roles make clear it’s not necessary. Data scientists don’t have to write production code, MLEs don’t need to know the nuances of research and publishing. Specialized roles allow people to focus on their strengths and interests.

119 easy steps to becoming a data scientist (source)

Nonetheless, specialization takes us further from end-to-end ownership and delivery. With MLEs, one anti-pattern is scientists writing POC code in Jupyter notebooks or R and handing off to MLEs to convert the code and productionize. The context and methodology is often lost in translation and diffusion of responsibility seeps in. It also slows down iteration and increases operations and maintenance cost.

Another concern is title inflation and loss of hiring competitiveness. As Lyft shared, the possibility of losing data analytics candidates to competitors offering “data scientist” titles is real. Will this escalate with companies losing machine learning candidates if they offer the data scientist title, relative to competitors offering the research scientist or machine learning engineer title? If so, we might see the field come up with more titles (e.g., deep learning scientist, AI engineer) that might obfuscate the real work being done.

Counterpoint: New roles emerge and rebranding occurs all the time in tech

The data engineer is a recent role that’s a combination of backend engineer, database administrator, and SQL developer. The increase in data collection and storage, and growth in specialized data infra (e.g., Hadoop, Spark, Kafka, etc.) required for this role. We also see the traditional catch-all software engineer split into frontend, backend, mobile, cloud, devops, quality assurance, etc.

And before data scientists, there were statisticians.

Here’s Google’s Chief Economist, Hal Varian, explaining why statisticians will be the sexiest job of the 2000s. This was in Sep 2009, barely three years before the HBR article on data scientists being the sexiest role of the 21st century.

And in early 2014, some also viewed data scientists as statisticians, but living in San Francisco and using a Mac.

Should our team adopt these distinct roles and titles?

It depends.

AFAIK, these roles are mostly seen in relatively larger tech organizations (i.e., data science teams of >50). At that scale, there’s sufficient volume of work for each specialized role. However, in smaller companies, there might not be the need (or luxury) to spilt the data scientist and applied scientist role, or the budget for research-focused roles with multi-year horizons. (Nonetheless, MLEs to help with production are increasingly common.)

And it appears to be a trend specific to the US. In South-East Asia, most of these roles are grouped under the title of data scientist. Nonetheless, regional tech unicorns such as Grab, Shopee, and Traveloka have large enough teams that may require such specialized roles. Chinese giants (e.g., Alibaba, Baidu, Tencent) will also have these different roles.

If you’re thinking of splitting the data scientist role into distinct specializations, please also consider the benefits of having data scientists be more end-to-end.

Conclusion: Skills + Deliverables > Titles

What if we’re performing the role of a <more prestigious title> but have a <less prestigious title>? Before we change jobs to upgrade titles, or accept a poorer offer with a better title, we should remind ourselves that it’s just a title. In the long run, focusing on deliverables expected and skills required will lead to better fit and job satisfaction.

P.S., But what if the role doesn’t match our skills and aspirations, like this guy? We’ll discuss that here.

Additional reading


Thanks to Yang Xinyi, Alexandra Macqueen, and Michell C. Clark for reading drafts of this.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Nov 2020). Applied / Research Scientist, ML Engineer: What’s the Difference?. eugeneyan.com. https://eugeneyan.com/writing/data-science-roles/.

or

@article{yan2020roles,  title   = {Applied / Research Scientist, ML Engineer: What’s the Difference?},  author  = {Yan, Ziyou},  journal = {eugeneyan.com},  year    = {2020},  month   = {Nov},  url     = {https://eugeneyan.com/writing/data-science-roles/}}
Share on:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据科学家 应用科学家 研究科学家 机器学习工程师 数据科学 角色演变 技能 交付成果
相关文章