https://eugeneyan.com/rss 09月30日 19:15
如何开始数据科学之旅
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分享了数据科学入门所需的工具和技能,包括SQL、Python/R和Spark等工具,以及概率统计、机器学习和沟通能力等关键技能。文章建议通过项目实践、志愿者工作和公开演讲与写作来提升技能,并强调实践的重要性。内容基于作者在Lazada的经验,适用于大多数数据科学职位,但需注意这些仅为最低门槛,不保证成功。

💡 SQL是数据科学的核心工具,用于数据查询、提取和分析,具有强大的灵活性和可分享性。掌握基本SQL命令即可满足日常80%的需求,其余可随时查阅资料学习。

🐍 Python和R拥有丰富的库,如pandas、dplyr、bokeh和ggplot2等,简化统计分析和可视化任务,并支持机器学习(如scikit-learn和caret),是数据科学不可或缺的编程语言。

🔥 Spark适用于处理大规模数据集,理解其分布式存储和计算概念(如map-reduce、shuffle优化)比直接学习代码更关键,能提升数据处理的效率。

📊 概率统计是数据科学的基础,帮助识别数据中的信号和噪声,例如通过实验设计(如A/B测试)和统计方法验证假设,解决实际问题。

🤖 机器学习虽仅占数据科学家工作时间的20%,但能实现自动化和规模化的问题解决,如自动分类产品,节省人力成本。

🗣️ 沟通能力对数据科学家至关重要,需将复杂的技术成果转化为简单清晰的语言,向非技术背景的干系人解释其价值,并解答‘这如何帮助我?’的问题。

🛠️ 通过个人项目(如构建数据产品API)、志愿者工作(如DataKind数据挖掘)和公开演讲/写作(如撰写博客或参与会议)是提升技能的有效途径,实践能巩固学习成果。

🚀 实践是技能提升的关键,从完成MOOC课程到实际应用,重复练习能加强大脑神经元连接,形成肌肉记忆,最终实现技能的熟练掌握。

More than a handful of times have I been asked about how to get into the field of data science. This includes SMU’s Master of IT in Business classes, regular meet-ups (e.g., DataScience SG, and requests via email/linkedin. Though the conversations that follow differ depending on the person’s background, a significant portion is applicable to most people.

I’m no data science rockstar. Neither am I an instructor that teaches how to get into data science. Nonetheless, here’s some previously shared advice on “How to get started in Data Science”, documented here so it can be shared in a more scalable manner.

What this post will (not) cover

This post will focus on the tools and skills (I find) essential in data science, and how to practice them. Every organization has different needs, and what’s listed is largely based on Lazada’s data science stack and process. Nonetheless, they should be applicable to most data science positions. These should be viewed as minimum thresholds, and they do not necessarily predict success in data science. They are:

    Tools: SQL, Python and/or R, Spark Skills: Probability and Statistics, Machine Learning, Communication Practice: Projects, Volunteering, Speaking and Writing

This post will not cover character traits, personalities, habits, etc. While there are some traits I find strongly correlated with success in data science (e.g., curiosity, humility, grit), we will not discuss them here. In some sense, these traits lead to success in all roles/life—not just data science.

SQL

This is the bread and butter of every data (science) person, and will probably be for a long time. While there are GUI tools that allow querying and extracting data without writing SQL, they are often not as powerful and flexible. Writing SQL allows you to join/filter/aggregate data as you wish, and the query can be easily shared with reproducible results. Learning 20% of what SQL can do should cover 80% of what you’ll need in day-to-day tasks—the rest can be googled when needed.

Some sites provide an in-browser SQL engine, providing immediate feedback and making learning easier. Here are some I found useful:

    w3schools (step-by-step tutorial on basic SQL commands) sqlzoo (more practice and assessments on SQL)

Python / R

While SQL is sufficient for basic analysis (using filters, aggregates, etc), you’ll soon need to do more such as statistical analyses, visualisations, machine learning, etc. Python and R have rich libraries that simplify these complex tasks. Some example libraries, for Python and R respectively, include:

    Data processing (pandas, dplyr) Visualization (bokeh, ggplot2) Machine Learning (scikit-learn, caret)

Here’s some MOOCs to pick up basic Python:

Here’s some MOOCs to pick up basic R:

Spark

Eventually, you’ll work with data that is unable to fit on your local machine, be it RAM or disk. To efficiently process this data, we’ll turn to distributed processing frameworks such as MapReduce. One open-sourced data processing engine that is popular is Apache Spark.

Technologies come and go, and Hadoop/Spark is no exception (though they’ll probably be around for a while). With regard to learning Spark, I find it more useful to understand the concepts of distributed storage and processing. Learn how map and reduce tasks work across distributed nodes, which actions are parallelizable and which are not. Learn about shuffle and when it occurs, and how to minimise shuffle to make jobs more efficient.

Here’s some MOOCs to pick up basic Spark:

Skills and how to pick them up

Probability, statistics, and experimental design

Part of a data scientist’s role is to distinguish signal from noise and provide data-driven insights to solve problems. Probability and statistics help you find trends and insights that are significant (e.g., fashion items shown on white background, as opposed to a lifestyle background, have higher conversion).

These insights may lead to website/app changes (e.g., converting all fashion items to have a white background). As far as possible, these changes should be AB tested by showing the different versions to similar users and measuring metrics such as click-thru and conversion. For this, you’ll need valid experimental design and the right statistical analyses.

Here’s some MOOCs to pick up basic probability, statistics, and experimental design

Machine learning

Most people think data scientists spend 80% of their time on machine learning. My experience is the opposite—80% of the time is spent engaging with stakeholders, acquiring/preparing/exploring data, putting models into production while 20% is spent on machine learning.

Nonetheless, though machine learning is only 20% of time spent, it enables data scientists to solve problems in an automated and scalable way. For example, instead of having people manually categorize products, we can build a machine learning classifier to do it automatically, saving time, effort, and cost.

Here’s some MOOCs to pick up machine learning:

    Machine learning by Coursera & Stanford (Andrew Ng’s famed course; one of the few that touch on gradient descent) Statistical learning by Stanford online (Machine learning from a statistical perspective)

Communication (speaking and writing)

For data scientists, it is important to communicate findings and data products in a simple and clear manner. Most of the time, stakeholders will not have the statistical and machine learning know-how that you do—they require your help to simplify and reframe the outcomes.

Their main question will be: “How will this help me?”—you should answer this simply. While it’s amazing that you’ve created a model with excellent AUC / logloss (which they will not understand), they will not use it if you cannot demonstrate how it helps them.

My approach to improving communication has been to practice, practice, practice—which brings us to the next point.

Practice

After completing a few MOOCs, you may find yourself yearning to practice your skills in a real world environment. Practice makes permanent—your brain’s neurons are more linked together through repeated use. Practicing and applying often allows your brain to strengthen the learned material. Here are some suggested avenues for practice.

Start your own project

My mission is to use data to create positive impact and improve lives. To gain practice with building data products end-to-end, I built a product classification API using data scrapped from Amazon’s website, deployed it on AWS, and developed a frontend accessible via datagene.io.

Update: API discontinued to save on cloud cost.

Throughout the process, I learnt a lot beyond what is normally taught in schools / MOOCs. For example, I learnt how to set up and deploy an API on AWS EC2, build a simple web app using Flask, and develop basic, user-friendly HTML pages. Specific to data, I learnt how to work with large data sets efficiently (the data was 12gb and I restricted myself to only using Python as a challenge) and how to clean real-world dirty data (yes, even Amazon has dirty data).

Eventually, I also learnt how to apply deep learning to images for classification and search.

As a bonus, I ended up with a simple portfolio to demonstrate what I’ve built.

Volunteer with NGOs

DataKind is an NGO that helps other NGOs use data more effectively. I had the opportunity to volunteer with DataKind SG on one of its DataDives (similar to a 2-day hackathon).

The DataDive involved helping HOME (Humanitarian Organization for Migration Economics). HOME works for the well-being and empowerment of migrant workers in Singapore. During the DataDive, we worked on anonymising the data, creating our own data dictionaries, and data cleaning and visualisation, with the aim of answering the questions HOME had of its data.

At the end of day two, HOME had a better understanding of the migrant worker situation in Singapore, and how they could improve their operations. We also built a simple self-service dashboard so they could discover insights themselves, making the benefit from the DataDive a sustainable one.

The experience was a mini project end-to-end. You get an opportunity to work with stakeholders and real-world messy data, create social impact, and learn lots from fellow do-gooders.

Speak and write

In Singapore (and most countries), there are meet-ups organised regularly where you can share, and gain experience in public speaking. Similarly, there are plenty of blogging sites such as wordpress, medium, etc where you can write articles and gain practice. As you work on projects, maintain a site for your writing journaling your progress and results, as well as a git repo that people can refer to.

Volunteer to write your organization’s data newsletter, or to speak at a meetup conference. I don’t think there’s a shortcut to this—but if you know of any, please let me know! From my experience, the way to getting better is to practice.

Conclusion

And there you have it. Simple guidelines on how to get started in data science: Three tools, three skills, and three avenues for practice.

Please let me know if you found this guide useful, and if you have any suggestions for improvement. Feedback on my writing and content is also most welcome.

P.S., here’s what I shared at SMU’s MITB last year.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Jun 2017). How to get started in Data Science. eugeneyan.com. https://eugeneyan.com/writing/how-to-get-started-in-data-science/.

or

@article{yan2017start,  title   = {How to get started in Data Science},  author  = {Yan, Ziyou},  journal = {eugeneyan.com},  year    = {2017},  month   = {Jun},  url     = {https://eugeneyan.com/writing/how-to-get-started-in-data-science/}}
Share on:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据科学 SQL Python R Spark 概率统计 机器学习 沟通能力 实践技能 职业发展
相关文章