AWS Machine Learning Blog 10月16日 00:38
在SageMaker Studio中集成Scala开发环境
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文档详细介绍了如何在Amazon SageMaker Studio中集成Scala开发环境,解决了现有Python为主的数据科学工作流中对Scala支持的缺失。通过安装Almond kernel,用户可以在SageMaker Studio内无缝地进行Scala的探索性分析和开发,尤其适用于Spark和大数据处理场景。文章提供了从创建Conda环境、安装OpenJDK、配置Coursier到最终安装和修复Almond kernel的详细步骤,并强调了JVM版本兼容性、环境隔离和用户自行维护的重要性,以确保Scala开发流程的稳定性和效率。

📦 **Almond Kernel集成解决方案**:文章提供了一种在Amazon SageMaker Studio中启用Scala开发环境的方法,通过安装Almond kernel,弥补了SageMaker Studio在Scala原生支持上的不足,使得Scala用户能够直接在熟悉的JupyterLab环境中进行数据分析和机器学习工作。

⚙️ **详细安装配置步骤**:提供了从创建隔离的Conda环境、安装OpenJDK、下载并设置Coursier(Scala的应用程序安装器和依赖管理器),到最终安装Almond kernel并修复kernel.json文件以正确指向Java路径的完整操作指南,确保用户可以复现和部署。

🧠 **技术考量与最佳实践**:强调了在SageMaker Studio中使用Scala时需要注意的关键技术点,包括确保JVM版本与Spark兼容性、维护自定义Conda环境的隔离性以避免冲突,以及用户需要自行负责自定义环境的维护、安全更新和库管理,以保证工作流程的稳定性和安全性。

💰 **成本效益与清理**:该解决方案仅使用开源工具,不产生额外的AWS费用,用户只需承担SageMaker Studio本身的使用成本。文章还提供了详细的清理步骤,指导用户如何关闭内核、删除Conda环境和相关SageMaker资源,以保持环境的整洁和成本效益。

Scala stands out as a versatile programming language that combines object-oriented and functional programming approaches. By running on the Java Virtual Machine (JVM), it maintains seamless compatibility with Java libraries while offering a concise and scalable development experience. The language has distinguished itself in the realm of distributed computing and big data processing, with the Apache Spark framework, built using Scala, serving as a prime example of its capabilities. Though Amazon SageMaker Studio provides comprehensive support for Python-based data science and machine learning (ML) workflows, it doesn’t include built-in support for Scala development.

This integration is particularly valuable for those working with Spark or engaged in complex data processing tasks, because it supports seamless Scala-based exploratory analysis and development alongside Python-centric tools in Amazon SageMaker. The addition of the Almond kernel expands the versatility of SageMaker Studio, so teams can maintain their preferred Scala workflows while taking advantage of the service’s ML and cloud computing capabilities.

Organizations and teams working in mixed-language environments, particularly those heavily invested in Scala and Spark-based data processing workflows, face challenges when using SageMaker Studio because it doesn’t have built-in Scala support. The current process requires developers to maintain separate environments or use workarounds, disrupting workflows and reducing productivity. Data scientists and engineers who prefer Scala’s strong typing and functional programming must adapt to Python or switch platforms, increasing development overhead and risking inconsistencies in production pipelines. Furthermore, teams that have built extensive Scala code bases for their big data processing face additional complexity when trying to integrate their existing work with ML capabilities of SageMaker, which slow down the adoption of advanced ML features or require additional engineering effort to align their Scala-based data processing and Python-based ML workflows.

This post provides a comprehensive guide on integrating the Almond kernel into SageMaker Studio, offering a solution for Scala development within the platform.

Solution overview

The Almond kernel is an open source project that brings Scala support to Jupyter notebooks, effectively integrating Scala and interactive data analysis environments. The installation of the Almond kernel uses Coursier, a widely recommended Scala application installer and artifact manager. Coursier simplifies and automates the process of downloading, managing, and installing Scala libraries and dependencies. Its dependency resolution mechanism makes sure users have consistent and compatible library versions, significantly reducing potential conflicts and installation complexities. The installation steps using Coursier are executed within a custom Conda environment, maintaining a clear separation from the base SageMaker Studio setup.

By walking through the installation and configuration process, developers and data scientists can use Scala’s robust features directly within SageMaker Studio. The following sections provide a step-by-step process to set up Scala development in SageMaker Studio using the Almond kernel.

Prerequisites

To begin working with this project, you must have access to JupyterLab (version 2.4.1 or later) in SageMaker Studio. This requires an active AWS account with a SageMaker Studio domain configured and a user profile set up. To set up your domain, refer to Guide to getting set up with Amazon SageMaker AI. Familiarity with the Jupyter notebooks environment is beneficial, because you will be working extensively within this interface.

By default, SageMaker Studio provides a network interface that allows communication with the internet through a virtual private cloud (VPC) managed by Amazon SageMaker AI. Egress to the internet from SageMaker Studio is necessary for downloading necessary packages and accessing various resources. Be aware that corporate firewalls or network restrictions might interfere with your ability to download required packages. If you have deployed SageMaker Studio in a private subnet, refer to Connect Amazon SageMaker Studio in a VPC to External Resources for instructions on enabling egress access to internet.

Appropriate AWS Identity and Access Management (IAM) permissions are essential for launching and modifying SageMaker Studio environments. For SageMaker Studio setup, admin access is initially required to configure the environment; however, in production scenarios, it’s crucial to follow the principle of least privilege by granting only the minimum necessary IAM permissions to users and roles for their specific tasks within SageMaker Studio.

Create Jupyter Lab space in SageMaker Studio

For instructions to create a JupyterLab space, refer to Create a space. You can choose your preferred supported version of SageMaker distribution for the Jupyter Lab space. Run and open the Jupyter lab space after you create it.

Create and activate custom Conda environment

With a custom Conda environment, you can maintain an isolated, reproducible development environment with specific package versions. Open the terminal in the Jupyter Lab space and run the following commands to create and activate the Conda environment:

conda create -n myenv python=3.10 -yconda init bashsource ~/.bashrcconda activate myenv

Install OpenJDK

Java must be installed inside the Conda environment because Scala needs it. Check if Java is already installed by running the following command:

java --version

If Java is not found, install OpenJDK 11, which is compatible with Spark 3.3.2:

conda install -c conda-forge openjdk=11 -y

Verify that the Java installation is successful:

java --version

Set JAVA_HOME

Validate that JAVA_HOME is updated with the configuration from Conda by running the following command:

which javaexport JAVA_HOME=/home/sagemaker-user/.conda/envs/myenv

Download and set up Coursier

Install the Coursier artifact manager using the following command:

curl -Lo coursier https://git.io/coursier-clichmod +x coursier

Install the Scala (Almond) kernel

Run the following command to install the Almond kernel:

./coursier launch almond -- --install

Fix kernel Java path

JupyterLab’s Scala kernel has difficulty locating the correct Java installation by default. This occurs because the kernel specification file (kernel.json) initially uses a generic Java path reference, which might not point to the actual Java installation on your SageMaker Studio instance. You must modify the Scala kernel’s configuration file (kernel.json) to explicitly specify the correct Java installation path.

Edit the kernel configuration located at the following location:

~/.local/share/jupyter/kernels/scala/kernel.json

Update the java path to the absolute path returned by which java:

{  "argv": [    "/home/sagemaker-user/.conda/envs/myenv/bin/java",    "-jar",    "/home/sagemaker-user/.local/share/jupyter/kernels/scala/launcher.jar",    "--connection-file",    "{connection_file}"  ],  "display_name": "Scala",  "language": "scala"}

Launch the kernel

From the JupyterLab space launcher in SageMaker Studio, open a new notebook using the Scala kernel (see the following screenshot).

Test Spark integration

Use the following sample code to verify if the Scala kernel is functioning:

println(s"Scala: ${scala.util.Properties.versionNumberString}")println(s"Java : ${System.getProperty("java.version")}")

The following is an example of the expected output:

Scala: 2.13.xJava : 11.0.x

Technical considerations during and after deployment

In this section, we discuss some of the key considerations when working with the Scala kernel on SageMaker Studio:

By proactively addressing these technical considerations, teams can effectively integrate Scala workflows within SageMaker Studio, creating robust and reliable data science environments that complement the existing Python-centric tools. These settings remain intact even when SageMaker Studio is restarted.

Cost considerations

This solution uses open source tools and doesn’t incur additional AWS charges beyond the use of the underlying SageMaker Studio environment. Review the SageMaker Pricing to get additional pricing information.

Clean up

After successfully setting up and using your new Scala environment in SageMaker Studio, it’s important to clean up to maintain efficiency and cost-effectiveness. This step not only frees up space but also keeps your SageMaker Studio environment tidy and organized. You can always recreate the environment later if needed, following the steps outlined in this post. By maintaining good housekeeping practices, your SageMaker Studio remains optimized for your current projects and ready for future explorations.

    When you’re finished with your Scala work, shut down the SageMaker Studio kernel. This helps prevent unnecessary resource usage and potential charges. Additionally, if you no longer need the custom Conda environment you created for Scala, you can delete it entirely:
conda deactivateconda remove -n myenv --all -y
    Clean up associated SageMaker resources:
      Stop and delete running applications such as SageMaker Studio applications and notebooks within user profiles. Delete user profiles and shared spaces within the domain.
    Delete the SageMaker domain:
      On the SageMaker console, choose Domains in the navigation pane. Select the domain you want to delete. On the Actions menu, choose Delete.

Conclusion

By following this post, you can use Scala within SageMaker Studio, taking advantage of the powerful capabilities of Spark and Scala-based data engineering and analytics workflows. This setup is ideal for data scientists and engineers who rely on Scala’s concise syntax and functional programming constructs, especially when working with Spark-based pipelines.The following are additional resources that can help you further explore the Almond kernel, Coursier, and Scala:

Start exploring these resources today to enhance your Scala development journey and streamline your ML workflows on SageMaker.


About the authors

Varun Rajan is a Senior Solutions Architect supporting Strategic Industries at Amazon Web Services. Varun has over two decades of experience in designing, building, and optimizing cloud-based solutions for a diverse range of clients and specializes in translating complex business challenges into scalable, secure solutions that deliver measurable business value.

Aakash Aggarwal is a Technical Account Manager at Amazon Web Services (AWS), based in the San Francisco Bay Area. He has over a decade of experience in the development and management of cloud-based workloads, and specializes in helping strategic AWS customers accelerate their cloud adoption. His focus areas include AI/ML, containerization, and observability on AWS.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker Studio Scala Almond Kernel JupyterLab Apache Spark 大数据 数据科学 机器学习 AWS Conda OpenJDK Coursier
相关文章