Kubernetes：容器编排的基石

It’s a platform that allows you to run and orchestrate container workloads. Kubernetes has become the de-facto standard for your cloud-native apps to (auto-) Scale-out and deploys your open-source zoo fast, cloud-provider-independent. No lock-in here. You could use OpenShift or OKD. With the latest version, they added the OperatorHub which you can install as of today 182 items with just a few clicks. Also, check out Managed Data Stacks which were created to mitigate exactly that.

Some more reasons for Kubernetes are the move from infrastructure as code towards infrastructure as data, specifically as YAML. All the resources in Kubernetes that include Pods, Configurations, Deployments, Volumes, etc., can simply be expressed in a YAML file. Developers quickly write applications that run across multiple operating environments. Costs can be reduced by scaling down (even to zero with, e.g. [Knative][63]) and also by using plain python or other programming languages instead of paying for a service on Azure, AWS, or Google Cloud. Its management makes it easy through its modularity and abstraction, also with the use of Containers (Docker or [Rocket][65]), you can monitor all your applications in one place.

To get hands-on with Kubernetes you can install Docker Desktop with Kubernetes included. All of my examples are built on top of it and run on any cloud as well as locally. For a more sophisticated set-up in terms of Apache Spark, I suggest reading the blog post from Data Mechanics about Setting up, Managing & Monitoring Spark on Kubernetes. If you are more of a video guy, An introduction to Apache Spark on Kubernetes contains the same content but adds still even on top of it.

As said above, if setting up Kubernetes is too hard, there are Managed Data Stacks, where you can choose existing open-source tools to pick from.

Security: Separation of Concerns as with different namespaces.

# Kubernetes Orchestration

Continuously working towards a desired state.

Pod

# Kubernetes Architecture

etcd

current known statedesired state

graph LR  subgraph node    kubelet["kubelet & kube-proxy"]    containerd    container  end  subgraph control_plane    subgraph etcd      kubernetes_resource    end    controllers    kube-api    scheduler[Default Scheduler]  end  subgraph yaml_file    resource_configurations  end  resource_configurations --> kubectl  kubectl --> kube-api  controllers -->|adapts| kube-api  scheduler -->|adapts| kube-api  kube-api -->|informs| scheduler  kube-api -->|informs| controllers  kube-api -->|manages| kubernetes_resource["kubernetes resource:
- current known state
- desired state"]  kube-api -->|informs| kubelet  kubelet -->|updates state| kube-api  kubelet -->|manages| containerd  containerd --> container

Kubernetes Architecture image ^31c463

# Workload Resources

graph TD  subgraph Workload Resources    deployment-->replicaset-->pod    statefulset-->pod    daemonset-->pod    cronjob-->job-->pod    pod[Pod]-.->container    container[Container]    style container stroke-dasharray: 5 5  end

Pods

Deployment

StatefulSet

DaemonSet

CronJob

Job

# Deployment Patterns

# Containers deployments

When to use multiple container inside a deplyoment?

In Kubernetes, it’s common to run multiple containers within a single Pod when the containers are tightly coupled application components that need to operate together. It’s a anti-pattern to use multiple containers inside the same pod, except for below patterns such as Sidecar, Ambassaador, etc. Usually you would use a different pod deployment for a DB or a different important service.

Shared Storage

Inter-process Communication

localhost

Sidecar Pattern

Adapter Pattern

Ambassador Pattern

Init-Container is another container, but these are specified in a sepreate part of the deployment.

Here an example:

 1 2 3 4 5 6 7 8 9101112...  initContainers:  - name: copy-airflow-dag-to-airflow-bd    image: my-image:0.1.0-a.2    command: ["/bin/sh","-c"]    args: [      "mkdir -p /storage/backup/dags-$(date +%Y%m%d-%H%M%S) && cp -a /storage/dags/. /storage/backup/dags-$(date +%Y%m%d-%H%M%S)/ &&       rm -rf /storage/dags/* && cp -a /opt/airflow/airflow_home/dags/. /storage/dags/"    ]    volumeMounts:    - name: storage      mountPath: /storage

# Services (Network)

Kubernetes provides several types of Services to expose your application inside or outside of a cluster. Let’s break them down:

ClusterIP

Scope

Purpose

Use-case

NodePort

Scope

<NodeIP>:<NodePort>

Purpose

Use-case

LoadBalancer

Scope

Purpose

Use-case

ExternalName

Scope

Purpose

Use-case

Headless Service

Scope

Purpose

Use-case

Ingress: Ingress is not a service type, but a separate Kubernetes resource designed for HTTP and HTTPS routing to services.

Scope

Purpose

nginx

Use-case

Decision Points:

ClusterIP

NodePort

LoadBalancer

ExternalName

Headless Service

Ingress

As Kubernetes continues to evolve, there might be additional service types or routing mechanisms in the future. Always refer to the official Kubernetes documentation for the most up-to-date information.

# Pod Types

# Evicted

Evicted pods in Kubernetes are pods that have been terminated and removed from nodes due to various reasons, such as:

Node pressure: When a node is under resource pressure (e.g., low on memory or disk space), Kubernetes may evict pods to free up resources.Quality of Service (QoS): Lower priority pods might be evicted to make room for higher priority pods.Node maintenance: Pods may be evicted when a node is being drained for maintenance.

Evicted pods remain in the cluster’s API server but are not running on any node. They stay in the “Evicted” state until they are manually deleted or automatically cleaned up by the cluster (depending on your cluster’s configuration).

To delete all evicted pods in a specific namespace, you can use the following kubectl command:

1kubectl get pods -n <namespace> | grep Evicted | awk '{print $1}' | xargs kubectl delete pod -n <namespace>

# Kinds

# DaemonSets

The desiredNumberScheduled in a DaemonSet is not typically set directly. Instead, it’s determined by the number of nodes in your cluster that match the DaemonSet’s node selection criteria. This is why you don’t see a direct option to set this number in the Helm chart.

Here’s how it works:

By default, a DaemonSet will try to schedule a pod on every node in the cluster.