跟我学 K8S--代码: Kubernetes StatefulSet 代码分析与Unknown 状

tolerious 发布于2019-07-01 17:18 / 3795人阅读

摘要：节点对不会有影响，查询处于状态并一直保持。根据上一节描述，此时已经有正确的在其他节点，此时故障节点恢复后，执行优雅删除，删除旧的。会从状态变为状态，执行优雅删除，，然后执行重新调度与重建操作。会从状态直接变成状态，不涉及重建。

节点离线后的 pod 状态

在 kubernetes 使用过程中，根据集群的配置不同，往往会因为如下情况的一种或几种导致节点 NotReady:

kubelet 进程停止

apiserver 进程停止

etcd 进程停止

kubernetes 管理网络 Down

当出现这种情况的时候，会出现节点 NotReady，进而当kube-controller-manager 中的--pod-eviction-timeout定义的值，默认 5 分钟后，将触发 Pod eviction 动作。

对于不同类型的 workloads，其对应的 pod 处理方式因为 controller-manager 中各个控制器的逻辑不通而不同。总结如下:

deployment: 节点 NotReady 触发 eviction 后，pod 将会在新节点重建(如果有 nodeSelector 或者亲和性要求，会处于 Pending 状态)，故障节点的 Pod 仍然会保留处于 Unknown 状态，所以此时看到的 pod 数多于副本数。

statefulset: 节点 NotReady 同样会对 StatefulSet 触发 eviction 操作，但是用户看到的 Pod 会一直处于 Unknown 状态没有变化。

daemonSet: 节点 NotReady 对 DaemonSet 不会有影响，查询 pod 处于 NodeLost 状态并一直保持。

这里说到，对于 deployment 和 statefulSet 类型资源，当节点 NotReady 后显示的 pod 状态为 Unknown。这里实际上 etcd 保存的状态为 NodeLost，只是显示时做了处理，与 daemonSet 做了区分。对应代码中的逻辑为:

### node controller

// 触发 NodeEviction 操作时会 DeletePods，这个删除为 GracefulDelete，
// apiserver rest 接口对 PodObj 添加了 DeletionTimestamp
func DeletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore extensionslisters.DaemonSetLister) (bool, error) {
...
        for _, pod := range pods.Items {
...
        // Set reason and message in the pod object.
        if _, err = SetPodTerminationReason(kubeClient, &pod, nodeName); err != nil {
            if apierrors.IsConflict(err) {
                updateErrList = append(updateErrList,
                    fmt.Errorf("update status failed for pod %q: %v", format.Pod(&pod), err))
                continue
            }
        }
        // if the pod has already been marked for deletion, we still return true that there are remaining pods.
        if pod.DeletionGracePeriodSeconds != nil {
            remaining = true
            continue
        }
        // if the pod is managed by a daemonset, ignore it
        _, err := daemonStore.GetPodDaemonSets(&pod)
        if err == nil { // No error means at least one daemonset was found
            continue
        }

        glog.V(2).Infof("Starting deletion of pod %v/%v", pod.Namespace, pod.Name)
        recorder.Eventf(&pod, v1.EventTypeNormal, "NodeControllerEviction", "Marking for deletion Pod %s from Node %s", pod.Name, nodeName)
        if err := kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil {
            return false, err
        }
        remaining = true
    }
...
}

### staging apiserver REST 接口

// 对于优雅删除，到这里其实已经停止，不再进一步删除，剩下的交给 kubelet watch 到变化后去做 delete
func (e *Store) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
...
    if graceful || pendingFinalizers || shouldUpdateFinalizers {
            err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
        }
    // !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.
    if !deleteImmediately || err != nil {
        return out, false, err
    }
...
}
        
// stagging/apiserver中的 rest 接口调用，设置了 DeletionTimestamp 和 DeletionGracePeriodSeconds
func (e *Store) updateForGracefulDeletionAndFinalizers(ctx genericapirequest.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) {
...

        if options.GracePeriodSeconds != nil {
            period := int64(*options.GracePeriodSeconds)
            if period >= *objectMeta.GetDeletionGracePeriodSeconds() {
                return false, true, nil
            }
            newDeletionTimestamp := metav1.NewTime(
                objectMeta.GetDeletionTimestamp().Add(-time.Second * time.Duration(*objectMeta.GetDeletionGracePeriodSeconds())).
                    Add(time.Second * time.Duration(*options.GracePeriodSeconds)))
            objectMeta.SetDeletionTimestamp(&newDeletionTimestamp)
            objectMeta.SetDeletionGracePeriodSeconds(&period)
            return true, false, nil
        }
...
}

### node controller 

// SetPodTerminationReason 尝试设置 Pod状态和原因到 Pod 对象中
func SetPodTerminationReason(kubeClient clientset.Interface, pod *v1.Pod, nodeName string) (*v1.Pod, error) {
    if pod.Status.Reason == nodepkg.NodeUnreachablePodReason {
        return pod, nil
    }

    pod.Status.Reason = nodepkg.NodeUnreachablePodReason
    pod.Status.Message = fmt.Sprintf(nodepkg.NodeUnreachablePodMessage, nodeName, pod.Name)

    var updatedPod *v1.Pod
    var err error
    if updatedPod, err = kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod); err != nil {
        return nil, err
    }
    return updatedPod, nil
}

### 命令行输出

// 打印输出时状态的切换，如果 "DeletionTimestamp 不为空" 且 "podStatus 为 NodeLost 状态"时，
// 显示的状态为 Unknown
func printPod(pod *api.Pod, options printers.PrintOptions) ([]metav1alpha1.TableRow, error) {
...
    if pod.DeletionTimestamp != nil && pod.Status.Reason == node.NodeUnreachablePodReason {
        reason = "Unknown"
    } else if pod.DeletionTimestamp != nil {
        reason = "Terminating"
    }
...
}

节点恢复 Ready 后 pod 状态

当节点恢复后，不同的 workload 对应的 pod 状态变化也是不同的。

deployment: 根据上一节描述，此时 pod 已经有正确的 pod 在其他节点 running，此时故障节点恢复后，kubelet 执行优雅删除，删除旧的 PodObj。
statefulset: statefulset 会从Unknown 状态变为 Terminating 状态，执行优雅删除，detach PV，然后执行重新调度与重建操作。
daemonset: daemonset 会从 NodeLost 状态直接变成 Running 状态，不涉及重建。

Statefulset 为什么没有重建与单副本高可用？

我们往往会考虑下面两个问题，statefulset 为什么没有重建？如何保持单副本 statefulset 的高可用呢？

关于为什么没重建

首先简单介绍下 statefulset 控制器的逻辑。

Statefulset 控制器通过 StatefulSetControl 以及 StatefulPodControl 2个模块协调完成对 statefulSet 类型 workload 的状态管理（StatefulSetStatusUpdater）和扩缩控制（StatefulPodControl）。实际上，StatefulsetControl是对 StatefulPodControl 的调用来增删改 Pod。

StatefulSet 在 podManagementPolicy 为默认值 OrderedReady 时，会按照整数顺序单调递增的依次创建 Pod，否则在 Parallel时，虽然是按整数，但是 Pod 是同时调度与创建。

具体的逻辑在核心方法 UpdateStatefulSet 中，见图:

我们看到的 Stateful Pod 一直处于 Unknown 状态的原因就是因为这个控制器屏蔽了对该 Pod 的操作。因为在第一节介绍了，NodeController 的 Pod Eviction 机制已经把 Pod 标记删除，PodObj 中包含的 DeletionTimestamp 被设置，StatefulSet Controller 代码检查 IsTerminating 符合条件，便直接 return 了。

// updateStatefulSet performs the update function for a StatefulSet. This method creates, updates, and deletes Pods in
// the set in order to conform the system to the target state for the set. The target state always contains
// set.Spec.Replicas Pods with a Ready Condition. If the UpdateStrategy.Type for the set is
// RollingUpdateStatefulSetStrategyType then all Pods in the set must be at set.Status.CurrentRevision.
// If the UpdateStrategy.Type for the set is OnDeleteStatefulSetStrategyType, the target state implies nothing about
// the revisions of Pods in the set. If the UpdateStrategy.Type for the set is PartitionStatefulSetStrategyType, then
// all Pods with ordinal less than UpdateStrategy.Partition.Ordinal must be at Status.CurrentRevision and all other
// Pods must be at Status.UpdateRevision. If the returned error is nil, the returned StatefulSetStatus is valid and the
// update must be recorded. If the error is not nil, the method should be retried until successful.
func (ssc *defaultStatefulSetControl) updateStatefulSet(
    ...
    for i := range replicas {
        ...
        // If we find a Pod that is currently terminating, we must wait until graceful deletion
        // completes before we continue to make progress.
        if isTerminating(replicas[i]) && monotonic {
            glog.V(4).Infof(
                "StatefulSet %s/%s is waiting for Pod %s to Terminate",
                set.Namespace,
                set.Name,
                replicas[i].Name)
            return &status, nil
        }
    ...
    }
}


// isTerminating returns true if pod"s DeletionTimestamp has been set
func isTerminating(pod *v1.Pod) bool {
    return pod.DeletionTimestamp != nil
}

那么如何保证单副本高可用？

往往应用中有一些 pod 没法实现多副本，但是又要保证集群能够自愈，那么这种某个节点 Down 掉或者网卡坏掉等情况，就会有很大影响，要如何能够实现自愈呢？

对于这种 Unknown 状态的 Stateful Pod ，可以通过 force delete 方式去删除。关于 ForceDelete，社区是不推荐的，因为可能会对唯一的标志符（单调递增的序列号）产生影响，如果发生，对 StatefulSet 是致命的，可能会导致数据丢失(可能是应用集群脑裂，也可能是对 PV 多写导致)。

kubectl delete pods --grace-period=0 --force

但是这样删除仍然需要一些保护措施，以 Ceph RBD 存储插件为例，当执行force delete 前，根据经验，用户应该先设置 ceph osd blacklist，防止当迁移过程中网络恢复后，容器继续向 PV 写入数据将文件系统弄坏。因为 force delete 是将 PodObj 直接从 ETCD 强制清理，这样 StatefulSet Controller 将会新建新的 Pod 在其他节点, 但是故障节点的 Kubelet 清理这个旧容器需要时间，此时势必存在 2 个容器mount 了同一块 PV（故障节点Pod 对应的容器与新迁移Pod 创建的容器），但是如果此时网络恢复，那么2 个容器可能同时写入数据，后果将是严重的

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/32992.html

Kubernetes 核心概念

摘要：核心概念是最小的调度单元，可以由一个或者多个容器组成。该模式会跟云服务商有关，比如可以通过等创建一个外部的负载均衡器，将请求转发到对应的服务组。而可以提供外部服务可访问的负载均衡器等。概述 Kubernetes 有各类资源对象来描述整个集群的运行状态。这些对象都需要通过调用 kubernetes api 来进行创建、修改、删除，可以通过 kubectl 命令工具，也可以直接调用 k8...

Cobub 2019-07-01 16:51 评论0 收藏0
为什么 kubernetes 天然适合微服务（3）

摘要：此文已由作者刘超授权网易云社区发布。五更加适合微服务和的设计好了，说了本身，接下来说说的理念设计，为什么这么适合微服务。相关阅读为什么天然适合微服务为什么天然适合微服务为什么天然适合微服务文章来源网易云社区此文已由作者刘超授权网易云社区发布。欢迎访问网易云社区，了解更多网易技术产品运营经验四、Kubernetes 本身就是微服务架构基于上面这十个设计要点，我们再回来看 Kube...

nicercode 2019-06-28 11:02 评论0 收藏0
zookeeper和etcd有状态服务部署实践

摘要：二总结使用的和的，能够很好的支持这样的有状态服务部署到集群上。部署方式有待优化本次试验中使用静态方式部署集群，如果节点变迁时，需要执行等命令手动配置集群，严重限制了集群自动故障恢复扩容缩容的能力。一. 概述 kubernetes通过statefulset为zookeeper、etcd等这类有状态的应用程序提供完善支持，statefulset具备以下特性：为pod提供稳定的唯一的...

dingda 2019-06-28 16:09 评论0 收藏0
zookeeper和etcd有状态服务部署实践

摘要：二总结使用的和的，能够很好的支持这样的有状态服务部署到集群上。部署方式有待优化本次试验中使用静态方式部署集群，如果节点变迁时，需要执行等命令手动配置集群，严重限制了集群自动故障恢复扩容缩容的能力。一. 概述 kubernetes通过statefulset为zookeeper、etcd等这类有状态的应用程序提供完善支持，statefulset具备以下特性：为pod提供稳定的唯一的...

jackwang 2019-07-01 16:37 评论0 收藏0
如何在Kubernetes中管理有状态应用

摘要：在中，被用来管理有状态应用的对象。并行管理并行管理告诉控制器以并行的方式启动或者终止所有的。如果设置为，则控制器将会删除和重建中的每一。在大部分的情况下，不会使用分隔当希望进行金丝雀发布，或者执行阶段发布时，分隔是很有用的。在Kubernetes中，StatefulSet被用来管理有状态应用的API对象。StatefulSets在Kubernetes 1.9版本才稳定。Statefu...

KaltZK 2019-07-01 16:43 评论0 收藏0