Cluster and Native Service Management

System DesignDistributed SystemsReliability & Scaling

Materials — open to everyone, no sign-in

Topic: Cluster and Native Service Management

Presenter: 马可

Interviewer: system

Additional Resources:


System Design Event Sign Up Form:

Job Referral

Candidates: https://commitway.com/job-refer

Search for “Databricks” for job opening on Michael’s team

Hiring managers/team members: https://commitway.com/job-open

QRCode

Cloud computing

Commodity hardware

Low cost, and reliable

failure is NORM

Heterogeneous hardware

Supports different workload

Computing and storage decouple

Massive scale

Resiliency

Cost efficiency

Multi geo-location

Resiliency - can failover across different regions

Locality - can deploy service close to customers

Data center topology

Rack

Switches (top-of-rack router)

There are usually multiple level of routers

Ethernet is the most popular protocol (more popular than ATM)

Why is data center management different?

Because we need to manage machines

What is a cluster?

A collection of machines

Can be bare metal machine or VM

Cluster management system

Control plane

Data plane

Functionalities

Service discovery

Inventory management

Allocation

Failure detection

Healing

Deployment

Policy management

Node lifecycle management

Workload lifecycle management

How is cluster management different?

Why not use a typical architecture? FE + BE + database?

Highly available?

Scalable?

Fault tolerance

Distributed

Entities to management - some part of the control plane needs to co-locate with the services

Control plane may be a cache of the data plane

What is native service?

This can be compared to 3-tier or single machine traditional software

Fault tolerance - machines can fail

Multi-tenancy - need to share resource with other processes

Usually container based

Declare resource demand

Cluster and native service management system

Borg

Google’s home grown

Job and service management

Autopilot

Microsoft home grown

Service management

Kubernetes

Invented by google

OSS

Concepts

| | Borg | Autopilot | Kubernetes |

| Management scope | cell | cluster | cluster |

| Control plane | borgmaster | autopilot | kubemaster |

| Data plane | borglet | shared | kubelet |

| Workload | job | Machine type | replicaset |

| Workload instance | Tasks (single container) | Machine | Pod (multi-container) |

| Node | Node | Physical machine | Node |

| Tenant | N/A | Environment | Namespace |

Microsoft Autopilot vs service fabric

Abstraction of services

Service fabric: control and data plane are co-hosted. Helps scaling out

Service fabric supports multiple-tenancy from the beginning (using job object/container). Autopilot - all instances map to a physical machine.

Service fabric supports service store and partition (helps scaling out)

Interface

They all use declarative interface.

Job description

Borg - BCL

Autopilot *.init file with workload information

K8S - YAML

Workload lifecycle

Autopilot

Has complex handling for failures. Introduced probation state to avoid cascading failures.

It tailors to long-running service

Availability constraints

Borg

Number of task disruptions

Autopilot

Failing limit

Kubernetes

Pod disruption budget

Autopilot needs to deal with potential data corruption. Borg jobs don’t have such need due to other storage infrastructures such as GFS.

Priority and quota

Priority

Borg

A positive integer

Bands: monitoring/production/batch/best-effort

No in-band preemption

Autopiot

K8S

Quota

Resource reservation

Part of admission control

Service discovery

Borg

Borg name service - a default name is created

Cell name/job name/task number

Host name/port in chubby

Autopilot

DNS

Host file

Local discovery file

KBS

Service type

clusterIP

nodePort

Loadbalancer

ExternalName

Discovering

Environment variable

DNS

Headless service

DNS

Workload failure detection

Can deadlock, or fail in other ways

Borg

HTTP based health check URL

Autopilot

Watchdog

OK/Error/Warning

KV based health store

K8S

Liveness probe

Demand may be predicted using ML, but it’s not always predicted correctly.

Node failure detection

borg

Borglet poll

Autopilot

Watchdog

K8S

Node problem detector

Run as daemonset

Monitor

System logs

System stats

Custom plugin

Health checker

Node controller

Probe node

Healing

Autopilot

History and error based

Actions

Reboot - may solve memory leak

Reimage - may solve disk out of space

Log rotation

Declare space needs

Replace - disk bisector, memory corruption

Migration

  • can solve problem similar to reboot, reimage, replace

Can solve problems of noisy neighbor

Monitoring

Borg

Infrastore

Autopilot

Collection service

Cockpit https endpoint on public IP. can use access internal information

K8S

Borg

Borgmaster

5 replicas

Paxos based repliacation/persistence

Leader election

Communication with borglet

Scheduler

Feasibility checking

Scoring

Best fit vs worst fit

How to avoid single point of failure?

Try to minimize / maximize sqr root of node number square

Minimum 5 replicas

1 failure for system

1 failure for system upgrade

Where are 5 replicas?

Usually at the same data center or same site (physical location)

Kubernetes in theory we can run at different data centers

Latency impacts the performance of paxos algorithm

Usually on different racks

Software may also be failure

Borgmaster software may contain bugs

Blast radius

Try to limit lower tier’s blast radius

High-level can have larger blast radius

Autopilot

Device manager

Strong consistency

Replicated using paxos

Pull vs push

Satellite service

Poll state from device manager

Reliable design - will eventually get the latest update

Simple - push model still requires pulling as backup

Deployment service

Watchdog service

Repair service

Provision service

Collection service

cockpit

K8S

Api service

Cluster management service

Replication controller

Node controller

CIDR assignment

Inventory reconciliation

Node health monitoring

ETCD - uses raft protocol. Distributed key-value store

Key - object path

Value - binary representation of object

Architecture - control plane scalability

Omega

Basic idea: sharding

optimistic locking

Borg

Decouple functionalities

Separate scheduler based on snapshot state

Workload specific scheduling

Score caching

Equivalent class

Random selection

Borglet probe

Read-only API

Sharded across replicas

Autopilot

Complex scale out design

Sharding

How to avoid conflict

Lock global resource (may make it bottleneck). Assuming infrequent access

Asynchronous protocol for update/state modification. Assumes it may fail. Optimistic locking. 2-phase commit.

Kubernetes

Claim resources

Complexity is high

Borg

Borglet

Borgmaster polls borglet

Link shard for partition/aggregation

Autopilot

File sync

Application manager (local)

Local watchdog

K8S

Kubelet

Job opening

Resource provisioning across multiple clouds

Challenges: unified interface

How to scale up service stack on different clouds