Topic: AI Automation Framework

Interviewer: Li

Level: L6 (Staff)

Additional Resources:

Live Stream

System Design Interview - Distributed System

8/21/2024

YouTube for the event:

https://www.youtube.com/watch?v=-7hkkRGsCa0

Coach Ken LinkedIn: https://commitway.com/linkedin

| | | 职场提升俱乐部 |

Requirement

[41:39]

[connect to external services, auth]

[38]

[36]

[34]

[32]

High level design

[30]

[28:45]

API

[27:29]

DB schema

of workflows

[26]

May not fit into one database instance

NoSQL database

Partition key: workflow ID

Sort key: creation_timestamp

[timeout?]

[workflow run in DB?]

[24]

[20]

Workflow service to schedule first run

Worker to schedule new runs

[crash recovery of workers?]

[19]

Workflow scheduler maintain heartbeat with workers

[how to scale worker pool?]

Pull model or push model?

[how to scale worker pool?]

Pull model because push is messy to keep status

[16]

Add message queue

Pull model: easier for maintenance

Heartbeat fails 3 times, then assume worker is dead

[pull model?]

[13:35]

Select pull

Each worker contains multiple docker container, some level of security

How do we know the status?

Worker will update status

[11]

Relational or non-relational?

Non-relational: scale is very large

Still need strong consistency

A workflow run is scheduled twice in two separate machines

WorkerID acts as a lock

Conditional update

[9:23]

DynamoDB or sharded MySQL

To ensure strong consistency

Partition key?

Workflow run vs scheduled time

Workflow run: easy to find the workflow, full table scan

Scheduled time: easy to run. Hard to query.

Scanning is more frequent: optimize for scan, scheduled time

Secondary index. Sharded SQL or DynamoDB, strong consistency

Transaction: cannot support commit > 100 records

[5]

How can worker retry the job?

If worker has failed, then need to reschedule. Management service should update the workflow run DB

[ missing intermediate result ]

[2:48]

Worker mgmt service is single point of failure

Good monitoring system. We may need manual investigation to reload the job

Or automatic fail over

A little risky. May prefer engineer to investigate.

[ time is up ]

Worker mgmt service.

317 jobs /second

7:09-

100k

Concurrent runs

7:09-7:11

Not too sure about today’s performance

The system is complex

Availability

Data model

设计不太有把握

重试可不可以作为workflow definition的一部分？

Materials — open to everyone, no sign-in

of workflows