Design Slack
Materials — open to everyone, no sign-in
Topic: Design Slack
Interviewer: Anna
Interviewee: Feng
Level: L5 (Senior)
Topic
Mock System Design Interview Summary
Interview Overview
Date: 3/13/2022
Target level: L5
Duration: 45 minutes
Topic covered: Design Slack
Requirements
Functional requirements
Design slack, chat application
Chat 1-1, and group chat
User
Channel
Workspace - team
Interviewer: we can focus on user and channel
User to user messages
User-to-channel msg
Add/remove/modify channels
Msg archiving: offline message
Search keywords
Public vs private channels
Multiple devices [mobile, app, web]
Non functional requirements
High availability
Low latency
scalable
Priority:
1-1 -> channel -> a
DAU: 10^6
ws 10^5
channels: 10^6
total users: 10^7
Target # of users per workspace: 10^4
Average messages per user per day: 10, avg message length: 10^2 bytes
Storage 3 years: 10^6 * 10 * 10^2 * 10^3 = 10 ^ 12 (1 TB)
Storage is not a big issue. Even if we increase order of magnitude by 2, still not very big
Increase it to 10^2 messages for users => 10TB
For attachment
System Design
External APIs
Login(uid, credential)
send_to_user(from_uid, to_uid, wid, msg)
send_to_channel(from_uid, to_cid, wid, msg)
retrieve_u2u_messages(to_uid, wid, start_time, end_time)
retrieve_u2c_messages(to_uid, wid, start_time, end_time)
search(uid, cid, wid, keyword)
add_channel / remove_channel / modify_channel
System design
High level system design
Websocket, client->web app layer
Where and why?
Real time send and receive message.
Receive message: pull or push. Push is more efficient
User to user chat. Let’s say both user are online.
How do you send and how do you receive?
Send: send HTTP request
Receive:
recipient is online. Access database. Send the message through the websocket
Design: we can route the traffic for the same workspace to one server.
Assume there is one big workspace globally
There can be multiple workers within the same workspace
We need some way for workers to communicate with each other.
Do the sender and receiver always talk to the same worker?
If not, how do workers talk to each other?
User->user messages are pushed through the worker to the database
2 types of entities
Worker is responsible to retrieve the messages from the database
Interviewer:
Receive message from sender, write to database
Retrieve message from database, send it to the receiver
How do senders talk to worker?
There is a websocket to the LB, and LB talks to the worker
Q: Why do we need a load balancer?
A: because we have a large number of clients
Q: How does load balancer select worker?
A: select the worker based on lowest workload
Q: Anything to add for 1-1 chat
A: I’m fine
Q: let’s jump to channel chat
A: same logic, the worker will save messages to database
Worker can retrieve the messages from the database
Some workers handle user-to-user
Some workers handle user-channel communication
Database design
Metadata (relational DB, read heavy)
Table of users
Table of channels
Table of channel-to-users mapping
Message database (nonSQL DB, no join operations)
Use a log-based structure for messages
User-to-user communication
Two types of logs: channel log, user inbox
LSM tree, levelDB
Message database
Use database to implement queue
Q: schema
A: timestamp, message, sender_id, recipient_id
Put the message into the log for the channel, or put the message to the user inbox
Each user has a log
U2u msg: directly pushed to user’s inbox
U2c msg: first push to channel log, then fanout to users’ inbox
Split into:
Meta database
Recent message DB
Archive message DB
Recent message: 2-3 database
Q: why do you need to separate the recent message db vs archived
You can put it in a cache
The reasons to split is recent messages are more frequently accessed
How do you fanout channel log to the user inbox?
We have services to monitor channel log, the service will send it to user inbox
Add a fanout service
Q: Why do we need a channel log?
A: can be used for fan-out
Q: who saves to user inbox?
A: the worker can write directly to user inbox
Q: what is a channel log?
A: all messages to a channel
User inbox:
Timestamp, message, from_uid, to_uid, msg_type
Channel log:
Timestamp, message, from_uid, to_cid, msg_type
Channel metadata, includes location of the channel DB for each channel
Add metadata database cache
Interviewer and Audience Feedback
Feedback from interviewer
Case is pretty complete
I wanted to simplify the model. Wanted to remove the workflow
Worker is closer to a component for a scheduled job
Here is closer to service
Suggest to read some solution of
LSM appears to be out of scope
Can directly point the type of database
E.g. mongoDB
Review from interviewee
Terminology: Worker => service
I don’t have experience of real products
==
Soft skill
Should listen to the interviewer
E.g. interviewer does not like to discuss about workspace
Scalability
DAU: try to see if it’s single machine vs multiple machine
Audience
Did we finish high level design?
It’s hard to know the focus of the interviewer
Interviewer:
Should cover all components
Service, data storage.
Cover important user scenarios
Make sure all scenarios are covered
Then this component
There may be lots of points to design
How do we cover the points of design?
As senior, you should identify the important scenarios
First cover the most important scenarios
We usually cover an MVP
How to define which one is most important
Based on knowledge and experience
===
Hard skill
A few points:
How to guarantee the ordering?
Inbox: should we create one inbox for each user? Is it a heavy use of resource? How to save design?
Ordering: try to use incoming time
Make sure local: use local time
Local and server time can be synced
A,B,C group chat
A’s message cannot get to B’s inbox
From C perspect, A maybe talking to himself
How to guarantee causality?
We need to make sure read-write is to the same node
If we use Kafka, we can use the partition key to make sure it uses the same broker for the same session
But for inbox, fanout may error out
Group message does not need to be fanned out to inbox
Enterprise wechat: can directly from group message
Partial failure may be hard to handle
Client server: does it always need to go through LB?
Middle tier: proxy
Client will connect websocket with proxy, proxy connect with backend
Client talks to LB, LB talks to “redirector”, “redirector” talks to backend
Proxy - handle service discovery
Service discovery -> may find a backend
https://www.youtube.com/watch?v=WE9c9AZe-DY
Sticky session in LB, can redirect
LB and websocket
HTTP service all go through LB
There is a middle tier to maintain websocket
Client directly connect to middle tier, middle tier goes to backend
Netflix video did not clarify if websocket
What happens after client connected to server, whether the additional topic goes through LB or not
Load balancer: HTTP connection -> upgrade to socket through a different channel -> additional topic still go through LB
LB: terminates the encrypted connection. LB decrypts connection and connect to internal backend
F5 load balancer, has its own CPU, very powerful
LB: HAProxy: redirect, rewrite URL
Load balancer, software LB
F5 hardware LB. active standby, active-active
F5: still goes to HAProxy. L3, L4, L7
Terminate the encrypted connection. Connection pool with micro service (backend), random, round-robin, weight, session sticky
Websocket: HTTP
What about subsequent data transmission? Does that still go through load balancer?
If the connection is not broken, it sends data through LB
LB must shield the IP address of the backend, so traffic must go through the LB
LB vs proxy. LB is more powerful than other proxies
Is LB single point of failure?
LB can convert IPv6 to IPv4. can rewrite URL, can detect signal inside the traffic
LB must read the headers, and it will need to read all the layers
Multimedia protocol? Does it go through load balancer
Multimedia: first there is a link and icon, then after clicking we will
L4 LB: handles all traffic (100k concurrent connections)
L7 LB: can handle much less traffic (50k concurrent connections)
Request is small, but result is big
If we go through original LB, bandwidth may be a problem
Return can go through a different hub
VIP: map multiple RIP to VIP
Interviewer’s design