Escalation and Notification System
Topic: Escalation and Notification System
Interviewer: Eric
Interviewee: jyuan
Level: L5 (Senior)
Design interview:
Public speaking / Workplace communication & behavior interview
Topic
Mock System Design Interview Summary
Interview Overview
Date: 5/29/2022
Target level: L5 senior engineer
Duration: 45 minutes
Topic covered: Escalation and notification system
Drawing tool used: excalidraw
Requirements
Functional requirements
Library for different companies or a single company? Pager system for multiple companies
Companies should onboard with SSO service
User can define the rules
Group escalation. Teams are customizable
Tickets of multiple severity. Customize.
Different company/user/group can generate different rules
Users or services (generated by other monitoring systems) can trigger the notification
Non functional requirements
One company; 100,000 teams, 1 million employees, 10 tickets
100 companies 1000 TPS
High scalability
High availability
Latency -> a few seconds of delay is fine
Accuracy -> escalate at least once
System Design
External APIs
/api/v1/createTickets(ticketId, group)
Response: 200
createTime, group
/api/v1/updateTickets(ticketId, status)
NonRead, under investigation, pending deployment/fix, resolved
/aip/v1/createRules(ruleName, group, ruleInformation)
/api/v1/updateRules
System design
Choosing no-sql database for scalability
Now add message queue / worker / different types of notifications
Q: what do you store in the metadata?
A: the alert rules, and group information
Q: how do you differentiate data from different companies?
A: user has signed in with SSO; frontend can resolve user’s SSO identity to company’s identity
Q: work through workflow
A: create ticket -> fetch meta data service -> task, resolve to next escalation point
E.g. next 5 minutes
Ticket ID, next escalation time
Send task to queue, which will delay handled by the worker
Worker reads the escalation task, double check with metadata service, may work on task or drop the task (depending on if ticket has been resolved)
For different tickets, we can hash based on ticketIDs, then hand tasks to different workers
If worker gets a task that the worker does not own:
It can put the task back to the queue
Or push the task to the right worker
We can use zookeeper to handle sharding
What happens if the metadata changed?
Worker can pull the metadata
What priority queue is used?
Store object, and next escalation time
Q: If the user received the email, why is it necessary for the worker add the message back to the queue?
A: if the user does not respond, then it may be necessary for the message to be sent to the next escalation point
Q: What happens if the worker goes down?
A: create disaster recovery service, task has not been processed by the worker. Disater recovery service will call the frontend service.
For each worker, we can have a replication
Disaster recovery to read the metadata service; find messages that are not resolved; compare the escalation time vs next escalation time; resend it back to the queue
Status of the messages:
Succeed: add back to priority queue
Failed, e.g. invalid email, or 3rd party tool failed
If it’s not retriable, then drop the user
If it’s retriable, then put back to the queue
If the user press “acknowledge” in the phone, then we can mark the message as “in progress”. We can drop future tasks.
Metrics: scaling up the workers
Q: How do we need know the system is working?
A: depends on the metrics
Q: What if the frontend is down?
A: tracks the service. Data announce. Canary to verify end to end flow.
What if the system is comp
Interviewer and Audience Feedback
Interviewer:
Good candidate for L4
L5: borderline
Requirement gathering
Design was a bit confusing
Whether the worker is stateful or stateless (e.g. queue)
Discovery system. Worker goes down. Can be recovered better
Metadata service does too much
We may split into two services.
====
Interviewee:
May have not gathered requirement
Spoke too quickly
Disaster recovery - can have more improvements. Scan database
Worker: data is in priority queue, but source of truth is in database. So worker is still stateless
====
Audience
Will it be different if we share this across many companies vs shared by different companies
A: much smaller scale for internal systems
A few minutes of drawing. He was silent. Does the interviewer needs explanation?
Interviewee: Drop architecture design. I can confirm with the interviewer after each stage
Interviewer: I usually don’t interrupt the interviewee
==
Audience
Interviewer said “Do you have something more to add?” What’s expected?
Interviewer:
No huge expectations. Monitoring by another party
My main concern was the design itself. There were small issues, so I was on borderline for L5
I felt meta service was too monolithic. I was hoping some refactoring into different services
API response should not be 200
===
Interviewee: what did you mean that I was missing a component?
Interviewer:
the metadata service is taking on too many responsibilities. Hoping to have more services
I was confused about the frontend.
Interviewee: I should improve names of the services
Interviewer: you can correct the name in the middle of the interview.
===
Interviewee: why not return 200?
Interviewer: get should return 200. Create should return 201, 202
Not a big deal
Not big issue overall, but we may not go to L5
===
Audience: suggestion: what happens if the work fails?
You can proactively work on fault tolerance. (similar to running test case)
===
Audience: how to acknowledge
A: We can use the meta data service to update the database
Escalation. It’s part of the business logic. We may not need to go into a lot of details.
===
Audience: there is a state machine. Continuously evaluate. We may need to dedupe.
Worker should be similar to a cronjob. Continuously look at unhandled task
A: Priority queue. At the point of escalation, we can drop the service if it’s already handled. It’s more lightweight
Audience: Noise neighbor. One service may flood other services
Need to guarantee everybody gets served
===
Audience: why was there no database design?
Interviewee: we should consider it. It feels the time was too tight
===
Audience: Schedule state machine. 2 scenarios.
Email not succeed. Should we put the message back to the queue?
A: put it to the priority queue. Invalid email, just drop the email. Retriable: put it back to the priority queue (not message queue)
If email succeeds. Should we wait for the acknowledgement?
Email cannot return read or no read.
User should update the ticket
Worker may be down or may not be able to handle so request
Didn’t consider (200 threads, 3rd party may be down. Threads may be all blocked) we probably need another component.
====
Audience: why do we use priority queue?
1M paging. 99% can be put into the queue. Some may not be handled, it may be a waste of resource
Why do we use priority queue but not a scheduling service?
A: not familiar with scheduling service. In order based on time. Cronjob may require lots of resource to scan the database
Q: different ticket may have different time to handle. Some items may take 7 days. When we add a new item log(N) for 15 minutes. We may fill up the system.
A: can consider data management service.
Any suggestions:
Split metadata service into meta data and runtime data.
Priority queue: we may be able to throw message back to message queue. No need for disaster recovery
Message queue: is in-order
Confused about priority queue; priority queue adds status to the worker
Every time worker is down, we need recovery
Audience: we use dynamodb, eventual consistency. Time sensitive. if the message sending failed, we rescan the table. There may be some delay.
Q: easy to extend. No relations.
Should everything go through message queue? Should we scan again?
A: Create and update should both go through message queue. There may be duplicate task. Use metadata service to dedupe
Q: ticket update is async. It may not be friendly to user.
A: database update is synchronous. Notification should be asynchronous.
Q: if nobody takes care of the ticket, should we have a timedb or re-pickup?
A: we can throw into priority, but it becomes