Ads Targeting System
Topic: Ads Targeting System
Interviewer: LW(gulingwei852)
Interviewee: 刘君洁
Level: L4 (Experienced Individual Contributor)
Mock System Design Interview Summary
QRCode for activities and future coaching
Interview Overview
Date: 8/7/2022
Target level: L4
Duration: 45 minutes
Topic covered: Ads targeting system
Drawing tool used: ?
Requirements
Functional requirements
Targeting users
Need to tag the user; they get different behavior
Hundred of tags for users
Millions of users
APIs:
AddTag(tag_info, user_id)
tags_infos get(user_id)
User_id get(tag_info)
Attach tags to users
For specific user: return the list of tags, and property of tags
Q: Do we need anything else on user?
A: User ID & tags on them. No other
Q: what is on the tag?
A: different tags have different metadata information. No overlap
Non functional requirements
Upstream services have different level of requirements for latencies:
Some need milisecond fast response: e.g. subscription for paid service
Some does not need fast response.
Reads should all be very fast
Writes doesn’t need very fast response
QPS: for add and get
For reads: offpeak 5k QPS. peak hours 25k QPS
For writes: 2K QPS. 1 request may contain millions of users
Minimum consistency is enough
Add MQ and cache
When adding tag, we need to invalidate cache
Interviewer: MQ is still needed because there may be spikes
Interviewer: what’s in the cache?
A: user_id -> tag_ids. Tag_id -> user_ids
Q: If we need a very fast response, do we need to go through the queue?
A: We need to fanout; if we don’t go through the queue, there may be timeouts
Q: some requests may be smaller and need faster response. Do we have the same message queue for smaller and larger requests?
A: we can have 2 message queues.
We can use relational database
Q: there may be spikes in the writes
A: the message queue can be the buffer the spikes
Interviewer: examples:
A user subscribes to a service; a request may just contain one user with one tag
A campaign may tag millions of users with one tag
Interviewee: are there multiple tags for a campaign?
Interviewer: one tag
Interviewer: why not noSQL?
Interviewee: SQL, blobStore, noSQL
Interviewer: what is a blob store?
Interviewee: can store very large objects like gigabytes
Interviewer: what type of NoSQL? What schema to use?
Interviewee: key value store
Key: Tag_name/User_id
You can shard by the hash of the tag name or user ID.
Value: can be an array of IDs, or a string which points to a blob store
Q: why do you still use Blob store?
A: one tag may have millions of users
Tag_id, tag_info
We still need to retrieve users by tag. In this case we need to have a large value that contains
Interviewee: Key value store: I only know react database
Interviewer: How do we implement consistency
Interviewee: compared to write, the reads are more frequent. After the update is successful, we will invalidate the cache. This ensures the cache has the most up-to-date data
If we have a very large request, the system can be slow.
Interviewer and Audience Feedback
Audience:
User goes to website
Javascript -> give me an ad
It is the most commonly used API
Feedback
Interviewer:
Reasonable flow
Needed a lots of hints
I asked about QPS and traffic for databases. Interviewee was distracted.
Interviewee didn’t estimate the traffic
Interviewee
I didn’t understand the requirements. Spent a lot of time on requirement clarification
If there were a set of clear APIs, it could be smoother
After more than 10 minutes of
===
Interviewer:
Ideal design
Some requests need fast response - can directly to backend
Some requests doesn’t but they are very large.
NoSQL probably makes more sense than SQL
Wanted to discuss more
NoSQL database:
Tag to user
User to tag
From tag to user: the key is the tag, the value is a huge list of users?
Interviewee:
It feels SQL database can handle XXk QPS
Audience
NoSQL, SQL are both fine
SQL: key and user ID can be combined into a key
SQL can support Prefix_scan
Partition:
You can scan by prefix within a partition
If we partition by tag. Once we find the partition, then we will find the tag including all users
===
In real work getUserByTag is rarely used
But getTagsByUser is more often used
SQL: read-heavy
Some SQLs don’t use B+ tree but use SSTable
1 million records has very small storage size
SQL:
User_id, tag_id
Sharded SQL is fine.
Don’t use noSQL.
2 tables
===
Audience:
If the interviewer is sure SQL is sufficient, why not express it with confidence in the interviewer? What are the tradeoffs?
Audience:
You can define a schema. This can clarify.
Interviewee:
Many to many table
SQL and NoSQL are both fine. I was more confident for SQL.
Interviewer:
NoSQL: if there are bigger write traffic, then NoSQL can handle it more easily
I agree that many by many relation is better handled by SQL
===
Design tradeoff:
Tag meta
User meta
Tag_id, user_id
Difficult to maintain consistency across 2 tables
Cassandra?
Redis is fine as it’s single thread. LUA script
NoSQL: cannot handle this user scenario. Hard to handle many-to-many
Can use Redis.
NoSQL:
tag_id + user_id can become a key
Ads: tag may include “sports”, “basketball”
Key = sports, value = basketball
How do we find tag by users? And user by tag?
Then we need 2 tables.
NoSQL hard to handle consistency
We can handle this using the application.
It’s all code. Move it to application.
Database cannot ensure commit across partitions or machines.
We can solve with application
Before you write, we need to write the state in a different database.
Application can implement 2PC.
Go across microservice.
Let’s say there is a tag service, user service.
Audience:
Using SQL is the simplest. NoSQL is much harder
===
Audience:
NoSQL: all NoSQL are easy to partition
Document DB, graph
Audience:
LSM tree doesn’t support reading very well. It needs to scan through segments. Cassandra may have some examples
Optimize for write
Audience:
Read heavy btree + rdbms=postgre
Write heavy lsmtree + rdbms=mysql+myrocks
Read heavy btree + nosql = mongodb
Write heavy lsmtree + nosql = cassandra
Audience
Cassandra, dynamo DB, are both based on LSM. Reads are slow.
Audience:
To partition SQL, people need to manually partition.
NoSQL handles partitioning natively.