Auto Complete for Search
Topic: Auto Complete for Search
Interviewer: pxl
Interviewee: Li Ning
Level: L5 (Senior)
Topic
Mock System Design Interview Summary
Interview Overview
Date: 6/5/2022
Target level: L5
Duration: 45 minutes
Topic covered: Auto complete for search
Drawing tool used: ?
Key problem vote: ?
Requirements
Functional requirements
User type prefix, should provide suggestions. Top 10 suggestions. Sorted based on popularity
English only, lower or upper
Build out history data
Prefix len should be less than 20
Non functional requirements
Real time response
Based on one week of history data
Scalable: 100M
HA
No strong consistency
System Design
External APIs
100M DAU
5 queries per day
Write QPS: 100M * 5 / (24 * 3600) = 100 * 10 ^ 6 / 8*10^4 = 10,000 QPS
Read QPS: 10 ^ 4 * 20 = 2 * 10^5
Storage: 20 * 100M * 5 * 20% (20% searches are new) = 2G per day
System design
15:00
A basic design with RDBMS to satisfy functional requirements
Database schema
Get top 10 using a prefix, and sort by frequency
The basic service is not scalable. I will scale it up.
API:
1 GET searchText=xxxx
2 GET prefix=a
Response {words: [apple, act, action]}
For 100M users, we cannot use RDBMS
There are 2 parts in the system.
Collect data
Search results for autocomplete
We can add queue to buffer the search request
We need to count the frequency for each search words for the past 7 days
We can add an aggregation service
We can use batch job in the background
We can use trie structure to save the work.
Save space
O(N) complexity for search
Example of “apple”
Why we build trie, we can save the frequency in each node
Using trie, we can only save in memory. We have to write our own data structure
This may not be good, since we cannot take advantage of existing solutions such as cache.
We can use key-value store, and use prefixes as keys
We can use redis as cache.
We can use kafka as message queue
Top-K generation service: we can have more than one generation. For each letter A,B,C, we can generate from English letter to each letter.
Q: all data is cached in the cach system?
A: All data is saved in the database. Cache only stores top 10
We can also save cache data into a data storage
Q: What if there is a cache miss?
A: every 7 days we will generate once, it’s not real time. It may return the empty string if there are no top ten for a prefix (say AAA).
We can update every hour
Interviewer and Audience Feedback
In general
Soft skill -
Covered requirement thoroughly
Hard skill
Functional system
Real time
Scale
Trie data structure
Used cache for speeding up
Spent too much time, search service
Caching service. How to aggregate, not sufficient coverage
Overall, good performance.
=
Interviewee
I was a bit too nervous
The design is different from the simple design. Spent too much time in the simple design
Have not expressed very clearly
Aggregation: was planning to use batch job/map reduce to do aggregation.
What I spoke is a bit different form what I thought
=
Audience
We had 2 step solution. Do we need to do 2 steps? One with basic, one with advanced solution.
Interviewer
Step 1 implemented functional requirement
It reduced time in step 2.
It may be fine with some interviewer, but the tradeoff is not having enough time.
Interviewee
If we hit a new question, it may be hard to design a scaled up solution immediately.
It can be easier to have a basic solution first.
Audience
What are the key points from the interviewer? I am guessing it’s trie and keep things fresh.
Interviewer:
Yes. Interviewee lost time to solve the key difficulties in the interview.
==
Audience
Are we just using log?
What is in the object store? Why do we need to do aggregation?
Interviewee
There are many users who search for Apple.
We store in the object store to understand how many users within the time interval searched “Apple”.
It’s not for prefix, but for the final search query and the count.
Audience:
Different users come at different times.
Interviewee: search service
Do we store raw data?
Interviewer:
Usually we can store in log, then we aggregate data in the log.
Audience:
Why do we store in object store, not object cache directly?
Interviewee:
We can remove the object store
==
Audience
If we test Google L6, are there additional requirements?
Interviewee
Personalized, English, capital vs small letter. Other languages
Audience
Trie vs key-value pair. Which is your final solution?
Interviewee
Only use key-value pair.
Trie: there are no existing solutions. You can build the trie within the service.
Object service, new trie
We can read from key value store. No need to rebuild trie
Key value store key, what are the keys?
Interviewee: every level of prefix tree is stored as keys
We are using space to trade faster time.
What’s in the primary object store
Why do we need to store in object store? Why not just in RDBMS?
Interviewer/interviewee:
We can use RDBMS as well
Interviewer
We should use no-sql to support high throughput
Audience
We should emphasize the need for low latency
Audience
How do we actually calculate topK?
Interviewee
We have to calculate based on all possible prefixes
We can use map reduce as well.
Audience
How frequently should we update?
Interviewee
We can periodically update, can be 1 hour, 2 hour, 1 day.
Audience
What happens for sudden news events
Interviewee
We have to manually intervene to update the index
Audience
We can combine a batch job and a streaming job
Audience
Autocomplete is based on index
Crawler also requires to speed up for sudden news
We can monitor current hot keywords and trigger crawler
Audience
Can split into 2 parts, a slow and a fast
Then we can combine them
Interviewee
Yes we can split into 2 parts, streaming and batch
Audience
How do you update to new index? AB system?
Interviewee
We can lock the record in redis
Audience
Redis is single thread, so there is no need to lock
Audience
Why not put the result of streaming and batch service into the same store
Audience
It’s possible. Lambda architecture
Audience
It may be better not to update the cache in real time. This will increase the pressure on the cache (written too frequently)