Auto Complete for Search

Topic: Auto Complete for Search

Interviewer: pxl

Interviewee: Li Ning

Level: L5 (Senior)

Topic

Mock System Design Interview Summary

Interview Overview

Date: 6/5/2022

Target level: L5

Duration: 45 minutes

Topic covered: Auto complete for search

Drawing tool used: ?

Key problem vote: ?

Requirements

Functional requirements

User type prefix, should provide suggestions. Top 10 suggestions. Sorted based on popularity

English only, lower or upper

Build out history data

Prefix len should be less than 20

Non functional requirements

Real time response

Based on one week of history data

Scalable: 100M

No strong consistency

System Design

External APIs

100M DAU

5 queries per day

Write QPS: 100M * 5 / (24 * 3600) = 100 * 10 ^ 6 / 8*10^4 = 10,000 QPS

Read QPS: 10 ^ 4 * 20 = 2 * 10^5

Storage: 20 * 100M * 5 * 20% (20% searches are new) = 2G per day

System design

15:00

A basic design with RDBMS to satisfy functional requirements

Database schema

Get top 10 using a prefix, and sort by frequency

The basic service is not scalable. I will scale it up.

API:

1 GET searchText=xxxx

2 GET prefix=a

Response {words: [apple, act, action]}

For 100M users, we cannot use RDBMS

There are 2 parts in the system.

Collect data

Search results for autocomplete

We can add queue to buffer the search request

We need to count the frequency for each search words for the past 7 days

We can add an aggregation service

We can use batch job in the background

We can use trie structure to save the work.

Save space

O(N) complexity for search

Example of “apple”

Why we build trie, we can save the frequency in each node

Using trie, we can only save in memory. We have to write our own data structure

This may not be good, since we cannot take advantage of existing solutions such as cache.

We can use key-value store, and use prefixes as keys

We can use redis as cache.

We can use kafka as message queue

Top-K generation service: we can have more than one generation. For each letter A,B,C, we can generate from English letter to each letter.

Q: all data is cached in the cach system?

A: All data is saved in the database. Cache only stores top 10

We can also save cache data into a data storage

Q: What if there is a cache miss?

A: every 7 days we will generate once, it’s not real time. It may return the empty string if there are no top ten for a prefix (say AAA).

We can update every hour

Interviewer and Audience Feedback

In general

Soft skill -

Covered requirement thoroughly

Hard skill

Functional system

Real time

Scale

Trie data structure

Used cache for speeding up

Spent too much time, search service

Caching service. How to aggregate, not sufficient coverage

Overall, good performance.

Interviewee

I was a bit too nervous

The design is different from the simple design. Spent too much time in the simple design

Have not expressed very clearly

Aggregation: was planning to use batch job/map reduce to do aggregation.

What I spoke is a bit different form what I thought

Audience

We had 2 step solution. Do we need to do 2 steps? One with basic, one with advanced solution.

Interviewer

Step 1 implemented functional requirement

It reduced time in step 2.

It may be fine with some interviewer, but the tradeoff is not having enough time.

Interviewee

If we hit a new question, it may be hard to design a scaled up solution immediately.

It can be easier to have a basic solution first.

Audience

What are the key points from the interviewer? I am guessing it’s trie and keep things fresh.

Interviewer:

Yes. Interviewee lost time to solve the key difficulties in the interview.

Audience

Are we just using log?

What is in the object store? Why do we need to do aggregation?

Interviewee

There are many users who search for Apple.

We store in the object store to understand how many users within the time interval searched “Apple”.

It’s not for prefix, but for the final search query and the count.

Audience:

Different users come at different times.

Interviewee: search service

Do we store raw data?

Interviewer:

Usually we can store in log, then we aggregate data in the log.

Audience:

Why do we store in object store, not object cache directly?

Interviewee:

We can remove the object store

Audience

If we test Google L6, are there additional requirements?

Interviewee

Personalized, English, capital vs small letter. Other languages

Audience

Trie vs key-value pair. Which is your final solution?

Interviewee

Only use key-value pair.

Trie: there are no existing solutions. You can build the trie within the service.

Object service, new trie

We can read from key value store. No need to rebuild trie

Key value store key, what are the keys?

Interviewee: every level of prefix tree is stored as keys

We are using space to trade faster time.

What’s in the primary object store

Why do we need to store in object store? Why not just in RDBMS?

Interviewer/interviewee:

We can use RDBMS as well

Interviewer

We should use no-sql to support high throughput

Audience

We should emphasize the need for low latency

Audience

How do we actually calculate topK?

Interviewee

We have to calculate based on all possible prefixes

We can use map reduce as well.

Audience

How frequently should we update?

Interviewee

We can periodically update, can be 1 hour, 2 hour, 1 day.

Audience

What happens for sudden news events

Interviewee

We have to manually intervene to update the index

Audience

We can combine a batch job and a streaming job

Audience

Autocomplete is based on index

Crawler also requires to speed up for sudden news

We can monitor current hot keywords and trigger crawler

Audience

Can split into 2 parts, a slow and a fast

Then we can combine them

Interviewee

Yes we can split into 2 parts, streaming and batch

Audience

How do you update to new index? AB system?

Interviewee

We can lock the record in redis

Audience

Redis is single thread, so there is no need to lock

Audience

Why not put the result of streaming and batch service into the same store

Audience

It’s possible. Lambda architecture

Audience

It may be better not to update the cache in real time. This will increase the pressure on the cache (written too frequently)