Ads Logging
Materials — open to everyone, no sign-in
Topic: Ads Logging
Interviewer: Harry
Interviewee: pinglu
Level: L5 (Senior)
Additional Resources:
Topic
Mock System Design Interview Summary
Interview Overview
Date: 1/23/2022
Target level: L5
Duration: 45 minutes
Topic covered: Ads Logging
Drawing tool used: whimsical
Requirements
Functional requirements
Generic ads logging system
Advertiser - provide ads - want users to convert
Publisher - display ads
Influence ranking of the ads
Log publisher page information, end user information
Event: display, click, conversion
Non functional requirements
Scalability
Lower latency
High availability
Latency requirement varies
System Design
System design
Interviewer: Confused about the flow
Interviewee: client loads app / web page -> goes to ad selector to inject the ad
Interviewer: is it a push or pull model?
Interviewee: client
Get the page first
Get the ads next
Display the page
Client then sends, page information, end user information,
Interviewer: How do you find out the user is the same user
Interviewee: if the user is authenticated. The system has Device ID, email.
Interviewer: ad platform may be a 3rd party platform, which may be separate from the publisher’s platform. E.g. you can see amazon ad on wall street journal.
Wall street journal needs to call ads platform to get relevant ads
Ads platform may not know who you are. It may serve multipler
Let’s say the publisher is twitter.
Twitter needs to call ads platform that are outside of twitter
Another publisher is wall street journal
Wall street journal can go to the same ads platform
User logs into twitter. Then the user logs into wall street journal, it may have a different idea of the user.
A: we can ask the publisher to supply the user email to the ads platform. Then ads platform knows who the user is.
Use kafka message queue
Q: Do you persist log data?
A: yes we persist for analysis based on hourly, daily, monthly, etc
Q: how do we get data in real time? E.g. for specific ad, want to send how many times the ad has been displayed? Database may become a bottleneck
Q: Let’s say there are 100k QPS for the ad impression. How do we make the system scalable?
A: the message queue can make it scalable.
More design:
Add data warehouse and cache
Q: do we first send count to cache or the database?
A: first put in cache, then flush cache into database.
Q: what if the cache crashes? Then you will lose some data
A: when can recover from message queue from kafka based on timestamp
Q: if the stream processor crashes, we may lose data
A: we can retrieve the data in kafka
Q: Give me example what you store in the stream process
What’s the schema for the event DB? How do we make event processor more scalable? What’s the sharding key
A: event db, event type, page ID, ads id, count
Q: how do we shard the event to make it more scalable? What’s the sharding key? What data do you store in the event processor in memory?
A: sharding key can be the ads ID
Event ID, event Type (click, display), pageID, adsID
Q: How do you guarantee the sequence of the events?
A: the MQ guarantees the sequence
Ads ranking system wants to understand the sequence of the events
There may be 5 different processors to process the event. Then
Interviewer and Audience Feedback
Interviewer:
Provided a workable solution
Want to hear how lead the interviewee
Why use stream processor. First persist data, or first send data to ads ranking
Key problems that interviewer tries to address
How to track user
3rd party cookie can track user
On mobile, you may not have the user
User -> mobile client and web page
Stream vs batch process
30 seconds -> return logging
Some have longer
Stream processing: there is some accuracy problems
How to do stream processing
Where to store
How to handle high throughput
Scalable
Soft skill:
Plural, singular
Sequence - interviewee tries to fight the interviewer (sequence is not important), try to assume this is a requirement
Interviewee can clarify why we need sequence
Interviewer and interviewee may not be on the same page
If interviewee does not agree, should they clarify the requirement.
Depends on the experience, then it
Interviewer provides some hint, they are helping you. Try to accept the solution
Interviewer says the same point twice. Most of the time the interviewer gives you hint, they want to guide you
At the beginning, the interviewer gives 3 sentences. Interviewee should ask questions instead of directly moving forward. In real situations, you probably don’t know the
Even if you know the system, you should clarify the requirements, which requirement is the highest priority. They may ask something you know, but it may not have the same focus. End-to-end solution - not sufficient in time. Every module may be challenged.
Interviewer may ask how to trace the user. Give a use-case (e.g. google search). Twitter or facebook must login. The user data may be different. May want to focus on one use-case.
Ideally should clarify. Different ads platform, uses different tracking methods. Google: ads ID, user ID. 3rd party ads: other ways to track user.
Improve English. Hard to understand for unexpected questions.
Interviewee is brave, more practice
Can continuously double check. Interviewee try to rephrase with the interviewer. Continous confirm with interviewer.
===
Hard skill
How to identify the user
Types of user
Generate page
Try to find ads, ads ranking - relevant, ingest the page
Web, come from ad or display
Logging information
Page information, user, ad
Need load balancer
We have lots of client
Validate and filter. Put the logging events into message queue
Kafka, high throughput, real time classification, can ensure correctness
Consumer - put into different database
Real time feedback to ads ranking - ads ranking can change the ads. Can remove the ad from the page.
Interviewer: Why ask about tracking. Publisher and advertisers may be the different companies
Ads logging system. Can I assume the
Creator: Placement. Any publisher can get any placement
Ads logging. Main concern is the logging
Ads delivery and ads logging can be separated
Facebook and google - they do ads ranking and placement. 3rd party supplied ranking
Logging system - what’s the difference between
What is the information
Key point is how to log user’s behavior
We need to know how the ad is deliver. Need to study
Stream processing vs batch processing
Want to clarify the environment where the ads is used
Publisher and advertiser are different companies
3rd party platform
Wall street journal includes scripts from 3rd party
3rd party can record user’s behavior
3rd party platform cannot track the user on the app
User is easy to get
On browser: you can use 3rd party cookie
On mobile: using the ID of the device. But not accuracy. Person can change device ID - try to identify the user
If the user create an event
How does the event gets the
How to track user on ID:
IdfA id for apple, now going obsolete
AdID for android
Event: pull or push
When the event happens, it’s send to the server
Frontend is the same
Workable solution of the frontend
Ads impression
Lamda architecture
Stream processor low latency - if we need to return data within 30 seconds, then we need to use in-memory
To endure failure, also asynchronous write to the fast storage
Batch processor - guarantee correctness, but higher latency
Lambda architecture
Streaming processor: cannot guarantee the sequence of the events.
A group of streaming processor. Lots of stream processors. What to use as sharding.
Why do we need to guarantee the sequence of the events? Logging is not time sensitive
1-2 seconds latency may not be important?
Sometimes the event sequence matters. First the user click on the ad, and the they purchase.
We use client or server clock? Clock is not reliable.
On server, there is a clock in the message queue.
Every message has a timestamp.
Ads service, logging service may have timestamp.
Want to honor my timestamp during batch processing
If the user has a bunch of events, the timestamp is generated by the server. May not be the timestamp of the message queue.
Impression calculated by the stream processor
Impression +1 if the user
Sharding key based on ads ID, one stream processor can handle all events for the same ads ID
Stream processor. Shards by ads ID or shard by advertiser ID
Router: decouple message queue w/ rest of the log processing, so business components don’t need to directly talk to the MQ
Router may not be needed
Streaming processor and batch processor: do they have the same input?
Possible to be same or different
Different services may be interested in different topics
API design. Ads service, logging service.
1M QPS
Usually - mock: the number is provided by the interviewer
How to design the database?
Cassandra: append only (append my events). When try to recover.
Just need high performance. No need for relational database
Ads platform has lots of ads and budget limits
Ads platform relies on ads service
Search/ads ranking
Relevance: television -> recall, ads relevance check, then business metrics, impression count, CTR, CVR. impression count
Relevance check may rely on the business
Also targeting requirement (age, geography)
Logging system: if the user click on the ad, then it will be sent to LB -> ads / logging service
DFS: distributed file system
Log all data
Map reduce job
Primarily for data correction and monitoring
Tradeoff of sharding key: ads ID vs commercial or other ID?
If the business is to calculate ads metrics directly, no need to use aggregator
Frontend - ads platform,
==
Ads platform, publisher will log the same set of data
Need to 对账