Ads Logging · commitway

Topic: Ads Logging

Interviewer: Harry

Interviewee: pinglu

Level: L5 (Senior)

Additional Resources:

Ads impression logging study notes

Topic

Mock System Design Interview Summary

Interview Overview

Date: 1/23/2022

Target level: L5

Duration: 45 minutes

Topic covered: Ads Logging

Drawing tool used: whimsical

Requirements

Functional requirements

Generic ads logging system

Advertiser - provide ads - want users to convert

Publisher - display ads

Influence ranking of the ads

Log publisher page information, end user information

Event: display, click, conversion

Non functional requirements

Scalability

Lower latency

High availability

Latency requirement varies

System Design

System design

Interviewer: Confused about the flow

Interviewee: client loads app / web page -> goes to ad selector to inject the ad

Interviewer: is it a push or pull model?

Interviewee: client

Get the page first

Get the ads next

Display the page

Client then sends, page information, end user information,

Interviewer: How do you find out the user is the same user

Interviewee: if the user is authenticated. The system has Device ID, email.

Interviewer: ad platform may be a 3rd party platform, which may be separate from the publisher’s platform. E.g. you can see amazon ad on wall street journal.

Wall street journal needs to call ads platform to get relevant ads

Ads platform may not know who you are. It may serve multipler

Let’s say the publisher is twitter.

Twitter needs to call ads platform that are outside of twitter

Another publisher is wall street journal

Wall street journal can go to the same ads platform

User logs into twitter. Then the user logs into wall street journal, it may have a different idea of the user.

A: we can ask the publisher to supply the user email to the ads platform. Then ads platform knows who the user is.

Use kafka message queue

Q: Do you persist log data?

A: yes we persist for analysis based on hourly, daily, monthly, etc

Q: how do we get data in real time? E.g. for specific ad, want to send how many times the ad has been displayed? Database may become a bottleneck

Q: Let’s say there are 100k QPS for the ad impression. How do we make the system scalable?

A: the message queue can make it scalable.

More design:

Add data warehouse and cache

Q: do we first send count to cache or the database?

A: first put in cache, then flush cache into database.

Q: what if the cache crashes? Then you will lose some data

A: when can recover from message queue from kafka based on timestamp

Q: if the stream processor crashes, we may lose data

A: we can retrieve the data in kafka

Q: Give me example what you store in the stream process

What’s the schema for the event DB? How do we make event processor more scalable? What’s the sharding key

A: event db, event type, page ID, ads id, count

Q: how do we shard the event to make it more scalable? What’s the sharding key? What data do you store in the event processor in memory?

A: sharding key can be the ads ID

Event ID, event Type (click, display), pageID, adsID

Q: How do you guarantee the sequence of the events?

A: the MQ guarantees the sequence

Ads ranking system wants to understand the sequence of the events

There may be 5 different processors to process the event. Then

Interviewer and Audience Feedback

Interviewer:

Provided a workable solution

Want to hear how lead the interviewee

Why use stream processor. First persist data, or first send data to ads ranking

Key problems that interviewer tries to address

How to track user

3rd party cookie can track user

On mobile, you may not have the user

User -> mobile client and web page

Stream vs batch process

30 seconds -> return logging

Some have longer

Stream processing: there is some accuracy problems

How to do stream processing

Where to store

How to handle high throughput

Scalable

Soft skill:

Plural, singular

Sequence - interviewee tries to fight the interviewer (sequence is not important), try to assume this is a requirement

Interviewee can clarify why we need sequence

Interviewer and interviewee may not be on the same page

If interviewee does not agree, should they clarify the requirement.

Depends on the experience, then it

Interviewer provides some hint, they are helping you. Try to accept the solution

Interviewer says the same point twice. Most of the time the interviewer gives you hint, they want to guide you

At the beginning, the interviewer gives 3 sentences. Interviewee should ask questions instead of directly moving forward. In real situations, you probably don’t know the

Even if you know the system, you should clarify the requirements, which requirement is the highest priority. They may ask something you know, but it may not have the same focus. End-to-end solution - not sufficient in time. Every module may be challenged.

Interviewer may ask how to trace the user. Give a use-case (e.g. google search). Twitter or facebook must login. The user data may be different. May want to focus on one use-case.

Ideally should clarify. Different ads platform, uses different tracking methods. Google: ads ID, user ID. 3rd party ads: other ways to track user.

Improve English. Hard to understand for unexpected questions.

Interviewee is brave, more practice

Can continuously double check. Interviewee try to rephrase with the interviewer. Continous confirm with interviewer.

===

Hard skill

How to identify the user

Types of user

Generate page

Try to find ads, ads ranking - relevant, ingest the page

Web, come from ad or display

Logging information

Page information, user, ad

Need load balancer

We have lots of client

Validate and filter. Put the logging events into message queue

Kafka, high throughput, real time classification, can ensure correctness

Consumer - put into different database

Real time feedback to ads ranking - ads ranking can change the ads. Can remove the ad from the page.

Interviewer: Why ask about tracking. Publisher and advertisers may be the different companies

Ads logging system. Can I assume the

Creator: Placement. Any publisher can get any placement

Ads logging. Main concern is the logging

Ads delivery and ads logging can be separated

Facebook and google - they do ads ranking and placement. 3rd party supplied ranking

Logging system - what’s the difference between

What is the information

Key point is how to log user’s behavior

We need to know how the ad is deliver. Need to study

Stream processing vs batch processing

Want to clarify the environment where the ads is used

Publisher and advertiser are different companies

3rd party platform

Wall street journal includes scripts from 3rd party

3rd party can record user’s behavior

3rd party platform cannot track the user on the app

User is easy to get

On browser: you can use 3rd party cookie

On mobile: using the ID of the device. But not accuracy. Person can change device ID - try to identify the user

If the user create an event

How does the event gets the

How to track user on ID:

IdfA id for apple, now going obsolete

AdID for android

Event: pull or push

When the event happens, it’s send to the server

Frontend is the same

Workable solution of the frontend

Ads impression

Lamda architecture

Stream processor low latency - if we need to return data within 30 seconds, then we need to use in-memory

To endure failure, also asynchronous write to the fast storage

Batch processor - guarantee correctness, but higher latency

Lambda architecture

Streaming processor: cannot guarantee the sequence of the events.

A group of streaming processor. Lots of stream processors. What to use as sharding.

Why do we need to guarantee the sequence of the events? Logging is not time sensitive

1-2 seconds latency may not be important?

Sometimes the event sequence matters. First the user click on the ad, and the they purchase.

We use client or server clock? Clock is not reliable.

On server, there is a clock in the message queue.

Every message has a timestamp.

Ads service, logging service may have timestamp.

Want to honor my timestamp during batch processing

If the user has a bunch of events, the timestamp is generated by the server. May not be the timestamp of the message queue.

Impression calculated by the stream processor

Impression +1 if the user

Sharding key based on ads ID, one stream processor can handle all events for the same ads ID

Stream processor. Shards by ads ID or shard by advertiser ID

Router: decouple message queue w/ rest of the log processing, so business components don’t need to directly talk to the MQ

Router may not be needed

Streaming processor and batch processor: do they have the same input?

Possible to be same or different

Different services may be interested in different topics

API design. Ads service, logging service.

1M QPS

Usually - mock: the number is provided by the interviewer

How to design the database?

Cassandra: append only (append my events). When try to recover.

Just need high performance. No need for relational database

Ads platform has lots of ads and budget limits

Ads platform relies on ads service

Search/ads ranking

Relevance: television -> recall, ads relevance check, then business metrics, impression count, CTR, CVR. impression count

Relevance check may rely on the business

Also targeting requirement (age, geography)

Logging system: if the user click on the ad, then it will be sent to LB -> ads / logging service

DFS: distributed file system

Log all data

Map reduce job

Primarily for data correction and monitoring

Tradeoff of sharding key: ads ID vs commercial or other ID?

If the business is to calculate ads metrics directly, no need to use aggregator

Frontend - ads platform,

Ads platform, publisher will log the same set of data

Need to 对账

Materials — open to everyone, no sign-in