Spam and Abuse Machine Learning System

System DesignMachine Learning

Materials — open to everyone, no sign-in

Topic: Spam and Abuse Machine Learning System

Interviewer: Cathy

Interviewee: Estella

Level: L5 (Senior)

Additional Resources:


Topic

Mock System Design Interview Summary

Interview Overview

Date: 2/20/2022

Target level: L5

Duration: 45 minutes

Topic covered: Spam and Abuse ML System

Drawing tool used: ?

Key problem vote: ?

Requirements

Functional requirements

Text based ML algorithm to detect and spam and abuse

Focus on text

Action: tweet block people form seeing this tweet (abuse and spam content); label tag (abuse) scratch

Ask users to report bad twit

Social network

Scope binary data abuse

Do we have professional annotator.?

Quick solution for first iteration. 1-2 annotators

Objective: ML system to detect abuse tweet to prevent people from seeing it.

Non functional requirements

10,000 reports regards abuses @

100k tweets per month

System Design

Data collection:

10,000 reports regards abuses @ negative class

2 ways to get more data

  1. Reports + annotators (true negative class 10k + 50% 10k positive class) => 20k

How to get positive class? Can ask annotator to annotate good tweets

Traditional ML such as logistic regression / SVM / tree-based model

Regression

Easy to scale

Explainable

Outputs probability

Run model for 6 months => 600k

  1. Reports + crowdsourcing + annotators => true positive class true negative class

Harder to manage crowdsourcing quality, recommend 1

Feature selection:

User profile based features: has this account been reported in the past? Number of followers; number of followings; number of favorites; account age at Twitter; previous retweet to tweet ratio; number of comments

Tweet content based features: (text) the number of words, number of digits; number of special characters/illegal characters; the content of Tweet(normalization: lowercase; remove punca; digit => word; embedding)

Feature selection on non-text features tree models to generate feature importance; stepwise to select important features

Interviewer: Non numeric features?

Logistic regression

1000 most common word in tweet “this is a good movie” one-hot encoding

[9, 34, 55, 199, 40]

One hot looks like this [0, 0, 0, 1, 0] but it may be too sparse

500 oov buckets

Algorithm

Supervised learning. Use logistic regression.

Cross entropy loss function

Macro F1 score to select top 20 models

Precision of the negative class to select the best model

Training and testing 70% 30%

How to split training and testing data?

Random split, time split, or based on category

Logistic regression can handle smaller amount of data

If data size increases, how to scale?

600k data points labeled

People generally use neural model

CNN / RNN / Transformer

CNN - fast and can run in parallel. Kernel size to model the relationship of words. E.g. 2 means relation of 2 words. 140 words. Don’t recommend CNN

RNN - cannot be run in parallel, sequence; long dependency: LSTM / GRU

Transformer: no sequential data, so we can run the training in parallel on different GPUs.

We need low latency and high accuracy, so it’s important to run in parallel.

Positional embedding + attention long dependency

BERT model:

pre-training BERT, encoder

Fine tuning - use the output from the pre-training part to build a classifier

It can run from left and right side of the tokens

Load a pretraining BERT + cls pad tokens => tokens from the bert model => concatenation layer to add non-text features => RELU layers => sigmoid function output layer

Model evaluation:

Offline metrics: macro F1 score; average of F1 score across all classes. Precision of negative class.

Online metrics: people can report. Goal is to return all customer. Overall metrics: Customer retention rate. Submetrics:

Report rate

Session time

Counter metric

Average session time. More customers and less time

Model deployment:

Model: optimization (quantization). Computation can be done in integer instead of float. Parallel on GPU (due to we are using BERT). Multi-stage models.

How often will we update the model?

batching/online learning

Batching: we will run when we have enough data, then we update our parameters. Easy to find “global” solution / better solution, but we need to tolerate some delay. May need to use old model

Online learning: streaming update our model. System is very complex to set up.

Depends on how many tweets, the accuracy for detection. Depends on business goals.

First we will use batch learning. Easy to set up. Use logistic regression for the 6 months.

Test: data test to detect data drift; A/B testing to monitor online metrics

Interviewer and Audience Feedback

Interviewer 考官

Technical is good

Narrowed down the project at the beginning

In the middle it’s hard for the interviewer to give more feedback

I planned to ask about

embedding

Offline metric why is precision is important

CNN/RNN/Transformer pros and cons. But didn’t ask whether these tradeoffs are important

Can give a high level overview - then ask for which part to dive in

Adjusting

Data collection -> algorithm -> metrics -> feature selection

Feature selection usually cost a lot of time

Interviewee:

Can pause in the middle

Feature selection may not be the right sequencing

Non text layer

Interviewer:

Algorithm may not be as detailed. BERT is very common. Then we can do the metrics and feature selection

Don’t need to worry about concatenate

Interviewee:

What is usually asked in feature selection

Can give a high level summary first: user profile, content, cross-content

Then drill down

Interviewer:

Drilled down too much detail: BERT

Key problem is you don’t know what are the key points in testing

You can say the best model first, then add other options. Then we can dive deep into which one.

We rarely see SVM.

===

Audience:

We did not go deep into algorithm

ML engineer

Try to do a bit more high level.

Last year interviewed ML system design

The process is very similar

Got pass score

Close to case study

More high level

More about experience in this area

Lots of details

Writing takes a lot of time

Can borrow from system design

A workflow diagram

½ online, tweet -> application server filter -> label it can pass or not pass

How does an application server do this?

Take feature

Feature store: Latency, scalable

How to build feature store, online and offline

After you discuss online, you can map steps to offline

Another flow: a loop

Other points worth discussion

data consistency, data shift

data collection bias. We have met similar problem during experience

Single point of failure.

Data overfitting.

What is a backup plan? If BERT failed, then is there a backup solution

We probably should spend more time on high level

recall/precision.

Recall high -> lots of false positive

Precision high -> lots of lawsuits

You may need to swap in a different model when encountering lawsuits

Deploy:

Centric, or deploy to client?

Centric: update quickly. If offline the user cannot use your service

===

Audience

Are these points for senior level?

Audience answer:

E4/L4: test on key knowledges, tradeoffs

E5: more system level

Difficult to type everything

===

High level design

Can draw and discuss at the same time

===

audience:

Interviewer may want to control the pacing

Feature selection

Interviewer should decide the priority of the answers

===

BERT feature model:

What features do we use?

We have numeric feature and language feature

BERT concatenate with the model of the numeric feature

Do we treat BERT as part of the big model? Or fix the bert and train the rest?

NLP: look at word relations

Combination is too big when the length

We can add a few more layers to handle more words

===

Audience

CNN - is it similar to N-gram?

Yes. 2 word kernel, then next layer x2

2x2x2x2 can expand to many words

Deployment

Why do we need parallel on GPU?

BERT - attention time. Parallel can speed up the computation

Do we need GPU for inference?

CPU - sequence model

BERT - can put the whole model into BERT

Non text concatenate

Naive based as the baseline

Offline - first build a naive based

If recall is important, e.g. 60%

Online inference - GPU is important

CPU serving is sufficient

GPU can be used for model training

During inference, we don’t need GPU any more

Can GPU significantly reduce latency? Yes. But too resource intensive. In industry we usually use CPU inference.

Depends on complexity and how big the data, e.g. real time processing for image

GPU:

Complex model, more data,

How to do it fast:

Tfid then word2vec

If we train embedding through BERT, then online we need to use BERT again.

Quantization:

Why?

Bitwise: 32 bits -> change to 8 bits

Float -> integer

You can speed up, but you sacrifice precision

Optimization:

can do it during training

can also do it during serving

Some possible model: mixed precision (32bits, 64bits)

Reduce latency

What precision we use during training, we use the same precision during deployment

During deployment we can compress the model

Why we don’t recommend LSTM

Because it’s hard to run in parallel

Main issue is it doesn’t solve long distance relation

Model deployment

Online, offline, late processing

Device processing

Where do I learn it from?

We may get these from work experience. If we hit this situation, we can use a backup model

AB testing has a backup plan

When we draw diagram, we can have talk about backup plan

Machine learning system design

Online training vs AB testing

How do I do AB testing when I do online training?

AB testing is not for updating model

We usually have a backup plan

When we add a feature, we will need to do AB testing

Data shift, quality.

Most of time people don’t use online training.

Sometimes online training is for model requiring complex/slow computation