Spam and Abuse Machine Learning System
Materials — open to everyone, no sign-in
Topic: Spam and Abuse Machine Learning System
Interviewer: Cathy
Interviewee: Estella
Level: L5 (Senior)
Additional Resources:
Topic
Mock System Design Interview Summary
Interview Overview
Date: 2/20/2022
Target level: L5
Duration: 45 minutes
Topic covered: Spam and Abuse ML System
Drawing tool used: ?
Key problem vote: ?
Requirements
Functional requirements
Text based ML algorithm to detect and spam and abuse
Focus on text
Action: tweet block people form seeing this tweet (abuse and spam content); label tag (abuse) scratch
Ask users to report bad twit
Social network
Scope binary data abuse
Do we have professional annotator.?
Quick solution for first iteration. 1-2 annotators
Objective: ML system to detect abuse tweet to prevent people from seeing it.
Non functional requirements
10,000 reports regards abuses @
100k tweets per month
System Design
Data collection:
10,000 reports regards abuses @ negative class
2 ways to get more data
- Reports + annotators (true negative class 10k + 50% 10k positive class) => 20k
How to get positive class? Can ask annotator to annotate good tweets
Traditional ML such as logistic regression / SVM / tree-based model
Regression
Easy to scale
Explainable
Outputs probability
Run model for 6 months => 600k
- Reports + crowdsourcing + annotators => true positive class true negative class
Harder to manage crowdsourcing quality, recommend 1
Feature selection:
User profile based features: has this account been reported in the past? Number of followers; number of followings; number of favorites; account age at Twitter; previous retweet to tweet ratio; number of comments
Tweet content based features: (text) the number of words, number of digits; number of special characters/illegal characters; the content of Tweet(normalization: lowercase; remove punca; digit => word; embedding)
Feature selection on non-text features tree models to generate feature importance; stepwise to select important features
Interviewer: Non numeric features?
…
Logistic regression
1000 most common word in tweet “this is a good movie” one-hot encoding
[9, 34, 55, 199, 40]
One hot looks like this [0, 0, 0, 1, 0] but it may be too sparse
500 oov buckets
Algorithm
Supervised learning. Use logistic regression.
Cross entropy loss function
Macro F1 score to select top 20 models
Precision of the negative class to select the best model
Training and testing 70% 30%
How to split training and testing data?
Random split, time split, or based on category
Logistic regression can handle smaller amount of data
If data size increases, how to scale?
600k data points labeled
People generally use neural model
CNN / RNN / Transformer
CNN - fast and can run in parallel. Kernel size to model the relationship of words. E.g. 2 means relation of 2 words. 140 words. Don’t recommend CNN
RNN - cannot be run in parallel, sequence; long dependency: LSTM / GRU
Transformer: no sequential data, so we can run the training in parallel on different GPUs.
We need low latency and high accuracy, so it’s important to run in parallel.
Positional embedding + attention long dependency
BERT model:
pre-training BERT, encoder
Fine tuning - use the output from the pre-training part to build a classifier
It can run from left and right side of the tokens
Load a pretraining BERT + cls pad tokens => tokens from the bert model => concatenation layer to add non-text features => RELU layers => sigmoid function output layer
Model evaluation:
Offline metrics: macro F1 score; average of F1 score across all classes. Precision of negative class.
Online metrics: people can report. Goal is to return all customer. Overall metrics: Customer retention rate. Submetrics:
Report rate
Session time
Counter metric
Average session time. More customers and less time
Model deployment:
Model: optimization (quantization). Computation can be done in integer instead of float. Parallel on GPU (due to we are using BERT). Multi-stage models.
How often will we update the model?
batching/online learning
Batching: we will run when we have enough data, then we update our parameters. Easy to find “global” solution / better solution, but we need to tolerate some delay. May need to use old model
Online learning: streaming update our model. System is very complex to set up.
Depends on how many tweets, the accuracy for detection. Depends on business goals.
First we will use batch learning. Easy to set up. Use logistic regression for the 6 months.
Test: data test to detect data drift; A/B testing to monitor online metrics
Interviewer and Audience Feedback
Interviewer 考官
Technical is good
Narrowed down the project at the beginning
In the middle it’s hard for the interviewer to give more feedback
I planned to ask about
embedding
Offline metric why is precision is important
CNN/RNN/Transformer pros and cons. But didn’t ask whether these tradeoffs are important
Can give a high level overview - then ask for which part to dive in
Adjusting
Data collection -> algorithm -> metrics -> feature selection
Feature selection usually cost a lot of time
Interviewee:
Can pause in the middle
Feature selection may not be the right sequencing
Non text layer
Interviewer:
Algorithm may not be as detailed. BERT is very common. Then we can do the metrics and feature selection
Don’t need to worry about concatenate
Interviewee:
What is usually asked in feature selection
Can give a high level summary first: user profile, content, cross-content
Then drill down
Interviewer:
Drilled down too much detail: BERT
Key problem is you don’t know what are the key points in testing
You can say the best model first, then add other options. Then we can dive deep into which one.
We rarely see SVM.
===
Audience:
We did not go deep into algorithm
ML engineer
Try to do a bit more high level.
Last year interviewed ML system design
The process is very similar
Got pass score
Close to case study
More high level
More about experience in this area
Lots of details
Writing takes a lot of time
Can borrow from system design
A workflow diagram
½ online, tweet -> application server filter -> label it can pass or not pass
How does an application server do this?
Take feature
Feature store: Latency, scalable
How to build feature store, online and offline
After you discuss online, you can map steps to offline
Another flow: a loop
Other points worth discussion
data consistency, data shift
data collection bias. We have met similar problem during experience
Single point of failure.
Data overfitting.
What is a backup plan? If BERT failed, then is there a backup solution
We probably should spend more time on high level
recall/precision.
Recall high -> lots of false positive
Precision high -> lots of lawsuits
You may need to swap in a different model when encountering lawsuits
Deploy:
Centric, or deploy to client?
Centric: update quickly. If offline the user cannot use your service
===
Audience
Are these points for senior level?
Audience answer:
E4/L4: test on key knowledges, tradeoffs
E5: more system level
Difficult to type everything
===
High level design
Can draw and discuss at the same time
===
audience:
Interviewer may want to control the pacing
Feature selection
Interviewer should decide the priority of the answers
===
BERT feature model:
What features do we use?
We have numeric feature and language feature
BERT concatenate with the model of the numeric feature
Do we treat BERT as part of the big model? Or fix the bert and train the rest?
NLP: look at word relations
Combination is too big when the length
We can add a few more layers to handle more words
===
Audience
CNN - is it similar to N-gram?
Yes. 2 word kernel, then next layer x2
2x2x2x2 can expand to many words
Deployment
Why do we need parallel on GPU?
BERT - attention time. Parallel can speed up the computation
Do we need GPU for inference?
CPU - sequence model
BERT - can put the whole model into BERT
Non text concatenate
Naive based as the baseline
Offline - first build a naive based
If recall is important, e.g. 60%
Online inference - GPU is important
CPU serving is sufficient
GPU can be used for model training
During inference, we don’t need GPU any more
Can GPU significantly reduce latency? Yes. But too resource intensive. In industry we usually use CPU inference.
Depends on complexity and how big the data, e.g. real time processing for image
GPU:
Complex model, more data,
How to do it fast:
Tfid then word2vec
If we train embedding through BERT, then online we need to use BERT again.
Quantization:
Why?
Bitwise: 32 bits -> change to 8 bits
Float -> integer
You can speed up, but you sacrifice precision
Optimization:
can do it during training
can also do it during serving
Some possible model: mixed precision (32bits, 64bits)
Reduce latency
What precision we use during training, we use the same precision during deployment
During deployment we can compress the model
Why we don’t recommend LSTM
Because it’s hard to run in parallel
Main issue is it doesn’t solve long distance relation
Model deployment
Online, offline, late processing
Device processing
Where do I learn it from?
We may get these from work experience. If we hit this situation, we can use a backup model
AB testing has a backup plan
When we draw diagram, we can have talk about backup plan
Machine learning system design
Online training vs AB testing
How do I do AB testing when I do online training?
AB testing is not for updating model
We usually have a backup plan
When we add a feature, we will need to do AB testing
Data shift, quality.
Most of time people don’t use online training.
Sometimes online training is for model requiring complex/slow computation