Machine Learning System Architecture and Development Cycle

Machine LearningSystem Design

Topic: Machine Learning System Architecture and Development Cycle

Presenter: Junzhi He

Additional Resources:


Sign Up Form:

Job Referral

Candidates: https://commitway.com/job-refer

Hiring managers/team members: https://commitway.com/job-open

QRCode

Meeting Summary

Machine Learning Architecture

and Development Cycle

Why a system for machine learning is developed

Tools, workflow, hardware, architecture

We will explain key concepts

We will help friends to understand across software engineering and machine learning engineering

We will cover recruiting talents, motivating teams, managing up as a machine learning manager

Outline:

What is machine learning

ML basic component and architect

Model development cycle vs SWE development cycle

The future of machine learning and AI

How do you learn a new topic? Distributed computing and machine learning

Misconceptions

Common issues:

Initially I didn’t have an overall knowledge of architecture.

I knew individual pieces, but it’s hard to know the overall picture

Machine learning is the same

When we learn through books, it’s dry.

Key issue is the knowledge is distant from day-to-day work

Then we doubt ourselves

愚公移山

Goal is good

But the process is not advisable.

What problem did it solve?

What challenges does it bring?

Key reasons

What does distributed system solve?

Cost vs product quality tradeoff

Data size increases, requiring systems to process lots of data

2015: 6ZB

2020: 44ZB

2030: 2500ZB

Value of Individual records decreases

7*24 hour service required due to competition

Need to support a company of 1000s - 10000s people

Single machine system

Frontend

Java web app

mySQL

Now the system is a lot more complex.

All nodes are replicated

Application server to web layer and application layer

Lots of problems from increasing scale

Where to find other servers?

How to distribute load?

Need to manage system architecture using software and not human

Different servers serve the same user

What if some server crashes. Will we have an avalanche

How to monitor and maintain servers?

Lots of problems with lots of data

Partition of data

Data replication

Before you study, write down problems and classify

Division of work

Find service

Find instance

Data partition

Avalanche

Consensus

Replication

Distributed application logic

Distributed lock

Operation

Configuration center

Monitor and observability

We need to have an overall system to learn

Machine learning:

Most important point: Complex patterns: there are patterns to learn, they are complex, there are historical

Learn: system has capacity to learn

Existing data: data is available

etc

Try to discover patterns

Which one is easiest for an ML system to solve?

A released prisoner, will they have another offense again

Buying pattern, will the buyer break the contract in the future?

Watching behavior, what video will they watch in the future

Using a questionnaire, decide the support rate of candidates

Answer: using past pattern to predict future pattern

difficult:

Support rate: not easy as ML system

Ethics: second offense

Easier:

Buying behavior to predict the risk of the buyer

Watching behavior to predict what they

5 more examples

Chess

麻将

Chinese chess

Texas poker

Go

The current state of the game + previous behavior: can decide your next best action.

Q: Previous behavior?

A: The psychology of the opponent

There can be infinite number of repeats of experiments

It is harder than the behavior prediction

Machine learning applications

Recommender system

Acquiring new customers

Increasing customer satisfaction

Increasing long term customer engagement

Generating customer intelligence

Predicting number of visitors

Reducing cost

Increasing customer satisfaction

Predicting demand fluctuations

Fraud detection

NLP and CV: interaction with customers

Basic steps for ML system development

Model: pattern for prediction

Deploy the model

Use the model

ML system deployment cycle

Explore and process:

collect historical data

clean and explore data

very important

Discover some relevance between signal to result

prepare/transform.

Modeling:

Develop and train model

Validate/ evaluate model

Deployment

Deploy to production

Monitor and update model & data

Challenges from ML systems:

Part1 challenges

Different data sources are easy to access

Easily queryable - infrastructure requirement such as 3 papers from Google

We must have data to train

Some requirements are vague:

Competitor can customize prices without losing satisfaction

You need to make some assumptions.

Lots of failures are due to assumptions

Complexity of computation. The speed is too slow

What is a model?

a*2 + b = 5

a*1 + b = 4

Linear model

min(abs(5 - 2a - b) + abs(4 - a - b)) => solution determines the best a and b

If I have too much data, how to do I do distributed optimization of the model?

Part2 challenges

After you deploy a model, performance is worse than experiment

After you deploy a model, performance worsened

Are you 100% sure the logic is correctly implemented?

What if engineers implemented different code than data scientists?

What if data scientists make a mistake?

How to version the model?

If the model has an error or there is a bias in data, how do we troubleshoot?

Part 3 challenges

What if the team’s value doesn’t align with others?

ML engineers

Sales teams - they want engagement. They want to recommend the thing that generate highest revenue

Product team - they want good user experience

ML platform team: biggest ask is not to change the platform

Manager - want people to use new ML model

Needs lots of negotiation with other teams

Solution to ML system challenges

“Technical debt in machine

ML challenges

Slides link

ML System challenges by root cause

Correctness

Data access

Cannot train practically

Prediction speed is low

Change in underlying data

Workflow for development

Balance of team needs

Infrastructure problems: 2, 3

Architecture problem: 2, 3, 4

Operation and tool problems: 1, 5, 6

Management and collaboration: how to make tradeoffs, how to set up a culture of data-driven decision culture. Data can help. Involvement of domain experts: 1, 7

ML’s own problem: 3, 4

Biggest problem is business impact

Revenue

Cost

Customer satisfaction

Computation speed and QPS

Usually:

restAPI

parse HTTP request

prepare feature

load model

pass feature to model

Calculate result

Send back HTTP response

What if the feature/calculate the result step is too slow

Things to improve:

Network protocol

Data format

Model I/O

Model inference speed

Hardware improvement

Easy way to improve

Cache the model in GPU

Model compression:

lower-rank factorization. Reduce/eliminate some layers

Knowledge distillation

Pruning

Quantization - 32 bit double, 16 bit float, 8 bit int If they don’t degrade the result, then it works

Reference Uber ML system

Pre-compute result

Replication and partition. Scalable

Hardware: TPU (tensor)

GPU is better for 1 dimensional calculation

Model optimization

The most expensive. Need to optimize based on hardware and ML cod

Matrix calculation optimization

Parallel computing

For loop vectorization

Assembly language optimization

https://lmax-exchange.github.io/disruptor/

Compiler optimization?

Usually optimize based on single thread (Single thread vs multi-thread)

Model serving

Batch inference or online inference?

Batch cannot handle latest behaviors

Improving speed vs latency (model freshness) is a tradeoff

Data infrastructure

Datalake, data warehouse

Key questions to ask:

More raw the better? Or binary smaller data

Do you read all data or part of data

Do you need to read often or write often?

Can loss be fine?

Row or column based?

What format? Readable or machine optimized

Datalake: row based, raw

Datawarehouse: column based, formated

Data governance: complex

Data management like code management

Review

Versioning

Refer to Esensoft

What if a column is deleted?

We need to know the baseline

ML system architect 3

Consistent environment leads to consistent output

Development env vs product env

Feature engineering

Workflow management: airflow, prefect (they are scheduler

Resource management: K8S, Kubeflow

Development lifecycle, CD for ML

Other topics

Model training

Model monitor

Continual learning

ML architect 5.1

Most important is how to put together ML system with other systems

What can be slow?

What must be fast? E.g. serving.

Delay tolerance to return prediction:

Online: 80ms

Nearline: 800ms

Offline: One week

Chip: design machine learning system

How do you unify online and offline

We should understand why there needs to be a workflow?

The main reason

CI/CD: standard issue for development environments

Software: code and dependencies

Hardware: environment

Accessory: environment

CI/CD: reduce error for test and verification

Developer

Code

How to test ML model? Offline test is easy. But how to test on prod? Answer: test on production

How to test research? Experiment tracking.

How to control the version of ML? How do we rollback? Answer: versioning. tradeoff of model serving.

Code splits into data, model

Build:

Experiment, model, code together

Automation test

Release

Model, entire ML pipeline, image, code as artifact

CD for ML is more complex

Scheduler and lower layer orchestrators

ML jobs

Research, analysis, model and engineers

Org:

Collaboration

Culture: data/result driven decision.

Make ML result a reference

广结善缘,雷厉风行

Is it a good chance to join ML?

Now ML is mature; growth is slower compared to initial stage

Longer term there will be more growth

MLE:

Closer to model

Closer to engineering

Choose based on your background

Should ML change direction?

Depends on your business impact

ML vs SDE? ML is more specific

Is it time to enter high tech or ML?

Specialization - needs people with lots of experience

Get ready

Technical trend - easier to guess based on experience

Business trend - hard to capture

ML is not a bubble

Picking company: business can change. the core is technical barrier

Books

Full Stack Deep Learning

Chip: Design Machine Learning System

Udacity: MLE - nanodegree - pretty good as training class

李宏毅: Youtube - all concepts well explained

王喆: 书和极客时间专栏

王树森:Youtube

李沐:b站和书 - paper

StatQuest - statistical book

ritvikmath

数学之美

白板推导

Google的老论文: Hidden Technical Debt in Machine Learning Systems

https://netflixtechblog.com/system-architectures-for-personalization-and-recommendation-e081aa94b5d8