Google File System, a Planet Scale Distributed File System

System DesignDatabases & StorageDistributed Systems

Materials — open to everyone, no sign-in

Topic: Google File System, a Planet Scale Distributed File System

Presenter: Longwei

Additional Resources:


Sign up for future events:

System Design Presentation Summary

Date: 9/11/2022

Topic: distributed file system

Presenter: Leo

https://docs.google.com/presentation/d/18u9NAe2pZamC7LBNLazVlPECIIrCpYb9HrMgl8Zjwvc/edit#slide=id.g10dd36c8b08_0_31

Top 4 layers are conceptual

The master is a single machine (single point of failure)

Client only upload data to one replica (the closest)

Network is the most expensive resource

Naively we can assume each set of operation is performed on each chunk

Chunk location is not persisted

Chunk server will sync with master during boot

Summary review:

Master supports metadata

Real data is stored in the chunk server

Meta information is very small X kB information

10 GB memory - 100T data

Q: how to handle master crash

A: master has write-ahead-log. Master can be restarted within 1 minute

Audience: there is a shadow master. Can recover from log

May get to a state where we fail to write to some replica.

Big idea

Client sends data to the memory of 3 replicas

Client asks 3 replicas to save from memory to disk

Lease: a node is a master of a version only for 1 minute

When appending more data than a chunk, GFS will start a new chunk.

Shadow master: similar to slaves in master/slave architecture where master can be used for write and read, and slaves can be used for read.

Are operation log sent to shadow master

Trade-offs

Chunk size = 64MB

Larger chunk size: client has less communication with master; master requires less memory

Not good for small sizes.

Not all master data is strongly consistent to gain performance.

At least once vs exactly once

Using at least once; may have duplicate; faster write, slower read

Offset decided by GFS

Cannot support fast random access

Two primary on network partitions?

Yes. resolved by lease + version

Master fails?

Human intervention

Stale replica

Audience Discussion

If we modify 1 chunk, will it bump up a version?

Snapshotting is related to file copying. When copying a file to another file, the other file is a shallow copy, until a modification happens.

Why can we write at most once during one minute?

Lease has TTL of 1 minute.

What if the access is longer than 1 minute?

Because we are sending a trunk, which should be fast.

Every request will get a lease from the master.

Master may change from one version to another version.

It seems the version number can solve the problem of ordering.