Crawler

System DesignSearch & Indexing

Topic: Crawler

Interviewer: Pinglu

Interviewee: Harry

Level: L5 (Senior)


Topic

Mock System Design Interview Summary

Interview Overview

Date: 1/2/2022

Target level: L5

Duration: 1 hour

Topic covered: Crawler

Drawing tool used: Whimsical

Requirements

Functional requirements

Distributed crawler

Non functional requirements

Latency requirements?

How many websites? When do we need to crawl?

Latency is flexible. Depends on how often the content changes

1 trillion URLs

Approach: start from a workable solution. Then make it scalable.

System Design

System design diagram

URL retriever

URL downloader

Content parser

Indexer

Interviewee: How to improve:

Control the speed of crawling -

Kafka to avoid loss

Interviewer: What’s metadata:

Metadata of the website.

It says “how often your client should access my website”

How to handle duplicate URL?

In metadata database, we have URL, and update time. Crawler can look at the URL.

To avoid the duplicate content, we can use hash to see if the content has changed or not.

We can adjust the scheduler if the content doesn’t change very often.

Database schema

ID,link, status, updatetime(i.e. Last crawl time)

Interviewer: How do you shard?

Interviewee: Calculate:

1 trillion websites

11k-12k per second if we crawl once a week

Interviewer:

Update frequency:

Most frequent: 5 minutes

Least frequent 1 month

For retriever or downloader

Need 2000 machines

1k websites/second

1 / s

1k

100 process for each machine

Need at least 10 machines

1000 machines

We can shard the URL based on the URL ID

Shard by ID, spread it to 10 machines

Revised ID:

We can shard by location

then shard by URL ID

Based on that hand to different Crawlers and Downloaders

How do you handle failures?

Failure 1: When website is down, we can add retry. Downloader can add the entry back to the queue

Another failure: message queue is down

Interviewee: clarify the failures

Interviewer: retriever and downloader

Interviewee:

We can assume the process is stateless

We can create a write-ahead log

Interviewer: How to detect failure of retriever and scheduler?

Interviewee:

Master can use heart beat, and restart if the process is down

We can also use zoo keeper

Interviewer: How to avoid infinite loop? Circular URLs

Interviewee:

We can use BFS to process the files

Mark all the URL that has been crawled

Avoid duplicate URL if the URL has been crawled

Interviewer: How do we extend the system?

Interviewee: Save in DFS. We can read the content differently.

Interviewer and Audience Feedback after the Interview

Clear design

Explain each detail components

Did well to ask for feedback

Answered the questions well

Strong technical knowledge

Scalability

URL size can be divided into multiple database

Biggest components

Retriever

Downloader

Content parser

Starts from retriever

Retrieve URL from database

Retriever put URL in queue for downloader

Downloader will get the URL from queue

Get IP from DNS server/cache

Downloaded URL and content goes to a new queue

Content parser processes content

HTML and raw data is saved in DFS

Content parser can get some new URLs, and put it into the database

Recurring job: for optimizing retriever and downloader

Robot.txt decides highest frequency you can retrieve

Recurring job saves it into DB

Question from the audience

Why retriever and downloader are different?

To handle different speed between retriever and downloader

Can also handle the speed for download

Do we need to have a full table scan?

We may use SQL

Look at the status and update time

Why not push the URL directly from parser directly to queue?

If there is no database, then you may not do dedupe

URL may have priority

Can set different priority in queue

Update time, schedule time

Schedule time can be computed

Exponential backoff

If there is no change

Interviewee:

Scheduler: is for speed control

Why do we use kafka?

How do you know the message can fit into queue?

You can first write into file system. Then you can send file location into the queue

Reason to split crawler and parser?

Audience

Ask for next steps

Should estimate the machines

Latency doesn’t apply to this case

How to partition, what’s partition key

Location and ID partition

Audience

Component is not clearly described?

Sharding

Don’t wait for the interviewer?

What needs sharding?

DFS and Database sharding

Sharding is usually based on visit patterns

Related website

Related time

Audience:

Please write a SQL statement for database retrieval

How to analyze?

NLP processing, keyword, elestatic search

Need to have different queue using different priorities

Details about fast changing URL vs slow changing URL

Basic

Bonus requirement: tiered access

Feature is not well defined

Usually uses Bloomfilter

Put seed into the database

Seed in database

Check bloomfilter

How to update bloomfilter

===

What happens if we are stuck?

Come back to the high level requirements

Latency

Scalability

Durability

Reliability - fault tolerant

There may be 3 points for reliability

User journey: MVP

Scale from small number of users -> high level of users

===

From simple to complex

Increase data size

Optimize: queue

Security

Monitoring

Fast and slow changing websites?

There are 2 types of interviewers

Some have strong idea

Some don’t have strong idea

Calibration:

What are the key points?

Completeness?

Or deep dive?

Fundamental

What are the product requirements

API, queue, database, cache

Should become familiar with basic components

Clarify requirement is the most important

==

Interviewer:

Requirement: all URLs, duplicate

Audience:

Should we ask interviewer

Audience

Missing the iterations

Step-by-step narrow down

Difference between L4 and L5?

L5:

requires trim down

Duplicate

Functional, non functional, user journey

Database, message queue

L5 - how to identify key feature, bottleneck, big problems

Infrastructure: machine, round trip, milliseconds

L5:

Multiple solutions, tradeoffs

Try to find the points

Different databases (sql/nosql)

Interviewers are emotional. Try to find one impressive point.

Deep dive, how to drill down?

Usually by the need of real work of the interviewer.

In real life, there is no well defined problems

Need a workable framework

Main bottleneck is the sites does not allow you to parse very quickly

The same domain may need to distribute to different crawler machines

==

Interviewer may deep dive based on the flow of the answer, and based on your expertise.

Should write down the requirements clearly before doing the design.