System Design
Availability Patterns
Fail over
- active/passive failover: when the active server heartbeat is lost, the passive server takes over as actice and resumes service
- active/active failover: both servers are managing traffic and balancing load between them, if one goes down either the DNS or the application would know to direct traffic to the only active server
Replication
Replication
What is replication?
- replication means duplicating our data, having copies on the same data stored across multiple nodes
- replication is a way to make our systems more robust, fault tolerant and potentially more performant
- example -- if a node goes down, or we want to bring data physically closer to our users
Things to consider
- replication means we have to think about consistency
- how do we ensure consistency of data across the various nodes
- how do we handle reads and writes when there are multiple copies of the data?
- how do we decide what node accepts a read, and which node accepts a write?
Consistency Patterns
https://www.allthingsdistributed.com/2008/12/eventually_consistent.html
Weak Consistency
- after a write, reads may or may not see it. but its NOT guaranteed
- mostly used in scenarios where data is not persisted
Eventual Consistency
- writes are asynchronous... reads will eventually see it
Strong Consistency
- writes are synchronous..... reads will see it right away
How do we arrive at these consistency Patterns?
- Leaders and Followers
Leaders and Followers
Catchup recovery
failover
Multi Leader
Leaderless
to read --- https://lethain.com/introduction-to-architecting-systems-for-scale/
Gateway
- communicates with user service to determine if this is an authenticated request and if yes directs it to the correct service
Grokking The Systems Design Interview
Load Balancing
- we can add LBs at 3 places
- between user and web server
- between web server and internal layer (application servers/cache servers)
- between internal layer and database
- Load Balancers themselves can be a single point of failure, to avoid this introudce redundant load balancers
- ie cluster of load balancers.
Load Balancing Algorithms:
- all algorithms involve health checks to make sure a particular server is up.
Least Connection Method
- directs traffic to server with the fewest active connections
Least Response Time Method
- directs to server with fewest activce connections and lowest average response time
Least Bandwidth Method
- directs to server that current has the least amount of traffic (measured in megabits per second -- Mbps)
Round Robin Method
- cycles through a list of server and sends each new request to the next server
- most useful when the servers are of equal specification and there are not many persistent connections
Weighted Round Robin
- better handles servers with different processing capacities
- eacher server is assigned a weight based on processing capacity
- servers with higher weights recieve new connections and more connections over those with lower weights
IP Hash
- a hash of the client's IP is calculated to redirect the request
Caching
Application Server Cache
- a cache placed directly on a request layer node enables response data to be stored locally
- each time a request is made, node quickly returns locally cached data if available, else it will go to disk
- what happens when you have multiple nodes serving requests?
- each node can host its own cache, however if the same request is randomly distributed to different nodes, cache misses will increase
- potential solutions: global caches and distributed caches
Content Delivery Network (CDN)
- a type of cache that is useful for sites that serve large amount of static media
- a request will first ask CDN for some static media
- if locally available, the CDN will serve the content
- else, it will query the backend server, cache it locally, and serve it to the requesting user.
Cache Invalidation
How do we ensure that when data is modified in the database, it is then removed from the cache to prevent inconsistencies? This process is called cache invalidation
Write-through cache
- we write to the database and to the cache simultaneously
- creates high latency for writes since each write must be done twice
Write-around
- delete from cache and write to database only ??
Write-back cache
- data is written to cache alone and persisted to perminent storage after some specified time interval
Cache Eviction Policies
- FIFO -- without any regard to access patterns
- LIFO -- without any regard to access patterns
- Least Recently Used (LRU)
- Most Recently Used (MRU)
- Least Frequently Used (LFU) -- counts how often items are needed
- Random Replacement - random eviction when space is necessary
Data Partitioning
Partitioning Methods
Horizontal partitioning
- different rows on different servers
- also called sharding
- also called range based partitioning
- con -- if the range isnt chosen carefully, could lead to some servers recieving higher load
Vertical Partitioning
- divide data to store tables related to a specific feature in their own server
Directory Based Partitioning
???
Partitioning Criteria
key / hash based partitioning
List Partitioning
round robin paritioning
composite partitioning
Indexes
- improve read performance but decrease writes
- its an additional data structure that takes a column and maps its value to its location in a table
- duplicates are allowed on secondary indexes
Proxies
Redundancy and Replication
SQL v NoSQL
CAP Theorem
Consistent Hashing
Long-Polling v WebSockets v Server-Sent Events
Grokking the Advanced System Design Interview
Bloom Filter
- tells whether an element may be in a set or definitely is not in a set
- the only possible errors are false positives
- bloom filters are created for SSTables
- in bigtable and cassandra, any read operation has to read from all sstables. if these tables are not in memory, you may end up needing many disk accesse
- they help reduce the number of disk accesses by predicting if an SSTable may contain data for a given row or column pair
- in some cases, a small amount of server memory used for storing bloom filters drastically reduces the number of disk seeks, thereby improving read performance
Consistent Hashing
( a solution for partitioning ) 2 Challenges when we distribute and partition data--
- How do we know on which node a particular piece of data will be stored?
- When we add or remove nodes, how do we know what data will be moved where?
- How can we minimize data movement when nodes join or leave?
Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are addor removed
- the node is like a ring, where each node gets a range of keys,
- when a node is added or removed from the ring, only the next node is affected
- virtual nodes help make sure the nodes stay balanced
![https://www.educative.io/courses/grokking-adv-system-design-intvw/3Yw5qVDnq9R]
Quorum
( a solution for replication ) How do we make sure that all replicas are consistent?
- a quorum is the minimum number of servers on which a distributed operation needs to be performed succesfully before declaring the operations overall success
- aka the minimum number of servers that need to respond and agree
- Quorum is achieved when nodes follow the below protocol:
- R + W > N
- N = nodes in the quorum group
- W = minimum write nodes
- R = minimum read nodes
- ... skipped some details for now
Leader and Follower
Allow a single server to be responsibel fro data replication and coordinating work
- users only write to the leader, and the leader will then write to the followers
- followers can serve read requests for load balancing
- failover -- the process of assigning a new leader in case the leader fails
- in kafka - each partition has a designated leader
Write ahead Log
Machines can fail or restart at any time. How can the program know what it was doing before a system crash?
- WAL: each modification to the system is first written an append only log on the disk. If the machine crashes the system will be able to recover and reapply the operation if necessary.
- results in a reduced number of disk writes
- each node would maintain its own WAL
Segmented Log
A single log can become difficult to manage, and can become a performance bottleneck.
- a segmented log is a WAL that has been broken down into smaller segments.
High Water Mark
When the leader fails and a new leader is to be selected, there may be some data from the old leader that was not yet replicated to the new leader. How can the system know which writes have been succesfully replicated onto the followers?
- A high Water Mark is the last log entry on the leader that has been succesfully replicated to a quorum of followers.
- the client can read data only up to the high water mark index.
Heartbeat
- each server periodically sends a heartbeat message to a central monitoring server or other servers in the system to show it is still alive and functioning
Gossip Protocol
In the absence of a central monitoring server, how do we efficiently communicate to other servers in the cluster that other servers are up and running?
- Gossip Protocol: each node keeps track of a specified number of other nodes in the cluster.
- they gossip (share) this information to one other random node every second
- eventually, each node will know the state of every other node in the cluster
- used by dynamo and cassandra
Hinted Handoff
When a node that was previously down comes back online again, how should we write data to it?
- for nodes that are down, the system keeps notes (hints) of all the write requests they have missed.
- one the failing nodes recover, the write requests are forwarded to them based on the stored hints
- the node coordinating the operation to the failed node will write a hint in text file on its local disk
- when the coordinating node discovers that the failed node has recovered, it forwards the write requests to the target
- used by cassandra and dynamo
Read Repair
How do we repair stale data on nodes that have recovered from a failure?
- read repair: we repair stale data during the read operation
- at that time we can read data from multiple nodes, perform a comparison, and update stale data.
- used by cassandra and dynamo