Designing Data-Intensive Applications

February 9, 2021 (last updated March 29, 2026)

cover

description

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications. Peer under the hood of the systems you already use, and learn how to use and operate them more effectively Make informed decisions by identifying the strengths and weaknesses of different tools Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity Understand the distributed systems research upon which modern databases are built Peek behind the scenes of major online services, and learn from their architectures--- tags: software/design category:

system design

Design Data Intensive

Applications

1. Reliable, Scalable, + Maintainable Applications

key terms:
- databases
- caches
- search indexes
- stream processing
- batch processing
the fundamentals of what we are trying to achieve --- reliable, scalable, maintainable

Reliability

Different kind of faults to protect against:

Hardware faults
Software errors/Systematic faults
Human Error RAID configuration

Scalability

scalability is the term we use to describe a systems ability to cope with increased load
Questions to consider:
- if the system grows in a certain way, what are our options for coping with the growth
- how can we add computing resources to handle the additional load

Describing load:

to answer questions about scalability, we need to be able to describe the current load on the system
examples:
requests per second
ratio of reads to writes in a db
of simultaneously active users in a chat
hit rate on a cache

Describing Performance

in batch processing we usually care about throughput
in online (streaming) response time is usually the most important
throughput:
- the number of records we can process per second
- the total time it takes to run a job of a certain size
response time:
- the time between the client sending their request and recieving response.
Latency vs response time
- response time is what the client sees
- latency:
  - the amount of time that a request is waiting to be handled
response time should be thought of as a distribution and not a single number
- the same request can have multiple different response times.
tail latencies
- high percentiles of response times
- important as they directly affect the users experience of the service
tail latency amplification
- when multiple calls are made in parallel, the end user needs to wait for the slowest call to complete the request. the chance of getting a slow call increases when an end user request requires multiple backend calls. thus a small percentage of slow calls can lead to a slow overall experience for the user

Approaches for Coping with Load

Shared nothing architecture
- distributing the load accross multiple machines
- aka horizontal scaling
elastic system
- can automatically scale out when they detect an increase in load
the load parameters for a system will determine the architecture for that system
- an architecture that scales well for a particular system is built around assumptions of which operations will be common and which operations will be rare.

Twitter Example:

main operations:

post tweet (write)
read timeline (read)

twitters scaling problem

fan out : ie many to many mapping. each user follows many people and is followed by many people

main ways of implementing two main operations

approach 1: using a relational database to store tweets users and followers
- a new tweet goes into the global collection of tweets
- home timeline is constructed at request, fetching the tweets from the people they follow and merge them sorted by time
approach 2: use a cache for each users home timeline
- when a user posts a tweet, update the caches of the people who follow that user
- reads of home timeline are cheap because it is precomputed
- posting tweets (writes) are now expensive
- TODO: INSERT DIAGRAMS

trade offs and best compromise

approach 1
approach 2
- when a user with large number of followers tweets, this operation is very expensive

2. Data Models & Query Languages

object relational mismatch / impedimence mismatch
- OOP data structures in code and relational data model in db dont match up well
- translating the relational model to the application code is awkward
document model
- ex-- json
- appropriate when the data for the data structure is a self contained document
- better locality (all the relevant info of a document is stored in one place)
- good representation for one to many relationship (one user has many fields ex resume)
its best to normalize databases by storing information based on id than by test string (example have an id for the country instead of the country name)
- the id never needs to change, anything that is understood/meaningful to humans might one day change.

Relational vs Document databases today

document data model main benefits
- schema flexibility
- better performance (due to locality)
- no impedimence mismatch (some cases)
relational model benefits:
- better support for joins
- better handling of many to one and many to many relationships (ie allows for data normalization)

Schema flexibiliy in the document model

schema on read
- structure of the data is interpreted by the application only once it is read
schema on write
- structure is explicit, all data written to the database must conform to the schema

3. Storage and Retrieval

index

an additional structure derived from the primary data
many dbs allow you to add and remove indexes, doesnt affect the content of the database
every index slows down writes, but this is an important tradeoff
well chosen indexes speed up read queires (retrievals)

Hash Indexes

an in memory hash map where every key is mapped to a byte offset in the data file (location where value can be found)
when you want to look up a value, use the hash map to find the offset, seek to that location, and read.

Segmentation, Compaction, & Merge

to avoid running out of disk space when appending to a log, we do log segmentation
segmentation: break the log into segments of a certain size by closing a segment file when it reaches a certain size. then making subsequent writes to a new segment file
we then perform compaction on the segments
compaction: discard duplicate keys in the log and keeping only the most recent update for a key in memory
then we can merge the smaller segments together as they are now smaller files.
each segment now has its own in memory hash table
- to find a value for a key, we first check the most recent segment hash map,
- if they key is not present, we keep checking each subsequent segment's hash map
- merging process keeps thenumber of segments small so we dont have to check too many hash maps

Some implementation issues to consider

File Format
Deleting Records
- append special deletion record to data file (called a tombstone)
Crash Recivery
- if database is restarted, the in memory hash maps are lost
- can rebuild each segments hash map or store a snapshot of each hash map on disk and load into memory quickly on restart
Partially written records
Concurrency Control
- common choice is to have only one writer thread (but many reads) to avoid conflicting writes

Advantages of an append only design

sequential writes faster than random writes
concurrency and crash recovery much easier to handle

Disadvantages

hash table must fit in memory
- difficult to make an on disk hash map perform well
range queries not efficient