Designing Data-Intensive Application 读书笔记 1. —— 简介 & 资料
作者:互联网
书籍下载地址:
目录
Part I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Thinking About Data Systems 4
Reliability 6
Hardware Faults 7
Software Errors 8
Human Errors 9
How Important Is Reliability? 10
Scalability 10
Describing Load 11
Describing Performance 13
Approaches for Coping with Load 17
Maintainability 18
Operability: Making Life Easy for Operations 19
Simplicity: Managing Complexity 20
Evolvability: Making Change Easy 21
Summary
2. Data Models and Query Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Relational Model Versus Document Model 28
The Birth of NoSQL 29
The Object-Relational Mismatch 29
Many-to-One and Many-to-Many Relationships 33
Are Document Databases Repeating History? 36
Relational Versus Document Databases Today 38
Query Languages for Data 42
Declarative Queries on the Web 44
MapReduce Querying 46
Graph-Like Data Models 49
Property Graphs 50
The Cypher Query Language 52
Graph Queries in SQL 53
Triple-Stores and SPARQL 55
The Foundation: Datalog 60
Summary 63
3. Storage and Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Data Structures That Power Your Database 70
Hash Indexes 72
SSTables and LSM-Trees 76
B-Trees 79
Comparing B-Trees and LSM-Trees 83
Other Indexing Structures 85
Transaction Processing or Analytics? 90
Data Warehousing 91
Stars and Snowflakes: Schemas for Analytics 93
Column-Oriented Storage 95
Column Compression 97
Sort Order in Column Storage 99
Writing to Column-Oriented Storage 101
Aggregation: Data Cubes and Materialized Views 101
Summary 103
4. Encoding and Evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Formats for Encoding Data 112
Language-Specific Formats 113
JSON, XML, and Binary Variants 114
Thrift and Protocol Buffers 117
Avro 122
The Merits of Schemas 127
Modes of Dataflow 128
Dataflow Through Databases 129
Dataflow Through Services: REST and RPC 131
Message-Passing Dataflow 136
Summary 139
In Part I, we discuss the fundamental ideas that underpin the design of data-
intensive applications. We start in Chapter 1 by discussing what we’re actually
trying to achieve: reliability, scalability, and maintainability; how we need to
think about them; and how we can achieve them. In Chapter 2 we compare sev‐
eral different data models and query languages, and see how they are appropriate
to different situations. In Chapter 3 we talk about storage engines: how databases
arrange data on disk so that we can find it again efficiently. Chapter 4 turns to
formats for data encoding (serialization) and evolution of schemas over time.
Part II. Distributed Data
5. Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Leaders and Followers 152
Synchronous Versus Asynchronous Replication 153
Setting Up New Followers 155
Handling Node Outages 156
Implementation of Replication Logs 158
Problems with Replication Lag 161
Reading Your Own Writes 162
Monotonic Reads 164
Consistent Prefix Reads 165
Solutions for Replication Lag 167
Multi-Leader Replication 168
Use Cases for Multi-Leader Replication 168
Handling Write Conflicts 171
Multi-Leader Replication Topologies 175
Leaderless Replication 177
Writing to the Database When a Node Is Down 177
Limitations of Quorum Consistency 181
Sloppy Quorums and Hinted Handoff 183
Detecting Concurrent Writes 184
Summary 192
6. Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Partitioning and Replication 200
Partitioning of Key-Value Data 201
Partitioning by Key Range 202
Partitioning by Hash of Key 203
Skewed Workloads and Relieving Hot Spots 205
Partitioning and Secondary Indexes 206
Partitioning Secondary Indexes by Document 206
Partitioning Secondary Indexes by Term 208
Rebalancing Partitions 209
Strategies for Rebalancing 210
Operations: Automatic or Manual Rebalancing 213
Request Routing 214
Parallel Query Execution 216
Summary 216
7. Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
The Slippery Concept of a Transaction 222
The Meaning of ACID 223
Single-Object and Multi-Object Operations 228
Weak Isolation Levels 233
Read Committed 234
Snapshot Isolation and Repeatable Read 237
Preventing Lost Updates 242
Write Skew and Phantoms 246
Serializability 251
Actual Serial Execution 252
Two-Phase Locking (2PL) 257
Serializable Snapshot Isolation (SSI) 261
Summary 266
8. The Trouble with Distributed Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Faults and Partial Failures 274
Cloud Computing and Supercomputing 275
Unreliable Networks 277
Network Faults in Practice 279
Detecting Faults 280
Timeouts and Unbounded Delays 281
Synchronous Versus Asynchronous Networks 284
Unreliable Clocks 287
Monotonic Versus Time-of-Day Clocks 288
Clock Synchronization and Accuracy 289
Relying on Synchronized Clocks 291
Process Pauses 295
Knowledge, Truth, and Lies 300
The Truth Is Defined by the Majority 300
Byzantine Faults 304
System Model and Reality 306
Summary 310
9. Consistency and Consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Consistency Guarantees 322
Linearizability 324
What Makes a System Linearizable? 325
Relying on Linearizability 330
Implementing Linearizable Systems 332
The Cost of Linearizability 335
Ordering Guarantees 339
Ordering and Causality 339
Sequence Number Ordering 343
Total Order Broadcast 348
Distributed Transactions and Consensus 352
Atomic Commit and Two-Phase Commit (2PC) 354
Distributed Transactions in Practice 360
Fault-Tolerant Consensus 364
Membership and Coordination Services 370
Summary 373
In Part II, we move from data stored on one machine to data that is distributed
across multiple machines. This is often necessary for scalability, but brings with
it a variety of unique challenges. We first discuss replication (Chapter 5), parti‐
tioning/sharding (Chapter 6), and transactions (Chapter 7). We then go into
more detail on the problems with distributed systems (Chapter 8) and what it
means to achieve consistency and consensus in a distributed system (Chapter 9).
Part III. Derived Data
10. Batch Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Batch Processing with Unix Tools 391
Simple Log Analysis 391
The Unix Philosophy 394
MapReduce and Distributed Filesystems 397
MapReduce Job Execution 399
Reduce-Side Joins and Grouping 403
Map-Side Joins 408
The Output of Batch Workflows 411
Comparing Hadoop to Distributed Databases 414
Beyond MapReduce 419
Materialization of Intermediate State 419
Graphs and Iterative Processing 424
High-Level APIs and Languages 426
Summary 429
11. Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Transmitting Event Streams 440
Messaging Systems 441
Partitioned Logs 446
Databases and Streams 451
Keeping Systems in Sync 452
Change Data Capture 454
Event Sourcing 457
State, Streams, and Immutability 459
Processing Streams 464
Uses of Stream Processing 465
Reasoning About Time 468
Stream Joins 472
Fault Tolerance 476
Summary 479
12. The Future of Data Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Data Integration 490
Combining Specialized Tools by Deriving Data 490
Batch and Stream Processing 494
Unbundling Databases 499
Composing Data Storage Technologies 499
Designing Applications Around Dataflow 504
Observing Derived State 509
Aiming for Correctness 515
The End-to-End Argument for Databases 516
Enforcing Constraints 521
Timeliness and Integrity 524
Trust, but Verify 528
Doing the Right Thing 533
Predictive Analytics 533
Privacy and Tracking 536
Summary 543
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
3. In Part III, we discuss systems that derive some datasets from other datasets.
Derived data often occurs in heterogeneous systems: when there is no one data‐
base that can do everything well, applications need to integrate several different
databases, caches, indexes, and so on. In Chapter 10 we start with a batch pro‐
cessing approach to derived data, and we build upon it with stream processing in
Chapter 11. Finally, in Chapter 12 we put everything together and discuss
approaches for building reliable, scalable, and maintainable applications in the
future.
何时 需要 可扩展的数据系统? (vs 关系型数据库) =》先分析 利、弊
Sometimes, when discussing scalable data systems, people make comments along the
lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a
relational database.” There is truth in that statement: building for scale that you don’t
need is wasted effort and may lock you into an inflexible design. In effect, it is a form
of premature optimization. However, it’s also important to choose the right tool for
the job, and different technologies each have their own strengths and weaknesses. As
we shall see, relational databases are important but not the final word on dealing with
data.
参考资料 如何获取
For academic papers, you can search for the title in Google Scholar to find
open-access PDF files. Alternatively, you can find all of the references at https://
github.com/ept/ddia-references, where we maintain up-to-date links.
The references at the end of each chapter are a great resource if
you want to explore an area in more depth, and most of them are freely available
online.
This book has over 800 references to articles, blog posts, talks, documenta‐
tion, and more, and they have been an invaluable learning resource for me. I am very
grateful to the authors of this material for sharing their knowledge.
资料
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/designing-data-intensive-apps.
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Members have access to thousands of books, training videos, Learning Paths, interac‐
tive tutorials, and curated playlists from over 250 publishers, including O’Reilly
Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐
fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
开源 vs 闭源
This book has a bias toward free and open source software (FOSS), because reading,
modifying, and executing source code is a great way to understand how something
works in detail. Open platforms also reduce the risk of vendor lock-in. However,
where appropriate, we also discuss proprietary software (closed-source software, soft‐
ware as a service, or companies’ in-house software that is only described in literature
but not released publicly).
2. Data Models and Query Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Relational Model Versus Document Model 28The Birth of NoSQL 29The Object-Relational Mismatch 29Many-to-One and Many-to-Many Relationships 33Are Document Databases Repeating History? 36vii
Relational Versus Document Databases Today 38Query Languages for Data 42Declarative Queries on the Web 44MapReduce Querying 46Graph-Like Data Models 49Property Graphs 50The Cypher Query Language 52Graph Queries in SQL 53Triple-Stores and SPARQL 55The Foundation: Datalog 60Summary 633. Storage and Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Data Structures That Power Your Database 70Hash Indexes 72SSTables and LSM-Trees 76B-Trees 79Comparing B-Trees and LSM-Trees 83Other Indexing Structures 85Transaction Processing or Analytics? 90Data Warehousing 91Stars and Snowflakes: Schemas for Analytics 93Column-Oriented Storage 95Column Compression 97Sort Order in Column Storage 99Writing to Column-Oriented Storage 101Aggregation: Data Cubes and Materialized Views 101Summary 1034. Encoding and Evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Formats for Encoding Data 112Language-Specific Formats 113JSON, XML, and Binary Variants 114Thrift and Protocol Buffers 117Avro 122The Merits of Schemas 127Modes of Dataflow 128Dataflow Through Databases 129Dataflow Through Services: REST and RPC 131Message-Passing Dataflow 136Summary 139viii | Table of Contents
Part II. Distributed Data5. Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Leaders and Followers 152Synchronous Versus Asynchronous Replication 153Setting Up New Followers 155Handling Node Outages 156Implementation of Replication Logs 158Problems with Replication Lag 161Reading Your Own Writes 162Monotonic Reads 164Consistent Prefix Reads 165Solutions for Replication Lag 167Multi-Leader Replication 168Use Cases for Multi-Leader Replication 168Handling Write Conflicts 171Multi-Leader Replication Topologies 175Leaderless Replication 177Writing to the Database When a Node Is Down 177Limitations of Quorum Consistency 181Sloppy Quorums and Hinted Handoff 183Detecting Concurrent Writes 184Summary 1926. Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Partitioning and Replication 200Partitioning of Key-Value Data 201Partitioning by Key Range 202Partitioning by Hash of Key 203Skewed Workloads and Relieving Hot Spots 205Partitioning and Secondary Indexes 206Partitioning Secondary Indexes by Document 206Partitioning Secondary Indexes by Term 208Rebalancing Partitions 209Strategies for Rebalancing 210Operations: Automatic or Manual Rebalancing 213Request Routing 214Parallel Query Execution 216Summary 2167. Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221The Slippery Concept of a Transaction 222
The Meaning of ACID 223Single-Object and Multi-Object Operations 228Weak Isolation Levels 233Read Committed 234Snapshot Isolation and Repeatable Read 237Preventing Lost Updates 242Write Skew and Phantoms 246Serializability 251Actual Serial Execution 252Two-Phase Locking (2PL) 257Serializable Snapshot Isolation (SSI) 261Summary 2668. The Trouble with Distributed Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273Faults and Partial Failures 274Cloud Computing and Supercomputing 275Unreliable Networks 277Network Faults in Practice 279Detecting Faults 280Timeouts and Unbounded Delays 281Synchronous Versus Asynchronous Networks 284Unreliable Clocks 287Monotonic Versus Time-of-Day Clocks 288Clock Synchronization and Accuracy 289Relying on Synchronized Clocks 291Process Pauses 295Knowledge, Truth, and Lies 300The Truth Is Defined by the Majority 300Byzantine Faults 304System Model and Reality 306Summary 3109. Consistency and Consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321Consistency Guarantees 322Linearizability 324What Makes a System Linearizable? 325Relying on Linearizability 330Implementing Linearizable Systems 332The Cost of Linearizability 335Ordering Guarantees 339Ordering and Causality 339Sequence Number Ordering 343x | Table of Contents
Total Order Broadcast 348Distributed Transactions and Consensus 352Atomic Commit and Two-Phase Commit (2PC) 354Distributed Transactions in Practice 360Fault-Tolerant Consensus 364Membership and Coordination Services 370Summary 373Part III. Derived Data10. Batch Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389Batch Processing with Unix Tools 391Simple Log Analysis 391The Unix Philosophy 394MapReduce and Distributed Filesystems 397MapReduce Job Execution 399Reduce-Side Joins and Grouping 403Map-Side Joins 408The Output of Batch Workflows 411Comparing Hadoop to Distributed Databases 414Beyond MapReduce 419Materialization of Intermediate State 419Graphs and Iterative Processing 424High-Level APIs and Languages 426Summary 42911. Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439Transmitting Event Streams 440Messaging Systems 441Partitioned Logs 446Databases and Streams 451Keeping Systems in Sync 452Change Data Capture 454Event Sourcing 457State, Streams, and Immutability 459Processing Streams 464Uses of Stream Processing 465Reasoning About Time 468Stream Joins 472Fault Tolerance 476Summary 479Table of Contents | xi
12. The Future of Data Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489Data Integration 490Combining Specialized Tools by Deriving Data 490Batch and Stream Processing 494Unbundling Databases 499Composing Data Storage Technologies 499Designing Applications Around Dataflow 504Observing Derived State 509Aiming for Correctness 515The End-to-End Argument for Databases 516Enforcing Constraints 521Timeliness and Integrity 524Trust, but Verify 528Doing the Right Thing 533Predictive Analytics 533Privacy and Tracking 536Summary 543Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
标签:Chapter,Replication,data,Processing,Intensive,Application,读书笔记,Databases,Data 来源: https://www.cnblogs.com/panpanwelcome/p/15854672.html