📚 Study

Designing Data-Intensive Applications - (1) Trade-offs in Data Systems Architecture

status

Public

date

Feb 9, 2026

slug

designing_data-intensive_applications-1

summary

데이터 집약적 애플리케이션 설계에 대한 내용으로, 운영 시스템(OLTP)과 분석 시스템(OLAP)의 차이를 설명하며, 데이터 웨어하우스와 데이터 레이크의 개념을 다룬다. 클라우드 서비스와 자체 호스팅 시스템의 장단점을 비교하고, 분산 시스템과 단일 노드 시스템의 전환 시점을 논의한다. 또한, 데이터 시스템과 법률, 사회의 균형을 맞추는 중요성에 대해서도 언급한다.

type

Post

Preface

You will develop a strong intuition for what your systems are doing under the hood so that you can reason about their behavior, make good design decisions, and track down any problems that may arise.

Chapter 1. Trade-offs in Data Systems Architecture

However, as the data volume or the rate of queries grows, it needs to be distributed across multiple machines, which introduces many challenges.

Operational Versus Analytical Systems

: the difference between operational and analytical systems

Characterizing Transaction Processing and Analytics

transaction: a group of reads and writes that form a logical unit.

데이터를 바라보는 시각은 2가지

Property	Operational systems (OLTP)	Analytical systems (OLAP)
Main read pattern	Point queries (fetch individual records by key)	Aggregate over large number of records
Main write pattern	Create, update, and delete individual records	Bulk import (ETL) or event stream
Human user example	End user of web/mobile application	Internal analyst, for decision support
Machine use example	Checking if an action is authorized	Detecting fraud/abuse patterns
Type of queries	Fixed set of queries, predefined by application	Analyst can make arbitrary queries
Data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

OLTP 시스템은 고정된 쿼리 집합이 있어, 이미 애플리케이션 코드에 녹아있다. 커스텀 쿼리는 가끔 관리나 트러블 슈팅을 위해 수행된다.

custom sql 쿼리를 작성하여 실행함으로써 데이터를 변경하는 작업이 발생할 수도 있고, 쿼리를 실행하는 것은 비싸며, 다른 이들로 하여금 디비 사용에 성능 영향을 줄 수 있게 때문

반면 분석하는 OLAP는 임의의 sql을 작성하는 것이 자유로우며, 데이터 시각화나 대시보드를 위해 데이터를 작업할 수 있다.

Data Warehousing

처음에는 processing하고 분석하는 쿼리를 실행하는 transaction을 같은 디비를 썼다. → 점차 분석을 위해 data warehouse를 구축하는 것이 트렌드가 되었음.

비즈니스 분석가와 데이터 사이언티스트가 OLTP 시스템에 직접 쿼리를 수행하지 않게 된 이유는

다양한 실행 시스템에 데이터가 분산되어 있어, 하나의 쿼리로 여러 데이터세트를 묶는 것이 불가능해졌다 → 데이터 사일로 문제 발생

+) 데이터 사일로 현상

데이터가 조직 내의 특정 부서나 특정 시스템 안에 갇혀서, 다른 곳과는 서로 호환되지 않거나 공유되지 않는 상태

OLTP에 적잡하게 구성된 스키마와 데이터 레이어는 분석용으로 적합하지 않다.

분석 쿼리는 꽤 비싸며, 이를 OLTP 디비에 실행시키는 것은 다른 사용자에게 성능 문제를 야기할 수 있다.

보안 문제나 규제 목적으로 직접 접근이 불가능하여 OLTP 시스템이 분산된 네트워크를 사용하고 있을 수 있다.

A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations

data warehouse는 회사 내 다양한 OLTP 디비로 부터 나온 read-only copy 를 포함하고 있다.

이는 기간이 나뉘어진 데이터 혹은 연속된 스트리밍 업데이트로 이뤄져 있다.

→ 이렇게 데이터를 data warehouse로 가져오는 작업을 Extract-Transform-Load 라고 한다.

Simplified outline of ETL into a data warehouse

소프트웨어 벤더에서 제공하는 api를 통해서만 접근할 수 있는 데이터의 경우, ELT 프로세스를 통해 데이터를 외부에서 가져와 내부 데이터 웨어하우스에 저장하고, 분석을 실행할 수 있다.

최근에는 hybrid transactional/analytics processing: HTAP, ETL 과정 없이도 하나의 시스템에서 OLTP 기능과 분석 기능을 제공하는데, 일반적으로는 다른 목적과 필요성을 가지고 있는 transactional 용과 analytics 용을 분리하여 관리한다.

operational의 경우, 서비스를 위한 자신만의 디비가 있거나, 분산된 구조를 가지고 있는 반면

analytics의 경우, 하나의 쿼리로 여러 시스템의 데이터를 다뤄야 하기 때문에 하나의 data warehouse를 갖게 된다.

→ HLTA는 data warehouse를 대체한다기 보다, 애플리케이션 중 perform analytics queries that scan a large number of rows, and also read and update individual records with low latency 가 필요한 경우에 사용한다.

머신러닝에 SQL 을 사용해보고자 하는 여러 노력은 있었지만, 데이터 사이언티스트는 data warehouse로 relational db를 선호하지는 않는다.

→ 따라서 조직은 데이터 사이언티스트들에게 적합한 데이터를 만들어 줬어야 했기 때문에 data lake를 만든다.

a data lake: a centralized data repository that holds a copy of any data that might be useful for analysis, obtained from operational systems via ETL processes.

data warehouse와 달리, 파일 형식이나 데이터 모델 혹은 스키마 규제 없이 다양한 데이터를 포함하고 있다는 것.

data lake는 데이터가 relational 형식으로 변환되기 전, raw 형식의 데이터를 가지고 있어, 그들이 원하는 목표를 가장 잘 수행할 수 있다.

file-based 데이터 분석은 데이터를 쌓아두었다가 한번에 처리할 수 있는 반면 stream processing은 이벤트가 발생하는 대로 실시간으로 처리가 된다.

애플리케이션, 시간 민감성에 따라서 스트리밍 처리 방식은 중요해진다.

Systems of Record and Derived Data

Systems of record

source of truth

원본

Derived data systems

the result of taking some existing data from another system and transforming or processing it in some way

recreate 할 수 있음

redundant하다고 할 수 있지만, 읽기 성능을 높이기 위해서 필요하다.

Cloud versus Self-Hosting

: pros and cons of cloud services and self-hosted systems

Pros and Cons of Cloud Services

장점:

관리적인 측면에서 자유로움(역할 분기)

그리고 어떨 때에는 더 싸기도 함

단점:

자유는 없음

원하는 기능이 없을 경우 스스로 보충하기보다 벤더가 기능을 추가하기를 기다려야 함

If the service goes down, all you can do is to wait for it to recover.

버그나 성능에 문제가 있을 때, 명확하게 이슈를 분석하기가 어려울 수 있음. 내부 서비스에 대한 접근이 제한되어 있기 때문에 단순히 로그를 보는 선에서 끝나게 된다.

서비스가 종료되거나 말도 안되게 비싸질 경우, 이전 버전의 소프트웨어를 지속하는 건 선택지에 없고 다른 대안 서비스로 migrate 하는 것이 강요된다. 물론, 마이그레이션을 위한 api를 제공하기는 하지만 standard는 없으며, 분명하게 비용은 발생된다.

데이터가 안전하게 지켜진다는 것을 믿어야 한다.

단점이 많아보여도 더 많은 조직들이 점점 클라우드 서비스를 사용하거나 하이브리드 형태로 전환한다.

클라우드 서비스가 모든 사내 데이터 시스템을 모두 대체하지는 못하는데

old systems : legacy → 클라우드가 나오기 훨씬 전부터 쓰던 오래된 시스템들이 너무 많아서 옮기기가 쉽지 않을 뿐더러

specialist requirements: 클라우드 서비스가 제공하는 일반적인 기능으로는 충족시킬 수 없는 특수한 요구사항이 있는 경우

Cloud-Native System Architecture

기존 소프트웨어를 클라우드에서 돌릴 수도 있지만, 애초에 클라우드 용으로 설계된 시스템: cloud-native system 이 훨씬 더 강력한 장점을 가지게 되었다.

그럼에도 요즘에는 cloud-native system, 즉 self-hosted system을 대체할 수 있는 클라우드 서비스 기반 서비스가 많이 등장했음

Category	Self-hosted systems	Cloud-native systems
Operational/OLTP	MySQL, PostgreSQL, MongoDB	AWS Aurora, Azure SQL DB Hyperscale, Google Cloud Spanner
Analytical/OLAP	Teradata, ClickHouse, Spark	Snowflake, Google BigQuery, Azure Synapse Analytics

cloud-native system: storage (disk) and computation (CPU and RAM) are separated or disaggregated

Distributed versus Single-Node Systems

: when to move from single-node systems to distributed systems

Distributed system을 사용해야 하는 이유는 많음

services that you write yourself (application code)
services consisting of off-the-shelf software (such as databases)

네트워크를 타야하기 때문에 모든 노드가 빠른 것은 아니다.

Microservice

allowing different teams to make progress independently without having to coordinate with each other

Serverless

each serverless function execution still runs on a server, but subsequent executions might run on a different on

Data Systems, Law, and Society

: balancing the needs of the business and the rights of the user

GDPR(General Data Protection Regulation), CCPA(California Consumer Privacy Act) → AI 관련 법규가 늘어나면서 개인 정보 처리에 대해 제한 사항이 많이 생겼다

단순히 저장하는 것만이 능사가 아니다.

데이터가 누출되거나 손상되었을 경우, 법적 비용과 벌칙금은 데이터를 효과적으로 저장하는 것보다 위험한 경우의 수 일 수 있다.

미래에 유용할 수 있는 개인 데이터를 저장하는 것은 data minimization에 위배되는 원칙 > GDPR에 따르게 된다면 특정할 수 있을 것으로 보이는 개인 데이터를 나중의 목적으로 저장하지 못한다.

but it’s advisable not to rush into making a system distributed if it’s possible to keep it on a single machine

Series : Designing Data-Intensive Applications

1.
Designing Data-Intensive Applications - (1) Trade-offs in Data Systems Architecture
2.
Designing Data-Intensive Applications - (2) Defining NonFunctional Requirements

← Back ↑ Top

📚 Study

Series: Designing Data-Intensive Applications

Designing Data-Intensive Applications - (2) Defining NonFunctional Requirements

Mar 7, 2026

이 챕터는 데이터 중심 애플리케이션의 핵심인 세 가지 비기능적 요구사항(신뢰성, 확장성, 유지보수성)을 정의합니다. 트위터의 타임라인 구축 사례를 통해 읽기/쓰기 시점의 부하 분산 전략(Fan-out)과 트레이드오프를 살펴보고, p99.9와 같은 꼬리 지연 시간(Tail Latency) 관리의 중요성을 강조합니다. 최종적으로는 복잡성을 제어하는 추상화와 변화에 유연한 설계가 장기적인 시스템 운영에 어떤 영향을 미치는지 다룹니다. This chapter defines the three pillars of data-intensive applications: Reliability, Scalability, and Maintainability. Through the case study of X (Twitter) home timelines, it explores the trade-offs of fan-out strategies between write and read paths. It also emphasizes the importance of managing tail latencies (p99.9) and explains how abstraction and evolvability are crucial for long-term system health and managing accidental complexity.

Designing Data

📚 Study

Claude Code 완전 정복: 에이전틱 개발자를 위한 필수 워크플로우와 컨텍스트 최적화

Feb 26, 2026

시스템 프롬프트를 절반으로 줄이고 성능은 높이는 Claude Code 해킹 비법! AI가 지시를 잊어버리는 '컨텍스트 드리프트' 현상을 막고, 크고 복잡한 개발 문제를 단계별로 격파하는 노하우를 공개합니다. Discover the ultimate Claude Code hacking secrets to cut your system prompts in half while boosting performance! Learn how to prevent "context drift" where AI forgets instructions, and master the art of breaking down complex development problems step-by-step.

NEWPySpark: 대용량 분산 처리 DataFrame 기초

Mar 30, 2026

PySpark는 Apache Spark를 Python 환경에서 사용할 수 있게 해주는 API로, 대량의 데이터를 분산 처리할 수 있다. 핵심 구조로는 Driver Node, Worker Node, Cluster Manager가 있으며, RDD와 DataFrame이 주요 데이터 구조이다. 학습 로드맵은 DataFrame 기초 조작, 스파크 최적화 및 고급 기능, 확장 모듈 다루기로 구성된다. Lazy Evaluation, SparkSession 생성, 데이터 불러오기 및 변환, 집계, 조인 등의 기법을 통해 성능을 최적화할 수 있다. 또한, Spark SQL, Structured Streaming, MLlib 등의 확장 모듈을 활용하여 데이터 엔지니어링을 강화할 수 있다.

PySpark

Data Engineering

🔎 Practice

Google Antigravity 시작하기 및 실제 프로젝트 구현해보기

Feb 28, 2026

구글 안티그래비티를 실제 프로젝트에 적용하며 얻은 기술적 통찰을 정리한다. 단순한 코드 추천을 넘어 스스로 계획을 수립하고 실행하는 '에이전트'로서의 특징과, 실제 배포 과정에서의 생산성 및 쿼터 관리 효율성을 분석한다. 개발자의 역할이 단순 코더에서 전체 프로세스를 관리하는 디렉터로 변화하는 지점을 가식 없이 기술한다. This post provides a technical review of Google Antigravity based on real-world project application. It explores its capabilities as an autonomous "Agent" that goes beyond code suggestions to planning and execution. The review analyzes productivity gains and the realities of quota management, highlighting the industry's shift where developers evolve from manual coders into strategic directors of AI agents.

AI & Tools