1.7 System Design
What is System Design?
System design is the process of defining the architecture, modules, connections, and data for a system to solve a problem. It is a blueprint for the system and the development team to follow.
Architecture Design Record (ADR)
An architecture design record (ADR) is a document that captures the decisions made during the design process. It is a way to document the rationale behind the design choices and the trade-offs made. An ADR is a living document that evolves as the system evolves.
How to Design
Most of the systems that you’re going to build will have the following components:
Frontend: The user interface that interacts with the user. It can be a web application, mobile application, or desktop application.
Backend: The server-side logic that processes requests from the frontend and interacts with the database.
Database: The storage layer that stores the data used by the system.
Network: The communication layer that allows the different components of the system to talk to each other.
Note
The easiest application you can build is a web page or web api behind a web server or reverse proxy.
After that you need to think about the following:
Scalability: How will the system handle increased load? Will it be able to scale horizontally or vertically?
Availability: How will the system handle failures? Will it be able to recover from failures automatically?
Security: How will the system protect sensitive data? Will it be able to detect and prevent security breaches?
Performance: How will the system handle large amounts of data? Will it be able to process requests quickly and efficiently?
Step 1: Start With a Simple Design
MVP (Minimum Viable Product) or POC (Proof of Concept) or that Hackathon App that you made in 24 hours.
Focus on solving the problem with the simplest design possible.
Begin with a single instance architecture: a simple web server (could be a Node.js/Express, Python/Flask, Java/Spring, etc.) connected to a single database (e.g., MySQL/MariaDB or PostgreSQL).
Run behind a basic reverse proxy or web server (like Nginx or Apache) to handle HTTP requests.
This covers core functionality quickly, helps you validate requirements, and gather feedback early.
Identify Bottlenecks Early
Monitor where the system slows down under load: CPU spikes, DB connections, memory usage.
This insight guides scaling, caching, or load-balancing decisions.
Step 2: Consider Scalability
Scalability is the ability of your system to handle increasing load without sacrificing performance. Two primary ways to scale:
Vertical Scaling (Scale Up)
Add more CPU, memory, or disk resources to a single machine.
Easiest to implement initially (just upgrade the server specs), but limited by the maximum capacity of a single box.
Horizontal Scaling (Scale Out)
Run multiple servers (or instances) in parallel and distribute incoming requests among them.
Requires stateless services when possible (so any instance can handle a request).
Often uses load balancers (e.g., AWS ELB, Nginx, HAProxy) to distribute traffic.
Potentially unlimited growth, but more complex to manage (e.g., data consistency, orchestration).
Common Strategies to Support Horizontal Scaling
Load Balancing:
A load balancer (e.g., HAProxy, Nginx, AWS ALB, Azure LoadBalancer) sits between clients and server instances.
Routes requests based on algorithms (round-robin, least connections, etc.).
Ensures no single instance is overwhelmed.
Caching:
Local Caching or CDN (Content Delivery Network) for static assets (images, CSS, JavaScript).
Application Cache (e.g., Redis, Memcached) to store frequently accessed data (session data, query results).
Database Caching (caching query results, indexes) to reduce expensive queries.
Database Sharding or Replication:
Master-Slave Replication for read scalability; reads go to the replicas, writes go to the master.
Sharding (horizontal partitioning) splits data across multiple database servers to handle large datasets or extremely high throughput.
Step 3: Ensure High Availability
Availability means the system remains accessible and functional even when components fail.
Redundancy
Run multiple instances of each service, typically across different availability zones or data centers.
If one instance fails, traffic automatically fails over to others.
Active-Active vs. Active-Passive
Active-Active: All instances handle requests concurrently. If one fails, the others take over.
Active-Passive: A primary instance handles all traffic, and a secondary instance is ready to take over if needed.
Failover Mechanisms
Health Checks: Periodically test each instance. If an instance is down, remove it from the load balancer rotation.
Automatic or Manual Failover: Detect a failed database server and promote a replica to primary if needed.
Resiliency Patterns
Circuit Breaker: If a downstream service is failing, the circuit breaker trips to prevent cascading failures.
Bulkhead Pattern: Partition resources (threads, connections) so a failure in one service doesn’t bring down the entire system.
Step 4: Incorporate Security Best Practices
Security should be baked in from the start, not bolted on later. Consider:
Authentication and Authorization
Use secure protocols (OAuth, SAML, JWT tokens) to ensure only authorized users can access resources.
Employ role-based access control (RBAC) or attribute-based access control (ABAC) for fine-grained permissions.
Transport Layer Security (TLS)
Encrypt data in transit (HTTPS) to prevent eavesdropping or tampering.
Use robust certificates, ensure TLS configurations are up to date (disable weak ciphers, etc.).
Data Protection
Encrypt data at rest (disk-level, database-level encryption).
Use secure storage for sensitive data (e.g., secrets management with Vault, AWS KMS).
Sanitize inputs to avoid SQL injection, cross-site scripting (XSS), cross-site request forgery (CSRF), etc.
Infrastructure Security
Network Segmentation: Isolate critical services in private subnets.
Firewalls and Security Groups: Restrict inbound and outbound traffic to necessary ports.
Monitoring and Auditing: Track system logs, user actions, anomalies (e.g., suspicious login attempts).
Step 5: Optimize Performance
Performance revolves around latency (response time) and throughput (requests handled per second).
Caching Strategy
Application-Level Cache: Use Redis or Memcached to cache the results of expensive queries or computations.
Database Query Optimization: Proper indexing, avoiding full table scans, denormalizing if needed.
Local Cache or Content Delivery Network: Edge caching for static content and possibly dynamic content with advanced rules.
Asynchronous Processing
Offload long-running tasks to background jobs (e.g., using a message queue like RabbitMQ, Kafka, or AWS SQS).
Frees up the main request cycle for quick responses.
Efficient Data Modeling
Choose SQL vs. NoSQL based on query patterns, relationships, and scale.
For very high throughput or flexible schemas, NoSQL (e.g., Cassandra, DynamoDB, MongoDB) might be more suitable.
For strong consistency and complex joins, a relational database is often best.
Load Testing and Profiling
Use tools like JMeter, Gatling, or Locust to simulate load.
Identify response time degradations, concurrency issues.
Profile code to pinpoint CPU-intensive or memory-heavy operations.
Step 6: Put It All Together
Design for Each Layer
Frontend: Optimize user experience (caching, CDNs) and handle authentication tokens securely.
Backend: Keep services stateless where possible; use message queues or streaming platforms for asynchronous flows.
Database: Plan for read replicas or partitioning; define a robust backup and disaster recovery strategy.
Network: Use load balancers, private subnets, firewalls, and TLS to secure communication.
Iterate and Evolve
Start simple, then incorporate more sophisticated patterns (microservices, advanced caching, sharding) as needed.
Maintain observability: Collect logs, metrics, and traces (e.g., using ELK stack, Prometheus, Grafana).
Continuously review performance and reliability metrics; refactor or re-architect as system demands grow.
Evaluate Trade-Offs
Understand CAP Theorem (Consistency, Availability, Partition Tolerance) for distributed systems.
Sometimes you’ll favor availability over consistency (and vice versa), depending on the use case (e.g., a social feed vs. a financial transaction system).
Example Architecture (Simplified)
Client → CDN → Load Balancer → Web/Backend Servers → Database
Caching Layer (e.g., Redis) alongside the backend to speed up frequent queries.
Worker Services pulling from a Message Queue for asynchronous tasks (notifications, data processing).
Monitoring (Prometheus, Grafana) + Logging (ELK stack) for real-time insights and alerting.
Automated Deployments (CI/CD pipelines) with rollbacks in case of failed releases.
Final Thoughts
Building robust, scalable systems involves iterative design, continuous monitoring, and a willingness to refactor. The initial goal is to satisfy core business requirements with a straightforward solution. As traffic grows or new features emerge, you incrementally introduce load balancing, caching, partitioning, and microservices.
Always keep in mind the key pillars:
Scalability: Horizontal vs. vertical scaling, load balancers, caching, partitioning, sharing.
Availability: Redundancy, failover, replication, resilient design.
Security: Defense in depth—secure data in transit and at rest, robust authentication, intrusion detection.
Performance: Profiling, caching, asynchronous jobs, efficient data modeling.