1.7 System Design

What is System Design?

System design is the process of defining the architecture, modules, connections, and data for a system to solve a problem. It is a blueprint for the system and the development team to follow.

Architecture Design Record (ADR)

An architecture design record (ADR) is a document that captures the decisions made during the design process. It is a way to document the rationale behind the design choices and the trade-offs made. An ADR is a living document that evolves as the system evolves.

How to Design

Most of the systems that you’re going to build will have the following components:

Frontend: The user interface that interacts with the user. It can be a web application, mobile application, or desktop application.
Backend: The server-side logic that processes requests from the frontend and interacts with the database.
Database: The storage layer that stores the data used by the system.
Network: The communication layer that allows the different components of the system to talk to each other.

Note

The easiest application you can build is a web page or web api behind a web server or reverse proxy.

After that you need to think about the following:

Scalability: How will the system handle increased load? Will it be able to scale horizontally or vertically?
Availability: How will the system handle failures? Will it be able to recover from failures automatically?
Security: How will the system protect sensitive data? Will it be able to detect and prevent security breaches?
Performance: How will the system handle large amounts of data? Will it be able to process requests quickly and efficiently?

Step 1: Start With a Simple Design

MVP (Minimum Viable Product) or POC (Proof of Concept) or that Hackathon App that you made in 24 hours.
- Focus on solving the problem with the simplest design possible.
- Begin with a single instance architecture: a simple web server (could be a Node.js/Express, Python/Flask, Java/Spring, etc.) connected to a single database (e.g., MySQL/MariaDB or PostgreSQL).
- Run behind a basic reverse proxy or web server (like Nginx or Apache) to handle HTTP requests.
- This covers core functionality quickly, helps you validate requirements, and gather feedback early.
Identify Bottlenecks Early
- Monitor where the system slows down under load: CPU spikes, DB connections, memory usage.
- This insight guides scaling, caching, or load-balancing decisions.

Step 2: Consider Scalability

Scalability is the ability of your system to handle increasing load without sacrificing performance. Two primary ways to scale:

Vertical Scaling (Scale Up)
- Add more CPU, memory, or disk resources to a single machine.
- Easiest to implement initially (just upgrade the server specs), but limited by the maximum capacity of a single box.
Horizontal Scaling (Scale Out)
- Run multiple servers (or instances) in parallel and distribute incoming requests among them.
- Requires stateless services when possible (so any instance can handle a request).
- Often uses load balancers (e.g., AWS ELB, Nginx, HAProxy) to distribute traffic.
- Potentially unlimited growth, but more complex to manage (e.g., data consistency, orchestration).

Common Strategies to Support Horizontal Scaling

Load Balancing:
- A load balancer (e.g., HAProxy, Nginx, AWS ALB, Azure LoadBalancer) sits between clients and server instances.
- Routes requests based on algorithms (round-robin, least connections, etc.).
- Ensures no single instance is overwhelmed.
Caching:
- Local Caching or CDN (Content Delivery Network) for static assets (images, CSS, JavaScript).
- Application Cache (e.g., Redis, Memcached) to store frequently accessed data (session data, query results).
- Database Caching (caching query results, indexes) to reduce expensive queries.
Database Sharding or Replication:
- Master-Slave Replication for read scalability; reads go to the replicas, writes go to the master.
- Sharding (horizontal partitioning) splits data across multiple database servers to handle large datasets or extremely high throughput.

Step 3: Ensure High Availability

Availability means the system remains accessible and functional even when components fail.

Redundancy
- Run multiple instances of each service, typically across different availability zones or data centers.
- If one instance fails, traffic automatically fails over to others.
Active-Active vs. Active-Passive
- Active-Active: All instances handle requests concurrently. If one fails, the others take over.
- Active-Passive: A primary instance handles all traffic, and a secondary instance is ready to take over if needed.
Failover Mechanisms
- Health Checks: Periodically test each instance. If an instance is down, remove it from the load balancer rotation.
- Automatic or Manual Failover: Detect a failed database server and promote a replica to primary if needed.
Resiliency Patterns
- Circuit Breaker: If a downstream service is failing, the circuit breaker trips to prevent cascading failures.
- Bulkhead Pattern: Partition resources (threads, connections) so a failure in one service doesn’t bring down the entire system.

Step 4: Incorporate Security Best Practices

Security should be baked in from the start, not bolted on later. Consider:

Authentication and Authorization
- Use secure protocols (OAuth, SAML, JWT tokens) to ensure only authorized users can access resources.
- Employ role-based access control (RBAC) or attribute-based access control (ABAC) for fine-grained permissions.
Transport Layer Security (TLS)
- Encrypt data in transit (HTTPS) to prevent eavesdropping or tampering.
- Use robust certificates, ensure TLS configurations are up to date (disable weak ciphers, etc.).
Data Protection
- Encrypt data at rest (disk-level, database-level encryption).
- Use secure storage for sensitive data (e.g., secrets management with Vault, AWS KMS).
- Sanitize inputs to avoid SQL injection, cross-site scripting (XSS), cross-site request forgery (CSRF), etc.
Infrastructure Security
- Network Segmentation: Isolate critical services in private subnets.
- Firewalls and Security Groups: Restrict inbound and outbound traffic to necessary ports.
- Monitoring and Auditing: Track system logs, user actions, anomalies (e.g., suspicious login attempts).

Step 5: Optimize Performance

Performance revolves around latency (response time) and throughput (requests handled per second).

Caching Strategy
- Application-Level Cache: Use Redis or Memcached to cache the results of expensive queries or computations.
- Database Query Optimization: Proper indexing, avoiding full table scans, denormalizing if needed.
- Local Cache or Content Delivery Network: Edge caching for static content and possibly dynamic content with advanced rules.
Asynchronous Processing
- Offload long-running tasks to background jobs (e.g., using a message queue like RabbitMQ, Kafka, or AWS SQS).
- Frees up the main request cycle for quick responses.
Efficient Data Modeling
- Choose SQL vs. NoSQL based on query patterns, relationships, and scale.
- For very high throughput or flexible schemas, NoSQL (e.g., Cassandra, DynamoDB, MongoDB) might be more suitable.
- For strong consistency and complex joins, a relational database is often best.
Load Testing and Profiling
- Use tools like JMeter, Gatling, or Locust to simulate load.
- Identify response time degradations, concurrency issues.
- Profile code to pinpoint CPU-intensive or memory-heavy operations.

Step 6: Put It All Together

Design for Each Layer
- Frontend: Optimize user experience (caching, CDNs) and handle authentication tokens securely.
- Backend: Keep services stateless where possible; use message queues or streaming platforms for asynchronous flows.
- Database: Plan for read replicas or partitioning; define a robust backup and disaster recovery strategy.
- Network: Use load balancers, private subnets, firewalls, and TLS to secure communication.
Iterate and Evolve
- Start simple, then incorporate more sophisticated patterns (microservices, advanced caching, sharding) as needed.
- Maintain observability: Collect logs, metrics, and traces (e.g., using ELK stack, Prometheus, Grafana).
- Continuously review performance and reliability metrics; refactor or re-architect as system demands grow.
Evaluate Trade-Offs
- Understand CAP Theorem (Consistency, Availability, Partition Tolerance) for distributed systems.
- Sometimes you’ll favor availability over consistency (and vice versa), depending on the use case (e.g., a social feed vs. a financial transaction system).

Example Architecture (Simplified)

Client → CDN → Load Balancer → Web/Backend Servers → Database

Caching Layer (e.g., Redis) alongside the backend to speed up frequent queries.

Worker Services pulling from a Message Queue for asynchronous tasks (notifications, data processing).

Monitoring (Prometheus, Grafana) + Logging (ELK stack) for real-time insights and alerting.

Automated Deployments (CI/CD pipelines) with rollbacks in case of failed releases.

Final Thoughts

Building robust, scalable systems involves iterative design, continuous monitoring, and a willingness to refactor. The initial goal is to satisfy core business requirements with a straightforward solution. As traffic grows or new features emerge, you incrementally introduce load balancing, caching, partitioning, and microservices.

Always keep in mind the key pillars:

Scalability: Horizontal vs. vertical scaling, load balancers, caching, partitioning, sharing.

Availability: Redundancy, failover, replication, resilient design.

Security: Defense in depth—secure data in transit and at rest, robust authentication, intrusion detection.

Performance: Profiling, caching, asynchronous jobs, efficient data modeling.