1.7 System Design

What is System Design?

System design is the process of defining the architecture, modules, connections, and data for a system to solve a problem. It is a blueprint for the system and the development team to follow.

Architecture Design Record (ADR)

An architecture design record (ADR) is a document that captures the decisions made during the design process. It is a way to document the rationale behind the design choices and the trade-offs made. An ADR is a living document that evolves as the system evolves.

How to Design

Most of the systems that you’re going to build will have the following components:

  1. Frontend: The user interface that interacts with the user. It can be a web application, mobile application, or desktop application.

  2. Backend: The server-side logic that processes requests from the frontend and interacts with the database.

  3. Database: The storage layer that stores the data used by the system.

  4. Network: The communication layer that allows the different components of the system to talk to each other.

Note

The easiest application you can build is a web page or web api behind a web server or reverse proxy.

After that you need to think about the following:

  • Scalability: How will the system handle increased load? Will it be able to scale horizontally or vertically?

  • Availability: How will the system handle failures? Will it be able to recover from failures automatically?

  • Security: How will the system protect sensitive data? Will it be able to detect and prevent security breaches?

  • Performance: How will the system handle large amounts of data? Will it be able to process requests quickly and efficiently?

Step 1: Start With a Simple Design

  1. MVP (Minimum Viable Product) or POC (Proof of Concept) or that Hackathon App that you made in 24 hours.

    • Focus on solving the problem with the simplest design possible.

    • Begin with a single instance architecture: a simple web server (could be a Node.js/Express, Python/Flask, Java/Spring, etc.) connected to a single database (e.g., MySQL/MariaDB or PostgreSQL).

    • Run behind a basic reverse proxy or web server (like Nginx or Apache) to handle HTTP requests.

    • This covers core functionality quickly, helps you validate requirements, and gather feedback early.

  2. Identify Bottlenecks Early

    • Monitor where the system slows down under load: CPU spikes, DB connections, memory usage.

    • This insight guides scaling, caching, or load-balancing decisions.

Step 2: Consider Scalability

Scalability is the ability of your system to handle increasing load without sacrificing performance. Two primary ways to scale:

  1. Vertical Scaling (Scale Up)

    • Add more CPU, memory, or disk resources to a single machine.

    • Easiest to implement initially (just upgrade the server specs), but limited by the maximum capacity of a single box.

  2. Horizontal Scaling (Scale Out)

    • Run multiple servers (or instances) in parallel and distribute incoming requests among them.

    • Requires stateless services when possible (so any instance can handle a request).

    • Often uses load balancers (e.g., AWS ELB, Nginx, HAProxy) to distribute traffic.

    • Potentially unlimited growth, but more complex to manage (e.g., data consistency, orchestration).

Common Strategies to Support Horizontal Scaling

  1. Load Balancing:

    • A load balancer (e.g., HAProxy, Nginx, AWS ALB, Azure LoadBalancer) sits between clients and server instances.

    • Routes requests based on algorithms (round-robin, least connections, etc.).

    • Ensures no single instance is overwhelmed.

  2. Caching:

    • Local Caching or CDN (Content Delivery Network) for static assets (images, CSS, JavaScript).

    • Application Cache (e.g., Redis, Memcached) to store frequently accessed data (session data, query results).

    • Database Caching (caching query results, indexes) to reduce expensive queries.

  3. Database Sharding or Replication:

    • Master-Slave Replication for read scalability; reads go to the replicas, writes go to the master.

    • Sharding (horizontal partitioning) splits data across multiple database servers to handle large datasets or extremely high throughput.

Step 3: Ensure High Availability

Availability means the system remains accessible and functional even when components fail.

  1. Redundancy

    • Run multiple instances of each service, typically across different availability zones or data centers.

    • If one instance fails, traffic automatically fails over to others.

  2. Active-Active vs. Active-Passive

    • Active-Active: All instances handle requests concurrently. If one fails, the others take over.

    • Active-Passive: A primary instance handles all traffic, and a secondary instance is ready to take over if needed.

  3. Failover Mechanisms

    • Health Checks: Periodically test each instance. If an instance is down, remove it from the load balancer rotation.

    • Automatic or Manual Failover: Detect a failed database server and promote a replica to primary if needed.

  4. Resiliency Patterns

    • Circuit Breaker: If a downstream service is failing, the circuit breaker trips to prevent cascading failures.

    • Bulkhead Pattern: Partition resources (threads, connections) so a failure in one service doesn’t bring down the entire system.

Step 4: Incorporate Security Best Practices

Security should be baked in from the start, not bolted on later. Consider:

  1. Authentication and Authorization

    • Use secure protocols (OAuth, SAML, JWT tokens) to ensure only authorized users can access resources.

    • Employ role-based access control (RBAC) or attribute-based access control (ABAC) for fine-grained permissions.

  2. Transport Layer Security (TLS)

    • Encrypt data in transit (HTTPS) to prevent eavesdropping or tampering.

    • Use robust certificates, ensure TLS configurations are up to date (disable weak ciphers, etc.).

  3. Data Protection

    • Encrypt data at rest (disk-level, database-level encryption).

    • Use secure storage for sensitive data (e.g., secrets management with Vault, AWS KMS).

    • Sanitize inputs to avoid SQL injection, cross-site scripting (XSS), cross-site request forgery (CSRF), etc.

  4. Infrastructure Security

    • Network Segmentation: Isolate critical services in private subnets.

    • Firewalls and Security Groups: Restrict inbound and outbound traffic to necessary ports.

    • Monitoring and Auditing: Track system logs, user actions, anomalies (e.g., suspicious login attempts).

Step 5: Optimize Performance

Performance revolves around latency (response time) and throughput (requests handled per second).

  1. Caching Strategy

    • Application-Level Cache: Use Redis or Memcached to cache the results of expensive queries or computations.

    • Database Query Optimization: Proper indexing, avoiding full table scans, denormalizing if needed.

    • Local Cache or Content Delivery Network: Edge caching for static content and possibly dynamic content with advanced rules.

  2. Asynchronous Processing

    • Offload long-running tasks to background jobs (e.g., using a message queue like RabbitMQ, Kafka, or AWS SQS).

    • Frees up the main request cycle for quick responses.

  3. Efficient Data Modeling

    • Choose SQL vs. NoSQL based on query patterns, relationships, and scale.

    • For very high throughput or flexible schemas, NoSQL (e.g., Cassandra, DynamoDB, MongoDB) might be more suitable.

    • For strong consistency and complex joins, a relational database is often best.

  4. Load Testing and Profiling

    • Use tools like JMeter, Gatling, or Locust to simulate load.

    • Identify response time degradations, concurrency issues.

    • Profile code to pinpoint CPU-intensive or memory-heavy operations.

Step 6: Put It All Together

  1. Design for Each Layer

    • Frontend: Optimize user experience (caching, CDNs) and handle authentication tokens securely.

    • Backend: Keep services stateless where possible; use message queues or streaming platforms for asynchronous flows.

    • Database: Plan for read replicas or partitioning; define a robust backup and disaster recovery strategy.

    • Network: Use load balancers, private subnets, firewalls, and TLS to secure communication.

  2. Iterate and Evolve

    • Start simple, then incorporate more sophisticated patterns (microservices, advanced caching, sharding) as needed.

    • Maintain observability: Collect logs, metrics, and traces (e.g., using ELK stack, Prometheus, Grafana).

    • Continuously review performance and reliability metrics; refactor or re-architect as system demands grow.

  3. Evaluate Trade-Offs

    • Understand CAP Theorem (Consistency, Availability, Partition Tolerance) for distributed systems.

    • Sometimes you’ll favor availability over consistency (and vice versa), depending on the use case (e.g., a social feed vs. a financial transaction system).

Example Architecture (Simplified)

  • Client → CDN → Load Balancer → Web/Backend Servers → Database

  • Caching Layer (e.g., Redis) alongside the backend to speed up frequent queries.

  • Worker Services pulling from a Message Queue for asynchronous tasks (notifications, data processing).

  • Monitoring (Prometheus, Grafana) + Logging (ELK stack) for real-time insights and alerting.

  • Automated Deployments (CI/CD pipelines) with rollbacks in case of failed releases.

Final Thoughts

Building robust, scalable systems involves iterative design, continuous monitoring, and a willingness to refactor. The initial goal is to satisfy core business requirements with a straightforward solution. As traffic grows or new features emerge, you incrementally introduce load balancing, caching, partitioning, and microservices.

Always keep in mind the key pillars:

  • Scalability: Horizontal vs. vertical scaling, load balancers, caching, partitioning, sharing.

  • Availability: Redundancy, failover, replication, resilient design.

  • Security: Defense in depth—secure data in transit and at rest, robust authentication, intrusion detection.

  • Performance: Profiling, caching, asynchronous jobs, efficient data modeling.