Bird
Raised Fist0
Microservicessystem_design~25 mins

Why service mesh manages inter-service traffic in Microservices - Design It to Understand It

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Service Mesh for Microservices Inter-Service Traffic Management
Focus on managing inter-service traffic within a microservices architecture using a service mesh. Out of scope: service development, database design, external client communication.
Functional Requirements
FR1: Manage communication between multiple microservices securely and reliably
FR2: Provide observability for inter-service calls (metrics, tracing, logging)
FR3: Enable traffic control features like load balancing, retries, and circuit breaking
FR4: Support secure communication with mutual TLS authentication
FR5: Allow dynamic routing and version-based traffic splitting for deployments
Non-Functional Requirements
NFR1: Handle up to 10,000 inter-service requests per second
NFR2: Ensure p99 latency for inter-service calls under 100ms
NFR3: Achieve 99.9% availability for service communication
NFR4: Minimal impact on existing microservices code (no code changes preferred)
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Sidecar proxies deployed alongside each microservice
Control plane to manage configuration and policies
Certificate authority for mutual TLS
Telemetry collection and visualization tools
Service registry for service discovery
Design Patterns
Sidecar proxy pattern
Circuit breaker and retry patterns
Mutual TLS for secure communication
Canary deployment and traffic splitting
Observability with distributed tracing
Reference Architecture
Client Service A  <--->  Sidecar Proxy A  <--->  Service Mesh Control Plane
                             |                         |
                             v                         v
                      Sidecar Proxy B  <--->  Service B

- Each microservice runs with a sidecar proxy.
- Sidecars handle all incoming and outgoing traffic.
- Control plane configures proxies with routing, security, and telemetry rules.
Components
Sidecar Proxy
Envoy Proxy
Intercepts and manages all network traffic for a microservice, enabling features like load balancing, retries, and security.
Control Plane
Istio Control Plane
Manages configuration, policies, and certificates for sidecar proxies dynamically.
Certificate Authority
Istio CA or external PKI
Issues and rotates certificates for mutual TLS to secure inter-service communication.
Telemetry System
Prometheus, Grafana, Jaeger
Collects metrics, logs, and traces from sidecars for observability.
Service Registry
Kubernetes API Server or Consul
Keeps track of available services and their endpoints for discovery.
Request Flow
1. 1. Microservice A sends a request to Microservice B.
2. 2. The request is intercepted by Sidecar Proxy A.
3. 3. Sidecar Proxy A applies routing rules and security policies.
4. 4. Sidecar Proxy A establishes a mutual TLS connection to Sidecar Proxy B.
5. 5. Sidecar Proxy B receives the request and forwards it to Microservice B.
6. 6. Microservice B processes the request and sends the response back through Sidecar Proxy B.
7. 7. Sidecar Proxy B applies response policies and sends it securely to Sidecar Proxy A.
8. 8. Sidecar Proxy A forwards the response to Microservice A.
9. 9. Throughout this flow, telemetry data is collected and sent to the telemetry system.
10. 10. The control plane continuously updates sidecar proxies with configuration changes.
Database Schema
Not applicable as service mesh manages runtime traffic and configuration, not persistent data storage.
Scaling Discussion
Bottlenecks
Sidecar proxies becoming CPU or memory bottlenecks under high traffic
Control plane overwhelmed by frequent configuration updates
Certificate authority latency during certificate issuance or rotation
Telemetry system storage and query performance with large volumes of data
Solutions
Scale sidecar proxies horizontally by distributing microservices across nodes; optimize proxy resource limits
Implement control plane horizontal scaling and caching of configurations
Use efficient certificate rotation strategies and caching to reduce latency
Use scalable telemetry backends and sampling to reduce data volume
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing the architecture and explaining components, 10 minutes discussing scaling and trade-offs, 5 minutes for questions.
Explain why sidecar proxies are used to manage traffic without changing microservice code
Describe how mutual TLS secures communication between services
Highlight observability benefits from telemetry collected by the mesh
Discuss traffic control features like retries, circuit breakers, and routing
Address scaling challenges and how to mitigate bottlenecks

Practice

(1/5)
1. Why does a service mesh manage inter-service traffic in a microservices architecture?
easy
A. To improve security, reliability, and observability between services
B. To replace the need for a database in microservices
C. To write the business logic inside each service
D. To increase the size of each service for better performance

Solution

  1. Step 1: Understand the role of service mesh

    A service mesh controls how services communicate, focusing on security, reliability, and monitoring.
  2. Step 2: Identify what service mesh does not do

    It does not replace databases or add business logic; it manages traffic between services.
  3. Final Answer:

    To improve security, reliability, and observability between services -> Option A
  4. Quick Check:

    Service mesh manages traffic for security and reliability = A [OK]
Hint: Service mesh controls communication, not business logic or storage [OK]
Common Mistakes:
  • Thinking service mesh replaces databases
  • Confusing service mesh with application code
  • Assuming service mesh increases service size
2. Which syntax correctly describes how a service mesh uses sidecar proxies?
easy
A. database -> service -> sidecar proxy
B. service -> sidecar proxy -> other service
C. sidecar proxy -> service -> database
D. service <- database <- sidecar proxy

Solution

  1. Step 1: Understand sidecar proxy role

    Sidecar proxies sit alongside services to intercept and manage traffic between services.
  2. Step 2: Identify correct traffic flow

    Traffic flows from the service through its sidecar proxy to the other service.
  3. Final Answer:

    service -> sidecar proxy -> other service -> Option B
  4. Quick Check:

    Sidecar proxies manage traffic between services = D [OK]
Hint: Sidecar proxies sit next to services, managing outgoing traffic [OK]
Common Mistakes:
  • Confusing database direction with sidecar proxy
  • Reversing traffic flow arrows
  • Mixing service and database roles
3. Given this simplified service mesh setup, what is the expected behavior when Service A calls Service B and Service B is temporarily down?
Service A -> Sidecar Proxy A -> Sidecar Proxy B -> Service B
Options:
medium
A. The call fails immediately with no retries
B. Service A handles retries without sidecar involvement
C. Sidecar Proxy A retries the call automatically before failing
D. Sidecar Proxy B forwards the call to a database instead

Solution

  1. Step 1: Recognize retry feature in service mesh

    Service mesh sidecar proxies can automatically retry failed calls to improve reliability.
  2. Step 2: Identify which proxy handles retries

    Sidecar Proxy A, managing outgoing traffic from Service A, retries the call before reporting failure.
  3. Final Answer:

    Sidecar Proxy A retries the call automatically before failing -> Option C
  4. Quick Check:

    Sidecar proxies handle retries to improve reliability = B [OK]
Hint: Sidecar proxies retry failed calls automatically [OK]
Common Mistakes:
  • Assuming no retries happen
  • Thinking service code retries instead
  • Confusing proxy roles with database
4. You configured a service mesh but notice that traffic between services is not encrypted. What is the most likely cause?
medium
A. Service mesh does not support encryption
B. Services are using HTTP instead of HTTPS internally
C. The database connection is not encrypted
D. Sidecar proxies are not enabled to handle TLS encryption

Solution

  1. Step 1: Understand encryption in service mesh

    Service mesh uses sidecar proxies to encrypt traffic between services using TLS.
  2. Step 2: Identify common misconfiguration

    If sidecar proxies are not configured or enabled for TLS, traffic remains unencrypted.
  3. Final Answer:

    Sidecar proxies are not enabled to handle TLS encryption -> Option D
  4. Quick Check:

    Encryption depends on sidecar proxy TLS setup = A [OK]
Hint: Check sidecar proxy TLS settings for encryption issues [OK]
Common Mistakes:
  • Blaming service internal HTTP usage
  • Confusing database encryption with service traffic
  • Assuming service mesh lacks encryption feature
5. In a microservices system using a service mesh, how does the mesh help when one service experiences intermittent failures?
hard
A. It automatically retries requests, routes around failures, and collects metrics for monitoring
B. It stops all traffic to the failing service until manually restarted
C. It merges the failing service into other services to avoid downtime
D. It disables sidecar proxies to reduce overhead during failures

Solution

  1. Step 1: Identify service mesh features for failure handling

    Service mesh retries requests, performs circuit breaking (routing around failures), and gathers metrics.
  2. Step 2: Understand what service mesh does not do

    It does not stop all traffic, merge services, or disable proxies during failures.
  3. Final Answer:

    It automatically retries requests, routes around failures, and collects metrics for monitoring -> Option A
  4. Quick Check:

    Service mesh improves reliability with retries and monitoring = C [OK]
Hint: Service mesh retries and monitors to handle failures smoothly [OK]
Common Mistakes:
  • Thinking mesh stops traffic completely
  • Believing mesh merges services automatically
  • Assuming proxies are disabled on failure