Bird
Raised Fist0
Microservicessystem_design~25 mins

Dashboards (Grafana) in Microservices - System Design Exercise

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Design: Microservices Monitoring Dashboard with Grafana
In scope: Metrics collection, storage, visualization, alerting, and access control. Out of scope: Microservices implementation, detailed alert notification channels.
Functional Requirements
FR1: Display real-time metrics from multiple microservices
FR2: Support customizable dashboards for different teams
FR3: Visualize key performance indicators (KPIs) such as latency, error rates, and throughput
FR4: Allow alerting based on threshold breaches
FR5: Handle up to 100 microservices with 10,000 metrics per second
FR6: Provide historical data for at least 30 days
FR7: Secure access with role-based permissions
Non-Functional Requirements
NFR1: API response latency for dashboard queries should be under 500ms (p99)
NFR2: System availability must be 99.9% uptime
NFR3: Data retention for 30 days with efficient storage
NFR4: Support concurrent access by 500 users
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
Key Components
Metrics collection agents (e.g., Prometheus exporters)
Time-series database for storing metrics
Grafana for dashboard visualization
Authentication and authorization service
Alert manager for threshold-based alerts
Design Patterns
Pull vs push metrics collection
Caching for dashboard queries
Role-based access control (RBAC)
Data retention and downsampling
High availability and failover
Reference Architecture
                    +---------------------+
                    |  User Browsers       |
                    +----------+----------+
                               |
                               v
                    +----------+----------+
                    |      Grafana UI      |
                    +----------+----------+
                               |
                               v
                    +----------+----------+
                    | Authentication &    |
                    | Authorization (RBAC)|
                    +----------+----------+
                               |
                               v
                    +----------+----------+
                    |   Query Engine       |
                    +----------+----------+
                               |
                               v
          +--------------------+--------------------+
          |                                         |
+---------+---------+                     +---------+---------+
| Time-Series DB    |                     | Alert Manager     |
| (e.g., Prometheus |                     |                   |
| TSDB)             |                     +-------------------+
+---------+---------+
          |
          v
+---------+---------+
| Metrics Exporters  |
| (on microservices) |
+-------------------+
Components
Metrics Exporters
Prometheus Exporters
Collect metrics from each microservice and expose them for scraping
Time-Series Database
Prometheus TSDB or Cortex
Store collected metrics efficiently with time stamps
Grafana UI
Grafana
Visualize metrics in customizable dashboards
Authentication & Authorization
OAuth2 / LDAP / RBAC system
Secure dashboard access and enforce user permissions
Alert Manager
Prometheus Alertmanager
Send alerts when metrics cross defined thresholds
Query Engine
PromQL or equivalent
Process user queries to fetch metrics from TSDB
Request Flow
1. 1. Metrics exporters on each microservice collect and expose metrics endpoints.
2. 2. Prometheus server scrapes metrics from exporters at regular intervals (e.g., every 15 seconds).
3. 3. Scraped metrics are stored in the time-series database.
4. 4. Users access Grafana UI to view dashboards.
5. 5. Grafana authenticates users and checks permissions via the auth service.
6. 6. Grafana queries the time-series database using the query engine to fetch requested metrics.
7. 7. Metrics data is visualized on dashboards with graphs and charts.
8. 8. Alert manager monitors metrics and triggers alerts based on configured rules.
9. 9. Alerts are sent to users via configured notification channels (email, Slack, etc.).
Database Schema
Entities: - Metric: {metric_id, name, labels (key-value), timestamp, value} - Dashboard: {dashboard_id, name, owner_user_id, configuration_json} - User: {user_id, username, roles} - AlertRule: {alert_id, metric_name, threshold, duration, severity, notification_channels} Relationships: - User owns multiple Dashboards (1:N) - AlertRules linked to Metrics by metric_name - Roles define access permissions for Dashboards and Alerts
Scaling Discussion
Bottlenecks
High ingestion rate of metrics causing storage and processing overload
Slow query response times due to large data volume
Authentication service becoming a single point of failure
Alert manager overwhelmed by frequent alerts
Dashboard UI performance degradation with many concurrent users
Solutions
Use a horizontally scalable TSDB like Cortex or Thanos to distribute storage and ingestion load
Implement query caching and downsampling of older metrics to speed up queries
Deploy authentication service in a highly available cluster with load balancing
Rate-limit alerts and use deduplication in alert manager to reduce noise
Use Grafana’s built-in caching and optimize dashboard queries; scale Grafana instances behind a load balancer
Interview Tips
Time: Spend 10 minutes understanding requirements and clarifying scope, 20 minutes designing architecture and data flow, 10 minutes discussing scaling and trade-offs, 5 minutes summarizing and answering questions.
Explain the choice of Prometheus and Grafana as industry standards for metrics and dashboards
Discuss pull-based metrics collection for reliability and scalability
Highlight security with RBAC and authentication integration
Describe how alerting integrates with monitoring for proactive issue detection
Address scaling challenges with distributed TSDB and caching
Mention data retention and downsampling strategies for storage efficiency

Practice

(1/5)
1. What is the main purpose of a Grafana dashboard in microservices monitoring?
easy
A. To visually display system data for easy monitoring
B. To write code for microservices
C. To store microservice source files
D. To deploy microservices automatically

Solution

  1. Step 1: Understand Grafana's role

    Grafana is a tool used to create dashboards that show data visually.
  2. Step 2: Connect purpose to microservices

    Dashboards help monitor microservices by showing their data clearly.
  3. Final Answer:

    To visually display system data for easy monitoring -> Option A
  4. Quick Check:

    Grafana dashboards = Visual monitoring [OK]
Hint: Dashboards show data visually to monitor systems fast [OK]
Common Mistakes:
  • Confusing dashboards with code editors
  • Thinking dashboards deploy services
  • Assuming dashboards store source code
2. Which of the following is the correct way to add a new panel in a Grafana dashboard?
easy
A. Write a new SQL query in the dashboard settings
B. Click the '+' icon and select 'Add Panel'
C. Restart the Grafana server
D. Edit the microservice code

Solution

  1. Step 1: Identify how to add panels in Grafana

    Grafana uses a '+' icon to add new panels visually.
  2. Step 2: Eliminate unrelated actions

    Writing SQL or restarting server does not add panels directly.
  3. Final Answer:

    Click the '+' icon and select 'Add Panel' -> Option B
  4. Quick Check:

    Add panel = '+' icon click [OK]
Hint: Use '+' icon to add panels quickly [OK]
Common Mistakes:
  • Trying to add panels by restarting Grafana
  • Confusing panel addition with code editing
  • Assuming SQL query alone adds panels
3. Given this Grafana query panel configuration:
SELECT mean("response_time") FROM "service_metrics" WHERE $timeFilter GROUP BY time($__interval) fill(null)
What will this panel display?
medium
A. List of all service names
B. Total number of requests received
C. Current CPU usage of the server
D. Average response time over time intervals

Solution

  1. Step 1: Analyze the SQL query

    The query calculates the mean (average) of "response_time" from "service_metrics" grouped by time intervals.
  2. Step 2: Understand the output meaning

    This means the panel shows average response time over time, not counts or other metrics.
  3. Final Answer:

    Average response time over time intervals -> Option D
  4. Quick Check:

    mean(response_time) = average response time [OK]
Hint: mean() shows average values in Grafana queries [OK]
Common Mistakes:
  • Confusing mean with total count
  • Assuming query lists service names
  • Thinking it shows CPU usage
4. You created a Grafana dashboard but the panels show 'No data'. What is the most likely cause?
medium
A. The data source is not connected or misconfigured
B. The dashboard theme is set to dark mode
C. The Grafana server needs a restart
D. The microservice code has a syntax error

Solution

  1. Step 1: Identify common reasons for 'No data'

    Panels show 'No data' usually when the data source is missing or wrong.
  2. Step 2: Exclude unrelated causes

    Theme or server restart rarely cause no data; code errors don't affect Grafana data directly.
  3. Final Answer:

    The data source is not connected or misconfigured -> Option A
  4. Quick Check:

    No data = data source issue [OK]
Hint: Check data source connection first if no data appears [OK]
Common Mistakes:
  • Restarting server unnecessarily
  • Changing theme expecting data fix
  • Blaming microservice code syntax
5. You want to create a Grafana dashboard that shows error rates for multiple microservices over the last 24 hours. Which steps should you follow?
hard
A. Use Grafana to deploy microservices and monitor logs
B. Write microservice code to log errors, then restart Grafana server
C. Connect data source, create a dashboard, add panels with queries filtering errors by service and time
D. Install Grafana plugins, then export dashboard JSON without queries

Solution

  1. Step 1: Connect the correct data source

    Grafana needs a data source with microservice metrics to query error rates.
  2. Step 2: Create dashboard and add panels with queries

    Panels should query error counts filtered by service name and last 24 hours.
  3. Step 3: Customize time range and filters

    Set time filter to last 24 hours and group by service for clear visualization.
  4. Final Answer:

    Connect data source, create a dashboard, add panels with queries filtering errors by service and time -> Option C
  5. Quick Check:

    Data source + queries + filters = dashboard [OK]
Hint: Always start with data source, then build queries in panels [OK]
Common Mistakes:
  • Skipping data source connection
  • Trying to deploy microservices via Grafana
  • Exporting dashboards without queries