Nov 17, 2024

Grafana and Prometheus Tutorial: Create a COMPLETE Dashboard

In this tutorial, you will master the process of transforming raw metric data into a Grafana dashboard with insightful visualizations. I will provide you a pre-configured environment so we can focus entirely on building each dashboard panel step by step.

Monitoring Methodologies

The two methodologies that will guide the creation of our dashboard are RED and USE.

The RED methodology focuses on metrics that affect user experience: Rate, Errors, and Duration.
The USE methodology focuses on system-level metrics: Resource Utilization, Saturation, and Errors.

By creating a dashboard that combines the RED and USE methodologies, we ensure comprehensive monitoring of both user experience and system performance.

Setting Up the Environment

Before starting, make sure you have the same environment as me by following this tutorial kubernetestraining.io/blog/prometheus-docker-compose-setup-with-grafana. This tutorial will give you a working environment with Prometheus scraping metrics from a Python web application and Grafana connected to Prometheus, ready for us to create our dashboard.

Creating Dashboard Rows

To organize our dashboard effectively and align with the RED and USE methodologies, we'll create rows for each key area. This helps in grouping related panels and improves the readability of the dashboard.

Create a New Dashboard
- In Grafana, click on the Plus Icon (Create) > Dashboard.
Add Rows for Each Section
- Click on the Add panel dropdown arrow next to Add panel.
- Select Add a new row.
- Title: Enter Quick Indicators.
Repeat the process to add more rows:
- Row 2: Title Rate (RED)
- Row 3: Title Errors (RED)
- Row 4: Title Duration (RED)
- Row 5: Title System Metrics (USE)
- Row 6: Title Garbage Collection (USE)
Save the Dashboard
- Click on the Save Dashboard icon (diskette) at the top.
- Title: Enter Comprehensive Monitoring Dashboard or any name you prefer.
- Folder: Choose or create a folder if desired.
- Click Save.

With the rows in place, we're now ready to start adding panels to each section.

Panel 1: Total HTTP Requests

In this section, we'll start constructing our dashboard by creating panels that will display key metrics. We'll begin with basic queries and progressively enhance them by introducing variables and adjusting time ranges.

Step 1: Create a Basic Panel

Create a New Dashboard
- In Grafana, click on the Plus Icon (Create) > Dashboard.
- Click on Add new panel.
Configure the Basic Query
- Data source: Ensure Prometheus is selected.
- Query Editor: Enter the following PromQL query:
  sum(increase(http_request_total[5m]
- Explanation:
  - This query sums the total number of HTTP requests over the last 5 minutes.
Visualize the Data
- Visualization: Select Stat.
- Panel Title: Enter Total Requests.
Apply the Panel
- Click Apply to add the panel to the dashboard.

Step 2: Analyze the Output

Review the Panel
- The panel displays the total number of requests in the last 5 minutes.
- However, this includes all jobs and instances scraped by Prometheus, which may not be specific to your application.
Identify the Need to Filter Data
- We need to filter the data to show only the requests related to our specific application.

Step 3: Filter by Job

Edit the Panel
- Hover over the panel and click on the Panel Title > Edit.
Update the Query to Filter by Job
- Modify the query to:
  sum(increase(http_request_total{job="my-app"}[5m]
- Explanation:
  - {job="my-app"} filters the metric to include only data from the my-app job.
  - Replace "my-app" with the actual job name of your application.
Apply the Panel
- Click Apply.
Review the Updated Panel
- The panel now displays the total requests specific to your application.

Step 4: Avoid Hardcoding the Job Name

Problem with Hardcoding
- Hardcoding job="my-app" means the dashboard is not flexible if the job name changes or if you have multiple jobs.
Solution: Use a Variable
- We'll create a variable for the job name so the dashboard can dynamically adjust based on the selected job.

Creating Dashboard Variables

To make our dashboard dynamic and avoid hardcoding values, we'll create variables that can be used in our queries.

Access Dashboard Settings
- Click on the Gear Icon (Dashboard settings) at the top right.
Add a Variable
- In the left menu, select Variables.
- Click on the Add variable button.
Configure the job Variable
- Name: job
- Type: Query
- Data source: Select your Prometheus data source.
- Query: label_values(job)
  - This PromQL query fetches all unique job label values from your metrics.
- Regex: Leave blank.
- Refresh: On Dashboard Load
- Include All Option: Enable (allows selecting all jobs).
- Multi-value: Disable (since we'll usually select one job).
- Sort: Disabled
Save the Variable
- Click on the Update button at the bottom.
Use the Variable in the Query
- Return to your panel.
- Edit the Panel:
  - Hover over the panel and click on the Panel Title > Edit.
- Update the Query:
  sum(increase(http_request_total{job="$job"}[5m]
- Explanation:
  - {job="$job"} uses the variable we just created.
  - Now, the query dynamically filters based on the selected job.
Apply the Panel
- Click Apply.
Test the Variable
- At the top of the dashboard, you should see a dropdown for job.
- Select different jobs (if available) to see how the panel updates.

Refining the Panels with Variables and Time Ranges

Now that we've introduced variables, we'll refine our panels further to make them responsive to the dashboard's time range.

Step 1: Adjusting the Time Range Variable

Understanding the Issue
- The [5m] in our query is a static time range (5 minutes).
- When we change the dashboard's time range (e.g., Last 1 hour), the panel still calculates over the last 5 minutes.
Use the Dashboard Time Range Variable
- Grafana provides a built-in variable $__range, which represents the current dashboard time range.
Update the Query
- Edit the Panel:
  - Hover over the panel and click on the Panel Title > Edit.
- Modify the Query:
  sum(increase(http_request_total{job="$job"}[$__range]
- Explanation:
  - [$__range] uses the dashboard's current time range.
  - Now, the panel dynamically adjusts based on the selected time interval.
Apply the Panel
- Click Apply.
Test the Time Range Adjustment
- Change the dashboard time range (e.g., Last 15 minutes, Last 1 hour) and observe how the panel updates accordingly.

Step 2: Proceeding with Other Panels

Now that we've established how to use variables and time ranges, we'll proceed to create other panels using similar methods.

Panel 2: Error Rate

Add a New Panel
- Click on Add panel > Add new panel.

Configure the Query

Query Editor:

(
  sum(increase(http_request_total{job="$job", status=~"4.."}[$__range])) +
  sum(increase(http_request_total{job="$job", status=~"5.."}[$__range]))
)
/
sum(increase(http_request_total{job="$job"}[$__range]

Explanation:
- Calculates the percentage of requests resulting in client (4xx) or server (5xx) errors.
- Uses the $job variable and $__range for dynamic filtering and time range adjustment.

Set the Panel Title
- Title: Enter Error Rate (%).
Select Visualization
- Choose Stat.
Customize the Visualization
- Field Options:
  - Unit: Select percent.
  - Decimal Places: Adjust to show at least two decimal places.
  - Thresholds:
    - Set thresholds at:
      - 1% (Yellow)
      - 5% (Red)
  - Color Mode: Choose Background.
Apply the Panel
- Click Apply.

Panel 3: Requests in Progress

Add a New Panel
- Click on Add panel > Add new panel.
Configure the Query
- Query Editor:
- Explanation:
  - Displays the current number of in-progress requests.
  - Uses the $job variable for dynamic filtering.
Set the Panel Title
- Title: Enter Requests in Progress.
Select Visualization
- Choose Stat.
Customize the Visualization
- Field Options:
  - Unit: Select none.
  - Color Mode: Choose Background.
  - Thresholds: Optional—set if desired.
Apply the Panel
- Click Apply.

Panel 4: Process Uptime

Add a New Panel
- Click on Add panel > Add new panel.
Configure the Query
- Query Editor:
- Explanation:
  - Calculates the process uptime by subtracting the process start time from the current time.
  - Uses the $job variable for dynamic filtering.
Set the Panel Title
- Title: Enter Process Uptime.
Select Visualization
- Choose Stat.
Customize the Visualization
- Field Options:
  - Unit: Select dthms (days, hours, minutes, seconds).
  - Color Mode: Choose Background.
- Value Mappings:
  - Map No Value to display text DOWN with color Red.
- Thresholds:
  - Set a threshold to display uptime in Green if the value is above 0.
Apply the Panel
- Click Apply.

At this point, you've successfully created four essential panels for your Grafana dashboard:

Total Requests: Displays the total number of requests over the selected time range.
Error Rate (%): Shows the percentage of requests resulting in errors.
Requests in Progress: Indicates the current number of in-progress requests.
Process Uptime: Displays how long the application process has been running.

These panels are dynamic and adjust based on the selected job and time range, providing valuable insights into your application's performance.

With the foundational panels in place, we'll now proceed to build additional panels that provide deeper insights into our application's performance and system health. These panels align with both the RED and USE methodologies, covering aspects such as throughput, latency, system resource utilization, and more.

Panel 5: API Throughput

Add a New Panel to the "Rate" Row
- In the Rate row, click on Add panel > Add new panel.
Configure the Query
- Data source: Ensure Prometheus is selected.
- Query Editor: Enter the following PromQL query:
  sum(rate(http_request_total{job="$job"}[1m]
  Explanation:
  - http_request_total: The metric tracking total HTTP requests.
  - {job="$job"}: Filters the metric by the selected job variable.
  - rate(...[1m]): Calculates the per-second rate over the last 1 minute.
  - sum(...) by (method, path): Aggregates the rate, grouping by HTTP method and path.
Set the Panel Title
- Title: Enter API Throughput.
Select Visualization
- Visualization: Select Time series.
Customize the Visualization
- Legend:
  - In the Legend field in the query editor, enter {{method}} {{path}}.
  - Enable Show legend and set Placement to Bottom.
- Axes:
  - Unit: Select reqps (requests per second).
Apply the Panel
- Click Apply to add the panel to the dashboard.

Analyze and Refine the Panel

Review the Panel
- The panel displays the rate of requests per second for each API endpoint, differentiated by method and path.
Handle Excessive Lines
- If you have many endpoints, the graph may become cluttered.
- You can filter specific endpoints by modifying the query or using additional variables (e.g., a path variable).

Panel 6: Requests by Status Code

Add a New Panel to the "Errors" Row
- In the Errors row, click on Add panel > Add new panel.
Configure the Query
- Query Editor: Enter:
  sum(rate(http_request_total{job="$job"}[1m]
  Explanation:
  - Aggregates the rate of requests by their HTTP status codes.
Set the Panel Title
- Title: Enter Requests by Status Code.
Select Visualization
- Visualization: Select Pie chart.
  Note: If the Pie Chart visualization is not available, you may need to install the Pie Chart plugin from the Grafana Plugins page.
Customize the Visualization
- Pie Chart Options:
  - Set Pie Type to Donut.
  - Enable Labels to display the status codes.
  - Display: Set to Percentage.
- Legend:
  - Enable Show legend and set Placement to Bottom.
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The pie chart displays the distribution of requests by status code, helping you identify the proportion of successful vs. error responses.

Panel 7: API Latency Percentiles

Add a New Panel to the "Duration" Row
- In the Duration row, click on Add panel > Add new panel.
Configure Multiple Queries for Percentiles
- Query A (p50):
  histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{job="$job"}[1m]
  - Legend: Set to p50.
- Query B (p90):
  histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket{job="$job"}[1m]
  - Legend: Set to p90.
- Query C (p95):
  histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="$job"}[1m]
  - Legend: Set to p95.
- Query D (p99):
  histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="$job"}[1m]
  - Legend: Set to p99.
  Explanation:
  - histogram_quantile(φ, ...): Calculates the φ-quantile (e.g., 0.5 for the 50th percentile) from histogram data.
  - rate(...[1m]): Computes the per-second average rate over the last 1 minute.
  - sum(...) by (le): Aggregates the data across all instances, grouping by the upper bound of histogram buckets (le).
Set the Panel Title
- Title: Enter API Latency Percentiles.
Select Visualization
- Visualization: Select Time series.
Customize the Visualization
- Axes:
  - Unit: Select s (seconds).
- Legend:
  - Ensure the legend displays the percentile labels (p50, p90, etc.).
  - Place the legend at the bottom for better readability.
- Graph Styles:
  - Assign different colors to each percentile for distinction.
  - Adjust line styles (solid, dashed) if desired.
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The panel shows how different latency percentiles vary over time, helping you understand the distribution of response times.
Investigate High Latencies
- If higher percentiles (e.g., p99) show spikes, it may indicate performance issues that need investigation.

Panel 8: Request Duration Heatmap

Add a New Panel to the "Duration" Row
- Click on Add panel > Add new panel.
Configure the Heatmap Query
- Query Editor:
  sum(rate(http_request_duration_seconds_bucket{job="$job"}[1m]
  Explanation:
  - Aggregates the rate of requests into histogram buckets based on their durations (le labels).
Set the Panel Title
- Title: Enter Request Duration Heatmap.
Select Visualization
- Visualization: Select Heatmap.
Customize the Visualization
- Heatmap Options:
  - Data Format: Ensure it's set to Time series buckets.
- Axes:
  - Y-Axis Unit: Select s (seconds).
- Color Scheme:
  - Choose a color scheme that highlights intensity, such as Spectral or Viridis.
- Y-Axis Settings:
  - Set the scale to Logarithmic if the data spans multiple orders of magnitude.
  - Enable Show grid lines for better readability.
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The heatmap displays the distribution of request durations over time, allowing you to spot patterns and anomalies.
Adjust Bucketing (If Necessary)
- If the heatmap is not displaying data effectively, you may need to adjust the bucket sizes or ranges in your Prometheus configuration.

Panel 9: CPU Usage (%)

Add a New Panel to the "System Metrics" Row
- In the System Metrics row, click on Add panel > Add new panel.
Configure the CPU Usage Query
- Query Editor:
  rate(process_cpu_seconds_total{job="$job"}[1m]
  Explanation:
  - process_cpu_seconds_total: Total user and system CPU time spent in seconds.
  - rate(...[1m]): Calculates the per-second rate over the last 1 minute.
  - Multiplying by 100 to express as a percentage.
Set the Panel Title
- Title: Enter CPU Usage (%).
Select Visualization
- Visualization: Select Stat.
Customize the Visualization
- Field Options:
  - Unit: Select percent (0-100).
  - Thresholds:
    - Set thresholds at:
      - 70% (Yellow)
      - 90% (Red)
  - Color Mode: Choose Background.
- Value Options:
  - Calculation: Select Last (not null).
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The panel displays the current CPU usage percentage.
Verify the Calculation
- Ensure that the CPU usage aligns with expectations. If the value seems incorrect, double-check the query and time window.

Panel 10: Memory Usage

Add a New Panel to the "System Metrics" Row
- Click on Add panel > Add new panel.
Configure Memory Usage Queries
- Query A (Resident Memory):
  - Legend: Set to Resident Memory.
- Query B (Virtual Memory):
  - Legend: Set to Virtual Memory.
  Explanation:
  - process_resident_memory_bytes: The amount of memory the process has in RAM.
  - process_virtual_memory_bytes: The amount of virtual memory used by the process.
Set the Panel Title
- Title: Enter Memory Usage.
Select Visualization
- Visualization: Select Time series.
Customize the Visualization
- Axes:
  - Unit: Select bytes (IEC).
- Legend:
  - Enable Show legend and set Placement to Bottom.
- Graph Styles:
  - Use different colors for each memory type.
  - Adjust line styles if desired.
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The panel shows how memory usage changes over time.
Investigate Memory Patterns
- Look for trends such as steady increases in memory usage, which may indicate memory leaks.

Panel 11: Open File Descriptors (%)

Add a New Panel to the "System Metrics" Row
- Click on Add panel > Add new panel.
Configure the Open File Descriptors Query
- Query Editor:
  Explanation:
  - process_open_fds: Number of open file descriptors.
  - process_max_fds: Maximum number of file descriptors that can be opened.
  - Dividing and multiplying by 100 gives the percentage of used file descriptors.
Set the Panel Title
- Title: Enter Open File Descriptors (%).
Select Visualization
- Visualization: Select Stat.
Customize the Visualization
- Field Options:
  - Unit: Select percent (0-100).
  - Thresholds:
    - Set thresholds at:
      - 70% (Yellow)
      - 90% (Red)
  - Color Mode: Choose Background.
- Value Options:
  - Calculation: Select Last (not null).
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The panel shows the percentage of open file descriptors in use.
Monitor Resource Usage
- High usage may indicate resource leaks or the need to increase the file descriptor limit.

Panel 12: GC Objects Collected

Add a New Panel to the "Garbage Collection" Row
- In the Garbage Collection row, click on Add panel > Add new panel.
Configure the Garbage Collection Query
- Query Editor:
  sum(rate(python_gc_objects_collected_total{job="$job"}[1m]
  Explanation:
  - python_gc_objects_collected_total: Total number of objects collected by Python's garbage collector.
  - rate(...[1m]): Computes the per-second rate over the last 1 minute.
  - sum(...) by (generation): Aggregates by GC generation (0, 1, 2).
Set the Panel Title
- Title: Enter GC Objects Collected.
Select Visualization
- Visualization: Select Time series.
Customize the Visualization
- Axes:
  - Unit: Select short, which formats large numbers with suffixes (K, M).
- Legend:
  - Enable Show legend and set Placement to Bottom.
  - In the Legend field in the query editor, enter Generation {{generation}}.
- Graph Styles:
  - Use distinct colors for each generation.
  - Adjust line width and style as desired.
Apply the Panel
- Click Apply.

Analyze and Refine the Panel

Review the Panel
- The panel shows how many objects are collected in each GC generation over time.
Interpret Garbage Collection Activity
- High rates of object collection may be normal, but sudden spikes could indicate increased memory allocation and deallocation.

Organizing the Dashboard

To enhance readability and structure, let's ensure that all panels are properly arranged within their respective rows and adjust the layout as needed.

Move Panels into Rows
- Drag and drop each panel into its corresponding row if you haven't already.
- For example:
  - Rate: API Throughput.
  - Errors: Requests by Status Code.
  - Duration: API Latency Percentiles, Request Duration Heatmap.
  - System Metrics: CPU Usage, Memory Usage, Open File Descriptors.
  - Garbage Collection: GC Objects Collected.
Adjust Panel Sizes and Layout
- Resize panels within each row for optimal display.
- Arrange panels side by side if appropriate.
Save the Dashboard
- Click on the Save Dashboard icon (diskette) at the top.
- Title: Ensure your dashboard has a meaningful name.
- Folder: Choose or create a folder if desired.
- Click Save.

At this point, you've successfully constructed a comprehensive Grafana dashboard that aligns with both the RED and USE methodologies. Each panel was built step by step, focusing on refining queries, utilizing variables, and customizing visualizations to make your dashboard both informative and beautiful.

By monitoring both application-level metrics (RED) and system-level metrics (USE), you're equipped to gain deep insights into your application's performance and the underlying system resources.

Kubernetes Training

If you found these guides helpful, check out The Complete Kubernetes Training course

Grafana and Prometheus Tutorial: Create a COMPLETE Dashboard

Table of Contents

Monitoring Methodologies

Setting Up the Environment

Creating Dashboard Rows

Panel 1: Total HTTP Requests

Step 1: Create a Basic Panel

Step 2: Analyze the Output

Step 3: Filter by Job

Step 4: Avoid Hardcoding the Job Name

Creating Dashboard Variables

Refining the Panels with Variables and Time Ranges

Step 1: Adjusting the Time Range Variable

Step 2: Proceeding with Other Panels

Panel 2: Error Rate

Panel 3: Requests in Progress

Panel 4: Process Uptime

Panel 5: API Throughput

Analyze and Refine the Panel

Panel 6: Requests by Status Code

Analyze and Refine the Panel

Panel 7: API Latency Percentiles

Analyze and Refine the Panel

Panel 8: Request Duration Heatmap

Analyze and Refine the Panel

Panel 9: CPU Usage (%)

Analyze and Refine the Panel

Panel 10: Memory Usage

Analyze and Refine the Panel

Panel 11: Open File Descriptors (%)

Analyze and Refine the Panel

Panel 12: GC Objects Collected

Analyze and Refine the Panel

Organizing the Dashboard

Kubernetes Training

Let’s keep in touch

Subscribe