Nov 17, 2024
Grafana and Prometheus Tutorial: Create a COMPLETE Dashboard
In this tutorial, you will master the process of transforming raw metric data into a Grafana dashboard with insightful visualizations. I will provide you a pre-configured environment so we can focus entirely on building each dashboard panel step by step.
Table of Contents
Monitoring Methodologies
The two methodologies that will guide the creation of our dashboard are RED and USE.
The RED methodology focuses on metrics that affect user experience: Rate, Errors, and Duration.
The USE methodology focuses on system-level metrics: Resource Utilization, Saturation, and Errors.
By creating a dashboard that combines the RED and USE methodologies, we ensure comprehensive monitoring of both user experience and system performance.
Setting Up the Environment
Before starting, make sure you have the same environment as me by following this tutorial kubernetestraining.io/blog/prometheus-docker-compose-setup-with-grafana. This tutorial will give you a working environment with Prometheus scraping metrics from a Python web application and Grafana connected to Prometheus, ready for us to create our dashboard.
Creating Dashboard Rows
To organize our dashboard effectively and align with the RED and USE methodologies, we'll create rows for each key area. This helps in grouping related panels and improves the readability of the dashboard.
Create a New Dashboard
In Grafana, click on the Plus Icon (Create) > Dashboard.
Add Rows for Each Section
Click on the Add panel dropdown arrow next to Add panel.
Select Add a new row.
Title: Enter
Quick Indicators
.
Repeat the process to add more rows:
Row 2: Title
Rate
(RED)Row 3: Title
Errors
(RED)Row 4: Title
Duration
(RED)Row 5: Title
System Metrics
(USE)Row 6: Title
Garbage Collection
(USE)
Save the Dashboard
Click on the Save Dashboard icon (diskette) at the top.
Title: Enter
Comprehensive Monitoring Dashboard
or any name you prefer.Folder: Choose or create a folder if desired.
Click Save.
With the rows in place, we're now ready to start adding panels to each section.
Panel 1: Total HTTP Requests
In this section, we'll start constructing our dashboard by creating panels that will display key metrics. We'll begin with basic queries and progressively enhance them by introducing variables and adjusting time ranges.
Step 1: Create a Basic Panel
Create a New Dashboard
In Grafana, click on the Plus Icon (Create) > Dashboard.
Click on Add new panel.
Configure the Basic Query
Data source: Ensure Prometheus is selected.
Query Editor: Enter the following PromQL query:
Explanation:
This query sums the total number of HTTP requests over the last 5 minutes.
Visualize the Data
Visualization: Select Stat.
Panel Title: Enter
Total Requests
.
Apply the Panel
Click Apply to add the panel to the dashboard.
Step 2: Analyze the Output
Review the Panel
The panel displays the total number of requests in the last 5 minutes.
However, this includes all jobs and instances scraped by Prometheus, which may not be specific to your application.
Identify the Need to Filter Data
We need to filter the data to show only the requests related to our specific application.
Step 3: Filter by Job
Edit the Panel
Hover over the panel and click on the Panel Title > Edit.
Update the Query to Filter by Job
Modify the query to:
Explanation:
{job="my-app"}
filters the metric to include only data from themy-app
job.Replace
"my-app"
with the actual job name of your application.
Apply the Panel
Click Apply.
Review the Updated Panel
The panel now displays the total requests specific to your application.
Step 4: Avoid Hardcoding the Job Name
Problem with Hardcoding
Hardcoding
job="my-app"
means the dashboard is not flexible if the job name changes or if you have multiple jobs.
Solution: Use a Variable
We'll create a variable for the job name so the dashboard can dynamically adjust based on the selected job.
Creating Dashboard Variables
To make our dashboard dynamic and avoid hardcoding values, we'll create variables that can be used in our queries.
Access Dashboard Settings
Click on the Gear Icon (Dashboard settings) at the top right.
Add a Variable
In the left menu, select Variables.
Click on the Add variable button.
Configure the
job
VariableName:
job
Type:
Query
Data source: Select your Prometheus data source.
Query:
label_values(job)
This PromQL query fetches all unique
job
label values from your metrics.
Regex: Leave blank.
Refresh:
On Dashboard Load
Include All Option: Enable (allows selecting all jobs).
Multi-value: Disable (since we'll usually select one job).
Sort:
Disabled
Save the Variable
Click on the Update button at the bottom.
Use the Variable in the Query
Return to your panel.
Edit the Panel:
Hover over the panel and click on the Panel Title > Edit.
Update the Query:
Explanation:
{job="$job"}
uses the variable we just created.Now, the query dynamically filters based on the selected job.
Apply the Panel
Click Apply.
Test the Variable
At the top of the dashboard, you should see a dropdown for
job
.Select different jobs (if available) to see how the panel updates.
Refining the Panels with Variables and Time Ranges
Now that we've introduced variables, we'll refine our panels further to make them responsive to the dashboard's time range.
Step 1: Adjusting the Time Range Variable
Understanding the Issue
The
[5m]
in our query is a static time range (5 minutes).When we change the dashboard's time range (e.g., Last 1 hour), the panel still calculates over the last 5 minutes.
Use the Dashboard Time Range Variable
Grafana provides a built-in variable
$__range
, which represents the current dashboard time range.
Update the Query
Edit the Panel:
Hover over the panel and click on the Panel Title > Edit.
Modify the Query:
Explanation:
[$__range]
uses the dashboard's current time range.Now, the panel dynamically adjusts based on the selected time interval.
Apply the Panel
Click Apply.
Test the Time Range Adjustment
Change the dashboard time range (e.g., Last 15 minutes, Last 1 hour) and observe how the panel updates accordingly.
Step 2: Proceeding with Other Panels
Now that we've established how to use variables and time ranges, we'll proceed to create other panels using similar methods.
Panel 2: Error Rate
Add a New Panel
Click on Add panel > Add new panel.
Configure the Query
Query Editor:
Explanation:
Calculates the percentage of requests resulting in client (4xx) or server (5xx) errors.
Uses the
$job
variable and$__range
for dynamic filtering and time range adjustment.
Set the Panel Title
Title: Enter
Error Rate (%)
.
Select Visualization
Choose Stat.
Customize the Visualization
Field Options:
Unit: Select
percent
.Decimal Places: Adjust to show at least two decimal places.
Thresholds:
Set thresholds at:
1% (Yellow)
5% (Red)
Color Mode: Choose
Background
.
Apply the Panel
Click Apply.
Panel 3: Requests in Progress
Add a New Panel
Click on Add panel > Add new panel.
Configure the Query
Query Editor:
Explanation:
Displays the current number of in-progress requests.
Uses the
$job
variable for dynamic filtering.
Set the Panel Title
Title: Enter
Requests in Progress
.
Select Visualization
Choose Stat.
Customize the Visualization
Field Options:
Unit: Select
none
.Color Mode: Choose
Background
.Thresholds: Optional—set if desired.
Apply the Panel
Click Apply.
Panel 4: Process Uptime
Add a New Panel
Click on Add panel > Add new panel.
Configure the Query
Query Editor:
Explanation:
Calculates the process uptime by subtracting the process start time from the current time.
Uses the
$job
variable for dynamic filtering.
Set the Panel Title
Title: Enter
Process Uptime
.
Select Visualization
Choose Stat.
Customize the Visualization
Field Options:
Unit: Select
dthms
(days, hours, minutes, seconds).Color Mode: Choose
Background
.
Value Mappings:
Map
No Value
to display textDOWN
with colorRed
.
Thresholds:
Set a threshold to display uptime in
Green
if the value is above0
.
Apply the Panel
Click Apply.
At this point, you've successfully created four essential panels for your Grafana dashboard:
Total Requests: Displays the total number of requests over the selected time range.
Error Rate (%): Shows the percentage of requests resulting in errors.
Requests in Progress: Indicates the current number of in-progress requests.
Process Uptime: Displays how long the application process has been running.
These panels are dynamic and adjust based on the selected job and time range, providing valuable insights into your application's performance.
With the foundational panels in place, we'll now proceed to build additional panels that provide deeper insights into our application's performance and system health. These panels align with both the RED and USE methodologies, covering aspects such as throughput, latency, system resource utilization, and more.
Panel 5: API Throughput
Add a New Panel to the "Rate" Row
In the
Rate
row, click on Add panel > Add new panel.
Configure the Query
Data source: Ensure Prometheus is selected.
Query Editor: Enter the following PromQL query:
Explanation:
http_request_total
: The metric tracking total HTTP requests.{job="$job"}
: Filters the metric by the selectedjob
variable.rate(...[1m])
: Calculates the per-second rate over the last 1 minute.sum(...) by (method, path)
: Aggregates the rate, grouping by HTTP method and path.
Set the Panel Title
Title: Enter
API Throughput
.
Select Visualization
Visualization: Select Time series.
Customize the Visualization
Legend:
In the Legend field in the query editor, enter
{{method}} {{path}}
.Enable Show legend and set Placement to
Bottom
.
Axes:
Unit: Select
reqps
(requests per second).
Apply the Panel
Click Apply to add the panel to the dashboard.
Analyze and Refine the Panel
Review the Panel
The panel displays the rate of requests per second for each API endpoint, differentiated by method and path.
Handle Excessive Lines
If you have many endpoints, the graph may become cluttered.
You can filter specific endpoints by modifying the query or using additional variables (e.g., a
path
variable).
Panel 6: Requests by Status Code
Add a New Panel to the "Errors" Row
In the
Errors
row, click on Add panel > Add new panel.
Configure the Query
Query Editor: Enter:
Explanation:
Aggregates the rate of requests by their HTTP status codes.
Set the Panel Title
Title: Enter
Requests by Status Code
.
Select Visualization
Visualization: Select Pie chart.
Note: If the Pie Chart visualization is not available, you may need to install the Pie Chart plugin from the Grafana Plugins page.
Customize the Visualization
Pie Chart Options:
Set Pie Type to
Donut
.Enable Labels to display the status codes.
Display: Set to
Percentage
.
Legend:
Enable Show legend and set Placement to
Bottom
.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The pie chart displays the distribution of requests by status code, helping you identify the proportion of successful vs. error responses.
Panel 7: API Latency Percentiles
Add a New Panel to the "Duration" Row
In the
Duration
row, click on Add panel > Add new panel.
Configure Multiple Queries for Percentiles
Query A (p50):
Legend: Set to
p50
.
Query B (p90):
Legend: Set to
p90
.
Query C (p95):
Legend: Set to
p95
.
Query D (p99):
Legend: Set to
p99
.
Explanation:
histogram_quantile(φ, ...)
: Calculates the φ-quantile (e.g., 0.5 for the 50th percentile) from histogram data.rate(...[1m])
: Computes the per-second average rate over the last 1 minute.sum(...) by (le)
: Aggregates the data across all instances, grouping by the upper bound of histogram buckets (le
).
Set the Panel Title
Title: Enter
API Latency Percentiles
.
Select Visualization
Visualization: Select Time series.
Customize the Visualization
Axes:
Unit: Select
s
(seconds).
Legend:
Ensure the legend displays the percentile labels (
p50
,p90
, etc.).Place the legend at the bottom for better readability.
Graph Styles:
Assign different colors to each percentile for distinction.
Adjust line styles (solid, dashed) if desired.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The panel shows how different latency percentiles vary over time, helping you understand the distribution of response times.
Investigate High Latencies
If higher percentiles (e.g., p99) show spikes, it may indicate performance issues that need investigation.
Panel 8: Request Duration Heatmap
Add a New Panel to the "Duration" Row
Click on Add panel > Add new panel.
Configure the Heatmap Query
Query Editor:
Explanation:
Aggregates the rate of requests into histogram buckets based on their durations (
le
labels).
Set the Panel Title
Title: Enter
Request Duration Heatmap
.
Select Visualization
Visualization: Select Heatmap.
Customize the Visualization
Heatmap Options:
Data Format: Ensure it's set to
Time series buckets
.
Axes:
Y-Axis Unit: Select
s
(seconds).
Color Scheme:
Choose a color scheme that highlights intensity, such as
Spectral
orViridis
.
Y-Axis Settings:
Set the scale to
Logarithmic
if the data spans multiple orders of magnitude.Enable Show grid lines for better readability.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The heatmap displays the distribution of request durations over time, allowing you to spot patterns and anomalies.
Adjust Bucketing (If Necessary)
If the heatmap is not displaying data effectively, you may need to adjust the bucket sizes or ranges in your Prometheus configuration.
Panel 9: CPU Usage (%)
Add a New Panel to the "System Metrics" Row
In the
System Metrics
row, click on Add panel > Add new panel.
Configure the CPU Usage Query
Query Editor:
Explanation:
process_cpu_seconds_total
: Total user and system CPU time spent in seconds.rate(...[1m])
: Calculates the per-second rate over the last 1 minute.Multiplying by 100 to express as a percentage.
Set the Panel Title
Title: Enter
CPU Usage (%)
.
Select Visualization
Visualization: Select Stat.
Customize the Visualization
Field Options:
Unit: Select
percent (0-100)
.Thresholds:
Set thresholds at:
70% (Yellow)
90% (Red)
Color Mode: Choose
Background
.
Value Options:
Calculation: Select
Last (not null)
.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The panel displays the current CPU usage percentage.
Verify the Calculation
Ensure that the CPU usage aligns with expectations. If the value seems incorrect, double-check the query and time window.
Panel 10: Memory Usage
Add a New Panel to the "System Metrics" Row
Click on Add panel > Add new panel.
Configure Memory Usage Queries
Query A (Resident Memory):
Legend: Set to
Resident Memory
.
Query B (Virtual Memory):
Legend: Set to
Virtual Memory
.
Explanation:
process_resident_memory_bytes
: The amount of memory the process has in RAM.process_virtual_memory_bytes
: The amount of virtual memory used by the process.
Set the Panel Title
Title: Enter
Memory Usage
.
Select Visualization
Visualization: Select Time series.
Customize the Visualization
Axes:
Unit: Select
bytes (IEC)
.
Legend:
Enable Show legend and set Placement to
Bottom
.
Graph Styles:
Use different colors for each memory type.
Adjust line styles if desired.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The panel shows how memory usage changes over time.
Investigate Memory Patterns
Look for trends such as steady increases in memory usage, which may indicate memory leaks.
Panel 11: Open File Descriptors (%)
Add a New Panel to the "System Metrics" Row
Click on Add panel > Add new panel.
Configure the Open File Descriptors Query
Query Editor:
Explanation:
process_open_fds
: Number of open file descriptors.process_max_fds
: Maximum number of file descriptors that can be opened.Dividing and multiplying by 100 gives the percentage of used file descriptors.
Set the Panel Title
Title: Enter
Open File Descriptors (%)
.
Select Visualization
Visualization: Select Stat.
Customize the Visualization
Field Options:
Unit: Select
percent (0-100)
.Thresholds:
Set thresholds at:
70% (Yellow)
90% (Red)
Color Mode: Choose
Background
.
Value Options:
Calculation: Select
Last (not null)
.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The panel shows the percentage of open file descriptors in use.
Monitor Resource Usage
High usage may indicate resource leaks or the need to increase the file descriptor limit.
Panel 12: GC Objects Collected
Add a New Panel to the "Garbage Collection" Row
In the
Garbage Collection
row, click on Add panel > Add new panel.
Configure the Garbage Collection Query
Query Editor:
Explanation:
python_gc_objects_collected_total
: Total number of objects collected by Python's garbage collector.rate(...[1m])
: Computes the per-second rate over the last 1 minute.sum(...) by (generation)
: Aggregates by GC generation (0, 1, 2).
Set the Panel Title
Title: Enter
GC Objects Collected
.
Select Visualization
Visualization: Select Time series.
Customize the Visualization
Axes:
Unit: Select
short
, which formats large numbers with suffixes (K, M).
Legend:
Enable Show legend and set Placement to
Bottom
.In the Legend field in the query editor, enter
Generation {{generation}}
.
Graph Styles:
Use distinct colors for each generation.
Adjust line width and style as desired.
Apply the Panel
Click Apply.
Analyze and Refine the Panel
Review the Panel
The panel shows how many objects are collected in each GC generation over time.
Interpret Garbage Collection Activity
High rates of object collection may be normal, but sudden spikes could indicate increased memory allocation and deallocation.
Organizing the Dashboard
To enhance readability and structure, let's ensure that all panels are properly arranged within their respective rows and adjust the layout as needed.
Move Panels into Rows
Drag and drop each panel into its corresponding row if you haven't already.
For example:
Rate: API Throughput.
Errors: Requests by Status Code.
Duration: API Latency Percentiles, Request Duration Heatmap.
System Metrics: CPU Usage, Memory Usage, Open File Descriptors.
Garbage Collection: GC Objects Collected.
Adjust Panel Sizes and Layout
Resize panels within each row for optimal display.
Arrange panels side by side if appropriate.
Save the Dashboard
Click on the Save Dashboard icon (diskette) at the top.
Title: Ensure your dashboard has a meaningful name.
Folder: Choose or create a folder if desired.
Click Save.
At this point, you've successfully constructed a comprehensive Grafana dashboard that aligns with both the RED and USE methodologies. Each panel was built step by step, focusing on refining queries, utilizing variables, and customizing visualizations to make your dashboard both informative and beautiful.
By monitoring both application-level metrics (RED) and system-level metrics (USE), you're equipped to gain deep insights into your application's performance and the underlying system resources.
Kubernetes Training
If you found these guides helpful, check out The Complete Kubernetes Training course