Oct 9, 2024

Prometheus Query Language (PromQL) Tutorial

The Prometheus Query Language (PromQL) is used to query data inside of the Prometheus time-series database. This tutorial will allow you to perform PromQL queries on various metric types (histograms, counters, gauges), and use PromQL to build a Grafana dashboard.

Table of Contents

  1. Setting Up the Environment

  2. Metric Types

    • 2.1 Counter

    • 2.2 Gauge

    • 2.3 Histogram

  3. Accessing Prometheus

  4. The Prometheus Time-Series Database

  5. Labels

  6. PromQL

    • 6.1 Range Vector

    • 6.2 Instant Vector

    • 6.3 Label Matching

    • 6.4 Aggregation

    • 6.5 Grouping and Aggregation

    • 6.6 Aggregation over Time

  7. Setting Up Grafana

    • Setting up the Data Source

    • Import a Dashboard

  8. Analyzing each PromQL Query

  9. Conclusion

Setting Up the Environment

Before we dive into PromQL, let's set up our Prometheus environment.

  1. Clone the repository:

    git clone https://github.com/rslim087a/prometheus-docker-compose 
    cd
    
    
  2. Start the environment:

    docker compose up

This command will start Prometheus, Grafana, and a sample application that exposes metrics. For more details on how this setup works, refer to our previous article: Prometheus Docker Compose Setup with Grafana.

Metric Types

Before we start querying with PromQL, it's crucial to understand the different types of metrics our application exposes.

First, make a GET request to localhost:8000 to confirm the app is running. Then, navigate to localhost:8000/metrics to view all the metrics.

  1. Counter
# HELP http_request_total Total HTTP Requests
# TYPE http_request_total counter
http_request_total{method="GET",path="/metrics",status="200"} 1.0
http_request_total{method="GET",path="/",status="200"

# HELP provides a description of the metric and # TYPE indicates the metric type. http_request_total is deemed a counter metric because it only increments.

# HELP http_request_total Total HTTP Requests
# TYPE http_request_total counter

The metric name is followed by labels (in curly braces) that specify attributes like method, path, and status, each with a corresponding value. This structure allows the metric to be split across multiple time series, each representing a unique combination of label values.

http_request_total{method="GET",path="/metrics",status="200"} 1.0
http_request_total{method="GET",path="/",status="200"

For instance, if you make a request to http://localhost:8000/, the counter for that specific path will increase, while other paths remain unaffected. Refreshing the metrics page (/metrics) will increment its respective counter. For example, after these actions, you might see:

http_request_total{method="GET",path="/metrics",status="200"} 2.0
http_request_total{method="GET",path="/",status="200"

This granularity enables precise tracking of requests across different endpoints and HTTP methods.

  1. Gauge
# HELP process_cpu_usage Current CPU usage in percent
# TYPE process_cpu_usage gauge

Gauge metrics represent values that can both increase and decrease. They're typically used for measuring current states that fluctuate, such as CPU usage, memory consumption, or concurrent requests. In this example, process_cpu_usage shows the current CPU usage percentage. If you refresh the page, you might see this value change:

  1. Histogram
# HELP http_request_duration_seconds HTTP Request Duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005",method="GET",path="/",status="200"} 1.0
http_request_duration_seconds_bucket{le="0.01",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.025",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.05",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.075",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.1",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.25",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.5",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.75",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="1.0",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="2.5",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="5.0",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="7.5",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="10.0",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="+Inf",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_count{method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_sum{method="GET",path="/",status="200"

Histogram metrics measure and group values into predefined ranges (buckets). For example, request durations count how many requests fall into each duration range (e.g., 0-5ms, 5-10ms, 10-25ms). This grouping is advantageous because it helps us quickly see patterns (e.g., most requests take 5-10ms) and spot outliers (e.g., a few requests taking >100ms).

In this example, each line represents a bucket with an upper bound (le = less than or equal to):

http_request_duration_seconds_bucket{le="0.005",method="GET",path="/",status="200"} 1.0
http_request_duration_seconds_bucket{le="0.01",method="GET",path="/",status="200"} 2.0

Here, 1 request took ≤0.005 seconds, and 2 requests took ≤0.01 seconds. This means the second request took between 0.005 and 0.01 seconds.

The value for each bucket is cumulative so all subsequent buckets also show 2.

http_request_duration_seconds_bucket{le="0.005",method="GET",path="/",status="200"} 1.0
http_request_duration_seconds_bucket{le="0.01",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.025",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.05",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.075",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.1",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.25",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.5",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="0.75",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="1.0",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="2.5",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="5.0",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="7.5",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="10.0",method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_bucket{le="+Inf",method="GET",path="/",status="200"

The bottom of the histogram aggregates the data into two sums: the total number of requests, and the amount of time it took for all requests to complete.

http_request_duration_seconds_count{method="GET",path="/",status="200"} 2.0
http_request_duration_seconds_sum{method="GET",path="/",status="200"

Accessing Prometheus

Now that we've examined the metrics exposed by our application, let's confirm that Prometheus is successfully scraping these metrics.

  1. In your web browser, navigate to http://localhost:9090 to access the Prometheus UI.

  2. In the query bar at the top of the page, enter the following PromQL query:

http_request_total
  1. Click the "Execute" button or press Enter.

You should see results similar to this:

http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/", status="200"} 1
http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/metrics", status="200"} 2

If you see these results, it means Prometheus is correctly scraping and storing the metrics from our application, and we're ready to start exploring more complex PromQL queries.

The Prometheus Time-Series Database

Prometheus scrapes data at regular intervals and stores it in a time-series database. A time series is a sequence of data points, where each data point is a timestamp associated with a value.

<timestamp> <value>
<timestamp> <value>
<timestamp> <value>

Let's assume Prometheus is configured to scrape the python application every 20 seconds, and over the course of 200 seconds (10 scrapes), 5 requests were made to the root path:

http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/", status="200"} 5

The data stored inside of Prometheus might look something like this:

{__name__="http_request_total", method="GET", status="200", path="/"}:

1623701320 1.0
1623701340 1.0
1623701360 2.0
1623701380 2.0
1623701400 3.0
1623701420 3.0
1623701440 4.0
1623701460 5.0
1623701480 5.0
1623701500 5.0

This time series applies to the metric http_request_total with labels:

method="GET", status="200", path="/"

The first number in each pair is a Unix timestamp (seconds since January 1, 1970), and the second number is the value of the counter at that time. Note how:

  1. The value starts at 1.0 and never decreases (characteristic of a counter)

  2. It doesn't change every scrape, reflecting periods where no new requests were made

  3. By the end, it reaches 5.0, representing the initial request plus the 4 additional requests made during this period

Labels

Labels split metrics into multiple time series, allowing us to track different aspects of the same metric. For example:

{__name__="http_request_total", method="GET", status="200", path="/"}:

1623701320 1.0
1623701340 1.0
1623701360 2.0
...
{__name__="http_request_total", method="GET", status="200", path="/items"}:

1623701320 1.0
1623701340 1.0
1623701360 1.0
1623701380 2.0

Here, the path label differentiates between requests made to "/" and "/items".

PromQL

Let's begin by examining how we can visualize time series data using PromQL. We'll use the `http_request_total` metric for our examples.

Range Vector

This query returns a range vector, showing data points for the last 5 minutes. You'll see output similar to:

http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/", status="200"}
1683000060 3.0
1683000075 3.0
1683000090 4.0
1683000105 5.0
1683000120 5.0
1683000135 5.0

This output reflects how Prometheus stores data points at regular intervals, allowing us to see how the metric changes over time.

Instant Vector

To get the most recent value for each time series, we can simply omit the time range:

http_request_total

This query returns the latest value that Prometheus has scraped for each time series of the http_request_total metric.

http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/", status="200"} 5
http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/metrics", status="200"} 8
Label Matching

We can use label matching to focus on specific metrics. Let's look at requests to the root path:

http_request_total{path="/"}

By adding the path="/" label matcher, we've refined our query to track only the requests made to the root path.

http_request_total{instance="fastapi-app:8000", job="fastapi-app", method="GET", path="/", status="200"
Aggregation

Aggregation functions take an instant vector (which contains the most recent value for every matching time series) as input and produce a single scalar value. Here are some key functions:

  1. sum(v instant-vector): Calculates the sum of all values in the vector

sum(http_request_total)

This output shows the total number of HTTP requests across all paths and methods.

1080


  1. avg(v instant-vector): Calculates the average of all values in the vector

avg(process_cpu_usage)

This result indicates an average CPU usage of 65% across all instances.

0.65


  1. max(v instant-vector): Returns the maximum value from all values in the vector

max(http_response_time_seconds)

This shows the highest response time across all endpoints.

1.23


  1. min(v instant-vector): Returns the minimum value from all values in the vector

min(process_memory_bytes)

This displays the lowest memory usage among all monitored processes.

52428800

Grouping and Aggregation

PromQL allows you to group values in an instant vector before aggregating them, which results in a smaller vector with fewer elements. Here are some examples:

  1. sum by (label)(v instant-vector): groups values based on a label and aggregates each group into a sum. For example:

This groups HTTP requests based on their method and sums the total requests for each group.

{method="GET"} 800
{method="POST"} 200
{method="PUT"


  1. avg by (label)(v instant-vector): groups values based on a label and aggregates each group into an average. For example:

This shows the average CPU usage for each pod.

{pod="app-pod-1"} 0.75 
{pod="app-pod-2"} 0.55 
{pod="app-pod-3"} 0.60
Aggregation over Time

Aggregation over time functions take a range vector as input and produce an instant vector. Here are some key functions:

  1. sum_over_time(v range-vector): Calculates the sum of all values in the specified range vector for each time series

This query calculates the total number of requests over the last hour for every time series

This output returns an input vector with multiple elements because of the metric's unique combination of labels.

{instance="fastapi-app:8000", method="GET", path="/"} 720
{instance="fastapi-app:8000", method="GET", path="/metrics"} 360


  1. avg_over_time(v range-vector): Calculates the average of all values in the specified range vector for each time series

This query calculates the average CPU usage over the last hour for every time series

This output returns an input vector with one element because the metric only has one label.

{instance="fastapi-app:8000"} 0.65


  1. rate(v range-vector): Calculates the per-second average rate in the specified range vector.

This query calculates the average per-second rate of HTTP requests over the last minute for every time series.

The output displays an instant vector with two elements, showing the per-second rates based on the last minute of data:

  • 0.2 requests per second for the root path ("/")

  • 0.1 requests per second for the "/metrics" path:

{instance="fastapi-app:8000", method="GET", path="/"} 0.2
{instance="fastapi-app:8000", method="GET", path="/metrics"} 0.1


  1. increase(v range-vector): Calculates the increase in the value of the time series in the specified range vector.

This query shows the total increase in HTTP requests over the last hour for every time series:

The output displays an instant vector with two elements, showing the total increase over the last hour:

  • 720 additional requests for the root path ("/")

  • 360 additional requests for the "/metrics" path:

{instance="fastapi-app:8000", method="GET", path="/"} 720
{instance="fastapi-app:8000", method="GET", path="/metrics"} 360

Setting Up Grafana

Access Grafana at http://localhost:3000. The default login is admin/admin.

Setting up the Data Source

In order for Grafana to query the Prometheus data, we need to set up Prometheus as a data source:

  1. Click on Settings (Gear Icon)

  2. Go to Configuration > Data Sources.

  3. Click "Add data source" and select Prometheus.

  4. Set the URL to http://prometheus:9090.

    • We use prometheus:9090 instead of localhost:9090 because Grafana and Prometheus are on the same Docker network, and prometheus resolves to the Prometheus container's IP.

  5. Click "Save & Test" to ensure the connection is working.

Import a Dashboard

Go to Dashboards > Import and paste the JSON from grafana-dashboard.json :

Each panel in the dashboard uses a PromQL query to visualize metrics from your FastAPI application.

Analyzing each PromQL Query

Now, let's break down the PromQL queries used in each panel of our Grafana dashboard:

1. Request Rate Panel

This query calculates the per-second average rate of HTTP requests over the last minute for every time series.

2. Average Response Time Panel

This query calculates the average response time by dividing the rate of increase in the sum of request durations by the rate of increase in the count of requests.

3. Memory Usage Panel

process_resident_memory_bytes

This query directly uses the process_resident_memory_bytes gauge metric to display current memory usage.

4. CPU Usage Panel

process_cpu_usage

This query uses the process_cpu_usage gauge metric to show current CPU usage.

Each of these queries utilizes concepts we've discussed earlier in this tutorial, demonstrating how PromQL can be used to create insightful visualizations of your application's performance.

Conclusion

This tutorial covered PromQL basics, from simple queries to complex aggregations and time-based operations. We've explored how to use PromQL to extract insights from Prometheus metrics and create Grafana visualizations. With this knowledge, you can now effectively monitor and analyze your applications using Prometheus and Grafana. Remember, mastering PromQL comes with practice. Experiment with different queries to gain valuable insights into your system's performance.


Let’s keep in touch

Subscribe to the mailing list and receive the latest updates

Let’s keep in touch

Subscribe to the mailing list and receive the latest updates

Let’s keep in touch

Subscribe to the mailing list and receive the latest updates

Let’s keep in touch

Subscribe to the mailing list and receive the latest updates