Prometheus is an open-source monitoring system and since it's commencement in 2012 it is now a standalone open source project. Software Developer Jakub Dziworski found it very useful but had a few questions when it came to the bigger picture, what to monitor? To help solve these questions for others Jakub has written this article on how to create a monitoring solution using Prometheus.
'When I first started using Prometheus, I found the documentation very useful when it comes to describing features of the library. I was, however, a little bit lost when it comes to bigger picture. What to monitor? What is the best method to monitor specific cases? Back then, I would really appreciate a common set of metrics for typical systems that most of us develop on a daily.
In this post I would like to present how to create such a monitoring solution.
What to monitor?
The very first thing to do, when building a monitoring system is deciding what should actually be monitored.
Over the time and a few projects I learned that items worth monitoring are:
- Inputs to the system
- Outputs from the system
- Important/time-consuming operations within the system
- Resources
The most common inputs to system are:
- HTTP endpoints
- Message broker consumers (e.g. Kafka Consumer)
- Reading from databases/files/services
The most common outputs from the system are:
- Saving to databases/files
- Calling external services
- Producing to a message broker
Examples of important/time-consuming operations within system are:
- One cycle of the batch processing (e.g. once a day you check all users transactions and make a report)
- Creating and signing a bitcoin transaction
Subject of monitoring
With this brief introduction, let’s actually take a look how to create monitoring system for a social media app. This app has the following components:
- HTTP endpoints 'POST /users' and 'GET /users/{id}'. Used for registering new user and getting their summary.
- PostgreSQL 'users' table. Used for saving and reading user data.
- Auth service. Called when any endpoint is triggered to validate permissions.
- Kafka topic 'posts'. Used for displaying posts created by friends a user is connected with.
- Time-consuming process of generating video for friends who have known each other for a couple years. Triggered every night.
Incoming HTTP requests
There are many libraries for integrating HTTP server-side frameworks with prometheus. At SoftwareMill we mainly use Akka HTTP so we go with Prometheus Akka HTTP.
Such libraries usually expose a Prometheus histogram and allow you to add labels with the names of the endpoints.
pathPrefix("users") {
recordResponseTime("post_user_endpoint_label") {
(post & pathEndOrSingleSlash) { /*….*/ }
} ~
recordResponseTime("get_user_endpoint_label") {
(get & path(JavaUUID) ) { /* …. */ }
}
}
A Prometheus histogram exposes two metrics: count and sum of duration.
Having such data we can plot requests per second and average request duration time.
- Requests per second (all endpoints combined — all labels are aggregated with sum):
'sum(rate(http_request_duration_count[1m]))'
- Average requests duration (all endpoints combined — all labels are aggregated with avg):
'avg(rate(http_request_duration_sum[1m])/rate(http_request_duration_count[1m]))'
- Requests per second (different graph for each endpoint — label):
'rate(http_request_duration_count[1m])'
- Requests per second (different graph for each endpoint — label):
'rate(http_request_duration_sum[1m])/rate(http_request_duration_count[1m])'
Database
There are Prometheus metrics exporters for databases. Often, however, things can get stuck on a client side so I find measuring queries from client very handy.
There are many different databases and libraries used to interact with them. Most of them provide you with some kind of generic callback that gives you information about queries and their duration. You can start a timer there and observe the duration. I also recommend adding a label for this metric with query statement itself. This will allow to find which kind of queries tend to be the most time-consuming. If you use this approach remember to remove the query parameters in order to have label per query type, instead of millions of useless labels.
In our case we use PostgreSQL and interact with it using ScalikeJDBC. This is how you can hook it up using such library:
Histogram
.build()
.name(“sql_duration”)
.help(“Sql statement duration in seconds.”)
.labelNames(“statement”)
.register()
GlobalSettings.queryCompletionListener = (sql: String, params: Seq[Any], millis: Long) => {
val sqlWithoutParamList = sql
.replaceAll(“, \\?”, “”)
.replaceAll(“JOIN \\(VALUES .* vals”, “JOIN VALUES (?) vals”)
sqlHistogram
.labels(sqlWithoutParamList)
.observe(millis / 1000.0)
}
Exactly the same method can be used to monitor HTTP requests to measure queries count and average duration. Just instead of 'http_request_duration' use 'sql_duration'.
Kafka
When it comes to measuring Kafka consumer, it’s simply a matter of starting a timer before handling message and closing it afterwards. Prometheus will measure time automatically and expose the metrics. It’s a good idea to add a label for a topic. Currently we only have just one topic but there might be more introduced in the future.
val kafkaHistogram = Histogram
.build(“kafka_message_duration”, “Kafka message handling duration”)
.labelNames(“topic”)
.register()
def handleMessage(message: Message) = {
val topic = message.record.topic
val timer = kafkaHistogram.labels(topic).startTimer()
handle(message)
timer.close()
}
Querying kafka messages received per second and average handling duration is the same as shown previously — just with different metrics names.
Calls to an external service
There are libraries for integrating an HTTP client but it can be also achieved manually using a timer. We use STTP, which indeed has a Prometheus plugin, which exposes histogram. Having histogram data we can perform the same monitoring techniques for measuring average request time and requests amount.
Generating video
For daily videos generation we can use a similar strategy. For such task we need to manually start and stop timer similarly to Kafka scenario:
val videoHistogram = Histogram
.build(“friends_video”, “Friends video generation duration”)
.register()
def createVideos = {
val friends = db.fetchUsersBeingFriendsOverFiveYears()
friends.foreach { friendsPair =>
val timer = videoHistogram.startTimer()
videoGenerator.generae(friendsPair)
timer.close()
}
}
Since video generation will take some time it’s pointless to measure “videos per second”. Instead, I would suggest plotting just the amount of videos generated over time. This way we can see how many videos were generated in one batch and see how long the batch took. It’s as simple as querying 'friends_videos_count' metrics.
Resources
Resources is the easiest part. By using Node Exporter we get all the resource-related metrics. There are also predefined Grafana dashboards which I highly recommend for finding out which metrics to use and what queries to perform.
Visualizing
For visualizing I recommend using Grafana. It’s especially useful when you have complicated metrics with lots of labels. You can then create drop-down menus and choose which labels values you want to select. The best way to learn it is just find a good dashboard and see how it’s done under the hood. Go to https://grafana.com/dashboards, choose Prometheus as data source, browse available dashboards and you will learn how to use to its’ full potential in just a few minutes.
Summary
In this post I presented how to build a Prometheus monitoring solution for components used in an ordinary system. There are, however, lots of different ways to approach such task. If you know any of them, we would love to hear it. '
This article was written by Jakub Dziworski and posted originally on blog.softwaremill.com