Monitoring and alerting
Monitoring
The Prometheus platform is typically used to monitor Stellio in production, together with Grafana for the visualisation of the different metrics gathered from the platform.
Setting up of Prometheus is of course beyond the scope of this documentation, there is good documentation on the Prometheus site.
However, it is recommended to monitor the VMs with node_exporter and the Docker containers with cAdvisor.
The Stellio services can also be configured to expose an health endpoint and Prometheus metrics (in parenthesis, the name of the environement variable to use when injecting the values into a Docker container):
management.endpoint.prometheus.enabled
(MANAGEMENT_ENDPOINT_PROMETHEUS_ENABLED
):true
management.endpoints.web.exposure.include
(MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE
):health,prometheus
management.metrics.pf.tag
(MANAGEMENT_METRICS_PF_TAG
): used to compose a specificapplication
tag that is used for easier querying in Prometheus
In a docker-compose or Docker Swarm based deployment, the environement variables can be declared by adding the following in the environment
section:
search-service:
container_name: search-service
image: stellio/stellio-search-service:latest-dev
environment:
- MANAGEMENT_ENDPOINT_PROMETHEUS_ENABLED=${MANAGEMENT_ENDPOINT_PROMETHEUS_ENABLED}
- MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE=${MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE}
- MANAGEMENT_METRICS_TAGS_APPLICATION=Search Service - ${MANAGEMENT_METRICS_PF_TAG}
Example Prometheus configurations
An example Prometheus configuration to get health information from Stellio services (using Blackbox exporter):
- job_name: 'Stellio services - health'
metrics_path: /probe
scrape_interval: 1m
params:
module: [http_200_stellio]
static_configs:
- targets: ['http://stellio-host:8080/actuator/health']
labels:
name: 'API Gateway'
- targets: ['http://stellio-host:8083/actuator/health']
labels:
name: 'Search Service'
- targets: ['http://stellio-host:8084/actuator/health']
labels:
name: 'Subscription Service'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox_exporter:9115
Where http_200_stellio
is configured in this way:
http_200_stellio:
prober: http
timeout: 5s
http:
preferred_ip_protocol: "ip4"
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
fail_if_body_not_matches_regexp:
- "UP"
Health information can for instance be then monitored with the following community Grafana dashboard: https://grafana.com/grafana/dashboards/12275.
An example Prometheus configuration to get metrics information from Stellio services:
- job_name: 'Stellio services - metrics'
metrics_path: '/actuator/prometheus'
scrape_interval: 30s
static_configs:
- targets: ['stellio-host:8083'] # 8083 : Search service
- targets: ['stellio-host:8084'] # 8084 : Subscription service
Metrics can for instance be then viewed with the following community Grafana dashboard: https://grafana.com/grafana/dashboards/4701.
Alerting
The Prometheus alert manager can be used to monitor the activity and send alerts in case something is going wrong.
A good place to find some example alerts is the Awesome Prometheus alerts site.
Some classic alerts that are generally recommended:
- Alert when a service is down
- name: service
rules:
- alert: service_down
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.name }} is down"
description: "Service {{ $labels.name }} is down ({{ $labels.instance }})"
- Alert when a container is down
- alert: container_down_stellio
expr: |
absent(container_memory_usage_bytes{name="api-gateway",job="Stellio Docker"}) or
absent(container_memory_usage_bytes{name="postgres",job="Stellio Docker"}) or
absent(container_memory_usage_bytes{name="kafka",job="Stellio Docker"}) or
absent(container_memory_usage_bytes{name="subscription-service",job="Stellio Docker"}) or
absent(container_memory_usage_bytes{name="search-service",job="Stellio Docker"})
for: 30s
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down on Stellio Dev"
description: "Container {{ $labels.name }} is down for more than 30 seconds on Stellio Dev"
Logging
The logs produced by the Stellio services can be sent to Graylog or any other GELF compatible logging platform.
Setting up of Graylog is of course beyond the scope of this documentation, there is good documentation on the Graylog site.
In order to send Stellio services logs to the logging platform, the following three variables have to configured (in parenthesis, the name of the environement variable to use when injecting the values into a Docker container):
application.graylog.host
(APPLICATION_GRAYLOG_HOST
): host where Graylog is installed (e.g., localhost)application.graylog.port
(APPLICATION_GRAYLOG_PORT
): port where Graylog is listening (e.g.g, 12201)application.graylog.source
(APPLICATION_GRAYLOG_SOURCE
): sent asplatform
key to Graylog (it can later be used to create streams specific to the originating platform)
In a docker-compose or Docker Swarm based deployment, the environement variables can be declared by adding the following in the environment
section:
entity-service:
container_name: entity-service
image: stellio/stellio-entity-service:latest
environment:
- APPLICATION_GRAYLOG_HOST=${APPLICATION_GRAYLOG_HOST}
- APPLICATION_GRAYLOG_PORT=${APPLICATION_GRAYLOG_PORT}
- APPLICATION_GRAYLOG_SOURCE=${APPLICATION_GRAYLOG_SOURCE}