Gridmatic: Efficiently Monitoring GKE Batch Jobs
Streamlining GKE Batch Job Monitoring: Zencore's Solution for Gridmatic's Cloud Monitoring and Alerting.
Monitoring GKE Batch Jobs
Project Location:
Santa Barbara, CA
Industry:
Energy, Financial services
Use Case:
Cloud Native Monitoring
Website:
Gridmatic
Zencore helped Gridmatic solution a way to monitor GKE batch jobs that aligned with their existing usage of Cloud Monitoring and Alerting.
Project Challenges
Gridmatic had GKE batch jobs that would potentially continue running after execution, resulting in additional pods being created and consuming resources in the GKE cluster. Metrics for Kubernetes jobs do not exist out of the box on Google Cloud (only on Anthos enabled cluster at the moment).
“The quality of support and the guidance we have received from Zencore since day 1 has helped us immensely while growing on Google Cloud.”
Solution
Zencore built a solution using kube-state-metrics and Managed Service for Prometheus. Kube-state-metrics listens to the Kubernetes API server and generates metrics about object state. Prometheus scrapes the kube-state-metric service and pushes those metrics into Cloud Monitoring. A custom query was developed to send alerts when jobs run beyond a specified threshold, and visualize in a dashboard.
Expertise
- Cloud Native Monitoring
- GKE
- Compute
- High Performance Computing
- Machine Learning
Solution
Gridmatic implemented the solution in their environment and is able to receive notifications when batch jobs run too long in their cluster. This solution reduced the operational burden on the infrastructure team, allowing them to focus on other engineering work and only requiring action when alerted.
About Gridmatic
Gridmatic works on enabling the clean energy transition, leveraging artificial intelligence and cloud technology for electricity markets.