Description :
- Primary focus is on Monitoring & Tools Engineer role here, therefore focus on the ones below :
- Hands on experience working with monitoring / alerting topics : SIEM, Syslog, Netflow, good understanding of environmental log structure / syntax and concepts.
- Monitoring / Alerting implementation and improvements knowledge in Grafana, Kibana, RegEx, Elasticsearch queries.
- Design and build of custom dashboards in Grafana, alert configuration, report setup assuring application performance and infrastructure health monitoring.
Tools in details like this :
Grafana :
Proficiency in setting up and configuring Grafana dashboards to visualize data.Experience with integrating Grafana with various data sources such as Prometheus, InfluxDB, or Graphite.Skills in creating custom queries and alerts within Grafana.ELK Stack :
Deep understanding of Elasticsearch for storing and searching log data.Experience with Logstash for data collection, transformation, and shipping.Skills in configuring Kibana for visualizing data and creating interactive dashboards.Data Analysis and Visualization :
Ability to analyze and interpret monitoring data to identify trends and anomalies.Skills in creating meaningful visualizations that aid in decision-making and and Alerting :Experience in setting up alerts and notifications based on monitoring data.Ability to implement automated responses to alerts to maintain system reliability.Performance Tuning :
Skills in optimizing the performance of monitoring tools and ensuring they scale with system growth.Understanding of metrics and logs that are critical for system reliability.Scripting and Automation :
Proficiency in scripting languages like Python, Bash, or Ruby for automating monitoring tasks.Experience in using tools like Cron or Jenkins for scheduling automated tasks.Security and Compliance :
Awareness of security best practices in monitoring and logging systems.Experience ensuring compliance with industry standards and regular networking knowledge to a secondary skillset :System Reliability Engineering (SRE) Principles :
Knowledge of SRE practices to enhance system reliability, availability, and performance.Experience with incident management and root cause analysis.(ref : hirist.tech)