Observability

Your teams and projects produce lots of operational data and you can take advantage of AI tools such as AWS GuardDuty and NewRelics AI alerts to reduced the MTTR (mean time to repair) significantly. However, the noise surrounding custom metrics alerts and cloud watch alerts could be significant. And although Splunk collects all application and system generated logs, enterprise Splunk tool is not designed and built for AI. So none of the current toolsets had insight into application generated error, audit and debug logs. Our federal customer’s application logs are a gold mine when it comes to detecting anomalies to provide additional context surrounding alerts that are triggered for memory, CPU usage, database CPU, slow response times.

To use Artificial Intelligence to minize the MTTR, one of the first task was to convert the current project application and audit logging to json format. The team used AWS glue jobs to convert them to parquet format for AI/ML ingestion. Using the enterprise DataBricks platform and MLFlow the team successfully operationalized the models that were created using the unsupervised methodology to detect anomalies near real time. These anomaly detections helped the team to reduce mean time between failures (MTBF) dramatically as well as being able to identify and fix data quality issues and code issues which caused CPU spikes and high response times. The additional contextual intelligence had reduced the time it took the DevSecOps teams to complete root cause analysis significantly. 

AI Photo.jpg

Currently the team is working on two additional AI/ML use cases. 

  • NewRelic alerts surrounding custom insights events does not exploit AI. This is causing a lot of noise for alerts based on custom events and cloudwatch alerts.  The goal is to build models for the NewRelic custom insights events and cloudwatch events and rely on home grown AI/ML alerts for these events.

  • Build models that use the main project’ application and system audit logs to detect insider threats. The goal is to detect suspicious access patterns and alert the SOC team to investigate further.

We’ll write a follow on to report how these two cases developed.