Elastic Streams：AI驱动的网络故障诊断新工具

Presented by Elastic

Logs set to become the primary tool for finding the “why” in diagnosing network incidents

Modern IT environments have a data problem: there’s too much of it. Organizations that need to manage a company’s environment are increasingly challenged to detect and diagnose issues in real-time, optimize performance, improve reliability, and ensure security and compliance — all within constrained budgets.

The modern observability landscape has many tools that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns and figure out what’s happening across the network, and diagnose why an issue or incident occurred. The problem is that the process creates information overload: A Kubernetes cluster alone can emit 30 to 50 gigabytes of logs a day, and suspicious behavior patterns can sneak past human eyes.

"It’s so anachronistic now, in the world of AI, to think about humans alone observing infrastructure," says Ken Exner, chief product officer at Elastic. "I hate to break it to you, but machines are better than human beings at pattern matching.“

An industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in logs, but because they contain massive volumes of unstructured data, the industry tends to use them as a tool of last resort. This has forced teams into costly tradeoffs: either spend countless hours building complex data pipelines, drop valuable log data and risk critical visibility gaps, or log and forget.

Elastic, the Search AI Company, recently released a new feature for observability called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context and meaning.

Streams uses AI to automatically partition and parse raw logs to extract relevant fields, and greatly reduce the effort required of SREs to make logs usable. Streams also automatically surfaces significant events such as critical errors and anomalies from context-rich logs, giving SREs early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues faster. The ultimate goal is to show remediation steps.

"From raw, voluminous, messy data, Streams automatically creates structure, putting it into a form that is usable, automatically alerts you to issues and helps you remediate them," Exner says. "That is the magic of Streams."

A broken workflow

Streams upends an observability process that some say is broken. Typically, SREs set up metrics, logs and traces. Then they set up alerts, and service level objectives (SLOs) — often hard-coded rules to show where a service or process has gone beyond a threshold, or a specific pattern has been detected.

When an alert is triggered, it points to the metric that's showing an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the issue and compare the alert to other metrics, or CPU to memory to I/O, and start looking for patterns.

They may then need to look at a trace, and examine upstream and downstream dependencies across the application to dig into the root cause of the issue. Once they figure out what's causing the trouble, they jump into the logs for that database or service to try and debug the issue.

Some companies simply seek to add more tools when current ones prove ineffective. That means SREs are hopping from tool to tool to keep on top of monitoring and troubleshooting across their infrastructure and applications.

"You’re hopping across different tools. You’re relying on a human to interpret these things, visually look at the relationship between systems in a service map, visually look at graphs on a metrics dashboard, to figure out what and where the issue is, " Exner says. "But AI automates that workflow away."

With AI-powered Streams, logs are not just used reactively to resolve issues, but also to proactively process potential issues and create information-rich alerts that help teams jump straight to problem-solving, offering a solution for remediation or even fixing the issue entirely, before automatically notifying the team that it's been taken care of.

"I believe that logs, the richest set of information, the original signal type, will start driving a lot of the automation that a service reliability engineer typically does today, and does very manually," he adds. "A human should not be in that process, where they are doing this by digging into themselves, trying to figure out what is going on, where and what the issue is, and then once they find the root cause, they’re trying to figure out how to debug it."

Observability’s future

Large language models (LLMs) could be a key player in the future of observability. LLMs excel at recognizing patterns in vast quantities of repetitive data, which closely resembles log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tooling, the LLM has the information and tools it needs to resolve database errors or Java heap issues, and more. Incorporating those into platforms that bring context and relevance will be essential.

Automated remediation will still take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next couple of years. In other words, remediation steps will be driven by LLMs. The LLM will offer up fixes, and the human will verify and implement them, rather than calling in an expert.

Addressing skill shortages

Going all in on AI for observability would help address a major shortage in the talent needed to manage IT infrastructure. Hiring is slow because organizations need teams with a great deal of experience and understanding of potential issues, and how to resolve them fast. That experience can come from an LLM that is contextually grounded, Exner says.

"We can help deal with the skill shortage by augmenting people with LLMs that make them all instantly experts," he explains. "I think this is going to make it much easier for us to take novice practitioners and make them expert practitioners in both security and observability, and it’s going to make it possible for a more novice practitioner to act like an expert.”

Streams in Elastic Observability is available now. Get started by reading more on the Streams.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

A broken workflow

Observability’s future

Addressing skill shortages

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签