VentureBeat 11月05日 23:15
Elastic Streams:AI驱动的网络故障诊断新工具
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

现代IT环境面临数据过载的挑战,诊断网络故障日益困难。Elastic新推出的Observability功能Streams,旨在通过AI自动解析日志,提取关键信息,将海量非结构化数据转化为可操作的模式和上下文,从而成为网络事件调查的首要信号。Streams能够自动识别关键错误和异常,为SRE团队提供预警,并最终指向修复步骤,极大地提升了故障排查的效率和准确性,解决了传统依赖人工分析日志的痛点,并有望通过LLM驱动自动化修复,缓解IT人才短缺问题。

💡 AI赋能日志分析:Elastic Streams利用AI技术自动解析和划分原始日志数据,提取关键字段,极大地减轻了SRE团队处理日志的负担。这使得日志不再是最后的调查手段,而是成为快速定位问题根源的首要信息源。

🎯 自动化异常检测与预警:Streams能够自动从包含丰富上下文的日志中识别出关键错误和异常事件,并及时发出预警。这使得SRE团队能够更早地发现潜在问题,并对工作负载有清晰的理解,从而更快地进行调查和解决。

🚀 简化故障排查流程:传统流程中,SRE需要跨越多个工具,手动分析指标、追踪和日志来诊断问题。Streams通过整合信息,自动化工作流程,将SRE从繁琐的手动模式中解放出来,使其能够直接专注于问题解决,甚至实现自动化修复。

🧠 LLM驱动未来可观测性:文章指出,大型语言模型(LLMs)在识别海量数据模式方面表现出色,将是未来可观测性的关键。LLMs可用于生成自动化修复方案和操作手册,辅助SRE进行验证和实施,从而提高效率并弥补技能短缺。

🤝 缓解IT技能短缺:通过AI工具(如Streams和LLMs)增强SRE的能力,可以帮助解决IT基础设施管理领域的人才短缺问题。AI可以使初级从业者迅速成长为专家,提升整体团队的专业水平和应对复杂问题的能力。

Presented by Elastic


Logs set to become the primary tool for finding the “why” in diagnosing network incidents

Modern IT environments have a data problem: there’s too much of it. Organizations that need to manage a company’s environment are increasingly challenged to detect and diagnose issues in real-time, optimize performance, improve reliability, and ensure security and compliance — all within constrained budgets.

The modern observability landscape has many tools that offer a solution. Most revolve around DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to uncover patterns and figure out what’s happening across the network, and diagnose why an issue or incident occurred. The problem is that the process creates information overload: A Kubernetes cluster alone can emit 30 to 50 gigabytes of logs a day, and suspicious behavior patterns can sneak past human eyes.

"It’s so anachronistic now, in the world of AI, to think about humans alone observing infrastructure," says Ken Exner, chief product officer at Elastic. "I hate to break it to you, but machines are better than human beings at pattern matching.“

An industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in logs, but because they contain massive volumes of unstructured data, the industry tends to use them as a tool of last resort. This has forced teams into costly tradeoffs: either spend countless hours building complex data pipelines, drop valuable log data and risk critical visibility gaps, or log and forget.

Elastic, the Search AI Company, recently released a new feature for observability called Streams, which aims to become the primary signal for investigations by taking noisy logs and turning them into patterns, context and meaning.

Streams uses AI to automatically partition and parse raw logs to extract relevant fields, and greatly reduce the effort required of SREs to make logs usable. Streams also automatically surfaces significant events such as critical errors and anomalies from context-rich logs, giving SREs early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues faster. The ultimate goal is to show remediation steps.

"From raw, voluminous, messy data, Streams automatically creates structure, putting it into a form that is usable, automatically alerts you to issues and helps you remediate them," Exner says. "That is the magic of Streams."

A broken workflow

Streams upends an observability process that some say is broken. Typically, SREs set up metrics, logs and traces. Then they set up alerts, and service level objectives (SLOs) — often hard-coded rules to show where a service or process has gone beyond a threshold, or a specific pattern has been detected.

When an alert is triggered, it points to the metric that's showing an anomaly. From there, SREs look at a metrics dashboard, where they can visualize the issue and compare the alert to other metrics, or CPU to memory to I/O, and start looking for patterns.

They may then need to look at a trace, and examine upstream and downstream dependencies across the application to dig into the root cause of the issue. Once they figure out what's causing the trouble, they jump into the logs for that database or service to try and debug the issue.

Some companies simply seek to add more tools when current ones prove ineffective. That means SREs are hopping from tool to tool to keep on top of monitoring and troubleshooting across their infrastructure and applications.

"You’re hopping across different tools. You’re relying on a human to interpret these things, visually look at the relationship between systems in a service map, visually look at graphs on a metrics dashboard, to figure out what and where the issue is, " Exner says. "But AI automates that workflow away."

With AI-powered Streams, logs are not just used reactively to resolve issues, but also to proactively process potential issues and create information-rich alerts that help teams jump straight to problem-solving, offering a solution for remediation or even fixing the issue entirely, before automatically notifying the team that it's been taken care of.

"I believe that logs, the richest set of information, the original signal type, will start driving a lot of the automation that a service reliability engineer typically does today, and does very manually," he adds. "A human should not be in that process, where they are doing this by digging into themselves, trying to figure out what is going on, where and what the issue is, and then once they find the root cause, they’re trying to figure out how to debug it."

Observability’s future

Large language models (LLMs) could be a key player in the future of observability. LLMs excel at recognizing patterns in vast quantities of repetitive data, which closely resembles log and telemetry data in complex, dynamic systems. And today’s LLMs can be trained for specific IT processes. With automation tooling, the LLM has the information and tools it needs to resolve database errors or Java heap issues, and more. Incorporating those into platforms that bring context and relevance will be essential.

Automated remediation will still take some time, Exner says, but automated runbooks and playbooks generated by LLMs will become standard practice within the next couple of years. In other words, remediation steps will be driven by LLMs. The LLM will offer up fixes, and the human will verify and implement them, rather than calling in an expert.

Addressing skill shortages

Going all in on AI for observability would help address a major shortage in the talent needed to manage IT infrastructure. Hiring is slow because organizations need teams with a great deal of experience and understanding of potential issues, and how to resolve them fast. That experience can come from an LLM that is contextually grounded, Exner says.

"We can help deal with the skill shortage by augmenting people with LLMs that make them all instantly experts," he explains. "I think this is going to make it much easier for us to take novice practitioners and make them expert practitioners in both security and observability, and it’s going to make it possible for a more novice practitioner to act like an expert.”

Streams in Elastic Observability is available now. Get started by reading more on the Streams.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Elastic Streams Observability AI Log Analysis Network Incident Diagnosis SRE LLM Automation IT Operations Elastic Observability
相关文章