Network and Security Virtualization 09月29日 10:48
优化NSX性能:基于工作负载的考量
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何根据应用流量特征和业务关键性来优化VMware NSX的性能。文章区分了“象流”(大包、长连接、不敏感延迟)和“鼠流”(小包、短连接、敏感延迟)两种流量类型,并指出两者可能相互影响。NSX提供了标准、增强和SmartNICs三种数据路径选项,以适应不同应用需求。文章重点介绍了标准数据路径的调优,包括Geneve Offload、MTU设置、Geneve Rx/Tx Filters和RSS,以及vNIC和pNIC/ESXi层面的队列和缓冲区配置。此外,还强调了通过增加pNIC数量(如采用4x pNIC设计)来扩展处理能力,以充分利用多核CPU并减少服务器占用。

🐘 **流量类型区分与影响**:文章区分了对延迟不敏感的“象流”(如备份、日志)和对延迟敏感的“鼠流”(如内存数据库、消息队列)。理解这些流量模式对于识别潜在的性能瓶颈至关重要,例如象流可能增加鼠流的延迟。

🚀 **NSX数据路径选项**:NSX提供了标准数据路径(中断驱动,适合带宽密集型应用)、增强数据路径(类似DPDK轮询模式,适合延迟敏感型应用)和SmartNICs(卸载至DPU,最大化应用CPU利用率,尤其适合延迟敏感和高CPU负载应用)。

⚙️ **标准数据路径性能调优**:对于以带宽为导向的标准数据路径,可通过Geneve Offload(TSO/LRO)、增大MTU(如9000)、配置Geneve Rx/Tx Filters或RSS来优化流量处理。同时,在vNIC和pNIC/ESXi层调整队列和缓冲区大小(如`ethernetX.ctxPerDev`, `ethernetX.pnicFeatures`, `esxcli network nic ring current set`等命令)能显著提升性能。

📈 **扩展性与多pNIC设计**:为了充分利用现代服务器的多核CPU并避免pNIC成为瓶颈,建议采用4x pNIC设计。相比2x pNIC,4x pNIC设计能使数据包处理能力翻倍,从而处理更多工作负载,减少所需服务器数量,提高CPU利用率和整体效率。

Optimizing NSX Performance Based on Workload

Overview

Performance tuning, in general, requires a holistic view of the application traffic profiles, features leveraged and the criteria for performance from the application perspective. In this blog, we will take a look at some of the factors to consider when optimizing NSX for performance.

Applications

In a typical data center, applications may have different requirements based on their traffic profile. Some applications such as backup services, log files and certain types of web traffic etc., may be able to leverage all the available bandwidth. These long traffic flows with large packets are called elephant flows. These applications with elephant flows, in general, are not sensitive to latency. 

In contrast, in-memory databases, message queuing services such as Kafka, and certain Telco applications may be sensitive to latency. These traffic flows, which are short lived and use smaller packets are generally called mice flows. Applications with mice flows are not generally bandwidth hungry.

While in general, virtual datacenters may be running a mixed set of workloads which should run as is without much tuning, there may be instances where one may have to tune to optimize performance for specific applications. For example, applications with elephant flows often impact the latency experienced by applications with mice flows. This is true for both physical and virtual infra. For business critical applications, traffic may need to be steered to stay separate on all components, virtual and physical, to avoid impact on performance. Hence, understanding the application traffic profile and business criticality, will help in tuning it for optimal performance based on application requirements.

Datapath Options

NSX provides three datapath options. 

In this blog, we will focus on tuning the Standard Datapath, for optimal performance.

Tuning for Optimal Performance

Standard Datapath, by default, is tuned to maximize bandwidth usage. Applications that are throughput hungry will benefit from the optimizations that are included by default in this mode. Following are some of those optimizations. Note: some of these optimizations are enabled by default:

Geneve Offload and MTU

Geneve offload is basically TSO (and LRO) for Geneve traffic. TSO helps move larger segments through the TCP stack on the transmit side. These larger segments are broken down into MTU compliant packets by either a NIC that supports Geneve offload or in software as a last step if the NIC doesn’t support this feature. LRO is a similar feature that’s enabled for the traffic on the receiving side. While most NICs support TSO, LRO support is not so prevalent. Often, LRO is done in software.

Geneve offload is essential, for applications with elephant flows. Apart from Geneve Offload that is enabled by default if the pNIC supports it, another way to optimize for applications with elephant flows is to enable jumbo MTU (9000).

Geneve Rx / Tx Filters and RSS

Geneve Rx / Tx Filters are a smarter version of RSS, that provides queueing based on need. While RSS works at the hardware level and queue flows based on the outer headers, Geneve Rx / Tx Filters queue flows based on insights into traffic flows. Queueing is simply providing multiple lanes for traffic flow. Similar to highways where multiple lanes ease congestion and maximize traffic flows, queuing does the same thing for application traffic flows. In general, performance increases almost linearly, based on the number of available queues, as long as the applications are able to leverage it.

Either Geneve Rx / Tx Filters or RSS is essential for all applications to improve performance.

Queuing needs to happen not only at the ESXi layer, but also at the VM layer. When enabling multiple queues, the vCPU count also should be considered, to avoid CPU related bottlenecks. The following image highlights all the tuning parameters related to queuing and how they relate to the entire stack, from pNIC to the VM.

For easier consumption, repeating the tuning commands in text below:

Queuing and Buffers at vNIC Layer

Queuing and Buffers at pNIC / ESXi Stack Layer

Ensure VM is not moved out from queueing

Scaling out 

Adding additional pNICs helps scale out the packet processing capacity of a system. 

Core considerations for queueing

In general, every queue will potentially consume a thread. However, this is only when needed. The threads are available for other tasks, when not in use for processing packets. The threads for the pNIC queues are allocated from the host. 

Threads for the vNIC queues are allocated from the vCPUs allocated to the VMs. Given that, the vCPU count of the VM should be considered, to ensure CPU doesn’t become a bottleneck.

2 x pNIC vs 4 x pNIC Design

Current servers are able to support, with a dual socket architecture, over 120 cores / 240 threads on a single host. Often, the pNIC capacity is reached before fully leveraging all the available cores. Following is an example with one NSX X-Large Edge on a dual socket host with a modest 96 cores, where the pNICs are configured with 8 Rx queues and 2 Tx queues: 

To leverage all the available cores on a system and to avoid pNIC bottlenecks, consider 4 x pNIC design. With a 4 x pNIC design, the same host can be leveraged to address twice the workload capacity. This also helps reduce the number of hosts for the workload, by half. Following is an example with 2 x NSX X-Large Edges, on a dual socket host.  Note: The system in this illustration, still has capacity to host more edge VMs.

Following is an illustration of the benefit of leveraging a 4 x pNIC design, compared with a 2 x pNIC design.

Conclusion

Performance tuning must consider the application traffic patterns and requirements. While most general purpose datacenter workloads should perform well with the default settings, some applications may require special handling. Queuing, buffering, separation of workloads and datapath selection are some of the key factors that help optimize performance for applications. Considering the large number of cores available today, a 4 x pNIC design would help not only with optimizing performance but also in optimizing CPU usage and reducing the server footprint.

Resources

Want to learn more?  Check out the following resources:

 

 

The post Optimizing NSX Performance Based on Workload and ROI appeared first on Network and Security Virtualization.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NSX 性能优化 工作负载 数据中心 网络虚拟化 NSX Performance Workload Optimization Data Center Network Virtualization
相关文章