热点
"推理延迟" 相关文章
Benchmarking On-Device Machine Learning on Apple Silicon with MLX
cs.AI updates on arXiv.org 2025-10-23T04:13:45.000000Z
ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
cs.AI updates on arXiv.org 2025-10-21T04:27:57.000000Z
BeLLMan: Controlling LLM Congestion
cs.AI updates on arXiv.org 2025-10-20T04:13:29.000000Z
BeLLMan: Controlling LLM Congestion
cs.AI updates on arXiv.org 2025-10-20T04:13:29.000000Z
BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation
cs.AI updates on arXiv.org 2025-10-14T04:17:55.000000Z
Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning
cs.AI updates on arXiv.org 2025-10-14T04:08:53.000000Z
Taming Latency-Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offloading
cs.AI updates on arXiv.org 2025-10-07T04:19:02.000000Z
Elastic On-Device LLM Service
cs.AI updates on arXiv.org 2025-10-07T04:18:48.000000Z
Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving
cs.AI updates on arXiv.org 2025-10-03T04:17:59.000000Z
Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving
cs.AI updates on arXiv.org 2025-10-03T04:17:59.000000Z
From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
cs.AI updates on arXiv.org 2025-09-29T04:15:44.000000Z
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
cs.AI updates on arXiv.org 2025-09-23T06:12:43.000000Z
Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
cs.AI updates on arXiv.org 2025-09-18T04:47:55.000000Z
Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
cs.AI updates on arXiv.org 2025-09-11T15:51:36.000000Z
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
cs.AI updates on arXiv.org 2025-07-14T04:08:35.000000Z