https://simonwillison.net/atom/everything 10月21日 01:34
AI助手助力DeepSeek-OCR在NVIDIA Spark上运行
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文讲述了作者如何利用AI助手Claude Code,通过迭代式提示和沙盒环境,成功在NVIDIA Spark硬件上部署并运行DeepSeek-OCR模型。过程中,Claude Code克服了PyTorch版本与硬件兼容性、CUDA环境配置等技术难题,最终实现了对图像的OCR识别。作者强调了提供清晰目标、沙盒环境以及结合自身经验的重要性,并分享了使用VS Code远程连接和监控Docker容器的实用技巧。

🚀 **AI辅助部署与运行**:作者利用Claude Code这一AI助手,在不到40分钟的时间内,通过四次提示,成功实现了DeepSeek-OCR模型在NVIDIA Spark上的部署与运行。这一过程高度自动化,作者仅需少量介入,凸显了AI在复杂技术任务中的强大能力。

🔧 **克服技术挑战**:在部署过程中,Claude Code识别并解决了PyTorch版本与NVIDIA Spark硬件(GB10 GPU,compute capability sm_121)之间的兼容性问题,通过查找并安装支持ARM64架构的特定PyTorch版本(2.9.0),以及处理模型中的`.cuda()`硬编码问题,最终使模型得以在目标硬件上运行。

📝 **详细记录与学习**:Claude Code在执行任务过程中生成了详尽的`notes.md`文件,记录了每一步的操作、发现的问题以及解决方案,并创建了`PROMPTS_GUIDE.md`文件,对比了不同OCR提示的性能和适用场景。这些资料为后续复现和深入理解提供了宝贵信息。

💡 **成功关键要素**:文章指出,项目成功的关键在于:1) 为AI提供清晰的目标、目标硬件的Docker环境以及所需资源的链接;2) 使用Docker沙盒环境允许AI以最高权限运行,避免了繁琐的手动确认;3) 作者结合自身经验,在AI遇到困难时给予恰当的引导,促使其找到解决方案。

Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code

20th October 2025

DeepSeek released a new model yesterday: DeepSeek-OCR, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running on the NVIDIA Spark by having Claude Code effectively brute force the challenge of getting it working on that particular hardware.

This small project (40 minutes this morning, most of which was Claude Code churning away while I had breakfast and did some other things) ties together a bunch of different concepts I’ve been exploring recently. I designed an agentic loop for the problem, gave Claude full permissions inside a Docker sandbox, embraced the parallel agents lifestyle and reused my notes on the NVIDIA Spark from last week.

I knew getting a PyTorch CUDA model running on the Spark was going to be a little frustrating, so I decided to outsource the entire process to Claude Code to see what would happen.

TLDR: It worked. It took four prompts (one long, three very short) to have Claude Code figure out everything necessary to run the new DeepSeek model on the NVIDIA Spark, OCR a document for me and produce copious notes about the process.

The setup

I connected to the Spark from my Mac via SSH and started a new Docker container there:

docker run -it --gpus=all \  -v /usr/local/cuda:/usr/local/cuda:ro \  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 \  bash

Then I installed npm and used that to install Claude Code:

apt-get updateDEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get install -y npmnpm install -g @anthropic-ai/claude-code

Then started Claude Code, telling it that it’s OK that it’s running as root because it’s in a sandbox:

IS_SANDBOX=1 claude --dangerously-skip-permissions

It provided me a URL to click on to authenticate with my Anthropic account.

The initial prompts

I kicked things off with this prompt:

Create a folder deepseek-ocr and do everything else in that folder

Then I ran the following, providing links to both the GitHub repository and the Hugging Face model, providing a clue about NVIDIA ARM and giving it an image (this one, see previous post) that I wanted it to run OCR on.

Your task is to get this working: https://github.com/deepseek-ai/DeepSeek-OCR—it uses Hugging Face Transformers and models from https://huggingface.co/deepseek-ai/DeepSeek-OCR—you are running in a Docker container on an NVIDIA ARM device, first research your environment to check it will be possible to do this. Git clone both of those—you may need to enable LFS first. Make extensive notes as you go in a notes.md file which you should only ever append new notes to. At the end of this you should have scripts in the folder which can be run to both setup and then run the model against an example—use https://static.simonwillison.net/static/2025/ft.jpeg as the example image that you OCR. As a final step create a detailed README.md file explaining the project and what you learned and how to run your code.

Note how I also told it to “make extensive notes as you go in a notes.md file which you should only ever append new notes to” so I’d have some useful notes at the end.

The follow-ups

It churned away for 13 minutes, and made a sequence of discoveries about the platform and the model and the difficulties of running PyTorch on it.

And then... it gave up:

The NVIDIA GB10 GPU has compute capability sm_121, which is too new for PyTorch 2.5.1

    PyTorch 2.5.1 supports: sm_50, sm_80, sm_86, sm_89, sm_90, sm_90aGB10 requires: sm_121Error: CUDA error: no kernel image is available for execution on the device

Additionally, the model has hardcoded .cuda() calls, preventing CPU fallback without code modifications.

So I gave it some encouragement:

Any other options that might help get CUDA working like using different versions of PyTorch that have available CUDA wheels for this platform? Have you checked to see what PyTorch versions already have these ARM CUDA wheels that would work here?

It ran this command to grep for all available PyTorch wheels on download.pytorch.org that match the ARM architecture:

curl -s https://download.pytorch.org/whl/torch/ | grep -o 'torch-[0-9.]*.*aarch64.whl' | sort -V | tail -20

... and that unblocked it.

Excellent! I can see that PyTorch 2.9.0 has ARM64 wheels for CUDA 12.8, 12.9, and CUDA 13.0 (cu130)! This is exactly what we need!

It installed the wheels it found and reported:

Good news! PyTorch 2.9.0 can detect the GB10 GPU with CUDA capability 12.1 (sm_121), but it says the maximum supported is 12.0. However, it’s just a warning now instead of a hard error! Let’s try running inference: [...]

Then kept on churning for a bit, ignoring warnings, and managed to run the model! It loaded the image and ran this prompt:

<|grounding|>OCR this image.

And got back output that started like this:

<|ref|>The perils of vibe coding<|/ref|><|det|>[[352, 30, 624, 111]]<|/det|><|ref|>opt<|/ref|><|det|>[[970, 0, 994, 30]]<|/det|><|ref|>such<|/ref|><|det|>[[970, 24, 996, 58]]<|/det|><|ref|>days<|/ref|><|det|>[[970, 52, 996, 87]]<|/det|><|ref|>pavi<|/ref|><|det|>[[970, 85, 996, 118]]<|/det|><|ref|>TECHNOLOGY<|/ref|><|det|>[[33, 199, 127, 230]]<|/det|><|ref|>holds the promise of replacing program-<|/ref|><|det|>[[220, 170, 390, 210]]<|/det|><|ref|>mers - a profession of very well paid<|/ref|><|det|>[[220, 199, 390, 238]]<|/det|><|ref|>people.For another, the work can be<|/ref|><|det|>[[220, 228, 390, 267]]<|/det|>

Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.

I said:

That result.mmd file looks to be exclusively whitespace to me, try and get a better result there—the bounding box image was good though

Which caused Claude to reconsider the prompt it used:

Maybe we need to use a different prompt. Looking at the README from DeepSeek-OCR, there are different prompts:

    "<image>\n<|grounding|>Convert the document to markdown."—for documents"<image>\n<|grounding|>OCR this image."—general OCR"<image>\nFree OCR."—without layouts

The “grounding” mode seems to be focused on bounding boxes, which is why we got good bounding box results but the text output might be in a different format.

Let me try running it again with a different prompt that’s more focused on text extraction rather than grounding.

It tried all of the prompt variants, made a note about them and created a file called PROMPTS_GUIDE.md with a bunch more details, including this comparison table:

PromptSpeedText QualityStructureCoordinatesBest Use Case
Free OCR⚡⚡⚡ Fast⭐⭐⭐ Excellent⭐ Basic❌ NoGeneral OCR
Markdown⚡⚡ Medium⭐⭐⭐ Excellent⭐⭐⭐ Full⭐⭐ PartialDocuments
Grounding⚡ Slow⭐⭐ Good⭐ Basic⭐⭐⭐ FullAnnotations
Detailed⚡⚡⚡ Fastest⭐ N/A❌ N/A❌ NoImage analysis

And this table under the heading “Performance Benchmarks”:

Test image: 3503×1668 pixels (Financial Times article)

PromptTimeOutput SizeTokens
Free OCR24sClean text2257
Markdown39sFormatted MD2257 + structure
Grounding58sText + coords2257 + boxes
Detailed9sDescription~300

My final prompt was this, to gather everything together into a zip file I could extract from the Docker container:

Create a zip file with the output and output_text and all of the scripts and notes—but leave out the github repo and the huggingface repo directories

I added the contents of that zip file to my new simonw/research GitHub repo in the deepseek-ocr-nvidia-spark folder.

Claude really likes writing notes! Here’s the directory listing of that finished folder:

  |-- download_test_image.sh  |-- FINAL_SUMMARY.md  |-- notes.md  |-- output  |   |-- images  |   |-- result_with_boxes.jpg  |   `-- result.mmd  |-- output_text  |   |-- detailed  |   |   |-- images  |   |   |-- result_with_boxes.jpg  |   |   `-- result.mmd  |   |-- free_ocr  |   |   |-- images  |   |   |-- result_with_boxes.jpg  |   |   `-- result.mmd  |   `-- markdown  |       |-- images  |       |   `-- 0.jpg  |       |-- result_with_boxes.jpg  |       `-- result.mmd  |-- PROMPTS_GUIDE.md  |-- README_SUCCESS.md  |-- README.md  |-- run_ocr_best.py  |-- run_ocr_cpu_nocuda.py  |-- run_ocr_cpu.py  |-- run_ocr_text_focused.py  |-- run_ocr.py  |-- run_ocr.sh  |-- setup.sh  |-- SOLUTION.md  |-- test_image.jpeg  |-- TEXT_OUTPUT_SUMMARY.md  `-- UPDATE_PYTORCH.md

Takeaways

My first prompt was at 15:31:07 (UTC). The final message from Claude Code came in at 16:10:03. That means it took less than 40 minutes start to finish, and I was only actively involved for about 5-10 minutes of that time. The rest of the time I was having breakfast and doing other things.

Having tried and failed to get PyTorch stuff working in the past, I count this as a huge win. I’ll be using this process a whole lot more in the future.

How good were the actual results? There’s honestly so much material in the resulting notes created by Claude that I haven’t reviewed all of it. There may well be all sorts of errors in there, but it’s indisputable that it managed to run the model and made notes on how it did that such that I’ll be able to do the same thing in the future.

I think the key factors in executing this project successfully were the following:

    I gave it exactly what it needed: a Docker environment in the target hardware, instructions on where to get what it needed (the code and the model) and a clear goal for it to pursue. This is a great example of the pattern I described in designing agentic loops.Running it in a Docker sandbox meant I could use claude --dangerously-skip-permissions and leave it running on its own. If I’d had to approve every command it wanted to run I would have got frustrated and quit the project after just a few minutes.I applied my own knowledge and experience when it got stuck. I was confident (based on previous experiments with the Spark) that a CUDA wheel for ARM64 existed that was likely to work, so when it gave up I prompted it to try again, leading to success.

Oh, and it looks like DeepSeek OCR is a pretty good model if you spend the time experimenting with different ways to run it.

Bonus: Using VS Code to monitor the container

A small TIL from today: I had kicked off the job running in the Docker container via SSH to the Spark when I realized it would be neat if I could easily monitor the files it was creating while it was running.

I asked Claude.ai:

I am running a Docker container on a remote machine, which I started over SSH

How can I have my local VS Code on MacOS show me the filesystem in that docker container inside that remote machine, without restarting anything?

It gave me a set of steps that solved this exact problem:

    Install the VS Code “Remote SSH” and “Dev Containers” extensionsUse “Remote-SSH: Connect to Host” to connect to the remote machine (on my Tailscale network that’s spark@100.113.1.114)In the window for that remote SSH session, run “Dev Containers: Attach to Running Container”—this shows a list of containers and you can select the one you want to attach to... and that’s it! VS Code opens a new window providing full access to all of the files in that container. I opened up notes.md and watched it as Claude Code appended to it in real time.

At the end when I told Claude to create a zip file of the results I could select that in the VS Code file explorer and use the “Download” menu item to download it to my Mac.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek-OCR NVIDIA Spark Claude Code AI部署 Docker PyTorch CUDA OCR AI助手 技术实践 DeepSeek-OCR NVIDIA Spark Claude Code AI Deployment Docker PyTorch CUDA OCR AI Assistant Technical Practice
相关文章