23 October 2025Andrew Kent
Java developers are building LLM applications across banking, healthcare, and enterprise software, but most AI observability and evaluation tools only target Python or TypeScript. The few JVM options either lack AI-specific features or require rebuilding your monitoring stack from scratch.
We built the Braintrust Java SDK to fix this. It's an open-source SDK for AI observability and evaluation that runs on Java 17+, built on OpenTelemetry so it fits into existing infrastructure.
If you're building LLM features in Java, you've likely run into these issues:
- Tracking LLM calls in production requires custom instrumentation to capture inputs, outputs, latency, token usage, and costs per requestTesting prompt changes or model swaps means either manual QA or writing custom test harnesses that don't integrate with existing eval toolsA/B testing prompts requires building feature flags, routing logic, and result tracking from scratchMost AI observability tools target Python/TypeScript and don't provide Java clients or JVM-compatible instrumentation
The SDK provides AI observability and evaluation for Java 17+ applications. It requires Java 17 or higher and uses modern language features (records, pattern matching, and more) where appropriate.
What's included:
- OpenTelemetry-based tracing built on OpenTelemetry spans and traces, not proprietary instrumentation. Export traces to Braintrust, Datadog, Honeycomb, or any OTLP-compatible backend. Fit alongside existing OpenTelemetry setups without conflicts using standard OTLP conventions for semantic attributes.Wrappers for OpenAI and Anthropic clients that automatically instrument LLM calls. Instrumentation is opt-in per client, so your existing Java services don't change unless you explicitly wrap AI clients.An evaluation framework that runs in CI/CD with support for custom scorersSupport for fetching prompts from Braintrust, managing datasets, and viewing traces in the UI
Track LLM calls in production: Every instrumented call captures input/output, latency, token counts, and costs. When debugging production issues, you can filter traces by metadata, search through prompts and responses, and see exactly what the model received and returned.
Run evals in CI/CD: Write test cases with expected outputs, define custom scoring functions, and run them on every commit. When you change a prompt or switch models, the eval framework shows which test cases passed, which failed, and aggregate scores across your test suite.
Fetch prompts from Braintrust: Instead of hardcoding prompts in your application, store them in Braintrust and fetch them at runtime. This lets you iterate on prompts without redeploying code and makes A/B testing different versions straightforward.
Here's how to instrument an OpenAI client:
Braintrust braintrust = Braintrust.get();OpenTelemetry openTelemetry = braintrust.openTelemetryCreate();OpenAIClient oaiClient = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv()); // Use the client as normalvar response = oaiClient.chat().completions().create( ChatCompletionCreateParams.builder() .model(ChatModel.GPT_4O_MINI) .addUserMessage("Explain quantum computing") .build());Every OpenAI call now flows through OpenTelemetry instrumentation, capturing inputs, outputs, latency, token usage, and costs.
To run an evaluation, define your task, test cases, and scoring functions:
var braintrust = Braintrust.get();var openTelemetry = braintrust.openTelemetryCreate();var openAIClient = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv()); // Define your taskFunction<String, String> getFoodType = (String food) -> { var request = ChatCompletionCreateParams.builder() .model(ChatModel.GPT_4O_MINI) .addSystemMessage("Return a one word answer") .addUserMessage("What kind of food is " + food + "?") .maxTokens(50L) .temperature(0.0) .build(); var response = openAIClient.chat().completions().create(request); return response.choices().get(0).message().content().orElse("").toLowerCase();}; // Define your evalvar eval = braintrust.<String, String>evalBuilder() .name("food-classification-eval") .cases( EvalCase.of("asparagus", "vegetable"), EvalCase.of("banana", "fruit"), EvalCase.of("chicken", "protein")) .task(getFoodType) .scorers( Scorer.of("fruit_scorer", result -> "fruit".equals(result) ? 1.0 : 0.0), Scorer.of("vegetable_scorer", result -> "vegetable".equals(result) ? 1.0 : 0.0)) .build(); // Run itvar result = eval.run();System.out.println(result.createReportString());This produces a detailed report showing per-case scores, aggregate metrics, and links to the Braintrust UI where you can drill into individual traces. Run this in CI/CD to catch regressions.
We're excited to support the AI developers building with Java.
For more examples, check out the README. The artifact is available on Maven Central. If you run into issues or have questions, please let us know on Discord.
