微调Amazon Nova Lite以优化文档处理

Multimodal fine-tuning represents a powerful approach for customizing vision large language models (LLMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that matter to your business.

A common use case is document processing, which includes extracting structured information from complex layouts including invoices, purchase orders, forms, tables, or technical diagrams. Although off-shelf LLMs often struggle with specialized documents like tax forms, invoices, and loan applications, fine-tuned models can learn from high data variations and can deliver significantly higher accuracy while reducing processing costs.

This post provides a comprehensive hands-on guide to fine-tune Amazon Nova Lite for document processing tasks, with a focus on tax form data extraction. Using our open-source GitHub repository code sample, we demonstrate the complete workflow from data preparation to model deployment. Since Amazon Bedrock provides on-demand inference with pay-per-token pricing for Amazon Nova, we can benefit from the accuracy improvement from model customization and maintain the pay-as-you-go cost structure.

The document processing challenge

Given a single or multi-page document, the goal is to extract or derive specific structured information from the document so that it can be used for downstream systems or additional insights. The following diagram shows how a vision LLM can be used to derive the structured information based on a combination of text and vision capabilities.

The key challenges for enterprises in workflow automation when processing documents, like invoices or W2 tax forms, are the following:

Complex layouts

Variability of document types:

Variability within a single document type:

Data quality variations

Language barriers:

Critical accuracy requirements

Structured output needs

Scalability and integration:

Approaches for intelligent document processing that use LLMs or vision LLMs fall into three main categories:

Zero-shot prompting:

Few-shot prompting:

Fine-tuning:

For the first two approaches, refer to the amazon-nova-samples repository, which contains sample code on how to use the Amazon Bedrock Converse API for structured output by using tool calling.

Off-shelf LLMs excel at general document understanding, but they might not optimally handle domain-specific challenges. A fine-tuned Nova model can enhance performance by:

Learning document-specific layouts and field relationships Adapting to common quality variations in your document dataset Providing consistent, structured outputs Maintaining high accuracy across different document variations. For example, invoice documents can have hundreds of different vendors, each with different formats, layouts or even different languages.

Creating the annotated dataset and selecting the customization technique

While there are various methods for customization of Amazon Nova models available, the most relevant for document processing are the following:

Fine-tune for specific tasks:

Distill to create smaller, faster models

To be able to learn from previous examples, you need to either have an annotated dataset from which we can learn or a model that is good enough for your task so that you can use it as a teacher model.

Automated dataset annotation with historic data from Enterprise Resource Planning (ERP) systems, such as SAP:

Manual dataset annotation

Annotate with the teacher model:

For the first and second options, we recommend supervised model fine-tuning. For the third, model distillation is the right approach.

Amazon Bedrock currently provides both fine-tuning and distillation techniques, so that anyone with a basic data science skillset can very easily submit jobs. They run on compute completely managed by Amazon, so you don’t have worry about instance sizes or capacity limits.

Nova customization is also available with Amazon SageMaker with more options and controls. For example, if you have sufficient high-quality labeled data and you want deeper customization for your use case, full rank fine-tuning might produce higher accuracy. Full rank fine tuning is supported with SageMaker training jobs and SageMaker HyperPod.

Data preparation best practices

The quality and structure of your training data fundamentally determine the success of fine-tuning. Here are key steps and considerations for preparing effective multimodal datasets and configuring your fine-tuning job:

Dataset analysis and base model evaluation

Our demonstration uses a synthetic dataset of W2 tax forms: the Fake W-2 (US Tax Form) Dataset. This public dataset comprises simulated US tax returns (W-2 statements for years 2016-19), including noisy images that mimic low-quality scanned W2 tax forms.

Before fine-tuning, it’s crucial to:

Analyze dataset characteristics (image quality, field completeness, class distribution), define use-case-specific evaluation metrics, and establish baseline model performance. Compare each predicted field value against the ground truth, calculating precision, recall, and F1 scores for individual fields and overall performance.

Prompt optimization

Crafting an effective prompt is essential for aligning the model with task requirements. Our system comprises two key components:

{media_file}-then-{text}

Amazon Nova model user guide

Iterate on your prompts using the base model to optimize performance before fine-tuning.

Dataset preparation

Prepare your dataset in JSONL format and split it into training, validation, and test sets:

Training set: 70-80% of data Validation set: 10-20% of data Test set: 10-20% of data

Fine-tuning job configuration and monitoring

Once the dataset is prepared and uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, we can configure and submit the fine-tuning job on Bedrock. When configuring your fine-tuning job on Amazon Bedrock, key parameters include:

ParameterDefinitionPurposeEpochs Number of complete passes through the training dataset Determines how many times the model sees the entire dataset during trainingLearning rate Step size for gradient descent optimization Controls how much model weights are adjusted in response to estimated errorLearning rate warmup steps Number of steps to gradually increase the learning rate Prevents instability by slowly ramping up the learning rate from a small value to the target rate

Amazon Bedrock customization provides validation loss metrics throughout the training process. Monitor these metrics to:

Assess model convergence Detect potential overfitting Gain early insights into model performance on unseen data

The following graph shows an example metric analysis:

When analyzing the training and validation loss curves, the relative behavior between these metrics provides crucial insights into the model’s learning dynamics. Optimal learning patterns can be observed as:

Both training and validation losses decrease steadily over time The curves maintain relatively parallel trajectories The gap between training and validation loss remains stable Final loss values converge to similar ranges

Model inference options for customized models

Once your custom model has been created in Bedrock, you have two main ways to make inferences to that model: use on-demand custom model inference (ODI) deployments, or use Provisioned Throughput endpoints. Let’s talk about why and when to choose one over the other.

On-demand custom model deployments provide a flexible and cost-effective way to leverage your custom Bedrock models. With on-demand deployments, you only pay for the compute resources you use, based on the number of tokens processed during inference. This makes on-demand a great choice for workloads with variable or unpredictable usage patterns, where you want to avoid over-provisioning resources. The on-demand approach also offers automatic scaling, so you don’t have to worry about managing infrastructure capacity. Bedrock will automatically provision the necessary compute power to handle your requests in near real time. This self-service, serverless experience can simplify your operations and deployment workflows.

Alternatively, Provisioned Throughput endpoints are recommended for workloads with steady traffic patterns and consistent high-volume requirements, offering predictable performance and cost benefits over on-demand scaling.

This example uses the ODI option to leverage per-token based pricing; the following code snippet is how you can create an ODI endpoint for your custom model:

# Function to create on-demand inferencing deployment for custom modeldef create_model_deployment(custom_model_arn):    """    Create an on-demand inferencing deployment for the custom model        Parameters:    -----------    custom_model_arn : str        ARN of the custom model to deploy            Returns:    --------    deployment_arn : str        ARN of the created deployment    """    try:        print(f"Creating on-demand inferencing deployment for model: {custom_model_arn}")                # Generate a unique name for the deployment        deployment_name = f"nova-ocr-deployment-{time.strftime('%Y%m%d-%H%M%S')}"                # Create the deployment        response = bedrock.create_custom_model_deployment(            modelArn=custom_model_arn,            modelDeploymentName=deployment_name,            description=f"on-demand inferencing deployment for model: {custom_model_arn}",        )                # Get the deployment ARN        deployment_arn = response.get('customModelDeploymentArn')                print(f"Deployment request submitted. Deployment ARN: {deployment_arn}")        return deployment_arn        except Exception as e:        print(f"Error creating deployment: {e}")        return None

Evaluation: Accuracy improvement with fine-tuning

Our evaluation of the base model and the fine-tuned Nova model shows significant improvements across all field categories. Let’s break down the performance gains:

Field category Metric Base model Fine-tuned model Improvement Employee information Accuracy 58% 82.33% 24.33% Precision 57.05% 82.33% 25.28% Recall 100% 100% 0% F1 score 72.65% 90.31% 17.66% Employer information Accuracy 58.67% 92.67% 34% Precision 53.66% 92.67% 39.01% Recall 100% 100% 0% F1 score 69.84% 96.19% 26.35% Earnings Accuracy 62.71% 85.57% 22.86% Precision 60.97% 85.57% 24.60% Recall 99.55% 100% 0.45% F1 score 75.62% 92.22% 16.60% Benefits Accuracy 45.50% 60% 14.50% Precision 45.50% 60% 14.50% Recall 93.81% 100% 6.19% F1 score 61.28% 75% 13.72% Multi-state employment Accuracy 58.29% 94.19% 35.90% Precision 52.14% 91.83% 39.69% Recall 99.42% 100% 0.58% F1 score 68.41% 95.74% 27.33%

The following graphic shows a bar chart comparing the F1 scores of the base model and fine-tuned model for each field category, with the improvement percentage shown in the previous table:

Key observations:

Substantial improvements across all categories, with the most significant gains in employer information and multi-state employment Consistent 100% recall maintained or achieved in the fine-tuned model, indicating comprehensive field extraction Notable precision improvements, particularly in categories that were challenging for the base model

Clean up

To avoid incurring unnecessary costs when you’re no longer using your custom model, it’s important to properly clean up the resources. Follow these steps to remove both the deployment and the custom model:

Delete the custom model deployment

Delete the custom model

Cost analysis

In our example, we chose to use Bedrock fine-tuning job which is PEFT and ODI is available. PEFT fine tuning Nova Lite paired with on-demand inference capabilities offers a cost-effective and scalable solution for enhanced document processing. The cost structure is straightforward and transparent:

One-time cost:

Model training: $0.002 per 1,000 tokens × number of epochs

Ongoing costs:

Example 1 page from above dataset: 1895 tokens/1000 * $0.00006 + 411 tokens/1000 * $0.00024 = $0.00021

On-demand inference allows you to run your custom Nova models without maintaining provisioned endpoints, enabling pay-as-you-go pricing based on actual token usage. This approach eliminates the need for capacity planning while ensuring cost-efficient scaling.

Conclusion

In this post, we’ve demonstrated how fine-tuning Amazon Nova Lite can transform document processing accuracy while maintaining cost efficiency. Our evaluation shows significant performance gains, with up to 39% improvement in precision for critical fields and perfect recall across key document categories. While our implementation did not require constrained decoding, tool calling with Nova can provide additional reliability for more complex structured outputs, especially when working with intricate JSON schemas. Please refer to the resource on structured output with tool calling for further information.

The flexible deployment options, including on-demand inference with pay-per-use pricing, eliminate infrastructure overhead while maintaining the same inference costs as the base model. With the dataset we used for this example, runtime inference per page cost was $0.00021, making it a cost-effective solution. Through practical examples and step-by-step guides, we’ve shown how to prepare training data, fine-tune models, and evaluate performance with clear metrics.

To get started with your own implementation, visit our GitHub repository for complete code samples and detailed documentation.

About the authors

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Arlind Nocaj is a GTM Specialist Solutions Architect for AI/ML and Generative AI for europe central based in AWS Zurich Office, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include Machine Learning, Generative AI and in particular Agentic systems with Multi-modal LLMs for document processing and structured insights.

Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Malte Reimann is a Solutions Architect based in Zurich, working with customers across Switzerland and Austria on their cloud initiatives. His focus lies in practical machine learning applications—from prompt optimization to fine-tuning vision language models for document processing. The most recent example, working in a small team to provide deployment options for Apertus on AWS. An active member of the ML community, Malte balances his technical work with a disciplined approach to fitness, preferring early morning gym sessions when it’s empty. During summer weekends, he explores the Swiss Alps on foot and enjoying time in nature. His approach to both technology and life is straightforward: consistent improvement through deliberate practice, whether that’s optimizing a customer’s cloud deployment or preparing for the next hike in the clouds.