philschmid RSS feed 09月30日
Vertex AI评估服务使用指南
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Vertex AI评估服务允许用户使用现有或自定义标准评估LLM或应用性能。支持BLEU、ROUGE等学术指标,以及基于G-Eval Coherence的Pointwise和Pairwise指标。本文档通过部署Llama 3.1 8B模型,使用不同提示模板在Coherence指标上评估新闻摘要生成效果,展示如何利用Gen AI评估服务优化模型输出。

📦 文章介绍了如何在Vertex AI中设置和配置Gen AI评估服务,包括安装gcloud CLI和google-cloud-aiplatform SDK,设置环境变量,以及启用必要的GCP服务API。

🤖 文章详细指导了如何在Vertex AI上部署Llama 3.1 8B模型,包括登录Hugging Face Hub获取访问令牌,使用google-cloud-aiplatform SDK注册和部署模型,以及配置推理参数。

📊 文章探讨了如何使用Gen AI评估服务评估不同提示模板在Coherence指标上的表现,包括定义基于G-Eval Coherence的Pointwise评估指标,选择argilla/news-summary数据集进行评估,以及比较不同提示模板的评估结果。

🔍 文章展示了如何分析评估结果,识别最佳提示模板,并讨论了评估结果的局限性,例如样本量较小导致的评分波动,以及如何进一步优化提示模板以提高模型性能。

The Gen AI Evaluation Service in Vertex AI lets us evaluate LLMs or Application using existing or your own evaluation criterias. It supports academic metrics like BLEU, ROUGE, or LLM as a Judge with Pointwise and Pairwise metrics or custom metrics you can define yourself. As default LLM as a Judge Gemini 1.5 Pro is used.

We can use the Gen AI Evaluation Service to evaluate the performance of open models and finetuned models using Vertex AI Endpoints and compute resources. In this example we will evaluate meta-llama/Meta-Llama-3.1-8B-Instruct generated summaries from news articles using a Pointwise metric based on G-Eval Coherence metric.

We will cover the following topics:

    Setup / ConfigurationDeploy Llama 3.1 8B on Vertex AIEvaluate Llama 3.1 8B using different prompts on CoherenceInterpret the resultsClean up resources

Setup / Configuration

First, you need to install gcloud in your local machine, which is the command-line tool for Google Cloud, following the instructions at Cloud SDK Documentation - Install the gcloud CLI.

Then, you also need to install the google-cloud-aiplatform Python SDK, required to programmatically create the Vertex AI model, register it, acreate the endpoint, and deploy it on Vertex AI.

!pip install --upgrade --quiet "google-cloud-aiplatform[evaluation]"  huggingface_hub transformers datasets

For ease of use we define the following environment variables for GCP.

Note 1: Make sure to adapt the project ID to your GCP project.
Note 2: The Gen AI Evaluation Service is not available in all regions. If you want to use it, you need to select a region that supports it. us-central1 is currently supported.

%env PROJECT_ID=gcp-partnership-412108%env LOCATION=us-central1%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 

Then you need to login into your GCP account and set the project ID to the one you want to use to register and deploy the models on Vertex AI.

!gcloud auth login!gcloud auth application-default login  # For local development!gcloud config set project $PROJECT_ID

Once you are logged in, you need to enable the necessary service APIs in GCP, such as the Vertex AI API, the Compute Engine API, and Google Container Registry related APIs.

!gcloud services enable aiplatform.googleapis.com!gcloud services enable compute.googleapis.com!gcloud services enable container.googleapis.com!gcloud services enable containerregistry.googleapis.com!gcloud services enable containerfilesystem.googleapis.com

Deploy Llama 3.1 8B on Vertex AI

Once everything is set up, we can deploy the Llama 3.1 8B model on Vertex AI. We will use the google-cloud-aiplatform Python SDK to do so. meta-llama/Meta-Llama-3.1-8B-Instruct is a gated model, you need to login into your Hugging Face Hub account with a read-access token either fine-grained with access to the gated model, or just overall read-access to your account. More information on how to generate a read-only access token for the Hugging Face Hub in the instructions at Hugging Face Hub Security Tokens.

from huggingface_hub import interpreter_login interpreter_login()

After we are logged in we can "upload" the model i.e. register the model on Vertex AI. If you want to learn more about the arguments you can pass to the upload method, check out Deploy Gemma 7B with TGI on Vertex AI.

import osfrom google.cloud import aiplatform aiplatform.init(    project=os.getenv("PROJECT_ID"),    location=os.getenv("LOCATION"),)

We will deploy the meta-llama/Meta-Llama-3.1-8B-Instruct to 1x NVIDIA L4 accelerator with 24GB memory. We set TGI parameters to allow for a maximum of 8000 input tokens, 8192 maximum total tokens, and 8192 maximum batch prefill tokens.

from huggingface_hub import get_token vertex_model_name = "llama-3-1-8b-instruct" model = aiplatform.Model.upload(    display_name=vertex_model_name,    serving_container_image_uri=os.getenv("CONTAINER_URI"),    serving_container_environment_variables={        "MODEL_ID": "meta-llama/Meta-Llama-3.1-8B-Instruct",        "MAX_INPUT_TOKENS": "8000",        "MAX_TOTAL_TOKENS": "8192",        "MAX_BATCH_PREFILL_TOKENS": "8192",        "HUGGING_FACE_HUB_TOKEN": get_token(),    },    serving_container_ports=[8080],)model.wait() # wait for the model to be registered # create endpointendpoint = aiplatform.Endpoint.create(display_name=f"{vertex_model_name}-endpoint") # deploy model to 1x NVIDIA L4deployed_model = model.deploy(    endpoint=endpoint,    machine_type="g2-standard-4",    accelerator_type="NVIDIA_L4",    accelerator_count=1,)

WARNING: The Vertex AI endpoint deployment via the deploy method may take from 15 to 25 minutes.

After the model is deployed, we can test our endpoint. We generate a helper generate function to send requests to the deployed model. This will be later used to send requests to the deployed model and collect the outputs for evaluation.

import refrom transformers import AutoTokenizer # grep the model id from the container spec environment variablesmodel_id = next((re.search(r'value: "(.+)"', str(item)).group(1) for item in list(model.container_spec.env) if 'MODEL_ID' in str(item)), None)tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") generation_config = {  "max_new_tokens": 256,  "do_sample": True,  "top_p": 0.2,  "temperature": 0.2,} def generate(prompt, generation_config=generation_config):  formatted_prompt = tokenizer.apply_chat_template(        [          {"role": "user", "content": prompt},        ],        tokenize=False,        add_generation_prompt=True,      )    payload = {    "inputs": formatted_prompt,    "parameters": generation_config  }  output = deployed_model.predict(instances=[payload])  generated_text = output.predictions[0]  return generated_text  generate("How many people live in Berlin?", generation_config)# 'The population of Berlin is approximately 6.578 million as of my cut off data. However, considering it provides real-time updates, the current population might be slightly higher'

Evaluate Llama 3.1 8B using different prompts on Coherence

We will evaluate the Llama 3.1 8B model using different prompts on Coherence. Coherence measures how well the individual sentences within a summarized news article connect together to form a unified and easily understandable narrative.

We are going to use the new Generative AI Evaluation Service. The Gen AI Evaluation Service can be used to:

    Model selection: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.Generation settings: Tweak model parameters (like temperature) to optimize output for your needs.Prompt engineering: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.Improve and safeguard fine-tuning: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.RAG optimization: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.Migration: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.

In our case, we will use it to evaluate different prompt templates to achieve the most coherent summaries using Llama 3.1 8B Instruct.

We are going to use a reference free Pointwise metric based on G-Eval Coherence metric.

The first step is to define our prompt template and create our PointwiseMetric. Vertex AI returns our response from the model in the response field our news article will be made available in the text field.

from vertexai.evaluation import EvalTask, PointwiseMetric g_eval_coherence = """You are an expert evaluator. You will be given one summary written for a news article.Your task is to rate the summary on one metric.Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. Evaluation Criteria: Coherence (1-5) - the collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic." Evaluation Steps: 1. Read the news article carefully and identify the main topic and key points.2. Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.  Example:  Source Text: {text} Summary: {response} Evaluation Form (scores ONLY): - Coherence:""" metric = PointwiseMetric(    metric="g-eval-coherence",    metric_prompt_template=g_eval_coherence,)

We are going to use argilla/news-summary dataset consisting of news article from Reuters. We are going to use a random subset of 15 articles to keep the evaluation fast. Feel free to change the dataset and the number of articles to evaluate the model with more data and different topics.

from datasets import load_dataset subset_size = 15dataset = load_dataset("argilla/news-summary", split=f"train").shuffle(seed=42).select(range(subset_size)) # print first 150 characters of the first articleprint(dataset[0]["text"][:150]) 

Before we can run the evaluation, we need to convert our dataset into a pandas dataframe.

# remove all columns except for "text"to_remove = [col for col in dataset.features.keys() if col != "text"]dataset = dataset.remove_columns(to_remove)df = dataset.to_pandas()df.head() 

Awesome! We are almost ready. Last step is to define our different summarization prompts we want to use for evaluation.

summarization_prompts = {  "simple": "Summarize the following news article: {text}",  "eli5": "Summarize the following news article in a way a 5 year old would understand: {text}",  "detailed": """Summarize the given news article, text, including all key points and supporting details? The summary should be comprehensive and accurately reflect the main message and arguments presented in the original text, while also being concise and easy to understand. To ensure accuracy, please read the text carefully and pay attention to any nuances or complexities in the language.  Article:{text}"""}

Now we can iterate over our prompts and create different evaluation tasks, use our coherence metric to evaluate the summaries and collect the results.

import uuid  results = {}for prompt_name, prompt in summarization_prompts.items():    prompt = summarization_prompts[prompt_name]     # 1. add new prompt column    df["prompt"] = df["text"].apply(lambda x: prompt.format(text=x))     # 2. create eval task    eval_task = EvalTask(        dataset=df,        metrics=[metric],        experiment="llama-3-1-8b-instruct",    )    # 3. run eval task    # Note: If the last iteration takes > 1 minute you might need to retry the evaluation    exp_results = eval_task.evaluate(model=generate, experiment_run_name=f"prompt-{prompt_name}-{str(uuid.uuid4())[:8]}")    print(f"{prompt_name}: {exp_results.summary_metrics['g-eval-coherence/mean']}")    results[prompt_name] = exp_results.summary_metrics["g-eval-coherence/mean"] for prompt_name, score in sorted(results.items(), key=lambda x: x[1], reverse=True):    print(f"{prompt_name}: {score}")

Nice, it looks like on our limited test the "simple" prompt yields the best results. We can inspect and compare the results in the GCP Console at Vertex AI > Model Development > Experiments.

The overview allows to compare the results across different experiments and to inspect the individual evaluations. Here we can see that the standard deviation of detailed is quite high. This could be because of the low sample size or that we need to improve the prompt further.

You can find more examples on how to use the Gen AI Evaluation Service in the Vertex AI Generative AI documentation including how to:

Resource clean-up

Finally, you can already release the resources that you've created as follows, to avoid unnecessary costs:

    deployed_model.undeploy_all to undeploy the model from all the endpoints.deployed_model.delete to delete the endpoint/s where the model was deployed gracefully, after the undeploy_all method.model.delete to delete the model from the registry.
deployed_model.undeploy_all()deployed_model.delete()model.delete()

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Vertex AI Gen AI Evaluation Service Llama 3.1 8B Coherence Prompt Engineering Natural Language Processing
相关文章