LLM生成参数：理解与调优

Tuning LLM outputs is largely a decoding problem: you shape the model’s next-token distribution with a handful of sampling controls—max tokens (caps response length under the model’s context limit), temperature (logit scaling for more/less randomness), top-p/nucleus and top-k (truncate the candidate set by probability mass or rank), frequency and presence penalties (discourage repetition or encourage novelty), and stop sequences (hard termination on delimiters). These seven parameters interact: temperature widens the tail that top-p/top-k then crop; penalties mitigate degeneration during long generations; stop plus max tokens provides deterministic bounds. The sections below define each parameter precisely and summarize vendor-documented ranges and behaviors grounded in the decoding literature.

1) Max tokens (a.k.a. `max_tokens`, `max_output_tokens`, `max_new_tokens`)

What it is: A hard upper bound on how many tokens the model may generate in this response. It doesn’t expand the context window; the sum of input tokens and output tokens must still fit within the model’s context length. If the limit hits first, the API marks the response “incomplete/length.”

When to tune:

stop

2) Temperature (`temperature`)

What it is: A scalar applied to logits before softmax:

softmax(z/T)i=∑jezj/Tezi/T<>

Lower T sharpens the distribution (more deterministic); higher T flattens it (more random). Typical public APIs expose a range near [0,2][0, 2][0,2]. Use low T for analytical tasks and higher T for creative expansion.

3) Nucleus sampling (`top_p`)

What it is: Sample only from the smallest set of tokens whose cumulative probability mass ≥ p. This truncates the long low-probability tail that drives classic “degeneration” (rambling, repetition). Introduced as nucleus sampling by Holtzman et al. (2019).

Practical notes:

Common operational band

top_p ≈ 0.9–0.95

Anthropic

either

temperature

top_p

4) Top-k sampling (`top_k`)

What it is: At each step, restrict candidates to the k highest-probability tokens, renormalize, then sample. Earlier work (Fan, Lewis, Dauphin, 2018) used this to improve novelty vs. beam search. In modern toolchains it’s often combined with temperature or nucleus sampling.

Practical notes:

top_k

HF docs

top_k

top_p

k-filtering then p-filtering

5) Frequency penalty (`frequency_penalty`)

What it is: Decreases the probability of tokens proportionally to how often they already appeared in the generated context, reducing verbatim repetition. Azure/OpenAI reference specifies the range −2.0 to +2.0 and defines the effect precisely. Positive values reduce repetition; negative values encourage it.

When to use: Long generations where the model loops or echoes phrasing (e.g., bullet lists, poetry, code comments).

6) Presence penalty (`presence_penalty`)

What it is: Penalizes tokens that have appeared at least once so far, encouraging the model to introduce new tokens/topics. Same documented range −2.0 to +2.0 in Azure/OpenAI reference. Positive values push toward novelty; negative values condense around seen topics.

Tuning heuristic: Start at 0; nudge presence_penalty upward if the model stays too “on-rails” and won’t explore alternatives.

7) Stop sequences (`stop`, `stop_sequences`)

What it is: Strings that force the decoder to halt exactly when they appear, without emitting the stop text. Useful for bounding structured outputs (e.g., end of JSON object or section). Many APIs allow multiple stop strings.

Design tips: Pick unambiguous delimiters unlikely to occur in normal text (e.g., "<|end|>", "\n\n###"), and pair with max_tokens as a belt-and-suspenders control.

Interactions that matter

Temperature vs. Nucleus/Top-k:

top_p

top_k

crop

one

Degeneration control:

Latency/cost:

max_tokens

Model differences:

References:

https://arxiv.org/abs/1904.09751https://openreview.net/forum?id=rygGQyrFvHhttps://huggingface.co/docs/transformers/en/generation_strategieshttps://huggingface.co/docs/transformers/en/main_classes/text_generationhttps://arxiv.org/abs/1805.04833https://aclanthology.org/P18-1082.pdfhttps://help.openai.com/en/articles/5072263-how-do-i-use-stop-sequenceshttps://platform.openai.com/docs/api-reference/introductionhttps://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-anthropic-claude-messages-request-response.htmlhttps://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/content-generation-parametershttps://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/adjust-parameter-valueshttps://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/reasoning

The post 7 LLM Generation Parameters—What They Do and How to Tune Them? appeared first on MarkTechPost.

1) Max tokens (a.k.a. `max_tokens`, `max_output_tokens`, `max_new_tokens`)

2) Temperature (`temperature`)

3) Nucleus sampling (`top_p`)

4) Top-k sampling (`top_k`)

5) Frequency penalty (`frequency_penalty`)

6) Presence penalty (`presence_penalty`)

7) Stop sequences (`stop`, `stop_sequences`)

Interactions that matter

References:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

1) Max tokens (a.k.a. max_tokens, max_output_tokens, max_new_tokens)

2) Temperature (temperature)

3) Nucleus sampling (top_p)

4) Top-k sampling (top_k)

5) Frequency penalty (frequency_penalty)

6) Presence penalty (presence_penalty)

7) Stop sequences (stop, stop_sequences)

Interactions that matter

References:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

1) Max tokens (a.k.a. `max_tokens`, `max_output_tokens`, `max_new_tokens`)

2) Temperature (`temperature`)

3) Nucleus sampling (`top_p`)

4) Top-k sampling (`top_k`)

5) Frequency penalty (`frequency_penalty`)

6) Presence penalty (`presence_penalty`)

7) Stop sequences (`stop`, `stop_sequences`)