API 部署清单 | OpenAI API

目录	预期影响
使用 Responses API	质量、成本、延迟、可靠性
设置 `reasoning.effort`	质量、成本、延迟
设置 `text.verbosity`	质量、成本、延迟
设置助手 `phase` 参数	质量、成本
使用 `tool_search`	成本、延迟
利用内置工具	质量
利用压缩	成本
使用 `prompt_cache_key`	延迟、成本
使用 `reasoning.encrypted_content`	质量、延迟
使用 `background=True`	可恢复性
使用 WebSocket 模式	延迟

使用 Responses API

始终开始 with the Responses API。它是 OpenAI 的旗舰 API，也是访问最新模型行为、内置工具、有状态工作流和智能体功能的最佳平台。

设置 `reasoning.effort`

使用 reasoning.effort 来决定模型在回答之前应进行多少思考。

For gpt-5.5, 支持的值为 none, low, medium, high，且 xhigh, 默认值为 medium。较低的投入更快，且消耗更少的推理 token。较高的投入为模型提供了更多的时间进行规划、调试、综合以及多步权衡。合适的值取决于任务，而不仅仅是模型。

使用 low 当任务主要是提取、路由、分类或简单重写时使用 medium or high 当模型需要诊断问题、比较选项、制定计划或对代码进行推理时使用。保留 xhigh 用于您的评估表明额外延迟是值得的情况。

针对任务调整推理工作量

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import OpenAI from "openai";

const openai = new OpenAI();

const prompt = [
  "Our CI job started failing after a dependency bump.",
  "",
  "Error:",
  "TypeError: Timeout.__init__() got an unexpected keyword argument 'connect'",
  "",
  "Identify the likeliest root cause and the smallest safe fix.",
].join("\n");

const response = await openai.responses.create({
  model: "gpt-5.5",
  reasoning: { effort: "high" },
  input: prompt,
});

console.log(response.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI

client = OpenAI()

prompt = """
Our CI job started failing after a dependency bump.

Error:
TypeError: Timeout.__init__() got an unexpected keyword argument 'connect'

Identify the likeliest root cause and the smallest safe fix.
"""

response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    input=prompt,
)

print(response.output_text)

设置 `text.verbosity`

text.verbosity 是平衡简洁性与完整性的主要调节手段。当产品需要快速、简短的回答时，使用较低的详细程度；当回答需要更丰富的解释、更清晰的结构或完整的上下文时，使用较高的详细程度。详细程度越低意味着输出的 token 越少，因此模型生成的内容更少，返回输出的速度也更快。

For coding, medium and high 往往会生成更长、更有条理且结构更清晰的输出。 low 使回答更加紧凑简练。

设置较低的冗余度以获取简洁的输出

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import OpenAI from "openai";

const openai = new OpenAI();

const incident = [
  "Summarize this incident for the next on-call engineer.",
  "- checkout latency spiked from 220 ms to 4.8 s",
  "- only us-east-1 was affected",
  "- rollback is complete",
  "- likely trigger: cache stampede after deploy",
].join("\n");

const response = await openai.responses.create({
  model: "gpt-5.5",
  text: { verbosity: "low" },
  input: incident,
});

console.log(response.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.5",
    text={"verbosity": "low"},
    input="""
    Summarize this incident for the next on-call engineer.
    - checkout latency spiked from 220 ms to 4.8 s
    - only us-east-1 was affected
    - rollback is complete
    - likely trigger: cache stampede after deploy
    """,
)

print(response.output_text)

设置助手 `phase` 参数

phase 是对话历史中助手消息上的一个标签。它向模型指示先前的助手消息是中间工作注释还是最终答案。请将 phase: "commentary" 用于进度更新、工具调用前的说明以及其他中间消息。请将 phase: "final_answer" for the completed response.

助手可能会说类似这样的话：

助手注释消息

1
2
3
4
5
{
  "role": "assistant",
  "phase": "commentary",
  "content": "I'm checking the logs and comparing them to the last successful deploy."
}

这不是最终答案，而是一条进度说明。随后，助手可能会说：

助手最终回答消息

1
2
3
4
5
{
  "role": "assistant",
  "phase": "final_answer",
  "content": "The deploy failed because the migration referenced a column that does not exist in production."
}

这在长时间运行或工具调用频繁的工作流中非常有用，因为助手可能会在完成之前生成可见的进度更新。当您将该历史记录发送回模型时，请保留助手消息上的 phase 以便模型能够区分哪些是进度更新，哪些是最终结果。

在后续请求中保留并重新发送助手消息上的 phase ，适用于 gpt-5.3-codex 及更高版本的新模型。这有助于解决提前停止的问题，确保智能体会一直运行直到得出最终答案。

使用 `tool_search`

无需在每个请求中加载完整的工具目录，只需添加 {"type": "tool_search"} 并将高成本的工具定义标记为 defer_loading: true。随后，模型可以在运行时按需加载所需子集。在请求开始时，模型只能看到搜索工具的名称和描述。如果模型认为需要某个延迟加载的工具，它会执行工具搜索，此时延迟的工具定义才会被加载到上下文中。此后，模型才会调用它们。这样可以节省 token 并保持缓存性能。

有两种模式：

托管工具搜索 是更简单的选项。当您已经知道请求可能需要哪些工具时，请使用此选项。
客户端执行的搜索工具 适用于您的应用程序需要自行决定可用工具的情况，例如根据用户的租户、项目、权限或内部注册表来决定。

请从托管工具搜索开始 除非您的应用程序确实需要自行控制工具发现过程。

按用户意图对工具进行分组。尽可能使用命名空间或 MCP 服务器。与在冗长的扁平函数列表中进行选择相比，模型更容易在几个清晰的分组间做出选择。我们建议将每个命名空间内的函数控制在约 10 个以内，以实现最佳的 token 效率和模型性能。

保持命名空间描述简短且具有区分度。将详细说明放入延迟加载的工具定义中。避免为所有内容创建一个庞大的命名空间。

结合延迟加载工具使用托管工具搜索

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import OpenAI from "openai";

const openai = new OpenAI();

const billingLookupInvoice = {
  type: "function",
  name: "billing.lookup_invoice",
  description: "Look up invoice state, taxes, credits, and payment attempts.",
  parameters: {
    type: "object",
    properties: {
      invoice_id: { type: "string" },
    },
    required: ["invoice_id"],
    additionalProperties: false,
  },
  strict: true,
  defer_loading: true,
};

const crmGetAccount = {
  type: "function",
  name: "crm.get_account",
  description: "Fetch account owner, plan, health, and payment history.",
  parameters: {
    type: "object",
    properties: {
      account_id: { type: "string" },
    },
    required: ["account_id"],
    additionalProperties: false,
  },
  strict: true,
  defer_loading: true,
};

const response = await openai.responses.create({
  model: "gpt-5.5",
  input:
    "Find the right billing tool and explain why invoice INV-1043 still " +
    "shows overdue after a payment yesterday.",
  tools: [
    { type: "tool_search" },
    billingLookupInvoice,
    crmGetAccount,
  ],
});

console.log(response.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from openai import OpenAI

client = OpenAI()

billing_lookup_invoice = {
    "type": "function",
    "name": "billing.lookup_invoice",
    "description": "Look up invoice state, taxes, credits, and payment attempts.",
    "parameters": {
        "type": "object",
        "properties": {
            "invoice_id": {"type": "string"},
        },
        "required": ["invoice_id"],
        "additionalProperties": False,
    },
    "strict": True,
    "defer_loading": True,
}

crm_get_account = {
    "type": "function",
    "name": "crm.get_account",
    "description": "Fetch account owner, plan, health, and payment history.",
    "parameters": {
        "type": "object",
        "properties": {
            "account_id": {"type": "string"},
        },
        "required": ["account_id"],
        "additionalProperties": False,
    },
    "strict": True,
    "defer_loading": True,
}

response = client.responses.create(
    model="gpt-5.5",
    input=(
        "Find the right billing tool and explain why invoice INV-1043 still "
        "shows overdue after a payment yesterday."
    ),
    tools=[
        {"type": "tool_search"},
        billing_lookup_invoice,
        crm_get_account,
    ],
)

print(response.output_text)

利用内置工具

内置工具是 API 的原生功能。无需自行构建每个工具，您可以直接让模型访问已经在 Responses API 中运行良好的工具。模型将自动决定何时使用它们。

OpenAI 会不断添加更多原生工具，因此只要内置工具适合您的工作流，请优先使用它们。当原生工具无法满足任务需求时，再构建自定义工具。目前的内置工具及相关工具选项包括：

网络搜索：搜索网络以获取最新信息
文件搜索：搜索上传的文件或向量存储
代码解释器：运行 Python 进行数据分析、数学计算、图表绘制和文件处理
Shell：在托管的容器或您自己的运行时环境中运行 shell 命令
计算机使用：通过截图、点击、输入和滚动来操作 UI
图像生成：生成或编辑图像
MCP/connectors：将模型连接到外部服务和工具
技能：附加可复用的指令包和工作流文件
应用补丁：进行结构化的代码编辑

偏好内置工具还有一个模型质量方面的原因。内置工具属于我们后训练分布内的数据，这意味着模型的训练和评估都围绕这些工具的形状、行为和输出展开。使用内置工具时，OpenAI 模型在工具选择、执行效果以及减少故障方面的表现均优于使用全新工具。

利用压缩

压缩是一个上下文工程工具：它决定模型在多轮对话中保留哪些信息。在长时间运行的智能体中，问题不仅仅在于“是否会达到上下文限制？”，而在于旧消息、工具日志、重试和过期细节会挤占模型所需的真正有用的状态空间。

压缩（Compaction）为您提供了一种受控的方式来减小上下文大小，同时保留后续轮次所需的状态。在完成一个有意义的里程碑（例如完成调试阶段或缩小根本原因范围）之后，您可以压缩之前的窗口，并从压缩后的输出继续。这能让模型保持敏锐，因为下一轮将围绕重要的状态构建，而不是包含每一次中间推理、失败命令和被淘汰的推理分支。

利用压缩有两种方式：

让服务器处理：如果您使用 previous_response_id，请开启 context_management with a compact_threshold。服务器会在对话过大时自动进行压缩。您只需继续发送最新的用户消息即可。
自己处理：如果您自行管理完整的输入数组，请调用 client.responses.compact()。它会返回一个更小的上下文窗口。直接将返回的输出用于下一次 responses.create() call.

请勿编辑压缩后的输出。 它不是供人阅读的摘要，而是用于帮助模型继续执行的机器状态。请将其原样传递，然后添加下一条用户消息。

从压缩后的响应状态继续

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import OpenAI from "openai";

const openai = new OpenAI();

// Full window collected from a long debugging session:
// user messages, assistant outputs, tool calls, and tool outputs.
const longWindow = sessionItems;

const compacted = await openai.responses.compact({
  model: "gpt-5.5",
  input: longWindow,
});

const nextResponse = await openai.responses.create({
  model: "gpt-5.5",
  store: false,
  input: [
    ...compacted.output, // Use compact output as-is.
    {
      type: "message",
      role: "user",
      content:
        "We found the bad cache invalidation path. Write the fix plan " +
        "and the verification checklist.",
    },
  ],
});

console.log(nextResponse.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from openai import OpenAI

client = OpenAI()

# Full window collected from a long debugging session:
# user messages, assistant outputs, tool calls, and tool outputs.
long_window = session_items

compacted = client.responses.compact(
    model="gpt-5.5",
    input=long_window,
)

next_response = client.responses.create(
    model="gpt-5.5",
    store=False,
    input=[
        *compacted.output,  # Use compact output as-is.
        {
            "type": "message",
            "role": "user",
            "content": (
                "We found the bad cache invalidation path. Write the fix plan "
                "and the verification checklist."
            ),
        },
    ],
)

print(next_response.output_text)

使用 `prompt_cache_key`

提示缓存当请求复用相同的长前缀时，会自动降低延迟和成本。对于大批量的工作流，请设置 prompt_cache_key 为共享相同稳定前缀的请求保持一致。

缓存键与提示前缀哈希组合在一起，因此这有助于将相似的请求路由到相同的缓存，而无需更改模型输入。对于真正共享的前缀，请保持键的稳定性，并选择合适的粒度，以避免将过多流量发送到单个前缀-键对。如果某个前缀和 prompt_cache_key 组合每分钟的请求超过约 15 次，请求可能会溢出到其他机器，从而降低缓存效率。

将相关请求路由到相同的提示缓存

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import OpenAI from "openai";

const openai = new OpenAI();

const instructions = [
  "You are the support agent for Acme.",
  "Follow the Acme support policy and escalation rubric.",
  "Use the same tone, safety rules, and tool plan for each ticket.",
].join("\n");

const response = await openai.responses.create({
  model: "gpt-5.5",
  prompt_cache_key: "tenant-acme-support-agent",
  instructions,
  input: "Summarize the current escalation for the on-call lead.",
});

console.log(response.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from openai import OpenAI

client = OpenAI()

instructions = """
You are the support agent for Acme.
Follow the Acme support policy and escalation rubric.
Use the same tone, safety rules, and tool plan for each ticket.
"""

response = client.responses.create(
    model="gpt-5.5",
    prompt_cache_key="tenant-acme-support-agent",
    instructions=instructions,
    input="Summarize the current escalation for the on-call lead.",
)

print(response.output_text)

使用 `reasoning.encrypted_content`

始终对推理项进行双向传递。这允许模型基于其先前的推理继续工作，从而为模型提供帮助。如果你的零数据保留 (ZDR) 的合规要求不允许存储响应数据，这时候 reasoning.encrypted_content 就显得尤为重要。 reasoning.encrypted_content 为您提供了一种无状态的交接方式。

添加 reasoning.encrypted_content to include，响应输出中的推理项将包含加密的推理内容，这些内容可以在下一次请求中原样传回。您的应用无需理解该值的具体内容，只需保持推理项原样不变，并在下一轮对话中将其发送回去，以便模型利用它继续工作流。

在无状态轮次之间传递加密的推理内容

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import OpenAI from "openai";

const openai = new OpenAI();

const first = await openai.responses.create({
  model: "gpt-5.5",
  store: false,
  reasoning: { effort: "medium" },
  include: ["reasoning.encrypted_content"],
  input: "Investigate why invoice INV-1043 has mismatched tax totals.",
});

const second = await openai.responses.create({
  model: "gpt-5.5",
  store: false,
  reasoning: { effort: "medium" },
  include: ["reasoning.encrypted_content"],
  input: [
    ...first.output,
    {
      role: "user",
      content: "Now write the customer-facing explanation in plain English.",
    },
  ],
});

console.log(second.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from openai import OpenAI

client = OpenAI()

first = client.responses.create(
    model="gpt-5.5",
    store=False,
    reasoning={"effort": "medium"},
    include=["reasoning.encrypted_content"],
    input="Investigate why invoice INV-1043 has mismatched tax totals.",
)

second = client.responses.create(
    model="gpt-5.5",
    store=False,
    reasoning={"effort": "medium"},
    include=["reasoning.encrypted_content"],
    input=[
        *first.output,
        {
            "role": "user",
            "content": "Now write the customer-facing explanation in plain English.",
        },
    ],
)

print(second.output_text)

使用 `background=True`

使用 background=True 用于可能需要很长时间的请求。API 无需保持客户端连接处于开启状态，而是启动一个作业并返回一个 ID。您的应用可以轮询该作业，直到其完成、失败或被取消。它可用于大型分析、长时间的工具运行，或者需要状态跟踪和重试机制的任务。

background=True 需要 store=True.

运行并轮询后台响应

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import OpenAI from "openai";

const openai = new OpenAI();

let job = await openai.responses.create({
  model: "gpt-5.5",
  background: true,
  store: true,
  input: "Analyze this large log bundle and cluster the primary failure modes.",
  tools: [
    {
      type: "code_interpreter",
      container: {
        type: "auto",
        file_ids: [logBundleFileId],
      },
    },
  ],
});

while (["queued", "in_progress"].includes(job.status)) {
  await new Promise((resolve) => setTimeout(resolve, 2000));
  job = await openai.responses.retrieve(job.id);
}

console.log(job.output_text);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from openai import OpenAI
import time

client = OpenAI()

job = client.responses.create(
    model="gpt-5.5",
    background=True,
    store=True,
    input="Analyze this large log bundle and cluster the primary failure modes.",
    tools=[
        {
            "type": "code_interpreter",
            "container": {
                "type": "auto",
                "file_ids": [log_bundle_file_id],
            },
        }
    ],
)

while job.status in {"queued", "in_progress"}:
    time.sleep(2)
    job = client.responses.retrieve(job.id)

print(job.output_text)

您可以将其与 stream=True 结合使用以获取进度事件，但首个事件所需的时间可能比普通请求更长。

从 UI 的角度来看，后台模式表明：“正在运行；这是当前状态；结果准备就绪后将显示在此处。”

Note: background=True 不兼容零数据保留.

使用 WebSocket 模式

WebSocket 模式专为长时间运行且大量调用工具的工作流而构建。在这些工作流中，您需要保持持久连接处于打开状态，并且只需发送新的输入项以及 previous_response_id。对于包含 20 次或更多工具调用的推出，此方法在端到端执行上大约快 40%。

工作原理: 第一条消息看起来像是一个正常的 Responses 请求：model、instructions、tools 以及用户输入。服务器会流式返回事件。如果模型请求使用工具，你的应用程序将运行该工具。然后，无需发送新的 HTTP 请求，你只需发送另一个 response.create 事件（在同一 Socket 上）以及先前的 previous_response_id 和新项目。这就是延迟优势的来源。在普通 HTTP 中，每次后续跟进都是一个新的请求。而在 WebSocket 模式下，连接会保持打开状态，并且最近的响应状态会在该连接的内存中保持活跃。当下一轮对话从该响应继续时，后端所需的设置工作会减少。

如果您的工作流是一问一答，那么 请继续使用 HTTP。如果你的工作流表现为一个长时间运行的 agent，请尝试 WebSocket 模式。

单个 WebSocket 连接一次只能处理一个进行中的响应，因此并行工作需要多个连接。目前连接的最长时间上限为 60 分钟。后续继续使用与 HTTP 模式相同的 previous_response_id 语义，并带有一个针对最近响应的连接本地缓存。

注意：WebSocket 模式支持 ZDR，因为您的数据不会存储到磁盘，而仅存储在内存中。

默认的 Python 示例使用 websocket-client (pip install websocket-client）。JavaScript 示例使用 ws (npm install ws).

启动 Responses API WebSocket 会话

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import OpenAI from "openai";
import WebSocket from "ws";

const openai = new OpenAI();

const ws = new WebSocket("wss://api.openai.com/v1/responses", {
  headers: {
    Authorization: "Bearer " + openai.apiKey,
  },
});

ws.on("open", () => {
  ws.send(
    JSON.stringify({
      type: "response.create",
      model: "gpt-5.5",
      store: false,
      input: [
        {
          type: "message",
          role: "user",
          content: [
            {
              type: "input_text",
              text:
                "Find the flaky test in this run, call the tools you need, " +
                "and keep going until you can explain the root cause.",
            },
          ],
        },
      ],
      tools: [testLogTool, codeSearchTool],
    })
  );
});

ws.on("message", (data) => {
  const firstEvent = JSON.parse(data.toString());
  console.log(firstEvent.type);
});

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from openai import OpenAI
from websocket import create_connection
import json

client = OpenAI()

ws = create_connection(
    "wss://api.openai.com/v1/responses",
    header=[f"Authorization: Bearer {client.api_key}"],
)

# Same request body you would send to client.responses.create(...).
ws.send(
    json.dumps(
        {
            "type": "response.create",
            "model": "gpt-5.5",
            "store": False,
            "input": [
                {
                    "type": "message",
                    "role": "user",
                    "content": [
                        {
                            "type": "input_text",
                            "text": (
                                "Find the flaky test in this run, call the tools "
                                "you need, and keep going until you can explain "
                                "the root cause."
                            ),
                        }
                    ],
                }
            ],
            "tools": [test_log_tool, code_search_tool],
        }
    )
)

first_event = json.loads(ws.recv())
print(first_event["type"])

最终结论

Responses API 是构建更智能、更强大的 OpenAI 应用程序的基础。其真正的优势在于，它允许开发者从一次性提示转变为持久、使用工具且具备上下文感知的工作流，从而能够适应复杂的任务。遵循本指南，您将在实际部署中看到更高的性能。

推荐

入门

核心概念

Apps SDK

工具

运行与扩展

评估

实时与音频

模型优化

专业模型

正式上线

旧版 API

资源

入门指南

使用 Codex

配置

管理

自动化

学习

发布

核心概念

规划

构建

部署

转化应用

指南

资源

指南

文件上传

API

衡量

广告主 API

API 参考

最新

主题

主题

贡献

分类

主题

项目

活动

使用 Responses API

设置 reasoning.effort

设置 text.verbosity

设置助手 phase 参数

使用 tool_search

利用内置工具

利用压缩

使用 prompt_cache_key

使用 reasoning.encrypted_content

使用 background=True

使用 WebSocket 模式

最终结论

设置 `reasoning.effort`

设置 `text.verbosity`

设置助手 `phase` 参数

使用 `tool_search`

使用 `prompt_cache_key`

使用 `reasoning.encrypted_content`

使用 `background=True`