Qwen3

访问我们的 Hugging Face 或 ModelScope 组织（点击上方链接），搜索以 `Qwen3-` 开头的模型，或访问 [Qwen3 模型集](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f)，您将找到所需的一切！尽情享用！要了解更多关于 Qwen3 的信息，请阅读我们的文档 \[[EN](https://qwen.readthedocs.io/en/latest/)|[ZH](https://qwen.readthedocs.io/zh-cn/latest/)\]。我们的文档包含以下部分： - 快速开始：基本用法和演示； - 推理：使用 Transformers 进行推理的指南，包括批量推理、流式处理等； - 本地运行：在 CPU 和 GPU 上使用 llama.cpp、Ollama 等框架本地运行 LLM 的说明； - 部署：演示如何使用 SGLang、vLLM、TGI 等框架部署 Qwen 进行大规模推理； - 量化：使用 GPTQ、AWQ 量化 LLM 的实践，以及如何制作高质量量化 GGUF 文件的指南； - 训练：后训练指令，包括使用 Axolotl、LLaMA-Factory 等框架进行 SFT 和 RLHF（待完成）； - 应用框架：在 RAG、Agent 等应用框架中使用 Qwen。 ## 简介 ### Qwen3-2507 在过去的三个月里，我们继续探索 Qwen3 系列模型的潜力，并很高兴地推出更新的 **Qwen3-2507**，包含两种变体：Qwen3-Instruct-2507 和 Qwen3-Thinking-2507，以及三种规模：235B-A22B、30B-A3B 和 4B。 **Qwen3-Instruct-2507** 是先前 Qwen3 非思考模式的更新版本，具有以下关键增强： - **通用能力显著提升**，包括**指令遵循、逻辑推理、文本理解、数学、科学、编程和工具使用**。 - **多语言长尾知识覆盖度大幅提升**。 - **在主观和开放式任务中与用户偏好对齐度显著改善**，能够提供更有帮助的响应和更高质量的文本生成。 - **256K 令牌长上下文理解能力增强**，可扩展至 **100 万令牌**。 **Qwen3-Thinking-2507** 是 Qwen3 思考模型的延续，推理质量和深度得到提升，具有以下关键增强： - **推理任务性能显著提升**，包括逻辑推理、数学、科学、编程以及通常需要人类专业知识的学术基准测试——在**开源思考模型中达到最先进的水平**。 - **通用能力明显改善**，例如指令遵循、工具使用、文本生成以及与人类偏好的对齐。 - **增强的 256K 长上下文理解能力**，可扩展至 **100 万令牌**。

先前 Qwen3 版本

Qwen3 (又名 Qwen3-2504)

我们很高兴地宣布发布 Qwen3，这是 Qwen 系列大语言模型的最新成员。这些模型代表了我们迄今为止最先进、最智能的系统，基于我们构建 QwQ 和 Qwen2.5 的经验进行了改进。我们向公众开放 Qwen3 的权重，包括密集模型和混合专家（MoE）模型。

Qwen3 的亮点包括：

多种规模的密集模型和混合专家（MoE）模型，提供 0.6B、1.7B、4B、8B、14B、32B 以及 30B-A3B、235B-A22B 版本。
在思考模式（用于复杂逻辑推理、数学和编程）和非思考模式（用于高效通用聊天）之间无缝切换，确保在各种场景下的最佳性能。
推理能力显著增强，在数学、代码生成和常识逻辑推理方面超越了先前的 QwQ（思考模式）和 Qwen2.5 指令模型（非思考模式）。
卓越的人类偏好对齐，在创意写作、角色扮演、多轮对话和指令遵循方面表现出色，提供更自然、引人入胜和沉浸式的对话体验。
智能体能力专长，能够在思考和非思考模式下精确集成外部工具，并在基于智能体的复杂任务中达到开源模型的领先性能。
支持 100 多种语言和方言，具备强大的多语言指令遵循和翻译能力。

## 动态 - 2025.08.08：您现在可以使用 Qwen3-2507 处理 **100 万令牌** 的超长输入！请查看更新的模型卡片（[235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)、[235B-A22B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)、[A30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)、[A30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507)）了解如何启用此功能。 - 2025.08.06：Qwen3-2507 的最终开源版本，[Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) 和 [Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)，现已发布！ - 2025.07.31：Qwen3-30B-A3B-Thinking-2507 发布。查看[模型卡片](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507)了解更多详情！ - 2025.07.30：Qwen3-30B-A3B-Instruct-2507 发布。查看[模型卡片](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)了解更多详情！ - 2025.07.25：我们发布了 Qwen3-235B-A22B 思考模式的更新版本，命名为 Qwen3-235B-A22B-Thinking-2507。查看[模型卡片](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507)了解更多详情！ - 2025.07.21：我们发布了 Qwen3-235B-A22B 非思考模式的更新版本，命名为 Qwen3-235B-A22B-Instruct-2507，相比先前版本有显著增强，并支持 256K 令牌长上下文理解。查看我们的[模型卡片](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507)了解更多详情！ - 2025.04.29：我们发布了 Qwen3 系列。查看我们的[博客](https://qwenlm.github.io/blog/qwen3)了解更多详情！ - 2024.09.19：我们发布了 Qwen2.5 系列。此次额外提供了 3 种模型规模：3B、14B 和 32B，带来更多可能性。查看我们的[博客](https://qwenlm.github.io/blog/qwen2.5)了解更多！ - 2024.06.06：我们发布了 Qwen2 系列。查看我们的[博客](https://qwenlm.github.io/blog/qwen2/)！ - 2024.03.28：我们发布了 Qwen 的首个 MoE 模型：Qwen1.5-MoE-A2.7B！目前仅支持 HF transformers 和 vLLM。我们将很快添加对 llama.cpp、mlx-lm 等的支持。查看我们的[博客](https://qwenlm.github.io/blog/qwen-moe/)了解更多信息！ - 2024.02.05：我们发布了 Qwen1.5 系列。 ## 性能详细的评估结果请参见这篇 [📑 博客 (Qwen3-2504)](https://qwenlm.github.io/blog/qwen3/) 和这篇 [📑 博客 (Qwen3-2507) \[即将发布\]]()。关于 GPU 内存需求和相应吞吐量的结果，请参见[此处](https://qwen.readthedocs.io/en/latest/getting_started/speed_benchmark.html)。 ## 运行 Qwen3 ### 🤗 Transformers Transformers 是一个用于推理和训练的预训练自然语言处理库。推荐使用最新版本的 `transformers`，且要求 `transformers>=4.51.0`。 #### Qwen3-Instruct-2507 以下代码片段展示了如何使用 Qwen3-30B-A3B-Instruct-2507 基于给定输入生成内容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"

# 加载分词器和模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# 准备模型输入
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# 执行文本补全
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

> [!Note] > Qwen3-Instruct-2507 仅支持非思考模式，其输出中不会生成 ```` 块。同时，不再需要指定 `enable_thinking=False`。 #### Qwen3-Thinking-2507 以下代码片段展示了如何使用 Qwen3-30B-A3B-Thinking-2507 基于给定输入生成内容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Thinking-2507"

# 加载分词器和模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# 准备模型输入
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# 执行文本补全
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# 解析思考内容
try:
    # rindex 查找 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)  # 没有开头的 <think> 标签
print("content:", content)

> [!Note] > Qwen3-Thinking-2507 仅支持思考模式。 > 此外，为了强制模型思考，默认的聊天模板会自动包含 ``。因此，模型的输出中只包含 `` 而没有显式的开头的 `` 标签是正常的。 > > Qwen3-Thinking-2507 还具有增加的思考长度。我们强烈建议在高度复杂的推理任务中使用它，并设置足够的最大生成长度。

为先前 Qwen3 模型切换思考/非思考模式

默认情况下，Qwen3 模型会在响应前进行思考。这可以通过以下方式控制：

enable_thinking=False：将 enable_thinking=False 传递给 `tokenizer.apply_chat_template` 将严格阻止模型生成思考内容。
/think 和 /no_think 指令：在系统或用户消息中使用这些词语来指示 Qwen3 是否应该思考。在多轮对话中，遵循最新的指令。

### ModelScope 我们强烈建议用户，尤其是中国大陆的用户，使用 ModelScope。 ModelScope 采用了与 Transformers 类似的 Python API。 CLI 工具 `modelscope download` 可以帮助您解决下载检查点的问题。对于 vLLM 和 SGLang，可以分别使用环境变量 `VLLM_USE_MODELSCOPE=true` 和 `SGLANG_USE_MODELSCOPE=true`。 ### llama.cpp [`llama.cpp`](https://github.com/ggml-org/llama.cpp) 能够以最少的设置在各种硬件上实现 LLM 推理，并提供最先进的性能。建议使用 `llama.cpp>=b5401` 以获得对 Qwen3 的完整支持。要使用 CLI，请在终端中运行以下命令：

./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
# 按 CTRL+C 退出

要使用 API 服务器，请在终端中运行以下命令：

./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080

一个简单的 Web 前端将位于 `http://localhost:8080`，而一个 OpenAI 兼容的 API 将位于 `http://localhost:8080/v1`。更多指南，请参考[我们的文档](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)。 > [!Note] > llama.cpp 采用“旋转上下文管理”，通过驱逐较早的令牌来实现无限生成。 > 这可以通过参数配置，上述命令有效地禁用了它。 > 更多详情，请参考[我们的文档](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html#llama-cli)。 ### Ollama [安装 Ollama](https://ollama.com/) 后，您可以使用以下命令启动 Ollama 服务（建议使用 Ollama v0.9.0 或更高版本）：

ollama serve
# 使用 ollama 时需要保持此服务运行

要拉取模型检查点并运行模型，请使用 `ollama run` 命令。您可以通过向 `qwen3` 添加后缀来指定模型规模，例如 `:8b` 或 `:30b-a3b`：

ollama run qwen3:8b
# 设置参数，输入 "/set parameter num_ctx 40960" 和 "/set parameter num_predict 32768"
# 要退出，输入 "/bye" 并按 ENTER
# 对于 Qwen3-2504 模型，
# - 要启用思考（默认），输入 "/set think"
# - 要禁用思考，输入 "/set nothink"

您也可以通过其 OpenAI 兼容的 API 访问 Ollama 服务。请注意，您需要 (1) 在使用 API 时保持 `ollama serve` 运行，并且 (2) 在使用此 API 之前执行 `ollama run qwen3:8b` 以确保模型检查点已准备就绪。默认情况下，API 位于 `http://localhost:11434/v1/`。更多详情，请访问 [ollama.ai](https://ollama.com/)。 > [!Note] > Ollama 的命名可能与 Qwen 的原始命名不一致。 > 例如，截至 2025 年 8 月，Ollama 中的 `qwen3:30b-a3b` 指向 `qwen3:30b-a3b-thinking-2507-q4_K_M`。 > 使用前请查看。 > [!Note] > Ollama 采用了与 llama.cpp 相同的“旋转上下文管理”。 > 然而，其默认设置（`num_ctx` 2048 和 `num_predict` -1）意味着在 2048 令牌的上下文中进行无限生成， > 这可能会给 Qwen3 模型带来问题。 > 我们建议正确设置 `num_ctx` 和 `num_predict`。 ### LMStudio Qwen3 已得到 [lmstudio.ai](https://lmstudio.ai/) 的支持。您可以直接使用 LMStudio 加载我们的 GGUF 文件。 ### ExecuTorch 要在 ExecuTorch（iOS、Android、Mac、Linux 等）上导出和运行，请遵循此[示例](https://github.com/pytorch/executorch/blob/main/examples/models/qwen3/README.md)。 ### MNN 要在支持移动设备上运行 Qwen3 的 MNN 上导出和运行，请访问 [Alibaba MNN](https://github.com/alibaba/MNN)。 ### MLX LM 如果您在 Apple Silicon 上运行，[`mlx-lm`](https://github.com/ml-explore/mlx-lm) 也支持 Qwen3（`mlx-lm>=0.24.0`）。请在 Hugging Face Hub 上查找以 MLX 结尾的模型。 ### OpenVINO 如果您在 Intel CPU 或 GPU 上运行，[OpenVINO 工具包](https://github.com/openvinotoolkit) 支持 Qwen3。您可以遵循此[聊天机器人示例](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/llm-chatbot/llm-chatbot.ipynb)。 ## 部署 Qwen3 Qwen3 得到多个推理框架的支持。这里我们演示 `SGLang`、`vLLM` 和 `TensorRT-LLM` 的用法。您也可以从各种推理提供商处找到 Qwen3 模型，例如 [阿里云 Model Studio](https://www.alibabacloud.com/en/product/modelstudio)。 ### SGLang [SGLang](https://github.com/sgl-project/sglang) 是一个用于大语言模型和视觉语言模型的快速服务框架。 SGLang 可用于启动具有 OpenAI 兼容 API 服务的服务器。要求 `sglang>=0.4.6.post1`。对于 Qwen3-Instruct-2507，

python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --port 30000 --context-length 262144

对于 Qwen3-Thinking-2507，

python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --port 30000 --context-length 262144 --reasoning-parser deepseek-r1

对于 Qwen3，命令如下：

python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --context-length 131072 --reasoning-parser qwen3

一个 OpenAI 兼容的 API 将位于 `http://localhost:30000/v1`。 > [!Note] > 由于 SGLang 对 API 请求进行了预处理，丢弃了所有 `reasoning_content` 字段，因此**使用 Qwen3 思考模型进行多步工具调用**的质量可能不理想，因为这需要相关的思考内容存在。在修复工作进行期间，作为一种变通方法，我们建议按原样传递内容，不提取思考内容，聊天模板将正确处理。 ### vLLM [vLLM](https://github.com/vllm-project/vllm) 是一个用于 LLM 的高吞吐量和内存高效的推理和服务引擎。建议使用 `vllm>=0.9.0`。对于 Qwen3-Instruct-2507，

vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --port 8000 --max-model-len 262144

对于 Qwen3-Thinking-2507，

vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --port 8000 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseek_r1

对于 Qwen3，命令如下：

vllm serve Qwen/Qwen3-8B --port 8000 --max-model-len 131072 --enable-reasoning --reasoning-parser qwen3

一个 OpenAI 兼容的 API 将位于 `http://localhost:8000/v1`。 > [!Note] > 由于 vLLM 对 API 请求进行了预处理，丢弃了所有 `reasoning_content` 字段，因此**使用 Qwen3 思考模型进行多步工具调用**的质量可能不理想，因为这需要相关的思考内容存在。在修复工作进行期间，作为一种变通方法，我们建议按原样传递内容，不提取思考内容，聊天模板将正确处理。 ### TensorRT-LLM [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) 是 NVIDIA 的开源 LLM 推理引擎，提供了包括自定义注意力内核、量化等在内的优化，适用于 NVIDIA GPU。Qwen3 在其重新架构的 [PyTorch 后端](https://nvidia.github.io/TensorRT-LLM/torch.html)中得到支持。建议使用 `tensorrt_llm>=0.20.0rc3`。更多详情请参考 [README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/qwen/README.md#qwen3) 页面。

trtllm-serve Qwen/Qwen3-8B --host localhost --port 8000 --backend pytorch

一个 OpenAI 兼容的 API 将位于 `http://localhost:8000/v1`。 ### MindIE 要在 Ascend NPU 上部署，请访问 [Modelers](https://modelers.cn/) 并搜索 Qwen3。 ## 基于 Qwen3 构建 ### 工具使用对于工具使用能力，我们建议查看 [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent)，它提供了这些 API 的封装，以支持工具使用或函数调用，并支持 MCP。使用 Qwen3 进行工具调用也可以通过 SGLang、vLLM、Transformers、llama.cpp、Ollama 等进行。请遵循我们文档中的指南，了解如何启用支持。 ### 微调我们建议您使用训练框架，包括 [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)、[UnSloth](https://github.com/unslothai/unsloth)、[Swift](https://github.com/modelscope/swift)、[Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) 等，通过 SFT、DPO、GRPO 等方式微调您的模型。 ## 许可协议我们所有开源权重的模型均采用 Apache 2.0 许可。您可以在相应的 Hugging Face 仓库中找到许可文件。 ## 引用如果您觉得我们的工作有帮助，请随时引用我们。

@article{qwen3,
    title={Qwen3 Technical Report}, 
    author={An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jing Zhou and Jingren Zhou and Junyang Lin and Kai Dang and Keqin Bao and Kexin Yang and Le Yu and Lianghao Deng and Mei Li and Mingfeng Xue and Mingze Li and Pei Zhang and Peng Wang and Qin Zhu and Rui Men and Ruize Gao and Shixuan Liu and Shuang Luo and Tianhao Li and Tianyi Tang and Wenbiao Yin and Xingzhang Ren and Xinyu Wang and Xinyu Zhang and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yinger Zhang and Yu Wan and Yuqiong Liu and Zekun Wang and Zeyu Cui and Zhenru Zhang and Zhipeng Zhou and Zihan Qiu},
    journal = {arXiv preprint arXiv:2505.09388},
    year={2025}
}

@article{qwen2.5,
    title   = {Qwen2.5 Technical Report}, 
    author  = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
    journal = {arXiv preprint arXiv:2412.15115},
    year    = {2024}
}

@article{qwen2,
    title   = {Qwen2 Technical Report}, 
    author  = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
    journal = {arXiv preprint arXiv:2407.10671},
    year    = {2024}
}

## 联系我们如果您有兴趣给我们的研究团队或产品团队留言，请加入我们的 [Discord](https://discord.gg/z3GAxXZ9Ce) 或 [微信群](assets/wechat.png)！

原文地址 https://github.com/QwenLM/Qwen3