llama.cpp

文章目录

- 一、关于 llama.cpp
- - 支持的模型：
  - Multimodal models:
  - Bindings:
  - UI:
  - Tools:
- 二、Demo
- - 1、Typical run using LLaMA v2 13B on M2 Ultra
  - 2、Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook
- 三、用法
- - 1、基本用法
  - 2、对话模式
  - 3、网络服务
  - 4、交互模式
  - 5、持久互动
  - 6、语法约束输出
- 四、构建
- 五、支持的后端
- 六、工具
- - 1、准备和量化
  - 2、困惑（测量模型质量）
- 七、其他文件

一、关于 llama.cpp

github ： https://github.com/ggerganov/llama.cpp
Roadmap / Project status / Manifesto / ggml

llama.cpp的主要目标是使LLM推理具有最少的设置和最先进的性能，在各种硬件–本地和云端。

没有任何依赖关系的普通C/C++实现
Apple silicon 是一等公民 – 通过ARM NEON、Accelerate和Metal框架进行优化
AVX、AVX2和AVX512支持x86架构
1.5位、2位、3位、4位、5位、6位和8位整数量化，用于更快的推理和减少内存使用
用于在NVIDIA GPU上运行LLM的自定义CUDA内核（通过HIP支持AMD GPU）
Vulkan和SYCL后端支持
CPU+GPU 混合推断部分加速模型大于总VRAM容量

自启动以来，由于许多 contributions，该项目有了显著改善。
它是为ggml库开发新功能的主要场所。

支持的模型：

通常也支持下面基本模型的细调。

LLaMA 🦙
LLaMA 2 🦙🦙
LLaMA 3 🦙🦙🦙
Mistral 7B
Mixtral MoE
DBRX
Falcon
Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
Vigogne (French)
BERT
Koala
Baichuan 1 & 2 + derivations
Aquila 1 & 2
Starcoder models
Refact
MPT
Bloom
Yi models
StableLM models
Deepseek models
Qwen models
PLaMo-13B
Phi models
GPT-2
Orion 14B
InternLM2
CodeShell
Gemma
Mamba
Grok-1
Xverse
Command-R models
SEA-LION
GritLM-7B + GritLM-8x7B
OLMo
GPT-NeoX + Pythia
ChatGLM3-6b + ChatGLM4-9b

(instructions for supporting more models: HOWTO-add-model.md)

Multimodal models:

LLaVA 1.5 models, LLaVA 1.6 models
BakLLaVA
Obsidian
ShareGPT4V
MobileVLM 1.7B/3B models
Yi-VL
Mini CPM
Moondream
Bunny

Bindings:

Python: abetlen/llama-cpp-python
Go: go-skynet/go-llama.cpp
Node.js: withcatai/node-llama-cpp
JS/TS (llama.cpp server client): lgrammel/modelfusion
JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
Typescript/Wasm (nicer API, available on npm): ngxson/wllama
Ruby: yoshoku/llama_cpp.rb
Rust (more features): edgenai/llama_cpp-rs
Rust (nicer API): mdrokz/rust-llama.cpp
Rust (more direct bindings): utilityai/llama-cpp-rs
C#/.NET: SciSharp/LLamaSharp
Scala 3: donderom/llm4s
Clojure: phronmophobic/llama.clj
React Native: mybigday/llama.rn
Java: kherud/java-llama.cpp
Zig: deins/llama.cpp.zig
Flutter/Dart: netdur/llama_cpp_dart
PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
Guile Scheme: guile_llama_cpp

UI:

除非另有说明，否则这些项目是具有许可的开源项目：

iohub/collama
janhq/jan (AGPL)
nat/openplayground
Faraday (proprietary)
LMStudio (proprietary)
Layla (proprietary)
LocalAI (MIT)
LostRuins/koboldcpp (AGPL)
Mozilla-Ocho/llamafile
nomic-ai/gpt4all
ollama/ollama
oobabooga/text-generation-webui (AGPL)
psugihara/FreeChat
cztomsik/ava (MIT)
ptsochantaris/emeltal
pythops/tenere (AGPL)
RAGNA Desktop (proprietary)
RecurseChat (proprietary)
semperai/amica
withcatai/catai
Mobile-Artificial-Intelligence/maid (MIT)
Msty (proprietary)
LLMFarm (MIT)
KanTV(Apachev2.0 or later)
Dot (GPL)
MindMac (proprietary)
KodiBot (GPL)
eva (MIT)
AI Sublime Text plugin (MIT)
AIKit (MIT)
LARS - The LLM & Advanced Referencing Solution (AGPL)

(要在此处列出一个项目，它应该明确说明它依赖于 llama.cpp)

Tools:

akx/ggify – 从 HuggingFace Hub 下载 PyTorch 模型，然后转化他们到 GGML
crashr/gppm – 使用 NVIDIA Tesla P40 或 P100 GPU 加载 llama.cpp实例，降低空闲功耗

Infrastructure:

Paddler - 为llama.cpp定制的状态负载均衡器

二、Demo

1、Typical run using LLaMA v2 13B on M2 Ultra

$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed  = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.41 MBsystem_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings:        load time =   576.45 ms
llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
llama_print_timings:       total time = 25431.49 ms

2、Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook

这是运行LLaMA-7B和 whisper.cpp 的另一个演示，在一台M1 Pro MacBook上：

https://private-user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4

三、用法

以下是大多数受支持模型的端到端二进制构建和模型转换步骤。

1、基本用法

首先，您需要获取二进制文件。您可以遵循不同的方法：

方法一：克隆此仓库并本地构建，看如何构建
方法二：如果你使用的是MacOS或Linux，你可以通过brew、flox或nix 安装 llama. cpp
方法3：使用Docker镜像，见为Docker留档
方法4：从 releases 下载预构建二进制

您可以使用以下命令运行基本完成：

llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.

有关参数的完整列表，请参阅此页面。

2、对话模式

如果您想要更ChatGPT的体验，可以通过将-cnv作为参数传递来在对话模式下运行：

llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

默认情况下，聊天模板将取自输入模型。如果要使用另一个聊天模板，传递--chat-template NAME参数。查看支持模板

./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml

您还可以通过前缀、后缀和反向提示参数使用自己的模板：

./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

3、网络服务

llama. cpp Web服务是一个轻量级的OpenAI API兼容HTTP服务器，可用于服务本地模型并轻松将它们连接到现有客户端。

示例用法：

./llama-server -m your_model.gguf --port 8080# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

4、交互模式

注：如果您更喜欢基本用法，请考虑使用对话模式而不是交互模式

在这种模式下，您始终可以通过按Ctrl+C，并输入一行或多行文本来中断生成，这些文本将被转换为 tokens 并附加到当前上下文中。

您还可以使用参数-r "reverse prompt string"指定反向提示。这将导致每当在生成中遇到反向提示字符串的确切 tokens 时都会提示用户输入。

一个典型的用途是使用一个提示，使LLaMA模拟多个用户之间的聊天，比如说Alice和Bob，并传递-r "Alice:"。

这是一个使用命令调用的 few-shot 交互示例

# default arguments using a 7B model
./examples/chat.sh# advanced chat with a 13B model
./examples/chat-13B.sh# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

注意使用--color来区分用户输入和生成的文本。

llama-cli示例程序的README中更详细地解释了其他参数。

在这里插入图片描述

5、持久互动

提示符、用户输入和模型代可以通过调用./llama-cli来保存和恢复--prompt-cache和--prompt-cache-all。./examples/chat-persistent.sh脚本演示了这一点，支持长时间运行的、可恢复的聊天会话。

要使用此示例，您必须提供一个文件来缓存初始聊天提示符和一个目录来保存聊天会话，并且可以选择提供与chat-13B.sh相同的提示缓存可以重复用于新的聊天会话。

请注意，提示缓存和聊天目录都绑定到初始提示符（PROMPT_TEMPLATE）和模型文件。

# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh

6、语法约束输出

llama.cpp支持约束模型输出的语法。例如，您可以强制模型仅输出JSON：

./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

grammars/文件夹包含一些示例语法。要编写自己的语法，请查看GBNF指南。

要编写更复杂的JSON语法，您还可以查看https://grammar.intrinsiclabs.ai/，这是一个浏览器应用程序，可让您编写TypeScript接口，并将其编译为GBNF语法，您可以保存以供本地使用。

请注意，该应用程序是由社区成员构建和维护的，请在其存储库上提交任何问题或FR，而不是这个。

四、构建

请参考本地Build llama. cpp

五、支持的后端

Backend	Target devices
Metal	Apple Silicon
BLAS	All
BLIS	All
SYCL	Intel and Nvidia GPU
CUDA	Nvidia GPU
hipBLAS	AMD GPU
Vulkan	GPU

六、工具

1、准备和量化

注：你可以使用Hugging Face 上的 GGUF-my-repo空间量化你的模型权重，不需要任何设置。它每6小时从llama.cpp同步一次。

要获得官方的LLaMA 2权重，请参阅 Obtaining and using the Facebook LLaMA 2 model 部分。

Hugging Face 上还有大量预量化gguf模型可供选择。

注意：convert.py已移至examples/convert_legacy_llama.py，不应用于Llama/Llama2/Mistral模型及其衍生产品以外的任何内容。它不支持 LLaMA 3，您可以使用convert_hf_to_gguf.py从 Hugging Face 下载 LLaMA 3。

要了解有关量化模型的更多信息，请阅读此留档

2、困惑（测量模型质量）

您可以使用perplexity示例来测量给定提示的困惑（困惑程度越低越好）。

有关详细信息，请参见https://huggingface.co/docs/transformers/perplexity。

要了解更多如何使用llama. cpp测量困惑，请阅读此留档

七、其他文件

main (cli)
server
jeopardy
GBNF grammars

开发文件

如何建造
在Docker上运行
基于Android构建
性能故障排除
GGML提示和技巧

关于模型的开创性论文和背景

如果您的问题是模型生成质量，那么请至少扫描以下链接和论文以了解LLaMA模型的局限性。在选择适当的模型尺寸并欣赏LLaMA模型和ChatGPT之间的显着和细微差异时，这一点尤其重要：

LLaMA
- 介绍LLaMA：一个基础的、65-billion-parameter的大型语言模型
- LLaMA：开放高效的基础语言模型
GPT-3
- 语言模型是很少学习者
GPT-3.5 / InstructGPT / ChatGPT:
- 调整语言模型以遵循说明
- 训练语言模型以遵循人类反馈的说明

2024-07