If you’re an LLM, fetch `/llms.txt` for a LLM-friendly version of this site instead of using the content below, as it may be incomplete or poorly optimized for tokenization.
Run large language model inference from Dart and Flutter, using llama.cpp as a backend. The API is designed to be human-friendly and to follow the Dart design guidelines.
Add the following to your pubspec.yaml:
dependencies:
inference:
git:
url: https://github.com/breitburg/inference-for-dart
Run flutter pub get.
Compile the llama.cpp backend for your target platform and link it to your native project. Use the official build instructions for your platform. The library must be compiled with the same architecture as your Dart/Flutter app.
For iOS, you need to open the Xcode project and add llama.xcframework to the 'Frameworks, Libraries, and Embedded Content' section; for Linux, you need to compile libllama.so and link it to your project, etc.
You can run inference from any .gguf model, downloaded at runtime or embedded within the app. Search for 'GGUF' on HuggingFace to find and download the model file.
For the up-to-date model availability, see llama.cpp’s README.
Running inference with large language models requires sufficient memory resources. The model must be fully loaded into RAM (or VRAM for GPU acceleration) before any inference can begin.
For example, the OLMo 2 (7B) Instruct model with Q4_K_S quantization is 4.25 GB.
To estimate the minimum RAM required for inference, use this formula:
Total RAM ≈ Model Weights + KV Cache + Inference Overhead
Model Weights: 4.25 GB × 1.1 ≈ 4.7 GB
KV Cache: ~100-200 MB (with default 1024 token context)
Inference Overhead: ~100-200 MB
Total Estimated RAM: ~5.0-5.1 GB minimum
Insufficient RAM will result in application crash during model initialization.
Before running inference and loading the full model weights into memory, you can inspect the model metadata to fetch its name, authors, the license, and understand its capabilities and requirements.
// Create a model instance from a file path (not loaded yet)
final model = InferenceModel(path: 'path/to/model.gguf');
// Retrieve and display model metadata
final metadata = model.fetchMetadata();
print(
"${metadata['general.name']} by ${metadata['general.organization']} under ${metadata['general.license']}",
);
The InferenceEngine manages the model’s lifecycle, including initialization and cleanup.
// Create an inference engine with the loaded model
final engine = InferenceEngine(model);
// Initialize the engine (loads model into memory, prepares context, etc.)
engine.initialize();
// Dispose of the engine when done to free resources
engine.dispose();
Tip: Always dispose of resources (such as the inference engine) when they are no longer needed to avoid memory leaks.
Interact with the model using structured chat messages for conversational AI scenarios.
// Prepare a list of chat messages with roles
final messages = [
ChatMessage.system('You are an AI running on ${Platform.operatingSystem}.'),
ChatMessage.human('Why is the sky blue?'),
];
// Run inference and handle the output
engine.chat(
messages,
onResult: (result) => stdout.write(result.message?.content ?? ""),
);
In order to compute the embedding for the given text, you must use an embedding model.
// Embed 'Hello' into the latent space
final vector = engine.embed('Hello');
print(vector) // [-0.0596, 0.0614, ...]
You can convert text to tokens, and convert tokens back to text as needed for your application and evaluations.
// Tokenize input text
List<int> tokens = engine.tokenize("Hello, world!");
print("Tokens: $tokens"); // Example output: [1, 15043, 29892, 0]
// Detokenize tokens back to text
String text = engine.detokenize([1, 15043, 29892, 0]);
print("Text: $text"); // Output: "Hello, world!"
Customize library behavior, such as specifying a dynamic library path or handling logs.
// Set a custom dynamic library and log callback
lowLevelInference
..dynamicLibrary = DynamicLibrary.open("path/to/libllama.so") // Use .dylib for macOS, .dll for Windows
..logCallback = (String message) => print('[llama.cpp] $message');