Ollama¶
Ollama provides an easy way to run large language models locally, offering an OpenAI-compatible API for seamless integration.
Features¶
- Simple model management
- OpenAI-compatible API
- Support for popular open models
- GPU acceleration support
- Low resource overhead
Supported Models¶
| Model | Parameters | RAM Required | Use Case |
|---|---|---|---|
| Llama 3.2 | 1B-3B | 4-6 GB | Fast responses |
| Llama 3.1 | 8B | 8-12 GB | General purpose |
| Llama 3.1 | 70B | 48+ GB | High quality |
| Mistral | 7B | 8 GB | Efficient |
| CodeLlama | 7B-34B | 8-32 GB | Code generation |
| Phi-3 | 3.8B | 4-6 GB | Compact |
Prerequisites¶
- Foundation complete
- Sufficient RAM for chosen models
- GPU recommended for performance
System Requirements¶
| Configuration | RAM | GPU | Models |
|---|---|---|---|
| Minimal | 8 GB | None | 7B quantized |
| Standard | 16 GB | Optional | 7B-13B |
| Performance | 32+ GB | Recommended | 13B-70B |
Installation¶
Deploy VM¶
GPU Passthrough (Optional)¶
For GPU acceleration, configure Proxmox GPU passthrough:
- Enable IOMMU in BIOS
- Configure Proxmox for PCI passthrough
- Add GPU to the Ollama VM
Configuration¶
Pull Models¶
Download models using the Ollama CLI:
# SSH to Ollama VM
ssh [email protected]
# Pull models
ollama pull llama3.1
ollama pull mistral
ollama pull codellama
List Models¶
Remove Models¶
API Access¶
Ollama exposes an OpenAI-compatible API on port 11434:
import openai
client = openai.OpenAI(
base_url="http://ollama.mgmt.internal:11434/v1",
api_key="ollama" # Any string works
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello!"}]
)
Integration¶
With OpenWebUI¶
Configure OpenWebUI to use Ollama:
- Navigate to Settings → Connections
- Add Ollama connection:
- URL:
http://ollama.mgmt.internal:11434
With LiteLLM¶
Add Ollama models to LiteLLM configuration:
model_list:
- model_name: local-llama
litellm_params:
model: ollama/llama3.1
api_base: http://ollama.mgmt.internal:11434
- model_name: local-code
litellm_params:
model: ollama/codellama
api_base: http://ollama.mgmt.internal:11434
Model Management¶
Custom Models¶
Create custom models with specific system prompts:
# Create Modelfile
cat > Modelfile << EOF
FROM llama3.1
SYSTEM You are a helpful coding assistant.
PARAMETER temperature 0.7
EOF
# Create custom model
ollama create coding-assistant -f Modelfile
Model Updates¶
Keep models updated:
Performance Tuning¶
Memory Settings¶
Configure memory limits in the service:
Add:
GPU Configuration¶
For NVIDIA GPUs:
Monitoring¶
Check Status¶
# Service status
systemctl status ollama
# Running models
ollama ps
# Resource usage
htop
nvidia-smi # If GPU available
Logs¶
Troubleshooting¶
Model Download Fails¶
- Check disk space:
df -h - Verify network connectivity
- Try smaller model first
Slow Inference¶
- Check available RAM:
free -h - Verify no other models loaded:
ollama ps - Consider GPU acceleration
- Use quantized models (Q4_K_M)
Out of Memory¶
- Stop other models:
ollama stop <model> - Use smaller models
- Increase VM RAM allocation
Backup¶
Model files are stored in /usr/share/ollama/.ollama/models. Include in VM backup.
Export Models¶
# Models are automatically backed up with VM
# For manual export:
tar -czf ollama-models.tar.gz /usr/share/ollama/.ollama/models