Skip to main content
This guide about Local LLMs was created 100% by AI. Please adjust based on your actual usage and requirements.

Why use local models?

Privacy

Your data never leaves your computer

No API costs

One-time setup, unlimited usage

Offline access

Work without internet connection

Full control

Customize model parameters

Requirements

Local models require significant hardware resources. Recommended minimum:
  • RAM: 16GB (32GB+ for larger models)
  • Storage: 10-50GB per model
  • GPU: Optional but highly recommended (NVIDIA with 8GB+ VRAM)
Easy-to-use local model runner with simple CLI. Pros:
  • Simple installation
  • Automatic model management
  • Active community
  • Optimized for Apple Silicon
Best models:
  • Llama 3.1 (8B, 70B)
  • Qwen 2.5
  • DeepSeek Coder

Ollama setup guide

Download and installation instructions

LM Studio

Desktop application with GUI for running local models. Pros:
  • User-friendly interface
  • Model discovery and download
  • Cross-platform (Mac, Windows, Linux)
  • Built-in chat interface
Best for:
  • Users who prefer GUI over CLI
  • Testing multiple models easily
  • Quick model comparison

LM Studio

Download LM Studio

vLLM (Advanced)

High-performance inference engine for production deployments. Pros:
  • Fastest inference speed
  • GPU optimization
  • Production-ready
  • API server included
Best for:
  • Technical users
  • High-throughput needs
  • Custom deployments

Connecting to SoloEnt

All local solutions expose an OpenAI-compatible API:
1

Start local server

Launch your chosen solution (Ollama, LM Studio, etc.)
2

Configure in SoloEnt

Use OpenAI-compatible configuration:
Base URL: http://localhost:11434/v1  (Ollama default)
API Key: ollama  (or leave blank)
Model ID: llama3.1  (your model name)
3

Test connection

Send a test message to verify the setup.

Writing & storytelling

ModelSizeRAM RequiredQuality
Llama 3.1 70B40GB64GB+Excellent
Qwen 2.5 32B20GB32GB+Very good
Llama 3.1 8B5GB16GB+Good

Chinese content

ModelSizeRAM RequiredQuality
Qwen 2.5 72B42GB64GB+Excellent
GLM-4 9B6GB16GB+Very good
DeepSeek 67B38GB64GB+Excellent

Code & technical

ModelSizeRAM RequiredQuality
DeepSeek Coder V216GB32GB+Excellent
CodeLlama 34B20GB32GB+Very good
Qwen 2.5 Coder 7B4GB8GB+Good

Performance optimization

NVIDIA GPUs dramatically improve inference speed. Ensure CUDA is properly installed.
Larger models ≠ always better. 7B-13B models often provide the best speed/quality balance.
Use Q4 or Q5 quantized models to reduce memory usage with minimal quality loss.
Shorter context windows (4K-8K) run faster than long context (32K+).

Common issues

  • Switch to smaller model (8B instead of 70B)
  • Use quantized version (Q4_K_M)
  • Enable GPU acceleration
  • Reduce context window size
  • Choose smaller model
  • Close other applications
  • Upgrade RAM
  • Use higher quantization (Q3, Q4)
  • Verify local server is running
  • Check Base URL and port number
  • Ensure no firewall blocking
  • Try http://127.0.0.1 instead of localhost
  • Try different prompt format
  • Adjust temperature/top_p settings
  • Switch to larger or different model
  • Check if model is appropriate for your language

Next steps