Local LLM Setup

This guide about Local LLMs was created 100% by AI. Please adjust based on your actual usage and requirements.

Why use local models?

Privacy

Your data never leaves your computer

No API costs

One-time setup, unlimited usage

Offline access

Work without internet connection

Full control

Customize model parameters

Requirements

Local models require significant hardware resources. Recommended minimum:

RAM: 16GB (32GB+ for larger models)
Storage: 10-50GB per model
GPU: Optional but highly recommended (NVIDIA with 8GB+ VRAM)

Ollama setup guide

Download and installation instructions

LM Studio

Desktop application with GUI for running local models. Pros:

User-friendly interface
Model discovery and download
Cross-platform (Mac, Windows, Linux)
Built-in chat interface

Best for:

Users who prefer GUI over CLI
Testing multiple models easily
Quick model comparison

LM Studio

Download LM Studio

vLLM (Advanced)

High-performance inference engine for production deployments. Pros:

Fastest inference speed
GPU optimization
Production-ready
API server included

Best for:

Technical users
High-throughput needs
Custom deployments

Connecting to SoloEnt

All local solutions expose an OpenAI-compatible API:

Start local server

Launch your chosen solution (Ollama, LM Studio, etc.)

Configure in SoloEnt

Use OpenAI-compatible configuration:

Base URL: http://localhost:11434/v1  (Ollama default)
API Key: ollama  (or leave blank)
Model ID: llama3.1  (your model name)

Test connection

Send a test message to verify the setup.

Recommended models by use case

Writing & storytelling

Model	Size	RAM Required	Quality
Llama 3.1 70B	40GB	64GB+	Excellent
Qwen 2.5 32B	20GB	32GB+	Very good
Llama 3.1 8B	5GB	16GB+	Good

Chinese content

Model	Size	RAM Required	Quality
Qwen 2.5 72B	42GB	64GB+	Excellent
GLM-4 9B	6GB	16GB+	Very good
DeepSeek 67B	38GB	64GB+	Excellent

Code & technical

Model	Size	RAM Required	Quality
DeepSeek Coder V2	16GB	32GB+	Excellent
CodeLlama 34B	20GB	32GB+	Very good
Qwen 2.5 Coder 7B	4GB	8GB+	Good

Performance optimization

Use GPU acceleration

NVIDIA GPUs dramatically improve inference speed. Ensure CUDA is properly installed.

Choose appropriate model size

Larger models ≠ always better. 7B-13B models often provide the best speed/quality balance.

Quantization

Use Q4 or Q5 quantized models to reduce memory usage with minimal quality loss.

Adjust context length

Shorter context windows (4K-8K) run faster than long context (32K+).

Common issues

Model runs too slowly

Switch to smaller model (8B instead of 70B)
Use quantized version (Q4_K_M)
Enable GPU acceleration
Reduce context window size

Out of memory

Choose smaller model
Close other applications
Upgrade RAM
Use higher quantization (Q3, Q4)

Connection refused

Verify local server is running
Check Base URL and port number
Ensure no firewall blocking
Try http://127.0.0.1 instead of localhost

Poor output quality

Try different prompt format
Adjust temperature/top_p settings
Switch to larger or different model
Check if model is appropriate for your language

Next steps

API directory

Browse cloud API providers

Free API keys

Get free cloud API access

Introduction

Getting started

Write with Flexibility

Write with High Quality

Free Resources

Why use local models?

Privacy

No API costs

Offline access

Full control

Requirements

Popular solutions

Ollama (Recommended for beginners)

Ollama setup guide

LM Studio

LM Studio

vLLM (Advanced)

Connecting to SoloEnt

Recommended models by use case

Writing & storytelling

Chinese content

Code & technical

Performance optimization

Common issues

Next steps

API directory

Free API keys

Introduction

Getting started

Write with Flexibility

Write with High Quality

Free Resources

​Why use local models?

Privacy

No API costs

Offline access

Full control

​Requirements

​Popular solutions

​Ollama (Recommended for beginners)

Ollama setup guide

​LM Studio

LM Studio

​vLLM (Advanced)

​Connecting to SoloEnt

​Recommended models by use case

​Writing & storytelling

​Chinese content

​Code & technical

​Performance optimization

​Common issues

​Next steps

API directory

Free API keys

Why use local models?

Requirements

Popular solutions

Ollama (Recommended for beginners)

LM Studio

vLLM (Advanced)

Connecting to SoloEnt

Recommended models by use case

Writing & storytelling

Chinese content

Code & technical

Performance optimization

Common issues

Next steps