Running large language models locally has never been easier. Ollama is an open-source tool that wraps model downloads, GPU acceleration, API serving, and prompt formatting into a single CLI — letting you run models like Llama 3, Mistral, and DeepSeek entirely on your own hardware. Combined with WordStructor, this gives you a fully private, offline-capable book generation pipeline with no recurring API costs.
Why go local? Three reasons stand out. First, privacy — your manuscript data never leaves your machine. No cloud provider logs your prompts or stores your content. Second, cost — after the one-time hardware investment, you can generate unlimited books without per-token fees. Third, offline availability — Ollama works entirely without internet, ideal for travel or air-gapped environments.
Installing Ollama
Ollama supports Windows, Linux, and macOS. On Windows, download the installer from ollama.com/download and run OllamaSetup.exe. It installs as a system service and adds itself to PATH. On Linux, use the one-liner curl -fsSL https://ollama.com/install.sh | sh — it sets up APT/RPM repos, installs the binary, and configures a systemd service. On macOS, download the .dmg or use brew install ollama.
Verify your installation by running ollama --version in a terminal. The Ollama server starts automatically as a background service and listens on http://localhost:11434.
Downloading and Running Your First Model
To pull and run a model, use the ollama run command. For example, to start with Llama 3.1 8B (a well-rounded 4.7 GB model):
ollama run llama3.1:8b
Ollama downloads the model on first invocation, then opens an interactive chat session. Type /exit to quit. For non-interactive use, pass a prompt directly:
ollama run mistral:7b "Summarize the key benefits of local AI."
Popular models include llama3.1:8b (best all-rounder), mistral:7b (fast, great for code and structured tasks), mixtral:8x7b (high quality via mixture-of-experts), deepseek-coder-v2 (specialized for code generation), and gemma2:9b (strong reasoning from Google). For most book generation tasks, start with mistral:7b or llama3.1:8b and upgrade to mixtral:8x7b for higher consistency on longer chapters.
Connecting Ollama to WordStructor
WordStructor supports any OpenAI-compatible API endpoint, which means Ollama integrates seamlessly. In the WordStructor settings, select AI Model → Custom Provider and enter http://localhost:11434/v1 as the API URL. Choose your preferred model from the dropdown and save.
Alternatively, configure it through WordStructor's .env file:
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.1:8b
Once connected, WordStructor routes all AI requests through your local Ollama instance. Chapters, outlines, character profiles, and research summaries are all generated on your machine — no data ever touches an external API.
Open WebUI — A Graphical Interface
Open WebUI is a ChatGPT-like web frontend for Ollama that adds chat history, RAG (document upload), multi-user support, and model switching. Install it alongside Ollama via Docker:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui --restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000, create an account, and connect it to your Ollama instance. From there you can experiment with different models, upload PDFs for RAG-based research, and test prompts before using them in WordStructor.
Performance Tuning and Best Practices
Ollama's performance depends heavily on your hardware. For 7-8B models, aim for 16 GB of system RAM and 6-8 GB of VRAM (NVIDIA GPU with CUDA, AMD with ROCm, or Apple Silicon with Metal). For 70B models, you'll need 32-64 GB of RAM or 24-48 GB of VRAM.
Key environment variables for tuning:
- OLLAMA_NUM_PARALLEL — number of concurrent requests (default: 1). Increase for higher throughput when batching book chapters.
- OLLAMA_KEEP_ALIVE — how long a model stays loaded after the last request (default: 300s). Set to
0to free RAM immediately. - OLLAMA_MAX_LOADED_MODELS — how many models can stay in memory simultaneously.
- OLLAMA_HOST — bind address. Set to
0.0.0.0to expose Ollama on your local network (use a reverse proxy with TLS for security).
Most models in Ollama are already quantized (Q4_K_M by default), which cuts memory usage roughly in half compared to full FP16 with minimal quality loss. For more VRAM headroom, use Q3_K or Q2_K variants.
Why Local AI Matters for Book Authors
Running Ollama with WordStructor transforms your writing workflow. You maintain complete control over your intellectual property — manuscripts, research, and character notes stay private. There are no rate limits, no surprise bills, and no dependency on third-party API uptime. Whether you're drafting a novel, compiling technical documentation, or generating marketing copy, local AI gives you the freedom to iterate as much as you need without constraint.
WordStructor's modular architecture lets you switch between local Ollama models and cloud providers at any time, so you can use local inference for drafts (cost-effective) and premium cloud models for final polish — all from the same interface.