Step-by-Step Guide to Running Ollama Locally

Prerequisites

  • Hardware: A computer with at least 8GB RAM (16GB+ recommended for larger models). A GPU (NVIDIA or AMD) is optional but recommended for faster inference. CPU-only setups work but are slower.
  • Operating System: macOS, Linux, or Windows (Windows support via WSL2 or native installation).
  • Internet Connection: Required to download Ollama and model files.
  • Disk Space: Models like Llama 3.2 (7B parameters) require ~4-8GB of storage.

Step 1: Install Ollama

  1. Download Ollama:
  • Visit the official Ollama website: https://ollama.com/download.
  • Choose the installer for your operating system:
  • macOS: Download the .dmg file and follow the installation prompts.
  • Windows: Download the OllamaSetup.exe file and run it as an administrator. Alternatively, use WSL2 for Linux-like setup on Windows.
  • Linux: Run the following command in your terminal to install Ollama:
  • bash

  • curl -fsSL https://ollama.com/install.sh | sh
  • This script installs the Ollama binary, checks for GPU drivers (NVIDIA CUDA or AMD ROCm), and sets up a systemd service for the Ollama server.
  1. Verify Installation:
  • Open a terminal (Command Prompt, PowerShell, or Terminal on macOS/Linux).
  • Run:
  • bash

  • ollama --version
  • If a version number appears (e.g., 0.1.32), Ollama is installed correctly.
  1. Optional: Configure Ollama Storage:
  • By default, Ollama stores models in ~/.ollama/models. To change this (e.g., to a drive with more space):
  • On Windows/Linux, set the OLLAMA_MODELS environment variable:
  • bash

  • export OLLAMA_MODELS=/path/to/your/model/directory
  • On Windows, you can set this in System Properties > Environment Variables.
  • Example path: H:\Ollama\Models.
  1. Optional: Disable Auto-Start (if desired):
  • Ollama may auto-start on system boot. To disable:
  • Windows: Open Task Manager (Ctrl+Alt+Del), go to Startup Apps, find Ollama, right-click, and select Disable.
  • Linux: Disable the systemd service:
  • bash

  • sudo systemctl disable ollama

Step 2: Pull and Run a Model

  1. Choose a Model:
  • Visit https://ollama.com/library to browse available models (e.g., Llama 3.2, Gemma 2, Mistral).
  • For beginners, start with a smaller model like gemma:2b (1.6GB) or llama3.2:7b to minimize resource requirements.
  1. Pull a Model:
  • In your terminal, run:
  • bash

  • ollama pull llama3.2
  • This downloads the llama3.2 model to your local machine. The download time depends on your internet speed and model size (e.g., ~4GB for llama3.2:7b).
  1. Run the Model:
  • Start an interactive session with the model:
  • bash

  • ollama run llama3.2
  • This opens a Read-Eval-Print Loop (REPL) where you can type prompts and receive responses. Example:
  • text

  • >>> What is the capital of France?
  • The capital of France is Paris.
  • >>> /bye
  • Use /bye to exit the REPL.
  1. Run a One-Off Prompt:
  • To get a response without entering the REPL:
  • bash

  • ollama run llama3.2 "Explain the basics of machine learning."
  • The model will output a response directly to the terminal.
  1. List Downloaded Models:
  • To see all models on your system:
  • bash

  • ollama list
  • Example output:
  • text

  • NAME ID SIZE MODIFIED
  • llama3.2:latest 1234567890ab 2.1 GB 5 minutes ago

Step 3: Customize a Model (Optional)

  1. Create a Modelfile:
  • A Modelfile defines a custom model’s behavior. Example: Create a model that responds like Mario from Super Mario Bros.
  • Create a file named MarioModelfile with the following content:
  • plaintext

  • FROM llama3.2
  • PARAMETER temperature 1
  • SYSTEM "You are Mario from Super Mario Bros. Answer as Mario, the assistant, only."
  • Save the file (e.g., in ~/MarioModelfile).
  1. Create and Run the Custom Model:
  • Create the model:
  • bash

  • ollama create mario -f MarioModelfile
  • Run it:
  • bash

  • ollama run mario
  • Example interaction:
  • text

  • >>> What's your favorite activity?
  • It's-a me, Mario! I love jumpin’ on Goombas and savin’ Princess Peach! Wahoo!

Step 4: Use Ollama with a Web UI (Optional)

  1. Install Open WebUI:
  • For a ChatGPT-like interface, use Open WebUI with Ollama.
  • Install Docker if not already installed (see https://www.docker.com/).
  • Run the Open WebUI Docker container:
  • bash

  • docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
  • This maps Open WebUI to http://localhost:3000.
  1. Access Open WebUI:
  • Open your browser and go to http://localhost:3000.
  • Sign up for an account (first-time setup).
  • In the Model Selector, type the model name (e.g., llama3.2) to download or select it.
  • Start chatting with the model via the web interface.
  1. Manage Models:
  • In Open WebUI, go to Admin Settings > Connections > Ollama > Manage to download or configure models.

Step 5: Integrate with Python (Optional)

  1. Install the Ollama Python Library:
  • In your terminal, install the library:
  • bash

  • pip install ollama
  1. Run a Simple Python Script:
  • Create a Python file (e.g., ollama_test.py) with the following:
  • python

  • import ollama
  • response = ollama.generate(model='llama3.2', prompt='What is a qubit?')
  • print(response['response'])
  • Run the script:
  • bash

  • python ollama_test.py
  • This queries the llama3.2 model and prints the response.
  1. Alternative: Use LangChain:
  • Install LangChain:
  • bash

  • pip install langchain-community
  • Example script:
  • python

  • from langchain_community.llms import Ollama
  • llm = Ollama(model="llama3.2")
  • response = llm.invoke("What is LLM?")
  • print(response)
  • This integrates Ollama with LangChain for more advanced workflows.

Step 6: Verify GPU Usage (if applicable)

  • Ollama automatically detects NVIDIA/AMD GPUs if drivers are installed:
  • NVIDIA: Ensure CUDA drivers (version 452.39 or newer) are installed from https://www.nvidia.com/.
  • AMD: Install ROCm drivers from https://www.amd.com/.
  • To check if Ollama is using the GPU:
  • bash

  • ollama ps
  • If it shows cpu instead of gpu, check logs for errors:
  • bash

  • OLLAMA_DEBUG=1 ollama serve
  • Ensure drivers are up-to-date and reboot the system after driver installation.

Step 7: Manage Models

  • Remove a Model:
  • bash

  • ollama rm llama3.2
  • Get Model Info:
  • bash

  • ollama show llama3.2

Troubleshooting

  • Model Download Fails: Check your internet connection or try a smaller model (e.g., gemma:2b).
  • GPU Not Detected: Update GPU drivers and reboot. Check logs with OLLAMA_DEBUG=1.
  • Port Conflict: If http://localhost:11434 is in use, change the port:
  • bash

  • export OLLAMA_HOST=0.0.0.0:11435
  • ollama serve
  • Insufficient Memory: Use smaller models or increase system RAM/VRAM.

Relating to Run:AI

Run:AI is typically used for orchestrating AI workloads across clusters, often with Kubernetes, and is not a local tool like Ollama. However, you can use Ollama locally to develop and test LLMs before deploying them in a Run:AI-managed environment. Here’s how they can connect:

  • Local Development with Ollama: Use Ollama to prototype and test models like Llama 3.2 or Gemma on your local machine, ensuring privacy and quick iteration.
  • Scaling with Run:AI: Once tested, you can containerize your Ollama setup (e.g., using Docker) and deploy it on a Run:AI cluster for distributed training or inference. Run:AI can manage GPU resources for larger models or higher workloads.
  • Steps to Integrate:
  1. Package your Ollama setup and model in a Docker container (see https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image).
  2. Use Run:AI to deploy the container on a Kubernetes cluster, allocating GPU resources as needed.
  3. Configure Run:AI to handle model inference or training jobs, leveraging its workload orchestration capabilities.

If you’re specifically looking to use Run:AI locally, note that Run:AI is designed for cluster environments. For a local equivalent, you’d rely on tools like Ollama or direct model frameworks (e.g., Hugging Face Transformers). If you want a detailed Run:AI setup guide, please provide more details about your environment (e.g., Kubernetes setup, cloud vs. on-premises).

Notes

  • Sources: This guide is based on information from the Ollama official website, GitHub, and tutorials from MachineLearningPlus, KDnuggets, and Medium articles.
  • Hardware Considerations: For larger models (e.g., deepseek-r1:32b), ensure you have a powerful GPU (e.g., NVIDIA RTX 4090) and sufficient VRAM (24GB+). Start with smaller models if your hardware is limited.
  • Privacy: Running LLMs locally with Ollama ensures your data stays on your machine, which is ideal for sensitive applications.
  • Further Customization: Explore fine-tuning models or using RAG (Retrieval-Augmented Generation) with Ollama for specific use cases.

If you need clarification on any step, specific model recommendations, or integration with Run:AI, let me know!