Step-by-Step Guide to Running Ollama Locally
Prerequisites
- Hardware: A computer with at least 8GB RAM (16GB+ recommended for larger models). A GPU (NVIDIA or AMD) is optional but recommended for faster inference. CPU-only setups work but are slower.
- Operating System: macOS, Linux, or Windows (Windows support via WSL2 or native installation).
- Internet Connection: Required to download Ollama and model files.
- Disk Space: Models like Llama 3.2 (7B parameters) require ~4-8GB of storage.
Step 1: Install Ollama
- Download Ollama:
- Visit the official Ollama website: https://ollama.com/download.
- Choose the installer for your operating system:
- macOS: Download the .dmg file and follow the installation prompts.
- Windows: Download the OllamaSetup.exe file and run it as an administrator. Alternatively, use WSL2 for Linux-like setup on Windows.
- Linux: Run the following command in your terminal to install Ollama:
- bash
curl -fsSL https://ollama.com/install.sh | sh
- This script installs the Ollama binary, checks for GPU drivers (NVIDIA CUDA or AMD ROCm), and sets up a systemd service for the Ollama server.
- Verify Installation:
- Open a terminal (Command Prompt, PowerShell, or Terminal on macOS/Linux).
- Run:
- bash
ollama --version
- If a version number appears (e.g., 0.1.32), Ollama is installed correctly.
- Optional: Configure Ollama Storage:
- By default, Ollama stores models in ~/.ollama/models. To change this (e.g., to a drive with more space):
- On Windows/Linux, set the OLLAMA_MODELS environment variable:
- bash
export
OLLAMA_MODELS=/path/to/your/model/directory
- On Windows, you can set this in System Properties > Environment Variables.
- Example path: H:\Ollama\Models.
- Optional: Disable Auto-Start (if desired):
- Ollama may auto-start on system boot. To disable:
- Windows: Open Task Manager (Ctrl+Alt+Del), go to Startup Apps, find Ollama, right-click, and select Disable.
- Linux: Disable the systemd service:
- bash
sudo systemctl
disable
ollama
Step 2: Pull and Run a Model
- Choose a Model:
- Visit https://ollama.com/library to browse available models (e.g., Llama 3.2, Gemma 2, Mistral).
- For beginners, start with a smaller model like gemma:2b (1.6GB) or llama3.2:7b to minimize resource requirements.
- Pull a Model:
- In your terminal, run:
- bash
ollama pull llama3.2
- This downloads the llama3.2 model to your local machine. The download time depends on your internet speed and model size (e.g., ~4GB for llama3.2:7b).
- Run the Model:
- Start an interactive session with the model:
- bash
ollama run llama3.2
- This opens a Read-Eval-Print Loop (REPL) where you can type prompts and receive responses. Example:
- text
>>> What is the capital of France?
The capital of France is Paris.
>>> /bye
- Use /bye to exit the REPL.
- Run a One-Off Prompt:
- To get a response without entering the REPL:
- bash
ollama run llama3.2
"Explain the basics of machine learning."
- The model will output a response directly to the terminal.
- List Downloaded Models:
- To see all models on your system:
- bash
ollama list
- Example output:
- text
NAME ID SIZE MODIFIED
llama3.2:latest 1234567890ab 2.1 GB 5 minutes ago
Step 3: Customize a Model (Optional)
- Create a Modelfile:
- A Modelfile defines a custom model’s behavior. Example: Create a model that responds like Mario from Super Mario Bros.
- Create a file named MarioModelfile with the following content:
- plaintext
FROM llama3.2
PARAMETER temperature 1
SYSTEM "You are Mario from Super Mario Bros. Answer as Mario, the assistant, only."
- Save the file (e.g., in ~/MarioModelfile).
- Create and Run the Custom Model:
- Create the model:
- bash
ollama create mario -f MarioModelfile
- Run it:
- bash
ollama run mario
- Example interaction:
- text
>>> What's your favorite activity?
It's-a me, Mario! I love jumpin’ on Goombas and savin’ Princess Peach! Wahoo!
Step 4: Use Ollama with a Web UI (Optional)
- Install Open WebUI:
- For a ChatGPT-like interface, use Open WebUI with Ollama.
- Install Docker if not already installed (see https://www.docker.com/).
- Run the Open WebUI Docker container:
- bash
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
- This maps Open WebUI to http://localhost:3000.
- Access Open WebUI:
- Open your browser and go to http://localhost:3000.
- Sign up for an account (first-time setup).
- In the Model Selector, type the model name (e.g., llama3.2) to download or select it.
- Start chatting with the model via the web interface.
- Manage Models:
- In Open WebUI, go to Admin Settings > Connections > Ollama > Manage to download or configure models.
Step 5: Integrate with Python (Optional)
- Install the Ollama Python Library:
- In your terminal, install the library:
- bash
pip install ollama
- Run a Simple Python Script:
- Create a Python file (e.g., ollama_test.py) with the following:
- python
import
ollama
response = ollama.generate(model=
'llama3.2'
, prompt=
'What is a qubit?'
)
print
(response[
'response'
])
- Run the script:
- bash
python ollama_test.py
- This queries the llama3.2 model and prints the response.
- Alternative: Use LangChain:
- Install LangChain:
- bash
pip install langchain-community
- Example script:
- python
from
langchain_community.llms
import
Ollama
llm = Ollama(model=
"llama3.2"
)
response = llm.invoke(
"What is LLM?"
)
print
(response)
- This integrates Ollama with LangChain for more advanced workflows.
Step 6: Verify GPU Usage (if applicable)
- Ollama automatically detects NVIDIA/AMD GPUs if drivers are installed:
- NVIDIA: Ensure CUDA drivers (version 452.39 or newer) are installed from https://www.nvidia.com/.
- AMD: Install ROCm drivers from https://www.amd.com/.
- To check if Ollama is using the GPU:
- bash
ollama ps
- If it shows cpu instead of gpu, check logs for errors:
- bash
OLLAMA_DEBUG=1 ollama serve
- Ensure drivers are up-to-date and reboot the system after driver installation.
Step 7: Manage Models
- Remove a Model:
- bash
ollama rm llama3.2
- Get Model Info:
- bash
ollama show llama3.2
Troubleshooting
- Model Download Fails: Check your internet connection or try a smaller model (e.g., gemma:2b).
- GPU Not Detected: Update GPU drivers and reboot. Check logs with OLLAMA_DEBUG=1.
- Port Conflict: If http://localhost:11434 is in use, change the port:
- bash
export
OLLAMA_HOST=0.0.0.0:11435
ollama serve
- Insufficient Memory: Use smaller models or increase system RAM/VRAM.
Relating to Run:AI
Run:AI is typically used for orchestrating AI workloads across clusters, often with Kubernetes, and is not a local tool like Ollama. However, you can use Ollama locally to develop and test LLMs before deploying them in a Run:AI-managed environment. Here’s how they can connect:
- Local Development with Ollama: Use Ollama to prototype and test models like Llama 3.2 or Gemma on your local machine, ensuring privacy and quick iteration.
- Scaling with Run:AI: Once tested, you can containerize your Ollama setup (e.g., using Docker) and deploy it on a Run:AI cluster for distributed training or inference. Run:AI can manage GPU resources for larger models or higher workloads.
- Steps to Integrate:
- Package your Ollama setup and model in a Docker container (see https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image).
- Use Run:AI to deploy the container on a Kubernetes cluster, allocating GPU resources as needed.
- Configure Run:AI to handle model inference or training jobs, leveraging its workload orchestration capabilities.
If you’re specifically looking to use Run:AI locally, note that Run:AI is designed for cluster environments. For a local equivalent, you’d rely on tools like Ollama or direct model frameworks (e.g., Hugging Face Transformers). If you want a detailed Run:AI setup guide, please provide more details about your environment (e.g., Kubernetes setup, cloud vs. on-premises).
Notes
- Sources: This guide is based on information from the Ollama official website, GitHub, and tutorials from MachineLearningPlus, KDnuggets, and Medium articles.
- Hardware Considerations: For larger models (e.g., deepseek-r1:32b), ensure you have a powerful GPU (e.g., NVIDIA RTX 4090) and sufficient VRAM (24GB+). Start with smaller models if your hardware is limited.
- Privacy: Running LLMs locally with Ollama ensures your data stays on your machine, which is ideal for sensitive applications.
- Further Customization: Explore fine-tuning models or using RAG (Retrieval-Augmented Generation) with Ollama for specific use cases.
If you need clarification on any step, specific model recommendations, or integration with Run:AI, let me know!
Comments (0)
No comments yet. Be the first to share your thoughts!
Please log in to comment.