How to Run AI Locally using OLLAMA

Step-by-Step Guide to Running Ollama Locally

Prerequisites

Hardware: A computer with at least 8GB RAM (16GB+ recommended for larger models). A GPU (NVIDIA or AMD) is optional but recommended for faster inference. CPU-only setups work but are slower.
Operating System: macOS, Linux, or Windows (Windows support via WSL2 or native installation).
Internet Connection: Required to download Ollama and model files.
Disk Space: Models like Llama 3.2 (7B parameters) require ~4-8GB of storage.

Step 1: Install Ollama

Download Ollama:

Visit the official Ollama website: https://ollama.com/download.
Choose the installer for your operating system:
macOS: Download the .dmg file and follow the installation prompts.
Windows: Download the OllamaSetup.exe file and run it as an administrator. Alternatively, use WSL2 for Linux-like setup on Windows.
Linux: Run the following command in your terminal to install Ollama:
bash
curl -fsSL https://ollama.com/install.sh | sh
This script installs the Ollama binary, checks for GPU drivers (NVIDIA CUDA or AMD ROCm), and sets up a systemd service for the Ollama server.

Verify Installation:

Open a terminal (Command Prompt, PowerShell, or Terminal on macOS/Linux).
Run:
bash
ollama --version
If a version number appears (e.g., 0.1.32), Ollama is installed correctly.

Optional: Configure Ollama Storage:

By default, Ollama stores models in ~/.ollama/models. To change this (e.g., to a drive with more space):
On Windows/Linux, set the OLLAMA_MODELS environment variable:
bash
export OLLAMA_MODELS=/path/to/your/model/directory
On Windows, you can set this in System Properties > Environment Variables.
Example path: H:\Ollama\Models.

Optional: Disable Auto-Start (if desired):

Ollama may auto-start on system boot. To disable:
Windows: Open Task Manager (Ctrl+Alt+Del), go to Startup Apps, find Ollama, right-click, and select Disable.
Linux: Disable the systemd service:
bash
sudo systemctl disable ollama

Step 2: Pull and Run a Model

Choose a Model:

Visit https://ollama.com/library to browse available models (e.g., Llama 3.2, Gemma 2, Mistral).
For beginners, start with a smaller model like gemma:2b (1.6GB) or llama3.2:7b to minimize resource requirements.

Pull a Model:

In your terminal, run:
bash
ollama pull llama3.2
This downloads the llama3.2 model to your local machine. The download time depends on your internet speed and model size (e.g., ~4GB for llama3.2:7b).

Run the Model:

Start an interactive session with the model:
bash
ollama run llama3.2
This opens a Read-Eval-Print Loop (REPL) where you can type prompts and receive responses. Example:
text
>>> What is the capital of France?
The capital of France is Paris.
>>> /bye
Use /bye to exit the REPL.

Run a One-Off Prompt:

To get a response without entering the REPL:
bash
ollama run llama3.2 "Explain the basics of machine learning."
The model will output a response directly to the terminal.

List Downloaded Models:

To see all models on your system:
bash
ollama list
Example output:
text
NAME ID SIZE MODIFIED
llama3.2:latest 1234567890ab 2.1 GB 5 minutes ago

Step 3: Customize a Model (Optional)

Create a Modelfile:

A Modelfile defines a custom model’s behavior. Example: Create a model that responds like Mario from Super Mario Bros.
Create a file named MarioModelfile with the following content:
plaintext
FROM llama3.2
PARAMETER temperature 1
SYSTEM "You are Mario from Super Mario Bros. Answer as Mario, the assistant, only."
Save the file (e.g., in ~/MarioModelfile).

Create and Run the Custom Model:

Create the model:
bash
ollama create mario -f MarioModelfile
Run it:
bash
ollama run mario
Example interaction:
text
>>> What's your favorite activity?
It's-a me, Mario! I love jumpin’ on Goombas and savin’ Princess Peach! Wahoo!

Step 4: Use Ollama with a Web UI (Optional)

Install Open WebUI:

For a ChatGPT-like interface, use Open WebUI with Ollama.
Install Docker if not already installed (see https://www.docker.com/).
Run the Open WebUI Docker container:
bash
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
This maps Open WebUI to http://localhost:3000.

Access Open WebUI:

Open your browser and go to http://localhost:3000.
Sign up for an account (first-time setup).
In the Model Selector, type the model name (e.g., llama3.2) to download or select it.
Start chatting with the model via the web interface.

Manage Models:

In Open WebUI, go to Admin Settings > Connections > Ollama > Manage to download or configure models.

Step 5: Integrate with Python (Optional)

Install the Ollama Python Library:

In your terminal, install the library:
bash
pip install ollama

Run a Simple Python Script:

Create a Python file (e.g., ollama_test.py) with the following:
python
import ollama
response = ollama.generate(model='llama3.2', prompt='What is a qubit?')
print(response['response'])
Run the script:
bash
python ollama_test.py
This queries the llama3.2 model and prints the response.

Alternative: Use LangChain:

Install LangChain:
bash
pip install langchain-community
Example script:
python
from langchain_community.llms import Ollama
llm = Ollama(model="llama3.2")
response = llm.invoke("What is LLM?")
print(response)
This integrates Ollama with LangChain for more advanced workflows.

Step 6: Verify GPU Usage (if applicable)

Ollama automatically detects NVIDIA/AMD GPUs if drivers are installed:
NVIDIA: Ensure CUDA drivers (version 452.39 or newer) are installed from https://www.nvidia.com/.
AMD: Install ROCm drivers from https://www.amd.com/.
To check if Ollama is using the GPU:
bash
ollama ps
If it shows cpu instead of gpu, check logs for errors:
bash
OLLAMA_DEBUG=1 ollama serve
Ensure drivers are up-to-date and reboot the system after driver installation.

Step 7: Manage Models

Remove a Model:
bash
ollama rm llama3.2
Get Model Info:
bash
ollama show llama3.2

Troubleshooting

Model Download Fails: Check your internet connection or try a smaller model (e.g., gemma:2b).
GPU Not Detected: Update GPU drivers and reboot. Check logs with OLLAMA_DEBUG=1.
Port Conflict: If http://localhost:11434 is in use, change the port:
bash
export OLLAMA_HOST=0.0.0.0:11435
ollama serve
Insufficient Memory: Use smaller models or increase system RAM/VRAM.

Relating to Run:AI

Run:AI is typically used for orchestrating AI workloads across clusters, often with Kubernetes, and is not a local tool like Ollama. However, you can use Ollama locally to develop and test LLMs before deploying them in a Run:AI-managed environment. Here’s how they can connect:

Local Development with Ollama: Use Ollama to prototype and test models like Llama 3.2 or Gemma on your local machine, ensuring privacy and quick iteration.
Scaling with Run:AI: Once tested, you can containerize your Ollama setup (e.g., using Docker) and deploy it on a Run:AI cluster for distributed training or inference. Run:AI can manage GPU resources for larger models or higher workloads.
Steps to Integrate:

Package your Ollama setup and model in a Docker container (see https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image).
Use Run:AI to deploy the container on a Kubernetes cluster, allocating GPU resources as needed.
Configure Run:AI to handle model inference or training jobs, leveraging its workload orchestration capabilities.

If you’re specifically looking to use Run:AI locally, note that Run:AI is designed for cluster environments. For a local equivalent, you’d rely on tools like Ollama or direct model frameworks (e.g., Hugging Face Transformers). If you want a detailed Run:AI setup guide, please provide more details about your environment (e.g., Kubernetes setup, cloud vs. on-premises).

Notes

Sources: This guide is based on information from the Ollama official website, GitHub, and tutorials from MachineLearningPlus, KDnuggets, and Medium articles.
Hardware Considerations: For larger models (e.g., deepseek-r1:32b), ensure you have a powerful GPU (e.g., NVIDIA RTX 4090) and sufficient VRAM (24GB+). Start with smaller models if your hardware is limited.
Privacy: Running LLMs locally with Ollama ensures your data stays on your machine, which is ideal for sensitive applications.
Further Customization: Explore fine-tuning models or using RAG (Retrieval-Augmented Generation) with Ollama for specific use cases.

If you need clarification on any step, specific model recommendations, or integration with Run:AI, let me know!