graphiti/DOCS/BACKLOG-OpenAI-Compatible-Endpoints.md

# OpenAI-Compatible Custom Endpoint Support in Graphiti

## Overview

This document analyzes how Graphiti handles OpenAI-compatible custom endpoints (like OpenRouter, NagaAI, Together.ai, etc.) and provides recommendations for improving support.

## Current Architecture

Graphiti has **three main OpenAI-compatible client implementations**:

### 1. OpenAIClient (Default)

**File**: `graphiti_core/llm_client/openai_client.py`

- Extends `BaseOpenAIClient`
- Uses the **new OpenAI Responses API** (`/v1/responses` endpoint)
- Uses `client.responses.parse()` for structured outputs (OpenAI SDK v1.91+)
- This is the **default client** exported in the public API

```python
response = await self.client.responses.parse(
    model=model,
    input=messages,
    temperature=temperature,
    max_output_tokens=max_tokens,
    text_format=response_model,
    reasoning={'effort': reasoning},
    text={'verbosity': verbosity},
)
```

### 2. OpenAIGenericClient (Legacy)

**File**: `graphiti_core/llm_client/openai_generic_client.py`

- Uses the **standard Chat Completions API** (`/v1/chat/completions`)
- Uses `client.chat.completions.create()`
- **Only supports unstructured JSON responses** (not Pydantic schemas)
- Currently **not exported** in `__init__.py` (hidden from public API)

```python
response = await self.client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens,
    response_format={'type': 'json_object'},
)
```

### 3. AzureOpenAILLMClient

**File**: `graphiti_core/llm_client/azure_openai_client.py`

- Azure-specific implementation
- Also uses `responses.parse()` like `OpenAIClient`
- Handles Azure-specific authentication and endpoints

## The Root Problem

### Issue Description

When users configure Graphiti with custom OpenAI-compatible endpoints, they encounter errors because:

1. **`OpenAIClient` uses the new `/v1/responses` endpoint** via `client.responses.parse()`
   - This is a **new OpenAI API** (introduced in OpenAI SDK v1.91.0) for structured outputs
   - This endpoint is **proprietary to OpenAI** and **not part of the standard OpenAI-compatible API specification**

2. **Most OpenAI-compatible services** (OpenRouter, NagaAI, Ollama, Together.ai, etc.) **only implement** the standard `/v1/chat/completions` endpoint
   - They do **NOT** implement `/v1/responses`

3. When you configure a `base_url` pointing to these services, Graphiti tries to call:
   ```
   https://your-custom-endpoint.com/v1/responses
   ```
   Instead of the expected:
   ```
   https://your-custom-endpoint.com/v1/chat/completions
   ```

### Example Error Scenario

```python
from graphiti_core import Graphiti
from graphiti_core.llm_client import OpenAIClient, LLMConfig

config = LLMConfig(
    api_key="sk-or-v1-...",
    model="meta-llama/llama-3-8b-instruct",
    base_url="https://openrouter.ai/api/v1"
)

llm_client = OpenAIClient(config=config)
graphiti = Graphiti(uri, user, password, llm_client=llm_client)

# This will fail because OpenRouter doesn't have /v1/responses endpoint
# Error: 404 Not Found - https://openrouter.ai/api/v1/responses
```

## Current Workaround (Documented)

The README documents using `OpenAIGenericClient` with Ollama:

```python
from graphiti_core.llm_client.openai_generic_client import OpenAIGenericClient
from graphiti_core.llm_client.config import LLMConfig

llm_config = LLMConfig(
    api_key="ollama",
    model="deepseek-r1:7b",
    base_url="http://localhost:11434/v1"
)

llm_client = OpenAIGenericClient(config=llm_config)
```

### Limitations of Current Workaround

- `OpenAIGenericClient` **doesn't support structured outputs with Pydantic models**
- It only returns raw JSON and manually validates schemas
- It's not the recommended/default client
- It's **not exported** in the public API (`graphiti_core.llm_client`)
- Users must know to import from the internal module path

## Recommended Solutions

### Priority 1: Quick Wins (High Priority)

#### 1.1 Export `OpenAIGenericClient` in Public API

**File**: `graphiti_core/llm_client/__init__.py`

**Current**:
```python
from .client import LLMClient
from .config import LLMConfig
from .errors import RateLimitError
from .openai_client import OpenAIClient

__all__ = ['LLMClient', 'OpenAIClient', 'LLMConfig', 'RateLimitError']
```

**Proposed**:
```python
from .client import LLMClient
from .config import LLMConfig
from .errors import RateLimitError
from .openai_client import OpenAIClient
from .openai_generic_client import OpenAIGenericClient

__all__ = ['LLMClient', 'OpenAIClient', 'OpenAIGenericClient', 'LLMConfig', 'RateLimitError']
```

#### 1.2 Add Clear Documentation

**File**: `README.md`

Add a dedicated section:

```markdown
### Using OpenAI-Compatible Endpoints (OpenRouter, NagaAI, Together.ai, etc.)

Most OpenAI-compatible services only support the standard Chat Completions API,
not OpenAI's newer Responses API. Use `OpenAIGenericClient` for these services:

**OpenRouter Example**:
```python
from graphiti_core import Graphiti
from graphiti_core.llm_client import OpenAIGenericClient, LLMConfig

config = LLMConfig(
    api_key="sk-or-v1-...",
    model="meta-llama/llama-3-8b-instruct",
    base_url="https://openrouter.ai/api/v1"
)

llm_client = OpenAIGenericClient(config=config)
graphiti = Graphiti(uri, user, password, llm_client=llm_client)
```

**Together.ai Example**:
```python
config = LLMConfig(
    api_key="your-together-api-key",
    model="meta-llama/Llama-3-70b-chat-hf",
    base_url="https://api.together.xyz/v1"
)
llm_client = OpenAIGenericClient(config=config)
```

**Note**: `OpenAIGenericClient` has limited structured output support compared to
the default `OpenAIClient`. It uses JSON mode instead of Pydantic schema validation.
```

#### 1.3 Add Better Error Messages

**File**: `graphiti_core/llm_client/openai_client.py`

Add error handling that detects the issue:

```python
async def _create_structured_completion(self, ...):
    try:
        response = await self.client.responses.parse(...)
        return response
    except openai.NotFoundError as e:
        if self.config.base_url and "api.openai.com" not in self.config.base_url:
            raise Exception(
                f"The OpenAI Responses API (/v1/responses) is not available at {self.config.base_url}. "
                f"Most OpenAI-compatible services only support /v1/chat/completions. "
                f"Please use OpenAIGenericClient instead of OpenAIClient for custom endpoints. "
                f"See: https://help.getzep.com/graphiti/guides/custom-endpoints"
            ) from e
        raise
```

### Priority 2: Better UX (Medium Priority)

#### 2.1 Add Auto-Detection Logic

**File**: `graphiti_core/llm_client/config.py`

```python
class LLMConfig:
    def __init__(
        self,
        api_key: str | None = None,
        model: str | None = None,
        base_url: str | None = None,
        temperature: float = DEFAULT_TEMPERATURE,
        max_tokens: int = DEFAULT_MAX_TOKENS,
        small_model: str | None = None,
        use_responses_api: bool | None = None,  # NEW: Auto-detect if None
    ):
        self.base_url = base_url
        self.api_key = api_key
        self.model = model
        self.small_model = small_model
        self.temperature = temperature
        self.max_tokens = max_tokens

        # Auto-detect API style based on base_url
        if use_responses_api is None:
            self.use_responses_api = self._should_use_responses_api()
        else:
            self.use_responses_api = use_responses_api

    def _should_use_responses_api(self) -> bool:
        """Determine if we should use the Responses API based on base_url."""
        if self.base_url is None:
            return True  # Default OpenAI

        # Known services that support Responses API
        supported_services = ["api.openai.com", "azure.com"]
        return any(service in self.base_url for service in supported_services)
```

#### 2.2 Create a Unified Smart Client

**Option A**: Modify `OpenAIClient` to Fall Back

```python
class OpenAIClient(BaseOpenAIClient):
    def __init__(self, config: LLMConfig | None = None, ...):
        super().__init__(config, ...)
        if config is None:
            config = LLMConfig()

        self.use_responses_api = config.use_responses_api
        self.client = AsyncOpenAI(api_key=config.api_key, base_url=config.base_url)

    async def _create_structured_completion(self, ...):
        if self.use_responses_api:
            # Use responses.parse() for OpenAI native
            return await self.client.responses.parse(...)
        else:
            # Fall back to chat.completions with JSON schema for compatibility
            return await self.client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
                response_format={
                    "type": "json_schema",
                    "json_schema": {
                        "name": response_model.__name__,
                        "schema": response_model.model_json_schema(),
                        "strict": False
                    }
                }
            )
```

**Option B**: Create a Factory Function

```python
# graphiti_core/llm_client/__init__.py

def create_openai_client(
    config: LLMConfig | None = None,
    cache: bool = False,
    **kwargs
) -> LLMClient:
    """
    Factory to create the appropriate OpenAI-compatible client.

    Automatically selects between OpenAIClient (for native OpenAI)
    and OpenAIGenericClient (for OpenAI-compatible services).

    Args:
        config: LLM configuration including base_url
        cache: Whether to enable caching
        **kwargs: Additional arguments passed to the client

    Returns:
        LLMClient: Either OpenAIClient or OpenAIGenericClient

    Example:
        >>> # Automatically uses OpenAIGenericClient for OpenRouter
        >>> config = LLMConfig(
        ...     api_key="sk-or-v1-...",
        ...     model="meta-llama/llama-3-8b-instruct",
        ...     base_url="https://openrouter.ai/api/v1"
        ... )
        >>> client = create_openai_client(config)
    """
    if config is None:
        config = LLMConfig()

    # Auto-detect based on base_url
    if config.base_url is None or "api.openai.com" in config.base_url:
        return OpenAIClient(config, cache, **kwargs)
    else:
        return OpenAIGenericClient(config, cache, **kwargs)
```

#### 2.3 Enhance `OpenAIGenericClient` with Better Structured Output Support

**File**: `graphiti_core/llm_client/openai_generic_client.py`

```python
async def _generate_response(
    self,
    messages: list[Message],
    response_model: type[BaseModel] | None = None,
    max_tokens: int = DEFAULT_MAX_TOKENS,
    model_size: ModelSize = ModelSize.medium,
) -> dict[str, typing.Any]:
    openai_messages: list[ChatCompletionMessageParam] = []
    for m in messages:
        m.content = self._clean_input(m.content)
        if m.role == 'user':
            openai_messages.append({'role': 'user', 'content': m.content})
        elif m.role == 'system':
            openai_messages.append({'role': 'system', 'content': m.content})

    try:
        # Try to use json_schema format (supported by more providers)
        if response_model:
            response = await self.client.chat.completions.create(
                model=self.model or DEFAULT_MODEL,
                messages=openai_messages,
                temperature=self.temperature,
                max_tokens=max_tokens or self.max_tokens,
                response_format={
                    "type": "json_schema",
                    "json_schema": {
                        "name": response_model.__name__,
                        "schema": response_model.model_json_schema(),
                        "strict": False  # Most providers don't support strict mode
                    }
                }
            )
        else:
            response = await self.client.chat.completions.create(
                model=self.model or DEFAULT_MODEL,
                messages=openai_messages,
                temperature=self.temperature,
                max_tokens=max_tokens or self.max_tokens,
                response_format={'type': 'json_object'},
            )

        result = response.choices[0].message.content or '{}'
        return json.loads(result)
    except Exception as e:
        logger.error(f'Error in generating LLM response: {e}')
        raise
```

### Priority 3: Nice to Have (Low Priority)

#### 3.1 Provider-Specific Clients

Create convenience clients for popular providers:

```python
# graphiti_core/llm_client/openrouter_client.py
class OpenRouterClient(OpenAIGenericClient):
    """Pre-configured client for OpenRouter.

    Example:
        >>> client = OpenRouterClient(
        ...     api_key="sk-or-v1-...",
        ...     model="meta-llama/llama-3-8b-instruct"
        ... )
    """
    def __init__(
        self,
        api_key: str,
        model: str,
        temperature: float = DEFAULT_TEMPERATURE,
        max_tokens: int = DEFAULT_MAX_TOKENS,
        **kwargs
    ):
        config = LLMConfig(
            api_key=api_key,
            model=model,
            base_url="https://openrouter.ai/api/v1",
            temperature=temperature,
            max_tokens=max_tokens
        )
        super().__init__(config=config, **kwargs)
```

```python
# graphiti_core/llm_client/together_client.py
class TogetherClient(OpenAIGenericClient):
    """Pre-configured client for Together.ai.

    Example:
        >>> client = TogetherClient(
        ...     api_key="your-together-key",
        ...     model="meta-llama/Llama-3-70b-chat-hf"
        ... )
    """
    def __init__(
        self,
        api_key: str,
        model: str,
        temperature: float = DEFAULT_TEMPERATURE,
        max_tokens: int = DEFAULT_MAX_TOKENS,
        **kwargs
    ):
        config = LLMConfig(
            api_key=api_key,
            model=model,
            base_url="https://api.together.xyz/v1",
            temperature=temperature,
            max_tokens=max_tokens
        )
        super().__init__(config=config, **kwargs)
```

#### 3.2 Provider Compatibility Matrix

Add to documentation:

| Provider | Standard Client | Generic Client | Structured Outputs | Notes |
|----------|----------------|----------------|-------------------|-------|
| OpenAI | ✅ `OpenAIClient` | ✅ | ✅ Full (Responses API) | Recommended: Use `OpenAIClient` |
| Azure OpenAI | ✅ `AzureOpenAILLMClient` | ✅ | ✅ Full (Responses API) | Requires API version 2024-08-01-preview+ |
| OpenRouter | ❌ | ✅ `OpenAIGenericClient` | ⚠️ Limited (JSON Schema) | Use `OpenAIGenericClient` |
| Together.ai | ❌ | ✅ `OpenAIGenericClient` | ⚠️ Limited (JSON Schema) | Use `OpenAIGenericClient` |
| Ollama | ❌ | ✅ `OpenAIGenericClient` | ⚠️ Limited (JSON mode) | Local deployment |
| Groq | ❌ | ✅ `OpenAIGenericClient` | ⚠️ Limited (JSON Schema) | Very fast inference |
| Perplexity | ❌ | ✅ `OpenAIGenericClient` | ⚠️ Limited (JSON mode) | Primarily for search |

## Testing Recommendations

### Unit Tests

1. **Endpoint detection logic**
   ```python
   def test_should_use_responses_api():
       # OpenAI native should use Responses API
       config = LLMConfig(base_url="https://api.openai.com/v1")
       assert config.use_responses_api is True

       # Custom endpoints should not
       config = LLMConfig(base_url="https://openrouter.ai/api/v1")
       assert config.use_responses_api is False
   ```

2. **Client selection**
   ```python
   def test_create_openai_client_auto_selection():
       # Should return OpenAIClient for OpenAI
       config = LLMConfig(api_key="test")
       client = create_openai_client(config)
       assert isinstance(client, OpenAIClient)

       # Should return OpenAIGenericClient for others
       config = LLMConfig(api_key="test", base_url="https://openrouter.ai/api/v1")
       client = create_openai_client(config)
       assert isinstance(client, OpenAIGenericClient)
   ```

### Integration Tests

1. **Mock server tests** with responses for both endpoints
2. **Real provider tests** (optional, may require API keys):
   - OpenRouter
   - Together.ai
   - Ollama (local)

### Manual Testing Checklist

- [ ] OpenRouter with Llama models
- [ ] Together.ai with various models
- [ ] Ollama with local models
- [ ] Groq with fast models
- [ ] Verify error messages are helpful
- [ ] Test both structured and unstructured outputs

## Summary of Issues

| Issue | Current State | Impact | Priority |
|-------|---------------|--------|----------|
| `/v1/responses` endpoint usage | Used by default `OpenAIClient` | **BREAKS** all non-OpenAI providers | High |
| `OpenAIGenericClient` not exported | Hidden from public API | Users can't easily use it | High |
| Poor error messages | Generic 404 errors | Confusing for users | High |
| No auto-detection | Must manually choose client | Poor DX | Medium |
| Limited docs | Only Ollama example | Users don't know how to configure other providers | High |
| No structured output in Generic client | Only supports loose JSON | Reduced type safety for custom endpoints | Medium |
| No provider-specific helpers | Generic configuration only | More setup required | Low |

## Implementation Roadmap

### Phase 1: Quick Fixes (1-2 days)
1. Export `OpenAIGenericClient` in public API
2. Add documentation section for custom endpoints
3. Improve error messages in `OpenAIClient`
4. Add examples for OpenRouter, Together.ai

### Phase 2: Enhanced Support (3-5 days)
1. Add auto-detection logic to `LLMConfig`
2. Create factory function for client selection
3. Enhance `OpenAIGenericClient` with better JSON schema support
4. Add comprehensive tests

### Phase 3: Polish (2-3 days)
1. Create provider-specific client classes
2. Build compatibility matrix documentation
3. Add integration tests with real providers
4. Update all examples and guides

## References

- OpenAI SDK v1.91.0+ Responses API: https://platform.openai.com/docs/api-reference/responses
- OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/chat
- OpenRouter API: https://openrouter.ai/docs
- Together.ai API: https://docs.together.ai/docs/openai-api-compatibility
- Ollama OpenAI compatibility: https://github.com/ollama/ollama/blob/main/docs/openai.md

## Contributing

If you're implementing these changes, please ensure:

1. All changes follow the repository guidelines in `AGENTS.md`
2. Run `make format` before committing
3. Run `make lint` and `make test` to verify changes
4. Update documentation for any new public APIs
5. Add examples demonstrating the new functionality

## Questions or Issues?

- Open an issue: https://github.com/getzep/graphiti/issues
- Discussion: https://github.com/getzep/graphiti/discussions
- Documentation: https://help.getzep.com/graphiti