Skip to content
SIE

HTTP API Reference

This reference documents all HTTP endpoints exposed by the SIE server.

EndpointMethodPurpose
/v1/encode/:modelPOSTGenerate embeddings
/v1/score/:modelPOSTRerank items
/v1/extract/:modelPOSTExtract entities and structured data
/v1/modelsGETList available models
/v1/models/:modelGETGet model details
/v1/embeddingsPOSTOpenAI-compatible embeddings
/healthzGETLiveness probe
/readyzGETReadiness probe
/metricsGETPrometheus metrics
/ws/statusWebSocketReal-time worker status

SIE defaults to msgpack for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.

Content negotiation:

  • Content-Type: application/msgpack for requests
  • Accept: application/msgpack for responses (default)
  • Accept: application/json falls back to JSON

When using JSON, arrays are converted to lists.


Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.

class EncodeRequest(TypedDict, total=False):
items: list[Item] # Required: items to encode
params: EncodeParams # Optional: encoding parameters
class EncodeParams(TypedDict, total=False):
output_types: list[str] # 'dense', 'sparse', 'multivector'
instruction: str # Task instruction for query encoding
output_dtype: str # 'float32', 'float16', 'int8', 'binary'
options: dict[str, Any] # Profile, LoRA, runtime options
class Item(TypedDict, total=False):
id: str # Client-provided ID (echoed back)
text: str # Text content
images: list[ImageInput] # Image bytes with format hint
class ImageInput(TypedDict, total=False):
data: bytes # Image bytes
format: str # 'jpeg', 'png', 'webp'
class EncodeResponse(TypedDict, total=False):
model: str # Model name used
items: list[EncodeResult] # One result per input item
timing: TimingInfo # Server-side timing breakdown
class EncodeResult(TypedDict, total=False):
id: str # Echoed item ID
dense: DenseVector # Dense embedding
sparse: SparseVector # Sparse embedding
multivector: MultiVector # Per-token embeddings
class DenseVector(TypedDict, total=False):
dims: int # Vector dimensionality
dtype: str # 'float32', 'float16', 'int8', 'binary'
values: list[float] # Vector values
class SparseVector(TypedDict, total=False):
dims: int # Vocabulary size
dtype: str # Data type
indices: list[int] # Non-zero dimension indices
values: list[float] # Values at those indices
class MultiVector(TypedDict, total=False):
token_dims: int # Per-token embedding dimension
num_tokens: int # Number of tokens
dtype: str # Data type
values: list[list[float]] # Token embeddings
ParameterTypeDefaultDescription
itemslist[Item]RequiredItems to encode
params.output_typeslist[str]["dense"]Output types to return
params.instructionstrNoneInstruction prefix for query encoding
params.output_dtypestr"float32"Output precision
params.optionsdictNoneRuntime options (profile, lora, etc.)

Basic encoding:

Terminal window
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "Hello, world!"}]
}'

Multiple output types:

Terminal window
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "Search query"}],
"params": {
"output_types": ["dense", "sparse"],
"instruction": "Represent this query for retrieval:"
}
}'

Response:

{
"model": "BAAI/bge-m3",
"items": [
{
"dense": {
"dims": 1024,
"dtype": "float32",
"values": [0.0234, -0.0891, 0.1234, ...]
},
"sparse": {
"dims": 250002,
"dtype": "float32",
"indices": [101, 2023, 5789, ...],
"values": [0.45, 0.32, 0.28, ...]
}
}
]
}

Rerank items against a query using a cross-encoder model.

class ScoreRequest(TypedDict, total=False):
query: Item # Required: query to score against
items: list[Item] # Required: items to score
instruction: str # Optional instruction
options: dict[str, Any] # Runtime options
class ScoreResponse(TypedDict, total=False):
model: str
query_id: str | None # Echoed query ID
scores: list[ScoreEntry] # Sorted by score descending
class ScoreEntry(TypedDict):
item_id: str | None # Echoed item ID
score: float # Relevance score
rank: int # Position (0 = most relevant)
Terminal window
curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"query": {"text": "What is machine learning?"},
"items": [
{"id": "doc-1", "text": "ML uses algorithms to learn from data."},
{"id": "doc-2", "text": "The weather is sunny today."}
]
}'

Response:

{
"model": "BAAI/bge-reranker-v2-m3",
"scores": [
{"item_id": "doc-1", "score": 0.891, "rank": 0},
{"item_id": "doc-2", "score": 0.023, "rank": 1}
]
}

Extract structured data from items: entities, relations, classifications, or vision outputs.

class ExtractRequest(TypedDict, total=False):
items: list[Item] # Required: items to extract from
params: ExtractParams # Optional: extraction parameters
class ExtractParams(TypedDict, total=False):
labels: list[str] # Entity types for NER
output_schema: dict # JSON schema for structured extraction
instruction: str # Task instruction
options: dict[str, Any] # Runtime options
class ExtractResponse(TypedDict, total=False):
model: str
items: list[ExtractResult]
class ExtractResult(TypedDict, total=False):
id: str
entities: list[Entity] # NER results
relations: list[Relation] # Relation extraction
classifications: list[Classification]
objects: list[DetectedObject] # Object detection
data: dict[str, Any] # Structured extraction results
class Entity(TypedDict, total=False):
text: str # Extracted span
label: str # Entity type
score: float # Confidence (0-1)
start: int # Start character offset
end: int # End character offset
bbox: list[int] # Bounding box [x, y, w, h] (images)
class Relation(TypedDict):
head: str # Source entity
tail: str # Target entity
relation: str # Relation type
score: float # Confidence
class Classification(TypedDict):
label: str # Class label
score: float # Probability
Terminal window
curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "Tim Cook is the CEO of Apple Inc."}],
"params": {
"labels": ["person", "organization", "role"]
}
}'

Response:

{
"model": "urchade/gliner_multi-v2.1",
"items": [
{
"id": "item-0",
"entities": [
{"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8},
{"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19},
{"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32}
]
}
]
}

List all available models with their capabilities.

class ModelsListResponse(BaseModel):
models: list[ModelInfo]
class ModelInfo(BaseModel):
name: str # Model name
inputs: list[str] # Supported inputs: text, image
outputs: list[str] # Supported outputs: dense, sparse, multivector
dims: dict[str, int] # Dimensions per output type
loaded: bool # Whether model is in GPU memory
max_sequence_length: int # Maximum tokens
profiles: dict[str, ProfileInfo] # Available profiles
class ProfileInfo(BaseModel):
is_default: bool # Whether this is the default profile
output_types: list[str] # Output types enabled by this profile
output_similarity: dict[str, str] # Similarity metrics per output type
Terminal window
curl -H "Accept: application/json" http://localhost:8080/v1/models

Response:

{
"models": [
{
"name": "BAAI/bge-m3",
"inputs": ["text"],
"outputs": ["dense", "sparse", "multivector"],
"dims": {"dense": 1024, "sparse": 250002, "multivector": 1024},
"loaded": true,
"max_sequence_length": 8192,
"profiles": {}
},
{
"name": "BAAI/bge-reranker-v2-m3",
"inputs": ["text"],
"outputs": ["score"],
"dims": {},
"loaded": false,
"max_sequence_length": 8192,
"profiles": {}
}
]
}

Drop-in replacement for OpenAI’s embeddings API.

Terminal window
curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"model": "BAAI/bge-m3",
"input": ["Hello, world!"]
}'

Response:

{
"object": "list",
"model": "BAAI/bge-m3",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0891, ...]
}
],
"usage": {
"prompt_tokens": 3,
"total_tokens": 3
}
}

Works with OpenAI SDK, LangChain’s OpenAIEmbeddings, and other compatible clients.


Liveness probe. Returns 200 if the server process is running.

Terminal window
curl http://localhost:8080/healthz
# "ok"

Readiness probe. Returns 200 if the server is ready to accept traffic.

Terminal window
curl http://localhost:8080/readyz
# "ok"

Prometheus metrics endpoint.

MetricTypeLabelsDescription
sie_requests_totalCountermodel, endpoint, statusTotal request count
sie_request_duration_secondsHistogrammodel, endpoint, phaseLatency by phase
sie_batch_sizeHistogrammodelBatch size distribution
sie_tokens_processed_totalCountermodelTotal tokens processed
sie_queue_depthGaugemodelPending items per model
sie_model_loadedGaugemodel, deviceModel load status (1/0)
sie_model_memory_bytesGaugemodel, deviceGPU memory per model

Real-time worker status stream. Sends updates every 200ms.

{
"timestamp": float, # Unix timestamp
"gpu": str, # GPU type (e.g., "l4", "a100-80gb")
"loaded_models": list[str], # Currently loaded models
"server": {
"version": str,
"uptime_seconds": int,
"user": str,
"working_dir": str,
"pid": int
},
"gpus": [ # Per-GPU metrics
{
"index": int,
"name": str,
"gpu_type": str, # Normalized type (e.g., "l4", "a100-80gb")
"utilization_percent": float,
"memory_used_bytes": int,
"memory_total_bytes": int,
"memory_threshold_pct": float,
"temperature_c": int
}
],
"models": [ # Per-model status
{
"name": str,
"state": str, # "loaded", "loading", "unloading", "available"
"device": str | None,
"memory_bytes": int,
"queue_depth": int,
"queue_pending_items": int,
"config": {...} # Model configuration
}
],
"counters": {...}, # Prometheus counter metrics
"histograms": {...} # Prometheus histogram metrics
}
const ws = new WebSocket("ws://localhost:8080/ws/status");
ws.onmessage = (event) => {
const status = JSON.parse(event.data);
console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);
};

All endpoints return consistent error responses:

{
"detail": {
"code": "MODEL_NOT_FOUND",
"message": "Model 'unknown-model' not found"
}
}
CodeHTTP StatusDescription
MODEL_NOT_FOUND404Requested model doesn’t exist
INVALID_INPUT400Invalid request format
MODEL_NOT_LOADED503Model is not loaded or still loading
LORA_LOADING503LoRA adapter is loading (retry with Retry-After header)
QUEUE_FULL503Server overloaded, request queue is full
DEPENDENCY_CONFLICT409Model requires different bundle/dependencies
INFERENCE_ERROR500Error during model inference
INTERNAL_ERROR500Unexpected server error

Timing and tracing information is included in response headers:

HeaderDescription
X-Total-TimeTotal request time (ms)
X-Queue-TimeTime waiting in queue (ms)
X-Tokenization-TimePreprocessing time (ms)
X-Inference-TimeGPU inference time (ms)
X-Postprocessing-TimePostprocessing time (ms), only if > 0
X-Trace-IDOpenTelemetry trace ID for distributed tracing