Skip to content
Why did we open-source our inference engine? Read the post

TypeScript SDK Reference

The TypeScript SDK provides an async client for interacting with the SIE server from Node.js and browser environments.

Terminal window
pnpm add @sie/sdk

Or with npm:

Terminal window
npm install @sie/sdk

Async client for the SIE server. All methods return Promises.

import { SIEClient } from "@sie/sdk";
const client = new SIEClient(
baseUrl: string, // Server URL (e.g., "http://localhost:8080")
options?: {
timeout?: number, // Request timeout in milliseconds (default: 30000)
apiKey?: string, // API key for authentication
gpu?: string, // Default GPU type for routing
pool?: PoolSpec, // Resource pool configuration
waitForCapacity?: boolean, // Auto-retry on 202 (default: false)
provisionTimeout?: number, // Max wait for provisioning in ms (default: 300000)
}
);

Generate embeddings.

async encode(
model: string, // Model name
items: Item | Item[], // Items to encode
options?: {
outputTypes?: OutputType[], // ["dense", "sparse", "multivector"]
instruction?: string, // Task instruction for instruction-tuned models
outputDtype?: DType, // "float32", "float16", "int8", "binary"
isQuery?: boolean, // Query vs document encoding
gpu?: string, // GPU routing
waitForCapacity?: boolean, // Wait for scale-up
}
): Promise<EncodeResult | EncodeResult[]>

Returns: Single EncodeResult if single item passed, otherwise array.

Example:

// Single item
const result = await client.encode("BAAI/bge-m3", { text: "Hello" });
console.log(result.dense?.slice(0, 5)); // Float32Array
// Batch
const results = await client.encode("BAAI/bge-m3", [
{ text: "First" },
{ text: "Second" },
]);

Rerank items against a query using a cross-encoder or late interaction model. Returns items sorted by relevance score (highest first).

async score(
model: string, // Model name (e.g., "BAAI/bge-reranker-v2-m3")
query: Item, // Query item with text or multivector
items: Item[], // Items to score against query
options?: {
topK?: number, // Return only top K results
gpu?: string,
waitForCapacity?: boolean,
}
): Promise<ScoreResult>

Example:

const result = await client.score(
"BAAI/bge-reranker-v2-m3",
{ text: "What is Python?" },
[{ text: "Python is..." }, { text: "Java is..." }]
);
// Scores are sorted by relevance (rank 0 = most relevant)
for (const entry of result.scores) {
console.log(`Rank ${entry.rank}: ${entry.score.toFixed(3)}`);
}

Note: For ColBERT-style models, you can pass pre-computed multivectors to score client-side without a server round-trip. See the Scoring Utilities section.

Extract entities or structured data from text. Supports Named Entity Recognition (NER) models like GLiNER.

async extract(
model: string, // Model name (e.g., "urchade/gliner_multi-v2.1")
items: Item | Item[], // Items to extract from
options: {
labels: string[], // Entity types to extract (e.g., ["person", "org"])
threshold?: number, // Minimum confidence (0-1)
gpu?: string,
waitForCapacity?: boolean,
}
): Promise<ExtractResult | ExtractResult[]>

Returns: Single ExtractResult if single item passed, otherwise array.

Example:

const result = await client.extract(
"urchade/gliner_multi-v2.1",
{ text: "Tim Cook leads Apple." },
{ labels: ["person", "organization"] }
);
for (const entity of result.entities) {
console.log(`${entity.label}: ${entity.text} (score: ${entity.score.toFixed(2)})`);
}
// Output:
// person: Tim Cook (score: 0.95)
// organization: Apple (score: 0.92)

Get available models.

async listModels(): Promise<ModelInfo[]>

Example:

const models = await client.listModels();
for (const model of models) {
console.log(`${model.name}: ${model.outputs.join(", ")}`);
}

Get cluster capacity information.

async getCapacity(gpu?: string): Promise<CapacityInfo>

Example:

const capacity = await client.getCapacity();
console.log(`Workers: ${capacity.workerCount}, GPUs: ${capacity.liveGpuTypes}`);
// Check if L4 GPUs are available
const l4Capacity = await client.getCapacity("l4");
if (l4Capacity.workerCount > 0) {
console.log("L4 workers available");
}

Wait for GPU capacity to become available. This is useful for pre-warming the cluster before running benchmarks.

async waitForCapacity(
gpu: string,
options?: {
model?: string, // If provided, sends a warmup encode request
timeout?: number, // Default: 300000ms
pollInterval?: number, // Default: 5000ms
}
): Promise<CapacityInfo>

Example:

// Wait for L4 capacity before running benchmarks
const capacity = await client.waitForCapacity("l4", { timeout: 300000 });
console.log(`Ready with ${capacity.workerCount} L4 workers`);
// Wait and pre-load a model
const capacityWithModel = await client.waitForCapacity("l4", { model: "BAAI/bge-m3" });

Close the client and cleanup resources.

async close(): Promise<void>

Input item for encode, score, and extract operations.

interface Item {
id?: string; // Client-provided ID (echoed in response)
text?: string; // Text content
images?: Uint8Array[]; // Image data as byte arrays (for multimodal models)
multivector?: Float32Array[]; // Pre-computed vectors (for client-side MaxSim)
metadata?: Record<string, unknown>; // Custom metadata
}

Common patterns:

// Simple text
{ text: "Hello world" }
// With ID for tracking
{ id: "doc-1", text: "Document text" }
// Multimodal (for CLIP, ColPali, etc.)
{ text: "Description", images: [imageBytes] }
interface EncodeResult {
id?: string; // Echoed item ID
dense?: Float32Array; // Dense embedding
sparse?: SparseResult; // Sparse embedding
multivector?: Float32Array[]; // Per-token embeddings
timing?: TimingInfo; // Timing breakdown
}
interface SparseResult {
indices: Int32Array; // Token IDs
values: Float32Array; // Token weights
}
interface ScoreResult {
model?: string; // Model used for scoring
queryId?: string; // Query ID (if provided in request)
scores: ScoreEntry[]; // Sorted by score descending
}
interface ScoreEntry {
itemId: string; // ID of the item
score: number; // Relevance score
rank: number; // Position (0 = most relevant)
}
interface ExtractResult {
id?: string; // Echoed item ID
entities: Entity[]; // Extracted entities
}
interface Entity {
text: string; // Extracted span
label: string; // Entity type
score: number; // Confidence (0-1)
start?: number; // Start character offset
end?: number; // End character offset
bbox?: number[]; // Bounding box [x, y, width, height] for vision models
}
interface ModelInfo {
name: string; // Model name/identifier
loaded: boolean; // Whether model weights are in memory
inputs: string[]; // Input types: ["text"], ["text", "image"], etc.
outputs: string[]; // Output types: ["dense"], ["dense", "sparse"], etc.
dims?: ModelDims; // Dimension info for each output type
maxSequenceLength?: number; // Maximum input sequence length
}
interface CapacityInfo {
status: string; // "healthy", "degraded", "no_workers"
workerCount: number; // Number of healthy workers
gpuCount: number; // Number of GPUs available
modelsLoaded: number; // Unique models loaded across workers
configuredGpuTypes: string[]; // GPU types configured in cluster
liveGpuTypes: string[]; // GPU types currently running
workers: WorkerInfo[]; // Worker details
}
interface TimingInfo {
totalMs?: number; // Total request time
queueMs?: number; // Time waiting in queue
tokenizationMs?: number; // Tokenization time
inferenceMs?: number; // Model inference time
}
type OutputType = "dense" | "sparse" | "multivector";
type DType = "float32" | "float16" | "bfloat16" | "int8" | "uint8" | "binary" | "ubinary";
// Convert typed arrays to regular number arrays (for JSON serialization)
function toNumberArray(arr: Float32Array | Int32Array): number[];
// Convert number array to Float32Array
function toFloat32Array(arr: number[]): Float32Array;

Client-side scoring for multi-vector embeddings.

Compute MaxSim scores for ColBERT-style retrieval. MaxSim finds the maximum similarity between each query token and any document token, then sums these maximums.

function maxsim(
query: Float32Array[], // [numQueryTokens][dim]
document: Float32Array[] // [numDocTokens][dim]
): number

Example:

import { SIEClient, maxsim } from "@sie/sdk";
const client = new SIEClient("http://localhost:8080");
// Encode query with isQuery=true for ColBERT models
const queryResult = await client.encode(
"jinaai/jina-colbert-v2",
{ text: "What is ColBERT?" },
{ outputTypes: ["multivector"], isQuery: true }
);
// Encode documents (no isQuery needed for documents)
const docResults = await client.encode(
"jinaai/jina-colbert-v2",
documents,
{ outputTypes: ["multivector"] }
);
// Compute MaxSim scores client-side
const queryMv = queryResult.multivector!;
const scores = docResults.map((r) => maxsim(queryMv, r.multivector!));
// Rank by score (higher is more relevant)
const ranked = scores
.map((score, idx) => ({ score, idx }))
.sort((a, b) => b.score - a.score);

Score a query against multiple documents.

function maxsimDocuments(
query: Float32Array[],
documents: Float32Array[][]
): number[]

Batch version for multiple queries against multiple documents.

function maxsimBatch(
queries: Float32Array[][],
documents: Float32Array[][]
): Float32Array // Flattened [numQueries * numDocuments]

Exception hierarchy for SDK errors.

Base class for all SDK errors.

class SIEError extends Error {
name: "SIEError";
}

Cannot connect to server.

class SIEConnectionError extends SIEError {
name: "SIEConnectionError";
}

Invalid request (4xx responses).

class RequestError extends SIEError {
name: "RequestError";
code?: string;
statusCode?: number;
}

Server error (5xx responses).

class ServerError extends SIEError {
name: "ServerError";
code?: string;
statusCode?: number;
}

No capacity available or timeout waiting for scale-up.

class ProvisioningError extends SIEError {
name: "ProvisioningError";
gpu?: string;
retryAfter?: number;
}

Resource pool operation failed.

class PoolError extends SIEError {
name: "PoolError";
poolName?: string;
state?: string;
}

LoRA adapter loading timeout.

class LoraLoadingError extends SIEError {
name: "LoraLoadingError";
lora?: string;
model?: string;
}
import { SIEClient, RequestError, ProvisioningError } from "@sie/sdk";
const client = new SIEClient("http://localhost:8080");
try {
const result = await client.encode("unknown-model", { text: "test" });
} catch (error) {
if (error instanceof RequestError) {
console.log(`Invalid request: ${error.code} (${error.statusCode})`);
} else if (error instanceof ProvisioningError) {
console.log(`No capacity for GPU ${error.gpu}, retry after ${error.retryAfter}ms`);
}
}

For cluster deployments with multiple GPU types, specify the target GPU:

// Per-request GPU selection
const result = await client.encode(
"BAAI/bge-m3",
items,
{ gpu: "a100-80gb" }
);
// Default GPU for all requests
const client = new SIEClient("http://router.example.com", {
gpu: "l4"
});

Available GPU types depend on your cluster configuration.


Create isolated worker sets for testing or tenant isolation:

import { SIEClient } from "@sie/sdk";
const client = new SIEClient("http://router.example.com");
await client.createPool("my-test-pool", { l4: 2, "a100-40gb": 1 });
// Route requests to the pool
const result = await client.encode(
"BAAI/bge-m3",
items,
{ gpu: "my-test-pool/l4" }
);
// Check pool status
const pool = await client.getPool("my-test-pool");
console.log(`Pool state: ${pool?.status.state}`);
console.log(`Workers: ${pool?.status.assignedWorkers.length}`);
// Clean up
await client.deletePool("my-test-pool");
await client.close();

import { SIEClient, maxsim } from "@sie/sdk";
// Initialize client
const client = new SIEClient("http://localhost:8080", { timeout: 60000 });
// Dense embeddings
const documents = [
"Machine learning is a subset of artificial intelligence.",
"Python is a popular programming language.",
"Neural networks are inspired by the human brain.",
];
const embeddings = await client.encode(
"BAAI/bge-m3",
documents.map((text, i) => ({ id: `doc-${i}`, text }))
);
// Store in vector database
for (const result of embeddings) {
if (result.dense) {
// vectorDb.insert(result.id, result.dense);
console.log(`Stored ${result.id}: ${result.dense.length} dimensions`);
}
}
// Query with reranking
const query = { text: "What is machine learning?" };
// Stage 1: Vector search
const queryEmb = await client.encode("BAAI/bge-m3", query, { isQuery: true });
// const candidates = await vectorDb.search(queryEmb.dense, { topK: 100 });
// Stage 2: Rerank (using documents directly for this example)
const rerankResult = await client.score(
"BAAI/bge-reranker-v2-m3",
query,
documents.map((text, i) => ({ id: `doc-${i}`, text }))
);
// Top results
console.log("\nTop results:");
for (const entry of rerankResult.scores.slice(0, 3)) {
console.log(` ${entry.rank + 1}. ${entry.itemId} (score: ${entry.score.toFixed(3)})`);
}
// Entity extraction
const extractResult = await client.extract(
"urchade/gliner_multi-v2.1",
{ text: "Elon Musk founded SpaceX and leads Tesla." },
{ labels: ["person", "organization"] }
);
console.log("\nExtracted entities:");
for (const entity of extractResult.entities) {
console.log(` ${entity.label}: ${entity.text}`);
}
// Clean up
await client.close();