Vision Tasks

Florence-2 and Donut models extract structured data from images — captions, OCR text, object detection, and document understanding.

Image Captioning

Python
TypeScript

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

result = client.extract(
    "microsoft/Florence-2-base",
    Item(images=[{"data": image_bytes, "format": "jpeg"}]),
    options={"task": "<CAPTION>"}
)

for entity in result["entities"]:
    print(entity["text"])
# "A golden retriever playing fetch in a park on a sunny day."

import { SIEClient } from "@sie/sdk";

const client = new SIEClient("http://localhost:8080");

const result = await client.extract(
  "microsoft/Florence-2-base",
  { images: [imageBytes] },  // Uint8Array of JPEG/PNG data
  { options: { task: "<CAPTION>" } }
);

for (const entity of result.entities) {
  console.log(entity.text);
}

await client.close();

OCR (Text from Images)

Python
TypeScript

result = client.extract(
    "microsoft/Florence-2-base",
    Item(images=[{"data": document_image, "format": "png"}]),
    options={"task": "<OCR>"}
)

for entity in result["entities"]:
    print(entity["text"])
# Extracted text from the document image

const result = await client.extract(
  "microsoft/Florence-2-base",
  { images: [documentImage] },  // Uint8Array of PNG data
  { options: { task: "<OCR>" } }
);

for (const entity of result.entities) {
  console.log(entity.text);
}

OCR with Regions

To get text with bounding box positions (the default task):

Python
TypeScript

result = client.extract(
    "microsoft/Florence-2-base",
    Item(images=[{"data": document_image, "format": "png"}]),
    options={"task": "<OCR_WITH_REGION>"}
)

for entity in result["entities"]:
    print(f"{entity['text']} at {entity['bbox']}")

const result = await client.extract(
  "microsoft/Florence-2-base",
  { images: [documentImage] },
  { options: { task: "<OCR_WITH_REGION>" } }
);

for (const entity of result.entities) {
  console.log(`${entity.text} at ${JSON.stringify(entity.bbox)}`);
}

Document Understanding

For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):

Python
TypeScript

result = client.extract(
    "naver-clova-ix/donut-base-finetuned-docvqa",
    Item(images=[{"data": receipt_image, "format": "jpeg"}]),
    instruction="What is the total amount?"
)

for entity in result["entities"]:
    print(entity["text"])
# "$42.50"

const result = await client.extract(
  "naver-clova-ix/donut-base-finetuned-docvqa",
  { images: [receiptImage] },
  { instruction: "What is the total amount?" }
);

for (const entity of result.entities) {
  console.log(entity.text);
}

Florence-2 Task Prompts

Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>.

Task	Task Token	Output
OCR	`<OCR>`	Extracted text
OCR with regions	`<OCR_WITH_REGION>`	Text with bounding boxes (default)
Caption	`<CAPTION>`	Image description
Detailed caption	`<DETAILED_CAPTION>`	Extended description
Object detection	`<OD>`	Bounding boxes and labels
Dense region caption	`<DENSE_REGION_CAPTION>`	Region descriptions
Phrase grounding	`<CAPTION_TO_PHRASE_GROUNDING>`	Match labels to regions
Document QA	`<DocVQA>`	Answer to question

Donut Models

Donut models parse structured documents without OCR pre-processing:

naver-clova-ix/donut-base-finetuned-cord-v2 — Receipt parsing with key-value extraction (totals, line items, dates)
naver-clova-ix/donut-base-finetuned-rvlcdip — Document classification into document types (letter, invoice, memo, etc.)
naver-clova-ix/donut-base-finetuned-docvqa — Document question answering (ask natural language questions about a document image)

Vision Models

Model	Tasks
`microsoft/Florence-2-base`	Caption, OCR, detection
`microsoft/Florence-2-large`	Higher quality Florence-2
`naver-clova-ix/donut-base-finetuned-docvqa`	Document question answering
`naver-clova-ix/donut-base-finetuned-cord-v2`	Receipt parsing

See Full model catalog for the complete list.

What’s Next

NER & Entity Extraction — named entity recognition
Relations & Classification — relation extraction and text classification
Full model catalog — all supported models