Vision Tasks
Florence-2 and Donut models extract structured data from images — captions, OCR text, object detection, and document understanding.
Image Captioning
Section titled “Image Captioning”from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": image_bytes, "format": "jpeg"}]), options={"task": "<CAPTION>"})
for entity in result["entities"]: print(entity["text"])# "A golden retriever playing fetch in a park on a sunny day."import { SIEClient } from "@sie/sdk";
const client = new SIEClient("http://localhost:8080");
const result = await client.extract( "microsoft/Florence-2-base", { images: [imageBytes] }, // Uint8Array of JPEG/PNG data { options: { task: "<CAPTION>" } });
for (const entity of result.entities) { console.log(entity.text);}
await client.close();OCR (Text from Images)
Section titled “OCR (Text from Images)”result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": document_image, "format": "png"}]), options={"task": "<OCR>"})
for entity in result["entities"]: print(entity["text"])# Extracted text from the document imageconst result = await client.extract( "microsoft/Florence-2-base", { images: [documentImage] }, // Uint8Array of PNG data { options: { task: "<OCR>" } });
for (const entity of result.entities) { console.log(entity.text);}OCR with Regions
Section titled “OCR with Regions”To get text with bounding box positions (the default task):
result = client.extract( "microsoft/Florence-2-base", Item(images=[{"data": document_image, "format": "png"}]), options={"task": "<OCR_WITH_REGION>"})
for entity in result["entities"]: print(f"{entity['text']} at {entity['bbox']}")const result = await client.extract( "microsoft/Florence-2-base", { images: [documentImage] }, { options: { task: "<OCR_WITH_REGION>" } });
for (const entity of result.entities) { console.log(`${entity.text} at ${JSON.stringify(entity.bbox)}`);}Document Understanding
Section titled “Document Understanding”For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):
result = client.extract( "naver-clova-ix/donut-base-finetuned-docvqa", Item(images=[{"data": receipt_image, "format": "jpeg"}]), instruction="What is the total amount?")
for entity in result["entities"]: print(entity["text"])# "$42.50"const result = await client.extract( "naver-clova-ix/donut-base-finetuned-docvqa", { images: [receiptImage] }, { instruction: "What is the total amount?" });
for (const entity of result.entities) { console.log(entity.text);}Florence-2 Task Prompts
Section titled “Florence-2 Task Prompts”Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>.
| Task | Task Token | Output |
|---|---|---|
| OCR | <OCR> | Extracted text |
| OCR with regions | <OCR_WITH_REGION> | Text with bounding boxes (default) |
| Caption | <CAPTION> | Image description |
| Detailed caption | <DETAILED_CAPTION> | Extended description |
| Object detection | <OD> | Bounding boxes and labels |
| Dense region caption | <DENSE_REGION_CAPTION> | Region descriptions |
| Phrase grounding | <CAPTION_TO_PHRASE_GROUNDING> | Match labels to regions |
| Document QA | <DocVQA> | Answer to question |
Donut Models
Section titled “Donut Models”Donut models parse structured documents without OCR pre-processing:
naver-clova-ix/donut-base-finetuned-cord-v2— Receipt parsing with key-value extraction (totals, line items, dates)naver-clova-ix/donut-base-finetuned-rvlcdip— Document classification into document types (letter, invoice, memo, etc.)naver-clova-ix/donut-base-finetuned-docvqa— Document question answering (ask natural language questions about a document image)
Vision Models
Section titled “Vision Models”| Model | Tasks |
|---|---|
microsoft/Florence-2-base | Caption, OCR, detection |
microsoft/Florence-2-large | Higher quality Florence-2 |
naver-clova-ix/donut-base-finetuned-docvqa | Document question answering |
naver-clova-ix/donut-base-finetuned-cord-v2 | Receipt parsing |
See Full model catalog for the complete list.
What’s Next
Section titled “What’s Next”- NER & Entity Extraction — named entity recognition
- Relations & Classification — relation extraction and text classification
- Full model catalog — all supported models