Skip to content
Why did we open-source our inference engine? Read the post

Vision Tasks

Florence-2 and Donut models extract structured data from images — captions, OCR text, object detection, and document understanding.

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": image_bytes, "format": "jpeg"}]),
options={"task": "<CAPTION>"}
)
for entity in result["entities"]:
print(entity["text"])
# "A golden retriever playing fetch in a park on a sunny day."
result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": document_image, "format": "png"}]),
options={"task": "<OCR>"}
)
for entity in result["entities"]:
print(entity["text"])
# Extracted text from the document image

To get text with bounding box positions (the default task):

result = client.extract(
"microsoft/Florence-2-base",
Item(images=[{"data": document_image, "format": "png"}]),
options={"task": "<OCR_WITH_REGION>"}
)
for entity in result["entities"]:
print(f"{entity['text']} at {entity['bbox']}")

For Donut models, the question is passed via the instruction parameter (free text appended to the task prompt):

result = client.extract(
"naver-clova-ix/donut-base-finetuned-docvqa",
Item(images=[{"data": receipt_image, "format": "jpeg"}]),
instruction="What is the total amount?"
)
for entity in result["entities"]:
print(entity["text"])
# "$42.50"

Florence-2 tasks are selected via options={"task": "<TASK_TOKEN>"}. The default task is <OCR_WITH_REGION>.

TaskTask TokenOutput
OCR<OCR>Extracted text
OCR with regions<OCR_WITH_REGION>Text with bounding boxes (default)
Caption<CAPTION>Image description
Detailed caption<DETAILED_CAPTION>Extended description
Object detection<OD>Bounding boxes and labels
Dense region caption<DENSE_REGION_CAPTION>Region descriptions
Phrase grounding<CAPTION_TO_PHRASE_GROUNDING>Match labels to regions
Document QA<DocVQA>Answer to question

Donut models parse structured documents without OCR pre-processing:

  • naver-clova-ix/donut-base-finetuned-cord-v2 — Receipt parsing with key-value extraction (totals, line items, dates)
  • naver-clova-ix/donut-base-finetuned-rvlcdip — Document classification into document types (letter, invoice, memo, etc.)
  • naver-clova-ix/donut-base-finetuned-docvqa — Document question answering (ask natural language questions about a document image)
ModelTasks
microsoft/Florence-2-baseCaption, OCR, detection
microsoft/Florence-2-largeHigher quality Florence-2
naver-clova-ix/donut-base-finetuned-docvqaDocument question answering
naver-clova-ix/donut-base-finetuned-cord-v2Receipt parsing

See Full model catalog for the complete list.