Publications

Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail.

Published in Emergent Visual Abilities and Limits of Foundation Models Workshop, ECCV 2024, Milan, 2024

Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. In conclusion, the VQA task which aims to predict specific product information from images being satisfying but performs less fulfilling in identifying specific features, possibly due to a lack of domain-specific knowledge.

Download here

Retail-786k: a Large-Scale Dataset for Visual Entity Matching

Published in Data-centric Machine Learning Research (DMLR) Workshop, ICLR 2024, Vienna, 2024

We introduce the first publicly available large-scale dataset for “visual entity matching”, based on a production level use case in the retail domain. Using scanned advertisement leaflets, collected over several years from different European retailers, we provide a total of ~786k manually annotated, high resolution product images containing ~18k different individual retail products which are grouped into ~3k entities.

Download here

Fine-Grained Product Classification on Leaflet Advertisements

Published in FGVC10: 10th Workshop on Fine-grained Visual Categorization, CVPR 2023, Vancouver, 2023

In this paper, we describe a first publicly available fine-grained product recognition dataset based on leaflet images. We provide a total of 41.6k manually annotated product images in 832 classes. Further, we investigate three different approaches for this fine-grained product classification task.

Download here