Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail.
Published in Emergent Visual Abilities and Limits of Foundation Models Workshop, ECCV 2024, Milan, 2024
Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. In conclusion, the VQA task which aims to predict specific product information from images being satisfying but performs less fulfilling in identifying specific features, possibly due to a lack of domain-specific knowledge.
Download here