Scientific papers combine figures, equations and prose in ways that defeat current vision-language models. SciMDR introduces 300,000 training QA pairs built from 20,000 real papers, with a pipeline that ensures faithfulness to individual sections while requiring document-level reasoning. Models fine-tuned on SciMDR show strong gains on science-focused multimodal benchmarks.

Comments on "SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning"
Create a free account or sign in to join the discussion.
Sign in to join the conversation