hub

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee · 2023

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

cs.CV · 2026-05-18 · unverdicted · novelty 8.0

CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

cs.RO · 2026-03-31 · unverdicted · novelty 6.0

DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.

CoRoVA: Compressed Representations for Vector-Augmented Code Completion

cs.CL · 2025-10-22 · unverdicted · novelty 6.0

CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

cs.AI · 2024-07-06 · conditional · novelty 6.0

LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.

Long Context Transfer from Language to Vision

cs.CV · 2024-06-24 · unverdicted · novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

cs.CL · 2023-06-05 · conditional · novelty 6.0

A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.

Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation

cs.CL · 2026-03-15 · unverdicted · novelty 5.0

A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

cs.CL · 2025-09-22 · unverdicted · novelty 5.0

QWHA proposes Walsh-Hadamard Transform adapters with adaptive initialization for quantization-aware PEFT, claiming better low-bit accuracy and faster training than low-rank or other FT-based baselines.

Common Inpainted Objects In-N-Out of Context

cs.CV · 2025-05-31 · unverdicted · novelty 5.0

COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.

Advancing AI Research Assistants with Expert-Involved Learning

cs.AI · 2025-05-03 · unverdicted · novelty 5.0

ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

citing papers explorer

Showing 13 of 13 citing papers.

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models cs.CV · 2026-05-18 · unverdicted · none · ref 30
CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation cs.CV · 2026-05-21 · unverdicted · none · ref 1
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA cs.RO · 2026-03-31 · unverdicted · none · ref 17
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
CoRoVA: Compressed Representations for Vector-Augmented Code Completion cs.CL · 2025-10-22 · unverdicted · none · ref 3
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts cs.AI · 2024-07-06 · conditional · none · ref 38
LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.
Long Context Transfer from Language to Vision cs.CV · 2024-06-24 · unverdicted · none · ref 49
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 cs.CL · 2023-06-05 · conditional · none · ref 24
A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.
OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models cs.CV · 2026-05-18 · unverdicted · none · ref 14
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.
Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation cs.CL · 2026-03-15 · unverdicted · none · ref 5
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models cs.CL · 2025-09-22 · unverdicted · none · ref 34
QWHA proposes Walsh-Hadamard Transform adapters with adaptive initialization for quantization-aware PEFT, claiming better low-bit accuracy and faster training than low-rank or other FT-based baselines.
Common Inpainted Objects In-N-Out of Context cs.CV · 2025-05-31 · unverdicted · none · ref 18
COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.
Advancing AI Research Assistants with Expert-Involved Learning cs.AI · 2025-05-03 · unverdicted · none · ref 56
ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 79
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Visual instruction tuning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer