CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.
hub
Visual instruction tuning
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
QWHA proposes Walsh-Hadamard Transform adapters with adaptive initialization for quantization-aware PEFT, claiming better low-bit accuracy and faster training than low-rank or other FT-based baselines.
COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.
ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
citing papers explorer
-
Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models
CiF is a large new civil infrastructure segmentation dataset that shows zero-shot foundation models and domain-supervised models plateau at roughly 25% mAP, establishing infrastructure inspection as an open challenge for current visual AI.
-
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
-
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
-
CoRoVA: Compressed Representations for Vector-Augmented Code Completion
CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
-
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.
-
OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models
OmniSelect is a training-free, modality-adaptive token pruning framework that dynamically selects Audio-Centric, Video-Centric, or Uniform compression regimes using AudioCLIP cross-modal relevance scores and then applies adaptive fine-grained pruning within temporal groups.
-
Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
-
QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
QWHA proposes Walsh-Hadamard Transform adapters with adaptive initialization for quantization-aware PEFT, claiming better low-bit accuracy and faster training than low-rank or other FT-based baselines.
-
Common Inpainted Objects In-N-Out of Context
COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.
-
Advancing AI Research Assistants with Expert-Involved Learning
ARIEL evaluates LLMs and LMMs on full-length biomedical summarization and figure interpretation with blinded expert review, identifies limitations, and demonstrates gains from prompt engineering, fine-tuning, and an integrated agent for hypothesis generation.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.