Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments

· 2005

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

cs.CV · 2026-07-02 · conditional · novelty 7.0

Proposes WUICC task and WUICC-bench dataset, then evaluates 11 image difference captioning methods plus 2 LLMs on web UI changes.

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.

From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 5.0

Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.

citing papers explorer

Showing 3 of 3 citing papers.

Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing cs.CV · 2026-07-02 · conditional · none · ref 43
Proposes WUICC task and WUICC-bench dataset, then evaluates 11 image difference captioning methods plus 2 LLMs on web UI changes.
LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation cs.CL · 2026-04-28 · unverdicted · none · ref 47
LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 20
Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.

Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments

fields

years

verdicts

representative citing papers

citing papers explorer