Proposes WUICC task and WUICC-bench dataset, then evaluates 11 image difference captioning methods plus 2 LLMs on web UI changes.
Meteor: An automatic metric for mt evalua- tion with improved correlation with human judgments
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.
Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.
citing papers explorer
-
Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing
Proposes WUICC task and WUICC-bench dataset, then evaluates 11 image difference captioning methods plus 2 LLMs on web UI changes.
-
LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.
-
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.