News from arXiv cs.CL

arXiv cs.CL Papers 16 hr ago

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv:2606.24391v1 Announce Type: cross Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy…

arXiv cs.CL Papers 16 hr ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm…

arXiv cs.CL Papers 16 hr ago

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

arXiv:2606.23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding,…

arXiv cs.CL Papers 16 hr ago

Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization

arXiv:2606.23989v1 Announce Type: new Abstract: End-to-end large language models (LLMs) produce fluent multi-document summaries but remain prone to hallucination, and the attributions they offer are typically…

arXiv cs.CL Papers 16 hr ago

Sexualised synthetic personas encode and amplify gendered power asymmetries through voice

arXiv:2606.21366v2 Announce Type: replace-cross Abstract: This work examines sexualised AI-generated English-speaking voices offered by a popular commercial platform. New technologies may enable sexual empowerment and greater…

arXiv cs.CL Papers 16 hr ago

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

arXiv:2505.18542v4 Announce Type: replace Abstract: Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has…

arXiv cs.CL Papers 16 hr ago

Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

arXiv:2606.23881v1 Announce Type: new Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi…

arXiv cs.CL Papers 16 hr ago

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

arXiv:2606.23992v1 Announce Type: new Abstract: Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently…

arXiv cs.CL Papers 16 hr ago

One Year Later…The Harms Persist, But So Do We!

arXiv:2606.23884v1 Announce Type: new Abstract: General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical…

arXiv cs.CL Papers 16 hr ago

Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability

arXiv:2606.23701v1 Announce Type: new Abstract: Qualitative product feedback can reveal nuanced user experiences, but its implicit sentiment is difficult to measure. This paper presents a scalable…

Latest

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization

Sexualised synthetic personas encode and amplify gendered power asymmetries through voice

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

One Year Later…The Harms Persist, But So Do We!

Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability