Awesome Long-Form Factuality: A Curated List of Papers and Resources on Enhancing Factuality in Long-Form LLM Generations

🧠 Our Recent Work:
Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality
[🧾 Paper] ｜ [💻 Code] ｜ [🤗 Models]

This paper proposes KLCF, a novel RL framework that mitigates hallucinations by explicitly aligning a policy model's expressed knowledge with its pretrained parametric knowledge to achieve knowledge-level consistency. Through its Dual-Fact Alignment mechanism, KLCF jointly optimizes factual recall and precision to balance comprehensiveness with truthfulness. Extensive experiments show it significantly outperforms existing baselines. Moreover, the efficient reward design enables scalable online RL training without costly external knowledge retrieval.

Overview

Hallucinations in large language models (LLMs) refer to the generation of fluent yet factually incorrect or unverifiable content, which fundamentally limits the trustworthiness of their outputs. This challenge is especially severe in long-form generation—such as essays, reports, and detailed reasoning—where small factual errors can accumulate through a snowball effect, resulting in cascading inaccuracies. In contrast, factuality measures how faithfully the generated text aligns with verifiable truth, emphasizing both precision (avoiding false statements) and recall (capturing relevant factual details comprehensively).

This repository provides a curated roadmap of research on enhancing long-form factuality in LLMs. It traces the field’s development across several key dimensions:

Some Insights — Foundational observations and reflections that contextualize the factuality problem.
Evaluation Methods for Factuality — Early efforts to decompose, verify, and quantify factual correctness.
Prompt-Based Methods — Prompt engineering strategies that steer models toward more truthful generations.
SFT and DPO Methods — Supervised and preference-based fine-tuning techniques for factual alignment.
RL Methods — Reinforcement learning frameworks, including RLHF variants, that dynamically optimize factuality.
Other Methods — Complementary and emerging strategies beyond the above paradigms.
Benchmarks — Datasets and evaluation frameworks that define and measure factual progress.
Surveys — Integrative overviews summarizing advances, limitations, and open questions.

Together, these sections aim to offer a comprehensive and evolving perspective on how the community is addressing hallucinations and advancing factual alignment in long-form generation. We warmly welcome contributions—feel free to open a PR to suggest new papers or resources!

Some Insights

Language Models (Mostly) Know What They Know (arXiv 2022-07-11). [🧾 Paper]
A Long Way to Go: Investigating Length Correlations in RLHF (COLM 2024). [🧾 Paper]
How Does Response Length Affect Long-Form Factuality (ACL 2025). [🧾 Paper] | [💻 Code]
Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts (arXiv 2025-05-23). [🧾 Paper]
Why Language Models Hallucinate (arXiv 2025-09-04). [🧾 Paper]

Evaluation Methods for Factuality

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators (EMNLP 2023). [🧾 Paper] | [💻 Code]
WiCE: Real-World Entailment for Claims in Wikipedia (EMNLP 2023). [🧾 Paper] | [💻 Code]
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (EMNLP 2023). [🧾 Paper] | [💻 Code]
Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations (ACL 2024). [🧾 Paper] | [💻 Code]
Complex Claim Verification with Evidence Retrieved in the Wild (NAACL 2024). [🧾 Paper] | [💻 Code]
Long-form factuality in large language models (NeurIPS 2024). [🧾 Paper] | [💻 Code]
AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators (ACL 2024). [🧾 Paper] | [💻 Code] | [🤗 Models]
A Closer Look at Claim Decomposition (*SEM 2024). [🧾 Paper]
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers (EMNLP 2024). [🧾 Paper] | [💻 Code]
VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation (EMNLP 2024). [🧾 Paper] | [💻 Code]
OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs (EMNLP 2024). [🧾 Paper] | [💻 Code]
LongDocFACTScore: Evaluating the factuality of long document abstractive summarisation (LREC-COLING 2024). [🧾 Paper] | [💻 Code]
Dndscore: Decontextualization and decomposition for factuality verification in long-form text generation (arXiv 2024-12-17). [🧾 Paper]
Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation (ACL 2025). [🧾 Paper] | [💻 Code]
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios (COLM 2025). [🧾 Paper] | [💻 Code]
Towards Effective Extraction and Evaluation of Factual Claims (arXiv 2025-02-15). [🧾 Paper]
VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts (arXiv 2025-05-14). [🧾 Paper]
Long-Form Information Alignment Evaluation Beyond Atomic Facts (arXiv 2025-05-21). [🧾 Paper] [💻 Code]
VeriFastScore: Speeding up long-form factuality evaluation (arXiv 2025-05-22). [🧾 Paper] | [💻 Code]
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses (arXiv 2025-09-19). [🧾 Paper]

Prompt-Based Methods

Self-Refine: Iterative Refinement with Self-Feedback (NeurIPS 2023). [🧾 Paper] | [💻 Code]
Chain-of-Verification Reduces Hallucination in Large Language Models (ACL 2024). [🧾 Paper] | [💻 Code]

SFT and DPO Methods

FLAME: Factuality-Aware Alignment for Large Language Models (NeurIPS 2024). [🧾 Paper]
FactAlign: Long-form Factuality Alignment of Large Language Models (EMNLP 2024). [🧾 Paper] | [💻 Code] |[🤗 Models]
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation (ACL 2024). [🧾 Paper] | [💻 Code]
Fine-Tuning Language Models for Factuality (ICLR 2024). [🧾 Paper] | [💻 Code]
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs (ICLR 2025). [🧾 Paper] | [💻 Code]
Improving Model Factuality with Fine-grained Critique-based Evaluator (ACL 2025). [🧾 Paper]
Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering (COLING 2025). [🧾 Paper] | [💻 Code]
LongReward: Improving Long-context Large Language Models with AI Feedback (ACL 2025). [🧾 Paper] | [💻 Code]
RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models (arXiv 2025-06-03). [🧾 Paper]
Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety (arXiv 2025-09-16). [🧾 Paper]

RL Methods

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering (KDD 2024). [🧾 Paper]
Unfamiliar Finetuning Examples Control How Language Models Hallucinate (NAACL 2025). [🧾 Paper] | [💻 Code]
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality (arXiv 2025-06-24). [🧾 Paper] | [💻 Code]
Learning to Reason for Factuality (arXiv 2025-08-07). [🧾 Paper]
BLEUBERI: BLEU is a surprisingly effective reward for instruction following (arXiv 2025-05-16). [🧾 Paper] | [💻 Code]
Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation (arXiv 2025-05-27). [🧾 Paper] | [💻 Code]
Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation (arXiv 2025-05-29). [🧾 Paper]
The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models (arXiv 2025-05-30). [🧾 Paper] | [💻 Code]
Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality (arXiv 2025-09-28). [🧾 Paper] | [💻 Code] | [🤗 Models]
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning (arXiv 2025-09-30). [🧾 Paper]
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards (arXiv 2025-10-01). [🧾 Paper] | [💻 Code] | [🤗 Models]

Other Methods

Fact-Level Confidence Calibration and Self-Correction (arXiv 2024-11-20). [🧾 Paper] | [💻 Code]
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models (arXiv 2025-02-25). [🧾 Paper] | [💻 Code]

Benchmarks

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (EMNLP 2023). [🧾 Paper] | [💻 Code]
ANAH: Analytical Annotation of Hallucinations in Large Language Models (ACL 2024). [🧾 Paper] | [💻 Code]
Long-form factuality in large language models (NeurIPS 2024). [🧾 Paper] | [💻 Code]
WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries (arXiv 2024-06-24). [🧾 Paper]
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis (arXiv 2024-11-29). [🧾 Paper] | [💻 Code]
FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation (ACL 2025). [🧾 Paper] | [💻 Code]
HalluLens: LLM Hallucination Benchmark (ACL 2025). [🧾 Paper] | [💻 Code]
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering (EMNLP 2025). [🧾 Paper] | [💻 Code]
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them (ACL 2025). [🧾 Paper] | [💻 Code]
Precise Information Control in Long-Form Text Generation (NeurIPS 2025). [🧾 Paper] | [💻 Code]
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality (arXiv 2025-07-31). [🧾 Paper]
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues (arXiv 2025-09-26). [🧾 Paper] | [💻 Code]

Surveys

A Survey of Hallucination in Large Foundation Models (arXiv 2023-09-12). [🧾 Paper] | [💻 Code]
Factuality of Large Language Models: A Survey (EMNLP 2024). [🧾 Paper]
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models (EMNLP 2024). [🧾 Paper]
Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study (NAACL 2024). [🧾 Paper]
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity (ACM Computing Surveys 2025). [🧾 Paper] | [💻 Code]
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions (TOIS 2025). [🧾 Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Long-Form Factuality: A Curated List of Papers and Resources on Enhancing Factuality in Long-Form LLM Generations

Overview

Table of Contents

Some Insights

Evaluation Methods for Factuality

Prompt-Based Methods

SFT and DPO Methods

RL Methods

Other Methods

Benchmarks

Surveys

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Long-Form Factuality: A Curated List of Papers and Resources on Enhancing Factuality in Long-Form LLM Generations

Overview

Table of Contents

Some Insights

Evaluation Methods for Factuality

Prompt-Based Methods

SFT and DPO Methods

RL Methods

Other Methods

Benchmarks

Surveys

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages