Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

This is the official code repository for the paper "Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review", presented at ACL 2026 Main Conference. It contains the scripts for author response generation and evaluation outlined in the paper.

Please find the paper here, and star the repository to stay updated with the latest information.

In case of questions please contact Qian Ruan.

Abstract

Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. In practice, authors possess domain expertise, author-only information, and response strategies-concrete forms of author expertise and intent--and seek NLP assistance that integrates these signals into author response generation (ARG). Yet this author-in-the-loop paradigm lacks formal NLP formulation and systematic study: no dataset provides fine-grained author signals, existing ARG work lacks author inputs and controls, and no evaluation measures response reflection of author signals and effectiveness in addressing reviewer concerns. To fill these gaps, we introduce Re³Align, the first large-scale dataset of aligned review–response–revision triplets, where revisions proxy author signals; REspGen, an author-in-the-loop ARG framework supporting flexible author input, multi-attribute control, and evaluation-guided refinement; and REspEval, a comprehensive evaluation suite with 20+ metrics spanning input utilization, controllability, response quality, and discourse. Experiments with SOTA LLMs demonstrate the benefits of author input and evaluation-guided refinement, the impact of input richness on quality, and controllability–quality trade-offs. We release our dataset, generation and evaluation tools.

Frameworks and Dataset

Figure 1. In this work, we contribute (1) REspGen, an author-in-the-loop ARG framework that integrates explicit author input (d), controllable planning and length (b–c), and additional paper context (e); (2) Re³Align, the first large-scale review–response–revision triplets dataset for modeling author signals; and (3) REspEval, a comprehensive response evaluation framework with over 20 metrics spanning four dimensions.

Quickstart

Download the project from github.

git clone https://github.com/UKPLab/acl2026-respgen-respeval

Setup environment

python -m venv .acl2026-respgen-respeval
source ./.acl2026-respgen-respeval/bin/activate
pip install -r requirements.txt

Data

Download the Re³Align dataset from [1] and extract the subfolders to ./data_triplets and ./tasks_data.

[1]. https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4982

Author Response Generation and Evaluation

Check the 'generate_evaluate_author_response.py' script for the complete pipeline and alternative settings. You can customize the arguments within <settings> and </settings>.

For example, author response generation with GPT-4o under REspGen-Setting 6 (see the paper):

Basic Settings

# basic settings
# <settings>
task_name ='author_response_generation'
method = 'inference_llm' 
data_root_path = 'tasks_data' # root path of the triplet linking data
train_type = None # name of the training data in data/{task_name}, none for inference mode
val_type = None # name of the validation data in data/{task_name}, none for inference mode
test_type = 'selected_samples' # name of the test data in data/{task_name}

Load Data

from tasks.task_data_loader import TaskDataLoader
task_data_loader = TaskDataLoader(data_root=data_root_path, task_name=task_name, test_type=test_type, train_type=train_type, val_type=val_type)
train_ds, val_ds, test_ds= task_data_loader.load_data()

Specify Model and Experimental Settings

# <settings>
llm_model_name = 'gpt-4o-2024-11-20' 
# path to saved api keys
api_key_path_dict = {'deepseek-r1': '.keys/deepseek_key.txt',
                 'gpt-4o-2024-11-20': '.keys/azure_key.txt',}
# generation settings
input_type = 'inst_nl_icl0'  # natural language instructions with 0 in-context learning examples
# an example setting (see Setting 6 in the paper) 
system_prompt = 'ARR-wAIx'# with author input
style_prompt = 'style-PH' # insert placeholder if author-only information is needed
sample_AIx = 'S+SecT+P+v1'# author input as edit strings + paragraph context + section title + v1 retrieval
itemizing = 'item' # itemize the review
planning = 'author-plan'# author control over response plan
length_control = 'dyn-upper-n+50'# author control over response length
refining = {} # not a refining step
refining_text = '' # not a refining step
# </settings>
inst_settings = {'system_prompt': system_prompt, 
                          'style_prompt': style_prompt, 
                          'sample_AIx': sample_AIx, 
                          'itemizing': itemizing, 
                          'planning': planning,
                          'length_control': length_control,
                          'refining': refining
                          }
inst_settings_texts = {'system_prompt': system_prompt, 
                          'style_prompt': style_prompt, 
                          'sample_AIx': sample_AIx, 
                          'itemizing': itemizing, 
                          'planning': planning,
                          'length_control': length_control,
                          'refining': refining_text
                          }
inst_type = [v for k, v in inst_settings_texts.items() if v != '']
inst_type = '_'.join(inst_type)
inst_type = '@'+ inst_type

Preprocess Data

from tasks.task_data_preprocessor import TaskDataPreprocessor
data_preprocessor = TaskDataPreprocessor(task_name=task_name, method=method).data_preprocessor
test_ds = data_preprocessor.preprocess_data(test_ds, 
                                                input_type=input_type, 
                                                inst_settings=inst_settings, 
                                                )

Create a model folder to save generations

# create output dir
# <settings>
recreate_dir = True # Create a directory for the model, true: recreate and rerun generation and evaluation if exists, false: not recreate if exists
# </settings>
# create model dir under ./results to save the generated outputs
output_dir = create_model_dir(task_name, method, llm_model_name, train_type, test_type, input_type, inst_type,  recreate_dir=recreate_dir)

Generate and Evaluate Author Response

# specify the local model name and path if using local models like LLaMA, Qwen, etc.
local_model_name_path_dict = {
    'llama-3.3-70b-inst': '',
    'qwen3-32b': '',
    'phi-4-reasoning': '',
     }
local_model_path = local_model_name_path_dict[llm_model_name] if llm_model_name in local_model_name_path_dict else None
    
#update api_settings if not using local model
if local_model_path is None:
        assert llm_model_name in api_key_path_dict, f'Please provide the key path for the model {llm_model_name} in key_path_dict'
        key_path = api_key_path_dict[llm_model_name]
        # read api key and base from the key file
        with open(key_path, 'r', encoding="utf-8") as f:
            lines = f.readlines()
            api_version = lines[0].strip().split('=')[1].strip()
            api_base    = lines[1].strip().split('=')[1].strip()
            api_key     = lines[2].strip().split('=')[1].strip()

            api_settings = {'api_version': api_version, # empty string '' if not needed
                            'api_base': api_base, 
                            'api_key': api_key, 
                            'api_model_id': llm_model_name
                            }
else:
        api_settings = None
############################################################################
# create generator and evaluator
from tasks.task_evaluater import TaskEvaluater
do_predict = True #generate responses
eval_gold = False # do not evaluate human gold responses, this should be set for true once at the beginning to get the gold eval results
eval_gold_model_name = 'gpt-4o-2024-11-20_None_selected_samples_inst_nl_icl0_@ARR-noAIx_style-PH' # model name where the gold responses were evaluated, needed if eval_gold is TFalse
eval_pred = True # evaluate the model-generated responses
evaluater = TaskEvaluater(task_name=task_name, method=method).evaluater
# define the metrics to evaluate
if inst_settings['length_control'].strip()!='':
            respeval_eval_types = ['meta','TSP_flow','factuality','conv_spec_direct', 'len_control']
else:
            respeval_eval_types = ['meta','TSP_flow','factuality','conv_spec_direct']
if inst_settings['planning'].strip()!='':
            respeval_eval_types.append('plan')
if itemizing == '' and planning == '' and length_control == '' and refining == {}:
            respeval_eval_types.append('ICR')

evaluater.evaluate(test_ds, 
                       output_dir=output_dir, 
                       model_path=local_model_path,
                       api_settings=api_settings, 
                       do_predict=do_predict, 
                       eval_gold=eval_gold,
                       eval_gold_model_name = eval_gold_model_name,
                       eval_pred=eval_pred,
                       respeval_eval_types=respeval_eval_types)
# The evaluation reports are saved as JSON files in the model folder under ./results, including basic similarity-based, politeness,  and REspEval scores

Citation

Please use the following citation:

Ruan, Q., & Gurevych, I. (2026). Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review. ArXiv. https://arxiv.org/abs/2602.11173

@misc{ruan2026authorintheloopresponsegenerationevaluation,
      title={Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review}, 
      author={Qian Ruan and Iryna Gurevych},
      year={2026},
      eprint={2602.11173},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.11173}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

https://intertext.ukp-lab.de/

https://www.ukp.tu-darmstadt.de

https://www.tu-darmstadt.de

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.cache/respeval/eval_results/gpt-4o-2024-11-20_None_selected_samples_inst_nl_icl0_@ARR-noAIx_style-PH		.cache/respeval/eval_results/gpt-4o-2024-11-20_None_selected_samples_inst_nl_icl0_@ARR-noAIx_style-PH
.keys		.keys
REspEval/respeval		REspEval/respeval
data_triplets		data_triplets
resource		resource
results/author_response_generation/inference_llm/model_generations_and eval_reports_saved_under_unique_identifier		results/author_response_generation/inference_llm/model_generations_and eval_reports_saved_under_unique_identifier
tasks		tasks
tasks_data		tasks_data
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
generate_evaluate_author_response.py		generate_evaluate_author_response.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

Abstract

Frameworks and Dataset

Quickstart

Data

Author Response Generation and Evaluation

Citation

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

Abstract

Frameworks and Dataset

Quickstart

Data

Author Response Generation and Evaluation

Citation

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages