Skip to content

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

@LuckyyySTA

Description

@LuckyyySTA

Hi, thanks for your inspiring work!

I would like to ask if the Llama-3-8b-Chat in Table 1 refers to the original "meta-llama/Meta-Llama-3-8B-Instruct" model. When I attempted to reproduce the results from Table 1, I calculated a factscore of 35.53, and for llama-3-8b-instruct + factalign, the factscore was 37.62. I noticed a significant discrepancy compared to the values in Table 1 (Llama-3-8b-chat=54.96, Llama-3-8b-chat+factalign=62.84).

For calculating the factscore, I used the evaluation script provided by [1], and to save costs, I evaluated using the "retrieval+llama+npm" model. Although this differs from your "retrieval+ChatGPT" approach, based on the FactScore authors' results, the difference shouldn't be too large. Therefore, I suspect it might be due to decoding parameters. I used the default sampling decoding with temperature=1.0. What is your decoding strategy? What other reasons do you think could lead to this discrepancy?

Thank you!

[1] https://github.com/shmsw25/FActScore

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions