Hi, thanks for your inspiring work!
I would like to ask if the Llama-3-8b-Chat in Table 1 refers to the original "meta-llama/Meta-Llama-3-8B-Instruct" model. When I attempted to reproduce the results from Table 1, I calculated a factscore of 35.53, and for llama-3-8b-instruct + factalign, the factscore was 37.62. I noticed a significant discrepancy compared to the values in Table 1 (Llama-3-8b-chat=54.96, Llama-3-8b-chat+factalign=62.84).
For calculating the factscore, I used the evaluation script provided by [1], and to save costs, I evaluated using the "retrieval+llama+npm" model. Although this differs from your "retrieval+ChatGPT" approach, based on the FactScore authors' results, the difference shouldn't be too large. Therefore, I suspect it might be due to decoding parameters. I used the default sampling decoding with temperature=1.0. What is your decoding strategy? What other reasons do you think could lead to this discrepancy?
Thank you!
[1] https://github.com/shmsw25/FActScore
Hi, thanks for your inspiring work!
I would like to ask if the Llama-3-8b-Chat in Table 1 refers to the original "meta-llama/Meta-Llama-3-8B-Instruct" model. When I attempted to reproduce the results from Table 1, I calculated a factscore of 35.53, and for llama-3-8b-instruct + factalign, the factscore was 37.62. I noticed a significant discrepancy compared to the values in Table 1 (Llama-3-8b-chat=54.96, Llama-3-8b-chat+factalign=62.84).
For calculating the factscore, I used the evaluation script provided by [1], and to save costs, I evaluated using the "retrieval+llama+npm" model. Although this differs from your "retrieval+ChatGPT" approach, based on the FactScore authors' results, the difference shouldn't be too large. Therefore, I suspect it might be due to decoding parameters. I used the default sampling decoding with temperature=1.0. What is your decoding strategy? What other reasons do you think could lead to this discrepancy?
Thank you!
[1] https://github.com/shmsw25/FActScore