# Evaluation

## Post-processing the prediction results

### BM25/DPR.

After running the inference of BM25 and DPR, we have a ranked list of retrieved facts with scores for each question. To get a ranked list of concepts as the answer as the final output, we post-process the retrieved results:

python evaluation/process_ret_results.py \
--drfact_format_gkb_file [drfact_data/knowledge_corpus/gkb_best.drfact_format.jsonl] \
--pred_result_file [results/ARC/dev_prediction.BM25.jsonl]
# [x] are example paths. You can adjust for your target files.


### DrKIT/DrFact

To have a unified format of prediction result, we reformat the result files of DrKIT and DrFact as follows:

input_file=[/path/to/best_predictions.json]
output_file=[results/ARC/dev_prediction.drfact.jsonl]
python evaluation/process_drx_results.py ${input_file}${output_file}

# [x] are example paths. You can adjust for your target files.


### X + Concept Re-Ranker

This is done in Step 7 of the re-ranking.

## Metrics

### Hit@K acc and Ret@K acc

Recall that, given a question $$q$$, the final output of every method is a weighted set of concepts $$A=\{(a_1, w_1), \dots \}$$. We denote the set of true answer concepts, as defined above, as $$A^*=\{a_1^*, a_2^*, \dots \}$$. We define Hit@K accuracy to be the fraction of questions for which we can find at least one correct answer concept $$a_i^*\in A^*$$ in the top-$$K$$ concepts of $$A$$ (sorted in descending order of weight). As questions have multiple correct answers, recall is also an important aspect for evaluating OpenCSR, so we also use Rec@K to evaluate the average recall of the top-K proposed answers.

### Run

python evaluation/eval_metrics.py \
--pred_result_file [results/ARC/dev_prediction.BM25.jsonl] \