BM25 (Elasticsearch) for OpenCSR

Installation
Preprocessing
- Step 1. Split the corpus into multiple bulks
- Step 2. Indexing
Inference

Link to the code for the experiment: OpenCSR/baseline_methods/BM25/

Installation

# Install elasticsearch
mkdir ~/elasticsearch
cd ~/elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.0-linux-x86_64.tar.gz
tar -xvf elasticsearch-7.9.0-linux-x86_64.tar.gz

# Install the python API
pip install elasticsearch

# Start the service
~/elasticsearch/elasticsearch-7.9.0/bin/elasticsearch

# Check if the service is okay
curl -X GET http://localhost:9200/
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

Preprocessing

Step 1. Split the corpus into multiple bulks

# Each bulk is 50k sentences.
split -l 50000 ${CORPUS_PATH}/gkb_best.drfact_format.jsonl -d -a 2 /tmp/gkb_best_

Step 2. Indexing

python baseline_methods/BM25/index.py /tmp/gkb_best_

Inference

python baseline_methods/BM25/search.py \
  --linked_qa_file [path to the linked_"train/dev/test".jsonl ]
# for example "drfact_data/datasets/ARC/linked_dev.jsonl"
# then, this will generate "drfact_data/datasets/ARC/linked_dev.BM25.jsonl"