BM25 (Elasticsearch) for OpenCSR
Link to the code for the experiment: OpenCSR/baseline_methods/BM25/
Installation
# Install elasticsearch
mkdir ~/elasticsearch
cd ~/elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.0-linux-x86_64.tar.gz
tar -xvf elasticsearch-7.9.0-linux-x86_64.tar.gz
# Install the python API
pip install elasticsearch
# Start the service
~/elasticsearch/elasticsearch-7.9.0/bin/elasticsearch
# Check if the service is okay
curl -X GET http://localhost:9200/
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
Preprocessing
Step 1. Split the corpus into multiple bulks
# Each bulk is 50k sentences.
split -l 50000 ${CORPUS_PATH}/gkb_best.drfact_format.jsonl -d -a 2 /tmp/gkb_best_
Step 2. Indexing
python baseline_methods/BM25/index.py /tmp/gkb_best_
Inference
python baseline_methods/BM25/search.py \
--linked_qa_file [path to the linked_"train/dev/test".jsonl ]
# for example "drfact_data/datasets/ARC/linked_dev.jsonl"
# then, this will generate "drfact_data/datasets/ARC/linked_dev.BM25.jsonl"