First, we need to obtain the annotated dataset from the official repository:
# Create a new directory for the dataset
mkdir -p data/rag-qa-benchmarking
# Get the annotated questions
curl https://raw.githubusercontent.com/awslabs/rag-qa-arena/refs/heads/main/data/annotations_science_with_citation.jsonl -o data/rag-qa-benchmarking/annotations_science_with_citation.jsonl
Download the Robust-QA Documents
LFRQA is built upon Robust-QA, so we must download the relevant documents:
# Download the Lotte dataset, which includes the required documents
curl https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz --output lotte.tar.gz
# Extract the dataset
tar -xvzf lotte.tar.gz
# Move the science test collection to our dataset folder
cp lotte/science/test/collection.tsv ./data/rag-qa-benchmarking/science_test_collection.tsv
# Clean up unnecessary files
rm lotte.tar.gz
rm -rf lotte
We now load the documents into a pandas dataframe:
import os
import pandas as pd
# Load questions and answers dataset
rag_qa_benchmarking_dir = os.path.join("data", "rag-qa-benchmarking")
# Load documents dataset
lfrqa_docs_df = pd.read_csv(
os.path.join(rag_qa_benchmarking_dir, "science_test_collection.tsv"),
sep="\t",
names=["doc_id", "doc_text"],
)
Select the Documents to Use
RobustQA consists on 1.7M documents, so building the whole index will take around 3 hours.
If you want to run a test, you can use a portion of the dataset and the questions that can be answered only on those documents.
proportion_to_use = 1 / 100
amount_of_docs_to_use = int(len(lfrqa_docs_df) * proportion_to_use)
print(f"Using {amount_of_docs_to_use} out of {len(lfrqa_docs_df)} documents")
Prepare the Document Files
We now create the document directory and store each document as a separate text file, so that paperqa can build the index.
partial_docs = lfrqa_docs_df.head(amount_of_docs_to_use)
lfrqa_directory = os.path.join(rag_qa_benchmarking_dir, "lfrqa")
os.makedirs(
os.path.join(lfrqa_directory, "science_docs_for_paperqa", "files"), exist_ok=True
)
for i, row in partial_docs.iterrows():
doc_id = row["doc_id"]
doc_text = row["doc_text"]
with open(
os.path.join(
lfrqa_directory, "science_docs_for_paperqa", "files", f"{doc_id}.txt"
),
"w",
encoding="utf-8",
) as f:
f.write(doc_text)
if i % int(len(partial_docs) * 0.05) == 0:
progress = (i + 1) / len(partial_docs)
print(f"Progress: {progress:.2%}")
Create the Manifest File
The manifest file keeps track of document metadata for the dataset. We need to fill some fields so that paperqa doesn’t try to get metadata using llm calls. This will make the indexing process faster.