The LFRQA dataset was introduced in the paper . It features 1,404 science questions (along with other categories) that have been human-annotated with answers. This tutorial walks through the process of setting up the dataset for use and benchmarking.
Download the Annotations
First, we need to obtain the annotated dataset from the official repository:
# Create a new directory for the dataset
!mkdir -p data/rag-qa-benchmarking
# Get the annotated questions
!curl https://raw.githubusercontent.com/awslabs/rag-qa-arena/refs/heads/main/data/\
annotations_science_with_citation.jsonl \
-o data/rag-qa-benchmarking/annotations_science_with_citation.jsonl
Download the Robust-QA Documents
LFRQA is built upon Robust-QA, so we must download the relevant documents:
# Download the Lotte dataset, which includes the required documents
!curl https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz --output lotte.tar.gz
# Extract the dataset
!tar -xvzf lotte.tar.gz
# Move the science test collection to our dataset folder
!cp lotte/science/test/collection.tsv ./data/rag-qa-benchmarking/science_test_collection.tsv
# Clean up unnecessary files
!rm lotte.tar.gz
!rm -rf lotte
Load the Data
We now load the documents into a pandas dataframe:
import os
import pandas as pd
# Load questions and answers dataset
rag_qa_benchmarking_dir = os.path.join("data", "rag-qa-benchmarking")
# Load documents dataset
lfrqa_docs_df = pd.read_csv(
os.path.join(rag_qa_benchmarking_dir, "science_test_collection.tsv"),
sep="\t",
names=["doc_id", "doc_text"],
)
Select the Documents to Use
RobustQA consists on 1.7M documents. Hence, it takes around 3 hours to build the whole index.
To run a test, we can use 1% of the dataset. This will be accomplished by selecting the first 1% available documents and the questions referent to these documents.
proportion_to_use = 1 / 100
amount_of_docs_to_use = int(len(lfrqa_docs_df) * proportion_to_use)
print(f"Using {amount_of_docs_to_use} out of {len(lfrqa_docs_df)} documents")
Prepare the Document Files
We now create the document directory and store each document as a separate text file, so that paperqa can build the index.
partial_docs = lfrqa_docs_df.head(amount_of_docs_to_use)
lfrqa_directory = os.path.join(rag_qa_benchmarking_dir, "lfrqa")
os.makedirs(
os.path.join(lfrqa_directory, "science_docs_for_paperqa", "files"), exist_ok=True
)
for i, row in partial_docs.iterrows():
doc_id = row["doc_id"]
doc_text = row["doc_text"]
with open(
os.path.join(
lfrqa_directory, "science_docs_for_paperqa", "files", f"{doc_id}.txt"
),
"w",
encoding="utf-8",
) as f:
f.write(doc_text)
if i % int(len(partial_docs) * 0.05) == 0:
progress = (i + 1) / len(partial_docs)
print(f"Progress: {progress:.2%}")
Create the Manifest File
The manifest file keeps track of document metadata for the dataset. We need to fill some fields so that paperqa doesn’t try to get metadata using llm calls. This will make the indexing process faster.
From now on, we will be using the paperqa library, so we need to install it:
!pip install paper-qa
Index the Documents
Now we will build an index for the LFRQA documents. The index is a Tantivy index, which is a fast, full-text search engine library written in Rust. Tantivy is designed to handle large datasets efficiently, making it ideal for searching through a vast collection of papers or documents.
Feel free to adjust the concurrency settings as you like. Because we defined a manifest, we don’t need any API keys for building this index because we don't discern any citation metadata, but you do need LLM API keys to answer questions.
Remember that this process is quick for small portions of the dataset, but can take around 3 hours for the whole dataset.
import nest_asyncio
nest_asyncio.apply()
We add the line above to handle async code within a notebook.
However, to improve compatibility and speed up the indexing process, we strongly recommend running the following code in a separate .py file