You can use papers from https://openreview.net/ as your database! Here's a helper that fetches a list of all papers from a selected conference (like ICLR, ICML, NeurIPS), queries this list to find relevant papers using LLM, and downloads those relevant papers to a local directory which can be used with paper-qa on the next step. Install openreview-py
with
and get your username and password from the website. You can put them into .env
file under OPENREVIEW_USERNAME
and OPENREVIEW_PASSWORD
variables, or pass them in the code directly.
It's been a while since we've tested this - so let us know if it runs into issues!
If you use Zotero to organize your personal bibliography, you can use the paperqa.contrib.ZoteroDB
to query papers from your library, which relies on pyzotero.
Install pyzotero
via the zotero
extra for this feature:
First, note that PaperQA2 parses the PDFs of papers to store in the database, so all relevant papers should have PDFs stored inside your database. You can get Zotero to automatically do this by highlighting the references you wish to retrieve, right clicking, and selecting "Find Available PDFs". You can also manually drag-and-drop PDFs onto each reference.
To download papers, you need to get an API key for your account.
Get your library ID, and set it as the environment variable ZOTERO_USER_ID
.
For personal libraries, this ID is given here at the part "Your userID for use in API calls is XXXXXX".
For group libraries, go to your group page https://www.zotero.org/groups/groupname
, and hover over the settings link. The ID is the integer after /groups/. (h/t pyzotero!)
Create a new API key here and set it as the environment variable ZOTERO_API_KEY
.
The key will need read access to the library.
With this, we can download papers from our library and add them to PaperQA2:
which will download the first 20 papers in your Zotero database and add them to the Docs
object.
We can also do specific queries of our Zotero library and iterate over the results:
You can read more about the search syntax by typing zotero.iterate?
in IPython.
If you want to search for papers outside of your own collection, I've found an unrelated project called paper-scraper that looks like it might help. But beware, this project looks like it uses some scraping tools that may violate publisher's rights or be in a gray area of legality.
The LFRQA dataset was introduced in the paper RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering. It features 1,404 science questions (along with other categories) that have been human-annotated with answers. This tutorial walks through the process of setting up the dataset for use and benchmarking.
First, we need to obtain the annotated dataset from the official repository:
LFRQA is built upon Robust-QA, so we must download the relevant documents:
For more details, refer to the original paper: RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering.
We now load the documents into a pandas dataframe:
RobustQA consists on 1.7M documents, so building the whole index will take around 3 hours.
If you want to run a test, you can use a portion of the dataset and the questions that can be answered only on those documents.
We now create the document directory and store each document as a separate text file, so that paperqa can build the index.
The manifest file keeps track of document metadata for the dataset. We need to fill some fields so that paperqa doesn’t try to get metadata using llm calls. This will make the indexing process faster.
Finally, we load the questions and filter them to ensure we only include questions that reference the selected documents:
From now on, we will be using the paperqa library, so we need to install it:
Copy the following to a file and run it. Feel free to adjust the concurrency as you like.
You don’t need any api keys for building this index because we don't discern any citation metadata, but you do need LLM api keys to answer questions.
Remember that this process is quick for small portions of the dataset, but can take around 3 hours for the whole dataset.
After this runs, you will get an answer!
After you have built the index, you are ready to run the benchmark.
Copy the following into a file and run it. To run this, you will need to have the ldp
and fhaviary[lfrqa]
packages installed.
After running this, you can find the results in the data/rag-qa-benchmarking/results
folder. Here is an example of how to read them:
PaperQA2 now natively supports querying clinical trials in addition to any documents supplied by the user. It uses a new tool, the aptly named clinical_trials_search
tool. Users don't have to provide any clinical trials to the tool itself, it uses the clinicaltrials.gov
API to retrieve them on the fly. As of January 2025, the tool is not enabled by default, but it's easy to configure. Here's an example where we query only clinical trials, without using any documents:
You can see the in-line citations for each clinical trial used as a response for each query. If you'd like to see more data on the specific contexts that were used to answer the query:
Using Settings.from_name('search_only_clinical_trials')
is a shortcut, but note that you can easily add clinical_trial_search
into any custom Settings
by just explicitly naming it as a tool:
We now see both papers and clinical trials cited in our response. For convenience, we have a Settings.from_name
that works as well:
And, this works with the pqa
cli as well:
This tutorial is available as a Jupyter notebook here.
This tutorial aims to show how to use the Settings
class to configure PaperQA
. Firstly, we will be using OpenAI
and Anthropic
models, so we need to set the OPENAI_API_KEY
and ANTHROPIC_API_KEY
environment variables. We will use both models to make it clear when paperqa
agent is using either one or the other. We use python-dotenv
to load the environment variables from a .env
file. Hence, our first step is to create a .env
file and install the required packages.
We will use the lmi
package to get the model names and the .papers
directory to save documents we will use.
The Settings
class is used to configure the PaperQA settings. Official documentation can be found here and the open source code can be found here.
Here is a basic example of how to use the Settings
class. We will be unnecessarily verbose for the sake of clarity. Please notice that most of the settings are optional and the defaults are good for most cases. Refer to the descriptions of each setting for more information.
Within this Settings
object, I'd like to discuss specifically how the llms are configured and how paperqa
looks for papers.
A common source of confusion is that multiple llms
are used in paperqa. We have llm
, summary_llm
, agent_llm
, and embedding
. Hence, if llm
is set to an Anthropic
model, summary_llm
and agent_llm
will still require a OPENAI_API_KEY
, since OpenAI
models are the default.
Among the objects that use llms
in paperqa
, we have llm
, summary_llm
, agent_llm
, and embedding
:
llm
: Main LLM used by the agent to reason about the question, extract metadata from documents, etc.
summary_llm
: LLM used to summarize the papers.
agent_llm
: LLM used to answer questions and select tools.
embedding
: Embedding model used to embed the papers.
Let's see some examples around this concept. First, we define the settings with llm
set to an OpenAI
model. Please notice this is not an complete list of settings. But take your time to read through this Settings
class and all customization that can be done.
As it is evident, Paperqa
is absolutely customizable. And here we reinterate that despite this possible fine customization, the defaults are good for most cases. Although, the user is welcome to explore the settings and customize the paperqa
to their needs.
We also set settings.verbosity to 1, which will print the agent configuration. Feel free to set it to 0 to silence the logging after your first run.
Which probably worked fine. Let's now try to remove OPENAI_API_KEY
and run again the same question with the same settings.
It would obviously fail. We don't have a valid OPENAI_API_KEY
, so the agent will not be able to use OpenAI
models. Let's change it to an Anthropic
model and see if it works.
Now the agent is able to use Anthropic
models only and although we don't have a valid OPENAI_API_KEY
, the question is answered because the agent will not use OpenAI
models. See that we also changed the embedding
because it was using text-embedding-3-small
by default, which is a OpenAI
model. Paperqa
implements a few embedding models. Please refer to the documentation for more information.
Notice that we redefined settings.agent.paper_directory
and settings.agent.index
settings. Paperqa
actually uses the setting from settings.agent
. However, for convenience, we implemented an alias in settings.paper_directory
and settings.index_directory
.
Paperqa
returns a PQASession
object, which contains not only the answer but also all the information gatheres to answer the questions. We recommend printing the PQASession
object (print(response.session)
) to understand the information it contains. Let's check the PQASession
object:
In addition to the answer, the PQASession
object contains all the references and contexts used to generate the answer.
Because paperqa
splits the documents into chunks, each chunk is a valid reference. You can see that it also references the page where the context was found.
Lastly, PQASession.session.contexts
contains the contexts used to generate the answer. Each context has a score, which is the similarity between the question and the context. Paperqa
uses this score to choose what contexts is more relevant to answer the question.