HotPotQA environment implemented with aviary, allowing agents to perform multi-hop question answering on the HotPotQA dataset.
[1] Yang et al. HotpotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering. EMNLP, 2018.
[2] Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations. 2023
To install the HotPotQA environment, run the following command:
LitQA2 environment implemented with aviary, allowing agents to perform question answering on the LitQA dataset.
LitQA (now legacy) is a dataset composed from 50 multiple-choice questions from recent literature. It is designed to test the LLM's the ability to retrieve information outside of the pre-training corpus. To ensure the questions are not in the pre-training corpus, the questions were collected from scientific papers published after September 2021 -- cut-off date of GPT-4's training data.
LitQA2 is part of the LAB-Bench dataset. LitQA2 contains 248 multiple-choice questions from the literature and was created ensuring that the questions cannot be answered by recalling from the pre-training corpus only. It considered scientific paper published within 36 months from the data of its publication. Therefore, LitQA2 is considered a scientific RAG dataset.
To install the LitQA environment, run:
In litqa/env.py
, you will find:
GradablePaperQAEnvironment
: an environment that can grade answers given an evaluation function.
And in litqa/task.py
, you will find:
LitQAv2TaskDataset
: a task dataset designed to pull LitQA v2 from Hugging Face, and create one GradablePaperQAEnvironment
per question
Here is an example of how to use them:
[1] Lála et al. PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. ArXiv:2312.07559, 2023.
[2] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.
[3] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.
An environment designed to utilize PaperQA for answering questions from the LFRQATaskDataset. Long-form RobustQA (LFRQA) is a human-annotated dataset introduced in the RAG-QA-Arena, featuring over 1400 questions from various categories, including science.
To install the LFRQA environment, run:
Refer to this tutorial for instructions on how to run the environment.
[1] RAG-QA Arena (https://arxiv.org/pdf/2407.13998)
GSM8k environment implemented with aviary, allowing agents to solve math word problems from the GSM8k dataset.
The citation for GSM8k is given below:
[1] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R. and Hesse, C., 2021. . arXiv preprint arXiv:2110.14168.
To install the GSM8k environment, run the following command: