DuReader-retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, Haifeng Wang

October, 2022

Abstract

In this paper, we present DuReader-retrieval, a large-scale Chinese dataset for passage retrieval. DuReader-retrieval contains more than 90K queries and over 8M unique passages from Baidu search. To ensure the quality of our benchmark and address the shortcomings in other existing datasets, we (1) reduce the false negatives in development and testing sets by pooling the results from multiple retrievers with human annotations, (2) and remove the training queries that are semantically similar to the development and testing queries. Additionally, we provide two out-of-domain testing sets for cross-domain evaluation, as well as a cross-lingual set that has been manually translated for cross-lingual retrieval. The experiments demonstrate that DuReader-retrieval is challenging and there is still plenty of room for improvement, e.g. salient phrase and syntax mismatch between query and paragraph. These experimental results show that the dense retriever does not generalize well across domains, and cross-lingual retrieval is essentially challenging. DuReader-retrieval is publicly available.

Type

Conference paper

Publication

In The 2022 Conference on Empirical Methods in Natural Language Processing

Open-domain Question Answering