publications | TIME Group

2025

Annotation-efficient universal honesty alignment

Shiyu Ni , Keping Bi , Jiafeng Guo , Minghao Tang , Jingtong Wu , Zengxin Han , and Xueqi Cheng

arXiv preprint arXiv:2510.17509, 2025

Abs Bib PDF

Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals. Experiments show that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) and better alignment performance on unseen MMLU tasks than the calibration-only baseline, offering a scalable solution toward universal honesty alignment in LLMs.
@article{ni2025annotation, title = {Annotation-efficient universal honesty alignment}, author = {Ni, Shiyu and Bi, Keping and Guo, Jiafeng and Tang, Minghao and Wu, Jingtong and Han, Zengxin and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2510.17509}, year = {2025}, }
ICLR
How Do LLM-Generated Texts Impact Term-Based Retrieval Models?

Wei Huang , Keping Bi , Yinqiong Cai , Wei Chen , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2508.17715, 2025

Abs Bib PDF

As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
@article{huang2025llm, title = {How Do LLM-Generated Texts Impact Term-Based Retrieval Models?}, author = {Huang, Wei and Bi, Keping and Cai, Yinqiong and Chen, Wei and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2508.17715}, year = {2025}, }
WSDM
Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation

Minghao Tang , Shiyu Ni , Jiafeng Guo , and Keping Bi

arXiv preprint arXiv:2507.19333, 2025

Abs Bib PDF GitHub

Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs’ robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs’ reasoning process, aiming to enhance the model’s ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs’ reasoning process is a promising direction for building more robust RAG systems.
@article{tang2025injecting, title = {Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation}, author = {Tang, Minghao and Ni, Shiyu and Guo, Jiafeng and Bi, Keping}, journal = {arXiv preprint arXiv:2507.19333}, year = {2025}, }
SIGIR-AP
Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

Hengran Zhang , Keping Bi , Jiafeng Guo , Jiaming Zhang , Shuaiqiang Wang , Dawei Yin , and Xueqi Cheng

arXiv preprint arXiv:2507.19102, 2025

Abs Bib PDF GitHub

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.
@article{zhang2025distilling, title = {Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation}, author = {Zhang, Hengran and Bi, Keping and Guo, Jiafeng and Zhang, Jiaming and Wang, Shuaiqiang and Yin, Dawei and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2507.19102}, year = {2025}, }
SIGIR-AP
A Comparative Study of Specialized LLMs as Dense Retrievers

Hengran Zhang , Keping Bi , and Jiafeng Guo

arXiv preprint arXiv:2507.03958, 2025

Abs Bib PDF GitHub

While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.
@article{zhang2025comparative, title = {A Comparative Study of Specialized LLMs as Dense Retrievers}, author = {Zhang, Hengran and Bi, Keping and Guo, Jiafeng}, journal = {arXiv preprint arXiv:2507.03958}, year = {2025}, }
CCIR
LifeIR at the NTCIR-18 Lifelog-6 Task

Jiahan Chen , Da Li , and Keping Bi

arXiv preprint arXiv:2505.20987, 2025

Abs Bib PDF

In recent years, sharing lifelogs recorded through wearable devices such as sports watches and GoPros, has gained significant popularity. Lifelogs involve various types of information, including images, videos, and GPS data, revealing users’ lifestyles, dietary patterns, and physical activities. The Lifelog Semantic Access Task(LSAT) in the NTCIR-18 Lifelog-6 Challenge focuses on retrieving relevant images from a large scale of users’ lifelogs based on textual queries describing an action or event. It serves users’ need to find images about a scenario in the historical moments of their lifelogs. We propose a multi-stage pipeline for this task of searching images with texts, addressing various challenges in lifelog retrieval. Our pipeline includes: filtering blurred images, rewriting queries to make intents clearer, extending the candidate set based on events to include images with temporal connections, and reranking results using a multimodal large language model(MLLM) with stronger relevance judgment capabilities. The evaluation results of our submissions have shown the effectiveness of each stage and the entire pipeline.
@article{chen2025lifeir, title = {LifeIR at the NTCIR-18 Lifelog-6 Task}, author = {Chen, Jiahan and Li, Da and Bi, Keping}, journal = {arXiv preprint arXiv:2505.20987}, year = {2025}, }
NTCIR
Bridging Queries and Tables through Entities in Table Retrieval

Da Li , Keping Bi , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2504.06551, 2025

Abs Bib PDF GitHub

Table retrieval is essential for accessing information stored in structured tabular formats; however, it remains less explored than text retrieval. The content of the table primarily consists of phrases and words, which include a large number of entities, such as time, locations, persons, and organizations. Entities are well-studied in the context of text retrieval, but there is a noticeable lack of research on their applications in table retrieval. In this work, we explore how to leverage entities in tables to improve retrieval performance. First, we investigate the important role of entities in table retrieval from a statistical perspective and propose an entity-enhanced training framework. Subsequently, we use the type of entities to highlight entities instead of introducing an external knowledge base. Moreover, we design an interaction paradigm based on entity representations. Our proposed framework is plug-and-play and flexible, making it easy to integrate into existing table retriever training processes. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that our proposed framework is both simple and effective in enhancing existing retrievers. We also conduct extensive analyses to confirm the efficacy of different components. Overall, our work provides a promising direction for elevating table retrieval, enlightening future research in this area.
@article{li2025bridging, title = {Bridging Queries and Tables through Entities in Table Retrieval}, author = {Li, Da and Bi, Keping and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2504.06551}, year = {2025}, }
CIKM
Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective

Da Li , Keping Bi , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2503.02251, 2025

Abs Bib PDF GitHub

Table retrieval, essential for accessing information through tabular data, is less explored compared to text retrieval. The row/column structure and distinct fields of tables (including titles, headers, and cells) present unique challenges. For example, different table fields have varying matching preferences: cells may favor finer-grained (word/phrase level) matching over broader (sentence/passage level) matching due to their fragmented and detailed nature, unlike titles. This necessitates a table-specific retriever to accommodate the various matching needs of each table field. Therefore, we introduce a Table-tailored HYbrid Matching rEtriever (THYME), which approaches table retrieval from a field-aware hybrid matching perspective. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that THYME significantly outperforms state-of-the-art baselines. Comprehensive analyses confirm the differing matching preferences across table fields and validate the design of THYME.
@article{li2025tailoring, title = {Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective}, author = {Li, Da and Bi, Keping and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2503.02251}, year = {2025}, }
EMNLP
Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG

Hengran Zhang , Minghao Tang , Keping Bi , Jiafeng Guo , Shihao Liu , Daiting Shi , Dawei Yin , and Xueqi Cheng

arXiv preprint arXiv:2504.05220, 2025

Abs Bib PDF GitHub

This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.
@article{zhang2025leveraging, title = {Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG}, author = {Zhang, Hengran and Tang, Minghao and Bi, Keping and Guo, Jiafeng and Liu, Shihao and Shi, Daiting and Yin, Dawei and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2504.05220}, year = {2025}, }
EMNLP
Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs

Zhikai Ding , Shiyu Ni , and Keping Bi

arXiv preprint arXiv:2508.19111, 2025

Abs Bib PDF

Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs’ perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs’ perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.
@article{ding2025lvlms, title = {Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs}, author = {Ding, Zhikai and Ni, Shiyu and Bi, Keping}, journal = {arXiv preprint arXiv:2508.19111}, year = {2025}, }
EMNLP (Findings)
Unbiased Learning to Rank with Query-Level Click Propensity Estimation: Beyond Pointwise Observation and Relevance

Lulu Yu , Keping Bi , Jiafeng Guo , Shihao Liu , Dawei Yin , and Xueqi Cheng

In Companion Proceedings of the ACM on Web Conference 2025, 2025

Abs Bib PDF GitHub

Most existing unbiased learning-to-rank (ULTR) approaches are based on the user examination hypothesis, which assumes that users will click a result only if it is both relevant and observed (typically modeled by position). However, in real-world scenarios, users often click only one or two results after examining multiple relevant options, due to limited patience or because their information needs have already been satisfied. Motivated by this, we propose a query-level click propensity model to capture the probability that users will click on different result lists, allowing for non-zero probabilities that users may not click on an observed relevant result. We hypothesize that this propensity increases when more potentially relevant results are present, and refer to this user behavior as relevance saturation bias. Our method introduces a Dual Inverse Propensity Weighting (DualIPW) mechanism-combining query-level and position-level IPW-to address both relevance saturation and position bias. Through theoretical derivation, we prove that DualIPW can learn an unbiased ranking model. Experiments on the real-world Baidu-ULTR dataset demonstrate that our approach significantly outperforms state-of-the-art ULTR baselines.
@inproceedings{yu2025unbiased, title = {Unbiased Learning to Rank with Query-Level Click Propensity Estimation: Beyond Pointwise Observation and Relevance}, author = {Yu, Lulu and Bi, Keping and Guo, Jiafeng and Liu, Shihao and Yin, Dawei and Cheng, Xueqi}, booktitle = {Companion Proceedings of the ACM on Web Conference 2025}, year = {2025}, }
WebConf (Short)
Clipure: Purification in latent space via clip for adversarially robust zero-shot classification

Mingkun Zhang , Keping Bi , Wei Chen , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2502.18176, 2025

Abs Bib PDF GitHub

In this paper, we aim to build an adversarially robust zero-shot image classifier. We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts “a photo of a class-name.”. Purification is the path we choose since it does not require adversarial training on specific attack types and thus can cope with any foreseen attacks. We then formulate purification risk as the KL divergence between the joint distributions of the purification process of denoising the adversarial samples and the attack process of adding perturbations to benign samples, through bidirectional Stochastic Differential Equations (SDEs). The final derived results inspire us to explore purification in the multi-modal latent space of CLIP. We propose two variants for our CLIPure approach: CLIPure-Diff which models the likelihood of images’ latent vectors with the DiffusionPrior module in DaLLE-2 (modeling the generation process of CLIP’s latent vectors), and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and “a photo of a.”. As far as we know, CLIPure is the first purification method in multi-modal latent space and CLIPure-Cos is the first purification method that is not based on generative models, which substantially improves defense efficiency. We conducted extensive experiments on CIFAR-10, ImageNet, and 13 datasets that previous CLIP-based defense methods used for evaluating zero-shot classification robustness. Results show that CLIPure boosts the SOTA robustness by a large margin, e.g., from 71.7% to 91.1% on CIFAR10, from 59.6% to 72.6% on ImageNet, and 108% relative improvements of average robustness on the 13 datasets over previous SOTA.
@article{zhang2025clipure, title = {Clipure: Purification in latent space via clip for adversarially robust zero-shot classification}, author = {Zhang, Mingkun and Bi, Keping and Chen, Wei and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2502.18176}, year = {2025}, }
ICLR
Towards fully exploiting llm internal states to enhance knowledge boundary perception

Shiyu Ni , Keping Bi , Jiafeng Guo , Lulu Yu , Baolong Bi , and Xueqi Cheng

arXiv preprint arXiv:2502.11677, 2025

Abs Bib PDF GitHub

Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries, leading to confident yet incorrect responses. This paper explores leveraging LLMs’ internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives. We investigate whether LLMs can estimate their confidence using internal states before response generation, potentially saving computational resources. Our experiments on datasets like Natural Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant pre-generation perception, which is further refined post-generation, with perception gaps remaining stable across varying conditions. To mitigate risks in critical domains, we introduce Consistency-based Confidence Calibration (C3), which assesses confidence consistency through question reformulation. C3 significantly improves LLMs’ ability to recognize their knowledge gaps, enhancing the unknown perception rate by 5.6% on NQ and 4.9% on HotpotQA. Our findings suggest that pre-generation confidence estimation can optimize efficiency, while C3 effectively controls output risks, advancing the reliability of LLMs in practical applications.
@article{ni2025towards, title = {Towards fully exploiting llm internal states to enhance knowledge boundary perception}, author = {Ni, Shiyu and Bi, Keping and Guo, Jiafeng and Yu, Lulu and Bi, Baolong and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2502.11677}, year = {2025}, }
ACL
Evaluating implicit bias in large language models by attacking from a psychometric perspective

Yuchen Wen , Keping Bi , Wei Chen , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2406.14023, 2025

Abs Bib PDF GitHub

As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs’ implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs’ inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development.
@article{wen2024evaluating, title = {Evaluating implicit bias in large language models by attacking from a psychometric perspective}, author = {Wen, Yuchen and Bi, Keping and Chen, Wei and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2406.14023}, year = {2025}, }
ACL (Findings)
Came: Competitively learning a mixture-of-experts model for first-stage retrieval

Jiafeng Guo , Yinqiong Cai , Keping Bi , Yixing Fan , Wei Chen , Ruqing Zhang , and Xueqi Cheng

ACM Transactions on Information Systems, 2025

Abs Bib PDF

The first-stage retrieval aims to retrieve a subset of candidate documents from a huge collection both effectively and efficiently. Since various matching patterns can exist between queries and relevant documents, previous work tries to combine multiple retrieval models to find as many relevant results as possible. The constructed ensembles, whether learned independently or jointly, do not care which component model is more suitable to an instance during training. Thus, they cannot fully exploit the capabilities of different types of retrieval models in identifying diverse relevance patterns. Motivated by this observation, in this article, we propose a Mixture-of-Experts (MoE) model consisting of representative matching experts and a novel competitive learning mechanism to let the experts develop and enhance their expertise during training. Specifically, our MoE model shares the bottom layers to learn common semantic representations and uses differently structured upper layers to represent various types of retrieval experts. Our competitive learning mechanism has two stages: (1) a standardized learning stage to train the experts equally to develop their capabilities to conduct relevance matching; (2) a specialized learning stage where the experts compete with each other on every training instance and get rewards and updates according to their performance to enhance their expertise on certain types of samples. Experimental results on retrieval benchmark datasets show that our method significantly outperforms the state-of-the-art baselines in the in-domain and out-of-domain settings.
@article{guo2025came, title = {Came: Competitively learning a mixture-of-experts model for first-stage retrieval}, author = {Guo, Jiafeng and Cai, Yinqiong and Bi, Keping and Fan, Yixing and Chen, Wei and Zhang, Ruqing and Cheng, Xueqi}, journal = {ACM Transactions on Information Systems}, year = {2025}, }
TOIS

2024

Causaldiff: Causality-inspired disentanglement via diffusion model for adversarial defense

Mingkun Zhang , Keping Bi , Wei Chen , Quanrun Chen , Jiafeng Guo , and Xueqi Cheng

Advances in Neural Information Processing Systems, 2024

Abs Bib PDF GitHub

Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark).
@article{zhang2024causaldiff, title = {Causaldiff: Causality-inspired disentanglement via diffusion model for adversarial defense}, author = {Zhang, Mingkun and Bi, Keping and Chen, Wei and Chen, Quanrun and Guo, Jiafeng and Cheng, Xueqi}, journal = {Advances in Neural Information Processing Systems}, year = {2024}, }
NeurIPS
Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Shiyu Ni , Keping Bi , Lulu Yu , and Jiafeng Guo

In China Conference on Information Retrieval, 2024

Abs Bib PDF

Large language models (LLMs) have been found to produce hallucinations when the question exceeds their internal knowledge boundaries. A reliable model should have a clear perception of its knowledge boundaries, providing correct answers within its scope and refusing to answer when it lacks knowledge. Existing research on LLMs’ perception of their knowledge boundaries typically uses either the probability of the generated tokens or the verbalized confidence as the model’s confidence in its response. However, these studies overlook the differences and connections between the two. In this paper, we conduct a comprehensive analysis and comparison of LLMs’ probabilistic perception and verbalized perception of their factual knowledge boundaries. First, we investigate the pros and cons of these two perceptions. Then, we study how they change under questions of varying frequencies. Finally, we measure the correlation between LLMs’ probabilistic confidence and verbalized confidence. Experimental results show that 1) LLMs’ probabilistic perception is generally more accurate than verbalized perception but requires an in-domain validation set to adjust the confidence threshold. 2) Both perceptions perform better on less frequent questions. 3) It is challenging for LLMs to accurately express their internal confidence in natural language.
@inproceedings{ni2024large, title = {Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?}, author = {Ni, Shiyu and Bi, Keping and Yu, Lulu and Guo, Jiafeng}, booktitle = {China Conference on Information Retrieval}, year = {2024}, }
CCIR
Linkage: Listwise ranking among varied-quality references for non-factoid qa evaluation via llms

Sihui Yang , Keping Bi , Wanqing Cui , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2409.14744, 2024

Abs Bib PDF

Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.
@article{yang2024linkage, title = {Linkage: Listwise ranking among varied-quality references for non-factoid qa evaluation via llms}, author = {Yang, Sihui and Bi, Keping and Cui, Wanqing and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2409.14744}, year = {2024}, }
EMNLP (Findings)
Reproducibility Analysis and Enhancements for Multi-aspect Dense Retriever with Aspect Learning

Keping Bi , Xiaojie Sun , Jiafeng Guo , and Xueqi Cheng

In European Conference on Information Retrieval, 2024

Abs Bib PDF GitHub

Multi-aspect dense retrieval aims to incorporate aspect information (e.g., brand and category) into dual encoders to facilitate relevance matching. As an early and representative multi-aspect dense retriever, MADRAL learns several extra aspect embeddings and fuses the explicit aspects with an implicit aspect "OTHER" for final representation. MADRAL was evaluated on proprietary data and its code was not released, making it challenging to validate its effectiveness on other datasets. We failed to reproduce its effectiveness on the public MA-Amazon data, motivating us to probe the reasons and re-examine its components. We propose several component alternatives for comparisons, including replacing "OTHER" with "CLS" and representing aspects with the first several content tokens. Through extensive experiments, we confirm that learning "OTHER" from scratch in aspect fusion is harmful. In contrast, our proposed variants can greatly enhance the retrieval performance. Our research not only sheds light on the limitations of MADRAL but also provides valuable insights for future studies on more powerful multi-aspect dense retrieval models.
@inproceedings{bi2024reproducibility, title = {Reproducibility Analysis and Enhancements for Multi-aspect Dense Retriever with Aspect Learning}, author = {Bi, Keping and Sun, Xiaojie and Guo, Jiafeng and Cheng, Xueqi}, booktitle = {European Conference on Information Retrieval}, year = {2024}, }
ECIR
A Multi-Granularity-Aware Aspect Learning Model for Multi-Aspect Dense Retrieval

Xiaojie Sun , Keping Bi , Jiafeng Guo , Sihui Yang , Qishen Zhang , Zhongyi Liu , Guannan Zhang , and Xueqi Cheng

In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024

Abs Bib PDF

Dense retrieval methods have been mostly focused on unstructured text and less attention has been drawn to structured data with various aspects, e.g., products with aspects such as category and brand. Recent work has proposed two approaches to incorporate the aspect information into item representations for effective retrieval by predicting the values associated with the item aspects. Despite their efficacy, they treat the values as isolated classes (e.g., "Smart Homes", "Home, Garden & Tools", and "Beauty & Health") and ignore their fine-grained semantic relation. Furthermore, they either enforce the learning of aspects into the CLS token, which could confuse it from its designated use for representing the entire content semantics, or learn extra aspect embeddings only with the value prediction objective, which could be insufficient especially when there are no annotated values for an item aspect. Aware of these limitations, we propose a MUlti-granulaRity-aware Aspect Learning model (MURAL) for multi-aspect dense retrieval. It leverages aspect information across various granularities to capture both coarse and fine-grained semantic relations between values. Moreover, MURAL incorporates separate aspect embeddings as input to transformer encoders so that the masked language model objective can assist implicit aspect learning even without aspect-value annotations. Extensive experiments on two real-world datasets of products and mini-programs show that MURAL outperforms state-of-the-art baselines significantly.
@inproceedings{sun2024multi, title = {A Multi-Granularity-Aware Aspect Learning Model for Multi-Aspect Dense Retrieval}, author = {Sun, Xiaojie and Bi, Keping and Guo, Jiafeng and Yang, Sihui and Zhang, Qishen and Liu, Zhongyi and Zhang, Guannan and Cheng, Xueqi}, booktitle = {Proceedings of the 17th ACM International Conference on Web Search and Data Mining}, pages = {674--682}, year = {2024}, }
WSDM
MORE: Multi-mOdal REtrieval augmented generative commonsense reasoning

Wanqing Cui , Keping Bi , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2402.13625, 2024

Abs Bib PDF GitHub

Since commonsense information has been recorded significantly less frequently than its existence, language models pre-trained by text generation have difficulty to learn sufficient commonsense knowledge. Several studies have leveraged text retrieval to augment the models’ commonsense ability. Unlike text, images capture commonsense information inherently but little effort has been paid to effectively utilize them. In this work, we propose a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models. Extensive experiments on the Common-Gen task have demonstrated the efficacy of MORE based on the pre-trained models of both single and multiple modalities.
@article{cui2024more, title = {MORE: Multi-mOdal REtrieval augmented generative commonsense reasoning}, author = {Cui, Wanqing and Bi, Keping and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2402.13625}, year = {2024}, }
ACL (Findings)
When Do LLMs Need Retrieval Augmentation? Mitigating LLMs’ Overconfidence Helps Retrieval Augmentation

Shiyu Ni , Keping Bi , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2402.11457, 2024

Abs Bib PDF GitHub

Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs’ hallucinations. However, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct RA all the time. A straightforward idea is to only conduct retrieval when LLMs are uncertain about a question. This motivates us to enhance the LLMs’ ability to perceive their knowledge boundaries to help RA. In this paper, we first quantitatively measure LLMs’ such ability and confirm their overconfidence. Then, we study how LLMs’ certainty about a question correlates with their dependence on external retrieved information. We propose several methods to enhance LLMs’ perception of knowledge boundaries and show that they are effective in reducing overconfidence. Additionally, equipped with these methods, LLMs can achieve comparable or even better performance of RA with much fewer retrieval calls.
@article{ni2024llms, title = {When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation}, author = {Ni, Shiyu and Bi, Keping and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2402.11457}, year = {2024}, }
ACL (Findings)

2023

A comparative study of training objectives for clarification facet generation

Shiyu Ni , Keping Bi , Jiafeng Guo , and Xueqi Cheng

In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2023

Abs Bib PDF GitHub

Due to the ambiguity and vagueness of a user query, it is essential to identify the query facets for the clarification of user intents. Existing work on query facet generation has achieved compelling performance by sequentially predicting the next facet given previously generated facets based on pre-trained language generation models such as BART. Given a query, there are mainly two types of training objectives to guide the facet generation models. One is to generate the default sequence of ground-truth facets, and the other is to enumerate all the permutations of ground-truth facets and use the sequence that has the minimum loss for model updates. The second is permutation-invariant while the first is not. In this paper, we aim to conduct a systematic comparative study of various types of training objectives, with different properties of not only whether it is permutation-invariant but also whether it conducts sequential prediction and whether it can control the count of output facets. To this end, we propose another three training objectives of different aforementioned properties. For comprehensive comparisons, besides the commonly used evaluation that measures the matching with ground-truth facets, we also introduce two diversity metrics to measure the diversity of the generated facets. Based on an open-domain query facet dataset, i.e., MIMICS, we conduct extensive analyses and show the pros and cons of each method, which could shed light on model training for clarification facet generation.
@inproceedings{ni2023comparative, title = {A comparative study of training objectives for clarification facet generation}, author = {Ni, Shiyu and Bi, Keping and Guo, Jiafeng and Cheng, Xueqi}, booktitle = {Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region}, year = {2023}, }
SIGIR-AP
Pre-training with aspect-content text mutual prediction for multi-aspect dense retrieval

Xiaojie Sun , Keping Bi , Jiafeng Guo , Xinyu Ma , Yixing Fan , Hongyu Shan , Qishen Zhang , and Zhongyi Liu

In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023

Abs Bib PDF GitHub

Grounded on pre-trained language models (PLMs), dense retrieval has been studied extensively on plain text. In contrast, there has been little research on retrieving data with multiple aspects using dense models. In the scenarios such as product search, the aspect information plays an essential role in relevance matching, e.g., category: Electronics, Computers, and Pet Supplies. A common way of leveraging aspect information for multi-aspect retrieval is to introduce an auxiliary classification objective, i.e., using item contents to predict the annotated value IDs of item aspects. However, by learning the value embeddings from scratch, this approach may not capture the various semantic similarities between the values sufficiently. To address this limitation, we leverage the aspect information as text strings rather than class IDs during pre-training so that their semantic similarities can be naturally captured in the PLMs. To facilitate effective retrieval with the aspect strings, we propose mutual prediction objectives between the text of the item aspect and content. In this way, our model makes more sufficient use of aspect information than conducting undifferentiated masked language modeling (MLM) on the concatenated text of aspects and content. Extensive experiments on two real-world datasets (product and mini-program search) show that our approach can outperform competitive baselines both treating aspect values as classes and conducting the same MLM for aspect and content strings.
@inproceedings{sun2023pre, title = {Pre-training with aspect-content text mutual prediction for multi-aspect dense retrieval}, author = {Sun, Xiaojie and Bi, Keping and Guo, Jiafeng and Ma, Xinyu and Fan, Yixing and Shan, Hongyu and Zhang, Qishen and Liu, Zhongyi}, booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management}, year = {2023}, }
CIKM
L2r: Lifelong learning for first-stage retrieval with backward-compatible representations

Yinqiong Cai , Keping Bi , Yixing Fan , Jiafeng Guo , Wei Chen , and Xueqi Cheng

In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023

Abs Bib PDF

First-stage retrieval is a critical task that aims to retrieve relevant document candidates from a large-scale collection. While existing retrieval models have achieved impressive performance, they are mostly studied on static data sets, ignoring that in the real-world, the data on the Web is continuously growing with potential distribution drift. Consequently, retrievers trained on static old data may not suit new-coming data well and inevitably produce sub-optimal results. In this work, we study lifelong learning for first-stage retrieval, especially focusing on the setting where the emerging documents are unlabeled since relevance annotation is expensive and may not keep up with data emergence. Under this setting, we aim to develop model updating with two goals: (1) to effectively adapt to the evolving distribution with the unlabeled new-coming data, and (2) to avoid re-inferring all embeddings of old documents to efficiently update the index each time the model is updated. We first formalize the task and then propose a novel Lifelong Learning method for the first-stage Retrieval, namely L^2R. L^2R adopts the typical memory mechanism for lifelong learning, and incorporates two crucial components: (1) selecting diverse support negatives for model training and memory updating for effective model adaptation, and (2) a ranking alignment objective to ensure the backward-compatibility of representations to save the cost of index rebuilding without hurting the model performance. For evaluation, we construct two new benchmarks from LoTTE and Multi-CPR datasets to simulate the document distribution drift in realistic retrieval scenarios. Extensive experiments show that L^2R significantly outperforms competitive lifelong learning baselines.
@inproceedings{cai2023l2r, title = {L2r: Lifelong learning for first-stage retrieval with backward-compatible representations}, author = {Cai, Yinqiong and Bi, Keping and Fan, Yixing and Guo, Jiafeng and Chen, Wei and Cheng, Xueqi}, booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management}, year = {2023}, }
CIKM
Cir at the ntcir-17 ultre-2 task

Lulu Yu , Keping Bi , Jiafeng Guo , and Xueqi Cheng

arXiv preprint arXiv:2310.11852, 2023

Abs Bib PDF

The Chinese academy of sciences Information Retrieval team (CIR) has participated in the NTCIR-17 ULTRE-2 task. This paper describes our approaches and reports our results on the ULTRE-2 task. We recognize the issue of false negatives in the Baidu search data in this competition is very severe, much more severe than position bias. Hence, we adopt the Dual Learning Algorithm (DLA) to address the position bias and use it as an auxiliary model to study how to alleviate the false negative issue. We approach the problem from two perspectives: 1) correcting the labels for non-clicked items by a relevance judgment model trained from DLA, and learn a new ranker that is initialized from DLA; 2) including random documents as true negatives and documents that have partial matching as hard negatives. Both methods can enhance the model performance and our best method has achieved nDCG@10 of 0.5355, which is 2.66% better than the best score from the organizer.
@article{yu2023cir, title = {Cir at the ntcir-17 ultre-2 task}, author = {Yu, Lulu and Bi, Keping and Guo, Jiafeng and Cheng, Xueqi}, journal = {arXiv preprint arXiv:2310.11852}, year = {2023}, }
NTCIR
Ensemble Ranking Model with Multiple Pretraining Strategies for Web Search

Xiaojie Sun , Lulu Yu , Yiting Wang , Keping Bi , and Jiafeng Guo

arXiv preprint arXiv:2302.09340, 2023

Abs Bib PDF

An effective ranking model usually requires a large amount of training data to learn the relevance between documents and queries. User clicks are often used as training data since they can indicate relevance and are cheap to collect, but they contain substantial bias and noise. There has been some work on mitigating various types of bias in simulated user clicks to train effective learning-to-rank models based on multiple features. However, how to effectively use such methods on large-scale pre-trained models with real-world click data is unknown. To alleviate the data bias in the real world, we incorporate heuristic-based features, refine the ranking objective, add random negatives, and calibrate the propensity calculation in the pre-training stage. Then we fine-tune several pre-trained models and train an ensemble model to aggregate all the predictions from various pre-trained models with human-annotation data in the fine-tuning stage. Our approaches won 3rd place in the "Pre-training for Web Search" task in WSDM Cup 2023 and are 22.6% better than the 4th-ranked team.
@article{sun2023ensemble, title = {Ensemble Ranking Model with Multiple Pretraining Strategies for Web Search}, author = {Sun, Xiaojie and Yu, Lulu and Wang, Yiting and Bi, Keping and Guo, Jiafeng}, journal = {arXiv preprint arXiv:2302.09340}, year = {2023}, }
WSDM
Feature-enhanced network with hybrid debiasing strategies for unbiased learning to rank

Lulu Yu , Yiting Wang , Xiaojie Sun , Keping Bi , and Jiafeng Guo

arXiv preprint arXiv:2302.07530, 2023

Abs Bib PDF

Unbiased learning to rank (ULTR) aims to mitigate various biases existing in user clicks, such as position bias, trust bias, presentation bias, and learn an effective ranker. In this paper, we introduce our winning approach for the "Unbiased Learning to Rank" task in WSDM Cup 2023. We find that the provided data is severely biased so neural models trained directly with the top 10 results with click information are unsatisfactory. So we extract multiple heuristic-based features for multi-fields of the results, adjust the click labels, add true negatives, and re-weight the samples during model training. Since the propensities learned by existing ULTR methods are not decreasing w.r.t. positions, we also calibrate the propensities according to the click ratios and ensemble the models trained in two different ways. Our method won the 3rd prize with a DCG@10 score of 9.80, which is 1.1% worse than the 2nd and 25.3% higher than the 4th.
@article{yu2023feature, title = {Feature-enhanced network with hybrid debiasing strategies for unbiased learning to rank}, author = {Yu, Lulu and Wang, Yiting and Sun, Xiaojie and Bi, Keping and Guo, Jiafeng}, journal = {arXiv preprint arXiv:2302.07530}, year = {2023}, }
WSDM