Yutao Mou

Yutao Mou (牟宇滔)

I am a first-year PhD student at KCL Lab, Peking University, and supervised by professors Wei Ye and Shikun Zhang. I received the Master and B.S. degree from Beijing University of Posts and Telecommunications, in 2024 and 2021.

I have been working on building safe, reliable and scalable artificial intelligence systems. Currently, my research interests lie in two main areas:

(1) Safety Evaluation and Red Teaming of Large Language Models (LLMs):
We focus on building comprehensive safety evaluation benchmarks across various scenarios (e.g., text generation, code agents, multimodal tasks), designing jailbreak and automated red-teaming methods, and studying the interpretability of model vulnerabilities.

(2) Post-Training and Safety Alignment of LLMs:
Our goal is to incorporate safety-related data, reward signals, and training objectives into the post-training stage of LLMs to enhance safety without compromising general capabilities. We are particularly interested in safety alignment of large reasoning models, safety-oriented reward models, and the automated synthesis and data augmentation of safety fine-tuning data. In the future, we also plan to explore multimodal safety alignment.

Feel free to contact me for communication and collaboration.

Email / Scholar / Github

Publications

	SaRO: Enhancing LLM Safety through Reasoning-based Alignment Yutao Mou, Yuxiao Luo, Shikun Zhang, Wei Ye Preprint* Code / Paper In this study, we address the prevalent issue of shallow alignment in current LLM safety alignment methods by proposing the Safety-oriented Reasoning Optimization Framework (SaRO). SaRO introduces a System 2-style alignment to internalize a deliberative, slow-thinking reasoning approach, enabling models to reflect on safety policies more effectively. This framework enhances safety performance without compromising the general capabilities of the model.
	Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective Yutao Mou, Xiao Deng, Yuxiao Luo, Shikun Zhang, Wei Ye ACL*, 2025 (CCF-A) Code / Paper In this paper, we construct a benchmark dataset named CoV-Eval, which encompasses various task types and covers a wide range of code security vulnerabilities. From the perspective of code security, we evaluate LLMs and various coding assistants, focusing on assessing the security vulnerabilities in code generated by LLMs, as well as their capabilities in vulnerability detection and repair.
	SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types Yutao Mou, Shikun Zhang, Wei Ye NeurIPS*, 2024 (CCF-A) Code / Paper This paper introduces SG-Bench, a new benchmark for evaluating LLM safety, which pioneers the concept of safety generalization—assessing how variations in task and prompt types affect model safety. SG-Bench combines both generative and discriminative tasks and expands test data to evaluate the impact of prompt engineering and jailbreak attacks
	UEGP: Unified Expert-Guided Pre-training for Knowledge Rekindle Yutao Mou, Kexiang Wang, Jianhe Lin, Dehong Ma, Jun Fan, Daiting Shi, Zhicong Cheng, Gu Simiu, Dawei Yin, Weiran Xu NAACL*, 2024 (Findings) Code / Paper This paper first propose a new paradigm: knowledge rekindle, which aims to re-incorporate the fine-tuned expert model into the training cycle and break through the performance upper bounds of experts without introducing additional annotated data. Then we further propose a unified expert-guided pre-training (UEGP) framework for knowledge rekindle.
	Decoupling Pseudo Label Disambiguation and Representation Learning for Generalized Intent Discovery Yutao Mou, Xiaoshuai Song, Keqing He, Chen Zeng, Pei Wang, Jingang Wang, Yunsen Xian, Weiran Xu ACL*, 2023 (CCF-A) Code / Paper This paper focuses on the generalized intent discovery task, and proposes a decoupled prototype learning framework (DPL) to decouple pseudo label disambiguation and representation learning.
	Watch the Neighbors: A Unified K-Nearest Neighbor Contrastive Learning Framework for OOD Intent Discovery Yutao Mou, Keqing He, Pei Wang, Yanan Wu, Jingang Wang, Wei Wu, Weiran Xu EMNLP, 2022 (CCF-B) Code / Paper This paper focuses on new intent discovery and clustering task, and propose a unified K-nearest neighbor contrastive learning framework to discover OOD intents. Specifically, we design a novel K-nearest neighbor contrastive learning objective (KCL) for in-domain pre-training, and a hard negative mining strategy for self-supervised representation learning on unlabeled out-of-domain data.
	UniNL: Aligning Representation Learning with Scoring Function for OOD Detection via Unified Neighborhood Learning Yutao Mou, Pei Wang, Keqing He, Yanan Wu, Jingang Wang, Wei Wu, Weiran Xu EMNLP*, 2022 (short paper) Code / Paper This paper focuses on out-of-domain (OOD) intent detection task, and propose a unified neighborhood learning framework (UniNL) to detect OOD intents. propose a unified K-nearest neighbor contrastive learning framework to discover OOD intents. Specifically, we design a K-nearest neighbor contrastive learning objective for training and introduce a KNN-based scoring function for confidence estimation. We aim to align training objective with confidence function in inference stage.
	Generalized Intent Discovery: Learning from Open World Dialogue System Yutao Mou, Keqing He, Yanan Wu, Pei Wang, Jingang Wang, Wei Wu, Yi Huang, Junlan Feng, Weiran Xu COLING, 2022 (CCF-B) Code / Paper This paper defines a new task, Generalized Intent Discovery (GID), which aims to extend an IND intent classifier to an open-world intent set including IND and OOD intents. We hope to simultaneously classify a set of labeled IND intent classes while discovering and recognizing new unlabeled OOD types incrementally. We construct three public datasets for different application scenarios and propose two kinds of frameworks, pipeline-based and end-to-end for future work.
	Disentangled Knowledge Transfer for OOD Intent Discovery with Unified Contrastive Learning Yutao Mou, Keqing He, Yanan Wu, Zhiyuan Zeng, Hong Xu, Huixing Jiang, Wei Wu, Weiran Xu ACL*, 2022 (short paper) Code / Paper Discovering Out-of-Domain (OOD) intents is essential for developing new skills in a task-oriented dialogue system. The key challenge is how to transfer prior IND knowledge to OOD clustering. This paper proposes a decoupled knowledge transfer framework for new intent discovery and clustering, which unifies the two-stage learning process of in-domain and out-of-domain data into instance discrimination and clustering discrimination tasks and bridges the gap between in-domain and out-of-domain data.

Internships

Sensetime Research, Beijing, China. June 2023 - December 2023.
Research Internship, working on Large Language Model and Hallucination Correction.
Baidu Inc., Beijing, China. December 2022 - May 2023.
Research Internship, focusing on search re-ranking and pre-training.

Selected Honors

China National Scholarship. Ministry of Education of P.R. China. 2023.
Excellent Graduate. Beijing University of Posts and Telecommunications. 2023.
Schlumberger Enterprise Scholarship. Beijing University of Posts and Telecommunications. 2022.
1st Award on SereTOD Challenge 2022 track 2, EMNLP 2022
Excellent Graduate. Beijing University of Posts and Telecommunications. 2022.
China National Scholarship. Ministry of Education of P.R. China. 2020.
First Prize in National College Student Mathematics Competition. Chinese Mathematics League. 2018.

Service

Reviewer: EMNLP2022, ACL2023, EMNLP2023, ARR

Design and source code from Jon Barron's website