多模态商品摘要生成的要素评估与偏好优化

宋雪萌, 李芷墨, 侯博涵, 等

doi:10.12141/j.issn.1000-565X.250375

华南理工大学学报(自然科学版) >

0 1

DOI: https://doi.org/10.12141/j.issn.1000-565X.250375

计算机科学与技术

多模态商品摘要生成的要素评估与偏好优化

展开

1.南方科技大学计算机科学与工程系，广东深圳 518055;

2.北京大学计算机学院，北京 100871;

3.山东大学计算机科学与技术学院，山东青岛 266237

网络出版日期: 2026-01-23

收起

Enhancing Multimodal Product Summarization through Claim-Based Evaluation and Preference Optimization

Expand

1. Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, Guangdong, China;
2. School of Computer Science, Peking University, Beijing 100871, China;
3. School of Computer Science and Technology, Shandong University, Qingdao 266237, Shandong, China

Online published: 2026-01-23

Fold

摘要

多模态商品摘要生成任务旨在基于商品的图文信息生成简洁、准确且能够突出核心卖点的摘要。然而，现有方法仍面临两大挑战：其一，传统的ROUGE等基于词汇重叠的指标难以有效衡量摘要对商品关键信息的表达能力；其二，主流监督微调范式难以捕捉用户在要素突出性方面的隐性偏好，生成内容与实际需求存在偏离。为此，本文提出基于要素的摘要评价指标（CSE），从要素命中率（CHA）与要素数量比（CQR）两个维度综合评估摘要中关键信息的表达效果；并进一步设计了基于偏好优化的多模态摘要生成模型PAMPS，通过监督微调、摘要重采样、要素评估驱动的偏好对构建以及直接偏好优化四个阶段，实现模型对商品要素表达偏好的对齐。在大规模中文电商数据集CEPSUM上的实证结果表明，PAMPS在ROUGE指标上整体取得明显提升，其中DPO-ROUGE相比SFT在ROUGE-1/2/L上分别平均提升0.25、0.44和1.21，展示了更强的整体生成质量。在CSE评价体系下，DPO-CSE在要素命中率（CHA）上的整体提升尤为显著，平均增幅超过4%，表明要素导向的偏好优化能够有效增强模型对商品核心要素的捕捉与表达能力。实验结果验证了本文方法在提升多模态商品摘要质量方面的有效性与实用价值。

关键词： 多模态大模型; 摘要评估; 偏好优化

本文引用格式

宋雪萌, 李芷墨, 侯博涵, 等 . 多模态商品摘要生成的要素评估与偏好优化[J]. 华南理工大学学报(自然科学版), 0 : 1 . DOI: 10.12141/j.issn.1000-565X.250375

Abstract

The task of multimodal product summarization aims to generate concise and accurate summaries that effectively highlight key selling points based on textual and visual product information. However, existing approaches face two major challenges: first, traditional overlap-based metrics such as ROUGE struggle to reliably assess how well a summary captures essential product information; second, mainstream supervised fine-tuning paradigms fail to model users’ implicit preferences regarding the prominence of key elements, resulting in summaries that deviate from actual needs. To address these issues, this paper proposes a claim-based summarization evaluation metric (CSE), which evaluates the expression of key information from two dimensions: claim hit accuracy (CHA) and claim quantity ratio (CQR). Furthermore, we introduce PAMPS, a preference-aligned multimodal product summarization model that incorporates four stages—supervised fine-tuning, summary resampling, CSE-driven preference pair construction, and direct preference optimization—to progressively align the model with user preferences regarding key product elements. Experiments on the large-scale Chinese e-commerce dataset CEPSUM demonstrate the effectiveness of the proposed method. PAMPS achieves notable improvements in ROUGE metrics, where DPO-ROUGE improves ROUGE-1/2/L by 0.25, 0.44, and 1.21 on average compared with SFT, indicating enhanced overall generation quality. Under the CSE evaluation framework, DPO-CSE yields the most significant gains in claim hit accuracy, with an average improvement exceeding 4%, highlighting the capability of element-oriented preference optimization to strengthen the model’s ability to capture and express core product information. Overall, the results validate the effectiveness and practical value of the proposed approach in improving multimodal product summarization quality.

Key words： multimodal large models; summarization evaluation; preference optimization

Options

摘要页面

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract