semantic_indexing
Semantic
2023-09-11 14:14:30 时间
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment
Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment
): [data for data in fn(samples)]
就是把数据元祖化而已
query_cls_embedding = self.get_pooled_embedding(
query_input_ids, query_token_type_ids, query_position_ids,
query_attention_mask)
title_cls_embedding = self.get_pooled_embedding(
title_input_ids, title_token_type_ids, title_position_ids,
title_attention_mask)
cosine_sim = paddle.matmul(
query_cls_embedding, title_cls_embedding, transpose_y=True)
# substract margin from all positive samples cosine_sim()
margin_diag = paddle.full(
shape=[query_cls_embedding.shape[0]],
fill_value=self.margin,
dtype=paddle.get_default_dtype())
cosine_sim = cosine_sim - paddle.diag(margin_diag)
# scale cosine to ease training converge
cosine_sim *= self.sacle
labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
labels = paddle.reshape(labels, shape=[-1, 1])
loss = F.cross_entropy(input=cosine_sim, label=labels)
return loss
通过同一个batch两两计算相似度
In-batch negatives 核心思路,最大化负例样本,做数据增强
In-batch negatives 核心思路
In-batch negatives 策略的训练数据为语义相似的 Pair 对,如下所示为 Batch size = 4 的训练数据样例:
我手机丢了,我想换个手机 我想买个新手机,求推荐
求秋色之空漫画全集 求秋色之空全集漫画
学日语软件手机上的 手机学日语的软件
侠盗飞车罪恶都市怎样改车 侠盗飞车罪恶都市怎么改车
In-batch negatives 策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新,将Batch 内除自身之外其它所有 Source Text 的相似文本 Target Text 作为负例,例如: 上例中 我手机丢了,我想换个手机 有 1 个正例(1.我想买个新手机,求推荐),3 个负例(1.求秋色之空全集漫画,2.手机学日语的软件,3.侠盗飞车罪恶都市怎么改车)。
HardestNeg 核心思路
HardestNeg 策略核心是在 1 个 Batch 内的所有负样本中先挖掘出最难区分的负样本,基于最难负样本进行梯度更新。例如: 上例中 Source Text: 我手机丢了,我想换个手机 有 3 个负例(1.求秋色之空全集漫画,2.手机学日语的软件,3.侠盗飞车罪恶都市怎么改车),其中最难区分的负例是 手机学日语的软件,模型训练过程中不断挖掘出类似这样的最难负样本,然后基于最难负样本进行梯度更新。
query_cls_embedding = self.get_pooled_embedding(
query_input_ids, query_token_type_ids, query_position_ids,
query_attention_mask)
title_cls_embedding = self.get_pooled_embedding(
title_input_ids, title_token_type_ids, title_position_ids,
title_attention_mask)
cosine_sim = paddle.matmul(
query_cls_embedding, title_cls_embedding, transpose_y=True)
pos_sim = paddle.max(cosine_sim, axis=-1)
# subtract 10000 from all diagnal elements of cosine_sim
mask_socre = paddle.full(
shape=[query_cls_embedding.shape[0]],
fill_value=10000,
dtype=paddle.get_default_dtype())
tmp_cosin_sim = cosine_sim - paddle.diag(mask_socre)
hardest_negative_sim = paddle.max(tmp_cosin_sim, axis=-1)
labels = paddle.full(
shape=[query_cls_embedding.shape[0]],
fill_value=1.0,
dtype='float32')
loss = F.margin_ranking_loss(
pos_sim, hardest_negative_sim, labels, margin=self.margin)
return loss
通过margin_ranking_loss
度量学习,参考链接https://blog.csdn.net/qq_15821487/article/details/119865995
相关文章
- 智能学术搜索引擎Semantic Scholar
- 论文阅读:Robust Semantic Representations for Inferring Human Co-manipulation Activities even with Different Demonstration Styles
- 论文阅读:Automatic Segmentation and Recognition of Human Activities from Observation based on Semantic Reasoning
- 论文阅读:Inferring Human Activities from Observation via Semantic Reasoning: A novel method for transferring skills to robots
- 论文阅读:A Semantic‑Based Method for Teaching Industrial Robots New Tasks
- 《BEVSegFormer:Bird’s Eye View Semantic Segmentation From Arbitrary Camera Rigs》论文笔记
- 《HRNet-OCR:Object-Contextual Representations for Semantic Segmentation》论文笔记
- 《LEDNet:A Lightweight Encoder-Decoder Network For Real-Time Semantic Segmentation》论文笔记
- 论文阅读笔记CVPR2020 Semantic Image Manipulation Using Scene Graphs