zl程序教程

您现在的位置是:首页 >  IT要闻

当前栏目

计算机视觉与模式识别学术速递[12.16]

2023-04-18 15:34:22 时间

cs.CV 方向,今日共计57篇

Transformer(1篇)

【1】 Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos 标题:基于视觉变换的视频散列检索追查假视频来源 链接:https://arxiv.org/abs/2112.08117

作者:Pengfei Pei,Xianfeng Zhao,Jinchuan Li,Yun Cao,Xiaowei Yi 摘要:传统的假视频检测方法输出篡改图像的可能性值或可疑掩码。然而,这些无法解释的结果不能作为令人信服的证据。所以最好是追踪假视频的来源。传统的散列方法用于检索语义相似的图像,不能区分图像的细微差别。具体来说,源跟踪与传统的视频检索相比有很大的优势。从类似的源视频中找到真实的视频是一个挑战。我们设计了一种新颖的loss散列三重丢失算法,解决了人的视频非常相似的问题:同一场景有不同的角度,同一场景有相同的人。我们提出了基于视觉变换的视频跟踪和篡改定位(VTL)模型。在第一阶段中,我们使用ViTHash(VTL-T)来训练散列中心。然后,一个假视频被输入到ViTHash,ViTHash输出一个散列码。哈希代码用于从哈希中心检索源视频。在第二阶段,将源视频和伪视频输入到生成器(VTL-L)。然后,屏蔽可疑区域以提供辅助信息。此外,我们构建了两个数据集:DFTL和DAVIS2016-TL。DFTL上的实验清楚地表明了我们的框架在类似视频源跟踪方面的优越性。特别是,VTL在DAVIS2016-TL上的性能与最先进的方法相当。我们的源代码和数据集已在GitHub:url上发布{https://github.com/lajlksdf/vtl}. 摘要:Conventional fake video detection methods outputs a possibility value or a suspected mask of tampering images. However, such unexplainable results cannot be used as convincing evidence. So it is better to trace the sources of fake videos. The traditional hashing methods are used to retrieve semantic-similar images, which can't discriminate the nuances of the image. Specifically, the sources tracing compared with traditional video retrieval. It is a challenge to find the real one from similar source videos. We designed a novel loss Hash Triplet Loss to solve the problem that the videos of people are very similar: the same scene with different angles, similar scenes with the same person. We propose Vision Transformer based models named Video Tracing and Tampering Localization (VTL). In the first stage, we train the hash centers by ViTHash (VTL-T). Then, a fake video is inputted to ViTHash, which outputs a hash code. The hash code is used to retrieve the source video from hash centers. In the second stage, the source video and fake video are inputted to generator (VTL-L). Then, the suspect regions are masked to provide auxiliary information. Moreover, we constructed two datasets: DFTL and DAVIS2016-TL. Experiments on DFTL clearly show the superiority of our framework in sources tracing of similar videos. In particular, the VTL also achieved comparable performance with state-of-the-art methods on DAVIS2016-TL. Our source code and datasets have been released on GitHub: url{https://github.com/lajlksdf/vtl}.

检测相关(7篇)

【1】 Reliable Multi-Object Tracking in the Presence of Unreliable Detections 标题:存在不可靠检测情况下的可靠多目标跟踪 链接:https://arxiv.org/abs/2112.08345

作者:Travis Mandel,Mark Jimenez,Emily Risley,Taishi Nammoto,Rebekka Williams,Max Panoff,Meynard Ballesteros,Bobbie Suarez 备注:12 pages, 5 figures, 6 tables 摘要:最近的多目标跟踪(MOT)系统利用了高精度的目标检测器;然而,训练这样的探测器需要大量的标记数据。尽管这类数据广泛适用于人类和交通工具,但对于其他动物物种而言,这类数据明显更为稀缺。我们提出了鲁棒置信跟踪(RCT),一种设计用于在检测质量较差时保持鲁棒性能的算法。与先前丢弃检测置信信息的方法相比,RCT采用了一种根本不同的方法,依靠精确的检测置信值来初始化轨迹、扩展轨迹和过滤轨迹。特别是,RCT能够通过有效地使用低置信度检测(以及单个对象跟踪器)来保持对对象的连续跟踪,从而最小化身份切换。为了在存在不可靠检测的情况下评估跟踪器,我们提出了一个具有挑战性的真实世界水下鱼类跟踪数据集FISHTRAC。在对FISHTRAC和UA-DETRAC数据集的评估中,我们发现RCT在检测不完善时优于其他算法,包括最先进的深度单目标和多目标跟踪器以及更经典的方法。具体而言,RCT在成功返回所有序列结果的方法中具有最佳的平均HOTA,并且与其他方法相比,具有显著更少的标识切换。 摘要:Recent multi-object tracking (MOT) systems have leveraged highly accurate object detectors; however, training such detectors requires large amounts of labeled data. Although such data is widely available for humans and vehicles, it is significantly more scarce for other animal species. We present Robust Confidence Tracking (RCT), an algorithm designed to maintain robust performance even when detection quality is poor. In contrast to prior methods which discard detection confidence information, RCT takes a fundamentally different approach, relying on the exact detection confidence values to initialize tracks, extend tracks, and filter tracks. In particular, RCT is able to minimize identity switches by efficiently using low-confidence detections (along with a single object tracker) to keep continuous track of objects. To evaluate trackers in the presence of unreliable detections, we present a challenging real-world underwater fish tracking dataset, FISHTRAC. In an evaluation on FISHTRAC as well as the UA-DETRAC dataset, we find that RCT outperforms other algorithms when provided with imperfect detections, including state-of-the-art deep single and multi-object trackers as well as more classic approaches. Specifically, RCT has the best average HOTA across methods that successfully return results for all sequences, and has significantly less identity switches than other methods.

【2】 Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental Study 标题:检测物体状态与检测物体:一种新的数据集和定量实验研究 链接:https://arxiv.org/abs/2112.08281

作者:Filippos Gouidis,Theodoris Patkos,Antonis Argyros,Dimitris Plexousakis 摘要:图像中目标状态的检测(State detection-SD)是一个具有理论和实际意义的问题,它与其他重要的计算机视觉问题紧密交织在一起,如动作识别和启示检测。它还与任何需要在动态领域进行推理和行动的实体高度相关,例如机器人系统和智能代理。尽管这个问题很重要,但到目前为止,对这个问题的研究还很有限。在本文中,我们试图对SD问题进行系统的研究。首先,我们介绍了对象状态检测数据集(OSDD),这是一个新的公开数据集,包含18个对象类别和9个状态类的19000多个注释。第二,使用用于对象检测(OD)的标准深度学习框架,我们进行了一些适当设计的实验,以深入研究SD问题的行为。这项研究能够建立SD性能的基线,以及在各种情况下与OD相比的相对性能。总的来说,实验结果证实SD比OD更难,需要开发定制的SD方法来有效解决这一重大问题。 摘要:The detection of object states in images (State Detection - SD) is a problem of both theoretical and practical importance and it is tightly interwoven with other important computer vision problems, such as action recognition and affordance detection. It is also highly relevant to any entity that needs to reason and act in dynamic domains, such as robotic systems and intelligent agents. Despite its importance, up to now, the research on this problem has been limited. In this paper, we attempt a systematic study of the SD problem. First, we introduce the Object State Detection Dataset (OSDD), a new publicly available dataset consisting of more than 19,000 annotations for 18 object categories and 9 state classes. Second, using a standard deep learning framework used for Object Detection (OD), we conduct a number of appropriately designed experiments, towards an in-depth study of the behavior of the SD problem. This study enables the setup of a baseline on the performance of SD, as well as its relative performance in comparison to OD, in a variety of scenarios. Overall, the experimental outcomes confirm that SD is harder than OD and that tailored SD methods need to be developed for addressing effectively this significant problem.

【3】 Interpretable Feature Learning Framework for Smoking Behavior Detection 标题:用于吸烟行为检测的可解释特征学习框架 链接:https://arxiv.org/abs/2112.08178

作者:Nakayiza Hellen,Ggaliwango Marvin 备注:15 pages 摘要:事实证明,在公共场所吸烟对不吸烟者的危害更大,这使其成为一个巨大的公共卫生问题,迫切需要当局采取积极措施并予以关注。随着世界迈向第四次工业革命,有必要对智能城市内外的这种有害的醉人行为采取可靠的环保检测措施。我们开发了一个用于吸烟行为检测的可解释特征学习框架,该框架利用深度学习VGG-16预训练网络预测和分类输入图像类别,并利用分层相关传播(LRP)解释基于最相关学习特征的网络检测或吸烟行为预测像素或神经元。网络的分类决策主要基于口部的特征,尤其是烟雾对网络决策的重要性。烟雾的轮廓突出显示为相应类别的证据。一些元素被视为对烟雾神经元有负面影响,因此会以不同方式突出显示。有趣的是,网络根据图像区域区分重要和不重要的特征。该技术还可以检测其他可吸烟药物,如大麻、什叶草、大麻等。该框架允许根据政府的监管健康政策,在学校、购物中心、公交车站、铁路车厢或其他违规吸烟场所等不安全区域可靠地识别基于行动的吸烟者。在吸烟区安装明确的装置后,这项技术可以检测到范围之外的吸烟者。 摘要:Smoking in public has been proven to be more harmful to nonsmokers, making it a huge public health concern with urgent need for proactive measures and attention by authorities. With the world moving towards the 4th Industrial Revolution, there is a need for reliable eco-friendly detective measures towards this harmful intoxicating behavior to public health in and out of smart cities. We developed an Interpretable feature learning framework for smoking behavior detection which utilizes a Deep Learning VGG-16 pretrained network to predict and classify the input Image class and a Layer-wise Relevance Propagation (LRP) to explain the network detection or prediction of smoking behavior based on the most relevant learned features or pixels or neurons. The network's classification decision is based mainly on features located at the mouth especially the smoke seems to be of high importance to the network's decision. The outline of the smoke is highlighted as evidence for the corresponding class. Some elements are seen as having a negative effect on the smoke neuron and are consequently highlighted differently. It is interesting to see that the network distinguishes important from unimportant features based on the image regions. The technology can also detect other smokeable drugs like weed, shisha, marijuana etc. The framework allows for reliable identification of action-based smokers in unsafe zones like schools, shopping malls, bus stops, railway compartments or other violated places for smoking as per the government's regulatory health policies. With installation clearly defined in smoking zones, this technology can detect smokers out of range.

【4】 Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions 标题:图像自适应YOLO在恶劣天气条件下的目标检测 链接:https://arxiv.org/abs/2112.08088

作者:Wenyu Liu,Gaofeng Ren,Runsheng Yu,Shi Guo,Jianke Zhu,Lei Zhang 备注:Accepted by AAAI 2022, Preprint version with Appendix 摘要:尽管基于深度学习的目标检测方法在传统数据集上取得了令人满意的结果,但从恶劣天气条件下拍摄的低质量图像中定位目标仍然具有挑战性。现有的方法要么难以平衡图像增强和目标检测的任务,要么常常忽略有利于检测的潜在信息。为了缓解这一问题,我们提出了一种新的图像自适应YOLO(IA-YOLO)框架,其中每个图像都可以自适应增强以获得更好的检测性能。具体而言,提出了一种可微图像处理(DIP)模块,以考虑YLO探测器的不利天气条件,其参数由小型卷积神经网络(CNN-PP)预测。我们以端到端的方式联合学习CNN-PP和YOLOv3,这确保了CNN-PP可以学习适当的DIP来增强图像,以便以弱监督的方式进行检测。我们提出的IA-YLO方法可以在正常和恶劣天气条件下自适应处理图像。实验结果非常令人鼓舞,证明了我们提出的IA-YLO方法在雾天和弱光情况下的有效性。 摘要:Though deep learning-based object detection methods have achieved promising results on the conventional datasets, it is still challenging to locate objects from the low-quality images captured in adverse weather conditions. The existing methods either have difficulties in balancing the tasks of image enhancement and object detection, or often ignore the latent information beneficial for detection. To alleviate this problem, we propose a novel Image-Adaptive YOLO (IA-YOLO) framework, where each image can be adaptively enhanced for better detection performance. Specifically, a differentiable image processing (DIP) module is presented to take into account the adverse weather conditions for YOLO detector, whose parameters are predicted by a small convolutional neural net-work (CNN-PP). We learn CNN-PP and YOLOv3 jointly in an end-to-end fashion, which ensures that CNN-PP can learn an appropriate DIP to enhance the image for detection in a weakly supervised manner. Our proposed IA-YOLO approach can adaptively process images in both normal and adverse weather conditions. The experimental results are very encouraging, demonstrating the effectiveness of our proposed IA-YOLO method in both foggy and low-light scenarios.

【5】 MissMarple : A Novel Socio-inspired Feature-transfer Learning Deep Network for Image Splicing Detection 标题:MissMarple:一种新颖的基于社会启发的特征迁移学习深度网络图像拼接检测方法 链接:https://arxiv.org/abs/2112.08018

作者:Angelina L. Gokhale,Dhanya Pramod,Sudeep D. Thepade,Ravi Kulkarni 备注:27 pages, 6 figures and 15 tables 摘要:在本文中,我们提出了一种新的社会启发卷积神经网络(CNN)图像拼接检测深度学习模型。基于从粗拼接图像区域的检测中学习可以提高视觉上不可察觉的精细拼接图像伪造的检测的前提,所提出的模型称为MissMarple,是一个包含特征转移学习的双CNN网络。使用基准数据集(如Columbia Splicting、WildWeb、DSO1)和一个名为AbhAS的拟议数据集(包含真实拼接伪造)对拟议模型进行训练和测试,结果表明,与现有深度学习模型相比,检测精度有所提高。 摘要:In this paper we propose a novel socio-inspired convolutional neural network (CNN) deep learning model for image splicing detection. Based on the premise that learning from the detection of coarsely spliced image regions can improve the detection of visually imperceptible finely spliced image forgeries, the proposed model referred to as, MissMarple, is a twin CNN network involving feature-transfer learning. Results obtained from training and testing the proposed model using the benchmark datasets like Columbia splicing, WildWeb, DSO1 and a proposed dataset titled AbhAS consisting of realistic splicing forgeries revealed improvement in detection accuracy over the existing deep learning models.

【6】 A Comparative Analysis of Machine Learning Approaches for Automated Face Mask Detection During COVID-19 标题:用于冠状病毒自动口罩检测的机器学习方法的比较分析 链接:https://arxiv.org/abs/2112.07913

作者:Junaed Younus Khan,Md Abdullah Al Alamin 摘要:世界卫生组织(WHO)建议戴口罩作为防止COVID-19传播的最有效措施之一。在许多国家,现在必须戴口罩,特别是在公共场所。由于面罩的手动监控在人群中间常常是不可行的,因此自动检测是有益的。为了促进这一点,我们探索了许多用于面罩检测的深度学习模型(即VGG1、VGG19、ResNet50),并在两个基准数据集上对其进行了评估。在此背景下,我们还评估了迁移学习(即VGG19、在ImageNet上预先训练的ResNet50)。我们发现,虽然所有模型的性能都很好,但迁移学习模型的性能最好。迁移学习可提高绩效0.10%--0.40%,训练时间减少30%。我们的实验还表明,对于测试数据集来自不同分布的真实情况,这些高性能模型并不十分健壮。如果不进行任何微调,这些模型在跨域设置下的性能将下降47%。 摘要:The World Health Organization (WHO) has recommended wearing face masks as one of the most effective measures to prevent COVID-19 transmission. In many countries, it is now mandatory to wear face masks, specially in public places. Since manual monitoring of face masks is often infeasible in the middle of the crowd, automatic detection can be beneficial. To facilitate that, we explored a number of deep learning models (i.e., VGG1, VGG19, ResNet50) for face-mask detection and evaluated them on two benchmark datasets. We also evaluated transfer learning (i.e., VGG19, ResNet50 pre-trained on ImageNet) in this context. We find that while the performances of all the models are quite good, transfer learning models achieve the best performance. Transfer learning improves the performance by 0.10\%--0.40\% with 30\% less training time. Our experiment also shows these high-performing models are not quite robust for real-world cases where the test dataset comes from a different distribution. Without any fine-tuning, the performance of these models drops by 47\% in cross-domain settings.

【7】 Revisiting 3D Object Detection From an Egocentric Perspective 标题:从自我中心的视角重新审视三维目标检测 链接:https://arxiv.org/abs/2112.07787

作者:Boyang Deng,Charles R. Qi,Mahyar Najibi,Thomas Funkhouser,Yin Zhou,Dragomir Anguelov 备注:Published in NeurIPS 2021 摘要:三维目标检测是自动驾驶等安全关键机器人应用的关键模块。对于这些应用程序,我们最关心的是检测如何影响自我代理的行为和安全(以自我为中心的观点)。直观地说,当物体的几何结构更容易干扰自我主体的运动轨迹时,我们会寻求更精确的描述。然而,当前的检测指标基于联合上的盒交叉(IoU),是以对象为中心的,不能捕获对象和自我代理之间的时空关系。为了解决这个问题,我们提出了一种新的以自我为中心的三维目标检测方法,即支持距离误差(SDE)。基于SDE的分析表明,以自我为中心的检测质量受包围盒的粗糙几何结构的限制。鉴于SDE将受益于更精确的几何描述,我们建议将对象表示为amodal轮廓,特别是amodal星形多边形,并设计一个简单的模型StarPoly来预测此类轮廓。我们在大规模Waymo开放数据集上的实验表明,与IoU相比,SDE更好地反映了检测质量对ego代理安全性的影响;与最近的3D目标检测器相比,StarPoly估计的轮廓持续改善了以自我为中心的检测质量。 摘要:3D object detection is a key module for safety-critical robotics applications such as autonomous driving. For these applications, we care most about how the detections affect the ego-agent's behavior and safety (the egocentric perspective). Intuitively, we seek more accurate descriptions of object geometry when it's more likely to interfere with the ego-agent's motion trajectory. However, current detection metrics, based on box Intersection-over-Union (IoU), are object-centric and aren't designed to capture the spatio-temporal relationship between objects and the ego-agent. To address this issue, we propose a new egocentric measure to evaluate 3D object detection, namely Support Distance Error (SDE). Our analysis based on SDE reveals that the egocentric detection quality is bounded by the coarse geometry of the bounding boxes. Given the insight that SDE would benefit from more accurate geometry descriptions, we propose to represent objects as amodal contours, specifically amodal star-shaped polygons, and devise a simple model, StarPoly, to predict such contours. Our experiments on the large-scale Waymo Open Dataset show that SDE better reflects the impact of detection quality on the ego-agent's safety compared to IoU; and the estimated contours from StarPoly consistently improve the egocentric detection quality over recent 3D object detectors.

分类|识别相关(6篇)

【1】 A Factorization Approach for Motor Imagery Classification 标题:一种用于运动图像分类的因子分解方法 链接:https://arxiv.org/abs/2112.08175

作者:Byeong-Hoo Lee,Jeong-Hyun Cho,Byung-Hee Kwon 备注:4 pages 摘要:脑-机接口使用大脑信号与外部设备进行通信,无需实际控制。基于机器学习的运动想象分类已经有很多研究。然而,对具有稀疏空间特征的图像数据(如单臂运动图像)进行分类仍然是一个挑战。在本文中,我们提出了一种方法,将脑电信号分解为两组,以便在空间特征稀疏的情况下对运动想象进行分类。在对抗学习的基础上,重点研究了对噪声具有鲁棒性的脑电信号的共同特征提取和仅提取信号特征。此外,还提取了用于分类的特定于类的特征。最后,该方法通过将两个组的特征表示为一个嵌入空间来对类进行分类。通过实验,我们证实了将特征分为两组对包含稀疏空间特征的数据集有利的可行性。 摘要:Brain-computer interface uses brain signals to communicate with external devices without actual control. Many studies have been conducted to classify motor imagery based on machine learning. However, classifying imagery data with sparse spatial characteristics, such as single-arm motor imagery, remains a challenge. In this paper, we proposed a method to factorize EEG signals into two groups to classify motor imagery even if spatial features are sparse. Based on adversarial learning, we focused on extracting common features of EEG signals which are robust to noise and extracting only signal features. In addition, class-specific features were extracted which are specialized for class classification. Finally, the proposed method classifies the classes by representing the features of the two groups as one embedding space. Through experiments, we confirmed the feasibility that extracting features into two groups is advantageous for datasets that contain sparse spatial features.

【2】 A learning-based approach to feature recognition of Engineering shapes 标题:一种基于学习的工程形状特征识别方法 链接:https://arxiv.org/abs/2112.07962

作者:Lakshmi Priya Muraleedharan,Ramanathan Muthuganapathy 摘要:在本文中,我们提出了一种机器学习方法来识别CAD网格模型中的孔、槽等工程形状特征。随着数字存档、3D打印、零部件扫描和逆向工程等新制造技术的出现,CAD数据以网格模型表示的形式大量增加。随着网格模型中节点和边的数量增加以及存在噪声的可能性,基于图的方法的直接应用不仅成本高昂,而且难以针对噪声数据进行调整。因此,这就需要为以网格形式表示的CAD模型设计新的特征识别方法。这里,我们展示了一个离散版本的高斯映射可以作为特征学习的特征。我们表明,这种方法不仅需要更少的内存需求,而且训练时间也非常少。由于不涉及网络体系结构,超参数的数量要少得多,并且可以在更快的时间内进行调整。识别精度也非常类似于使用三维卷积神经网络(CNN)获得的识别精度,但运行时间和存储要求要少得多。与其他非网络机器学习方法进行了比较,结果表明我们的方法具有最高的准确性。我们还展示了从公共基准测试中获得的具有多个特征以及复杂/交互特征的CAD模型的识别结果。处理噪声数据的能力也得到了证明。 摘要:In this paper, we propose a machine learning approach to recognise engineering shape features such as holes, slots, etc. in a CAD mesh model. With the advent of digital archiving, newer manufacturing techniques such as 3D printing, scanning of components and reverse engineering, CAD data is proliferated in the form of mesh model representation. As the number of nodes and edges become larger in a mesh model as well as the possibility of presence of noise, direct application of graph-based approaches would not only be expensive but also difficult to be tuned for noisy data. Hence, this calls for newer approaches to be devised for feature recognition for CAD models represented in the form of mesh. Here, we show that a discrete version of Gauss map can be used as a signature for a feature learning. We show that this approach not only requires fewer memory requirements but also the training time is quite less. As no network architecture is involved, the number of hyperparameters are much lesser and can be tuned in a much faster time. The recognition accuracy is also very similar to that of the one obtained using 3D convolutional neural networks (CNN) but in much lesser running time and storage requirements. A comparison has been done with other non-network based machine learning approaches to show that our approach has the highest accuracy. We also show the recognition results for CAD models having multiple features as well as complex/interacting features obtained from public benchmarks. The ability to handle noisy data has also been demonstrated.

【3】 From Noise to Feature: Exploiting Intensity Distribution as a Novel Soft Biometric Trait for Finger Vein Recognition 标题:从噪声到特征:利用强度分布作为一种新的软生物特征进行手指静脉识别 链接:https://arxiv.org/abs/2112.07931

作者:Wenxiong Kang,Yuting Lu,Dejian Li,Wei Jia 备注:None 摘要:尽管同时忽略了手指组织形成的强度分布,并且在某些情况下将其作为背景噪声处理,但大多数手指静脉特征提取算法由于其纹理表示能力而获得了令人满意的性能。在本文中,我们利用这种噪声作为一种新的软生物特征来实现更好的手指静脉识别性能。首先,对手指静脉成像原理和图像特征进行了详细分析,表明可以将手指组织在背景中形成的强度分布提取为软生物特征进行识别。然后,提出了两种手指静脉背景层提取算法和三种软生物特征提取算法,用于强度分布特征提取。最后,提出了一种混合匹配策略,解决了原始生物特征和软生物特征在分数水平上的维度差异问题。在三个openaccess数据库上进行的一系列严格对比实验表明,本文提出的方法是可行和有效的。 摘要:Most finger vein feature extraction algorithms achieve satisfactory performance due to their texture representation abilities, despite simultaneously ignoring the intensity distribution that is formed by the finger tissue, and in some cases, processing it as background noise. In this paper, we exploit this kind of noise as a novel soft biometric trait for achieving better finger vein recognition performance. First, a detailed analysis of the finger vein imaging principle and the characteristics of the image are presented to show that the intensity distribution that is formed by the finger tissue in the background can be extracted as a soft biometric trait for recognition. Then, two finger vein background layer extraction algorithms and three soft biometric trait extraction algorithms are proposed for intensity distribution feature extraction. Finally, a hybrid matching strategy is proposed to solve the issue of dimension difference between the primary and soft biometric traits on the score level. A series of rigorous contrast experiments on three open-access databases demonstrates that our proposed method is feasible and effective for finger vein recognition.

【4】 Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification 标题:推理想象:一种基于推理的长尾分类隐式语义数据增强 链接:https://arxiv.org/abs/2112.07928

作者:Xiaohua Chen,Yucan Zhou,Dayan Wu,Wanqian Zhang,Yu Zhou,Bo Li,Weiping Wang 备注:9 pages, 5 figures 摘要:现实世界中的数据通常遵循长尾分布,这使得现有分类算法的性能严重下降。一个关键问题是尾部类别中的样本无法描述其类内多样性。人类可以用他们的先验知识想象新姿势、场景和视角的样本,即使这是第一次看到这个类别。受此启发,我们提出了一种新的基于推理的隐式语义数据扩充方法,以借用其他类的转换方向。由于每个类别的协方差矩阵表示特征变换方向,因此我们可以从相似类别中采样新方向以生成完全不同的实例。具体来说,首先采用长尾分布数据来训练主干和分类器。然后,估计每个类别的协方差矩阵,并构造知识图来存储任意两个类别的关系。最后,通过传播知识图中所有相似类别的信息,自适应地增强尾部样本。在CIFAR-100-LT、ImageNet LT和iNaturalist 2018上的实验结果表明,与最先进的方法相比,我们提出的方法是有效的。 摘要:Real-world data often follows a long-tailed distribution, which makes the performance of existing classification algorithms degrade heavily. A key issue is that samples in tail categories fail to depict their intra-class diversity. Humans can imagine a sample in new poses, scenes, and view angles with their prior knowledge even if it is the first time to see this category. Inspired by this, we propose a novel reasoning-based implicit semantic data augmentation method to borrow transformation directions from other classes. Since the covariance matrix of each category represents the feature transformation directions, we can sample new directions from similar categories to generate definitely different instances. Specifically, the long-tailed distributed data is first adopted to train a backbone and a classifier. Then, a covariance matrix for each category is estimated, and a knowledge graph is constructed to store the relations of any two categories. Finally, tail samples are adaptively enhanced via propagating information from all the similar categories in the knowledge graph. Experimental results on CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 have demonstrated the effectiveness of our proposed method compared with the state-of-the-art methods.

【5】 Temporal Shuffling for Defending Deep Action Recognition Models against Adversarial Attacks 标题:用于防御敌方攻击的深层动作识别模型的时间洗牌 链接:https://arxiv.org/abs/2112.07921

作者:Jaehui Hwang,Huan Zhang,Jun-Ho Choi,Cho-Jui Hsieh,Jong-Seok Lee 摘要:近年来,基于视频的卷积神经网络动作识别方法取得了显著的识别效果。然而,人们对动作识别模型的泛化机制还缺乏认识。在本文中,我们建议动作识别模型对运动信息的依赖程度低于预期,因此它们对帧顺序的随机化具有鲁棒性。基于这一观察,我们开发了一种新的防御方法,使用输入视频的时间洗牌来对抗动作识别模型中的敌对攻击。另一个支持我们的防御方法的观察结果是,视频上的敌对干扰对时间破坏很敏感。据我们所知,这是第一次尝试设计一种针对基于视频的动作识别模型的防御方法。 摘要:Recently, video-based action recognition methods using convolutional neural networks (CNNs) achieve remarkable recognition performance. However, there is still lack of understanding about the generalization mechanism of action recognition models. In this paper, we suggest that action recognition models rely on the motion information less than expected, and thus they are robust to randomization of frame orders. Based on this observation, we develop a novel defense method using temporal shuffling of input videos against adversarial attacks for action recognition models. Another observation enabling our defense method is that adversarial perturbations on videos are sensitive to temporal destruction. To the best of our knowledge, this is the first attempt to design a defense method specific to video-based action recognition models.

【6】 Weed Recognition using Deep Learning Techniques on Class-imbalanced Imagery 标题:基于深度学习技术的类不平衡图像杂草识别 链接:https://arxiv.org/abs/2112.07819

作者:A S M Mahmudul Hasan,Ferdous Sohel,Dean Diepeveen,Hamid Laga,Michael G. K. Jones 备注:The paper is accepted by Crop and Pasture Science journal (this https URL) 摘要:大多数杂草物种通过争夺高价值作物所需的养分,对农业生产力产生不利影响。对于大面积种植区域,人工除草不切实际。已经进行了许多研究,以开发农作物杂草自动管理系统。在此过程中,主要任务之一是从图像中识别杂草。然而,杂草识别是一项具有挑战性的任务。这是因为杂草和农作物在颜色、质地和形状上可能相似,而在记录图像时,成像条件、地理或天气条件会进一步加剧这种情况。先进的机器学习技术可用于从图像中识别杂草。在本文中,我们研究了五种最先进的深度神经网络,即VGG16、ResNet-50、Inception-V3、Inception-ResNet-v2和MobileNet-v2,并评估了它们的杂草识别性能。我们使用了几个实验设置和多个数据集组合。特别是,我们通过组合几个较小的数据集构建了一个大型杂草作物数据集,通过数据扩充缓解了类别不平衡,并使用该数据集对深层神经网络进行基准测试。我们研究了转移学习技术的使用,通过保留预先训练的权重来提取特征,并使用作物和杂草数据集的图像对其进行微调。我们发现VGG16在小规模数据集上的性能优于其他网络,而ResNet-50在大型组合数据集上的性能优于其他深度网络。 摘要:Most weed species can adversely impact agricultural productivity by competing for nutrients required by high-value crops. Manual weeding is not practical for large cropping areas. Many studies have been undertaken to develop automatic weed management systems for agricultural crops. In this process, one of the major tasks is to recognise the weeds from images. However, weed recognition is a challenging task. It is because weed and crop plants can be similar in colour, texture and shape which can be exacerbated further by the imaging conditions, geographic or weather conditions when the images are recorded. Advanced machine learning techniques can be used to recognise weeds from imagery. In this paper, we have investigated five state-of-the-art deep neural networks, namely VGG16, ResNet-50, Inception-V3, Inception-ResNet-v2 and MobileNetV2, and evaluated their performance for weed recognition. We have used several experimental settings and multiple dataset combinations. In particular, we constructed a large weed-crop dataset by combining several smaller datasets, mitigating class imbalance by data augmentation, and using this dataset in benchmarking the deep neural networks. We investigated the use of transfer learning techniques by preserving the pre-trained weights for extracting the features and fine-tuning them using the images of crop and weed datasets. We found that VGG16 performed better than others on small-scale datasets, while ResNet-50 performed better than other deep networks on the large combined dataset.

分割|语义相关(10篇)

【1】 SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation 标题:SeqFormer:一种简单得令人沮丧的视频实例分割模型 链接:https://arxiv.org/abs/2112.08275

作者:Junfeng Wu,Yi Jiang,Wenqing Zhang,Xiang Bai,Song Bai 摘要:在这项工作中,我们提出了SeqFormer,一个令人沮丧的简单视频实例分割模型。SeqFormer遵循vision transformer的原理,为视频帧之间的实例关系建模。然而,我们观察到一个独立的实例查询足以捕获视频中实例的时间序列,但是注意机制应该独立于每一帧进行。为了实现这一点,SeqFormer在每一帧中定位一个实例并聚合时间信息以学习视频级实例的强大表示,该表示用于动态预测每一帧上的掩码序列。实例跟踪自然实现,无需跟踪分支或后处理。在YouTube VIS数据集上,SeqFormer使用ResNet-50主干网实现了47.4 AP,使用ResNet-101主干网实现了49.0 AP,无需任何提示。这一成就分别比之前的最先进水平高出4.6和4.4倍。此外,与最近提出的SWITransformer相结合,SeqFormer实现了59.3的更高AP。我们希望SeqFormer能够成为一个强大的基线,促进视频实例分割的未来研究,同时,用一个更健壮、准确、整洁的模型推进这一领域。代码和预先训练的模型可在https://github.com/wjf5203/SeqFormer. 摘要:In this work, we present SeqFormer, a frustratingly simple model for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms should be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On the YouTube-VIS dataset, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code and the pre-trained models are publicly available at https://github.com/wjf5203/SeqFormer.

【2】 Interactive Visualization and Representation Analysis Applied to Glacier Segmentation 标题:交互式可视化与表示分析在冰川分割中的应用 链接:https://arxiv.org/abs/2112.08184

作者:Minxing Zheng,Xinran Miao,Kris Sankaran 备注:14 pages, 10 figures 摘要:在地球观测问题中,可解释性越来越受到关注。我们应用交互式可视化和表示分析来指导冰川分割模型的解释。我们将U-Net的激活可视化,以了解和评估模型性能。我们使用Shinny R软件包构建了一个在线界面,以提供预测的全面错误分析。用户可以与面板交互并发现模型故障模式。此外,我们还讨论了可视化如何在数据预处理和模型训练期间提供健全性检查。 摘要:Interpretability has attracted increasing attention in earth observation problems. We apply interactive visualization and representation analysis to guide interpretation of glacier segmentation models. We visualize the activations from a U-Net to understand and evaluate the model performance. We build an online interface using the Shiny R package to provide comprehensive error analysis of the predictions. Users can interact with the panels and discover model failure modes. Further, we discuss how visualization can provide sanity checks during data preprocessing and model training.

【3】 Segmentation-Reconstruction-Guided Facial Image De-occlusion 标题:基于分割重建的人脸图像去遮挡 链接:https://arxiv.org/abs/2112.08022

作者:Xiangnan Yin,Di Huang,Zehua Fu,Yunhong Wang,Liming Chen 摘要:遮挡在野外人脸图像中非常常见,导致人脸相关任务的性能下降。尽管人们在去除人脸图像中的遮挡方面做了大量的工作,但是遮挡的形状和纹理的变化仍然对当前方法的鲁棒性提出了挑战。因此,当前的方法要么依赖于手动遮挡遮罩,要么仅适用于特定遮挡。提出了一种新的基于人脸分割和三维人脸重建的人脸去遮挡模型,该模型能够自动去除所有边界模糊的人脸遮挡,例如头发。该模型由三维人脸重建模块、人脸分割模块和图像生成模块组成。图像生成模块利用前两者分别预测的人脸先验信息和遮挡掩模信息,能够忠实地恢复缺失的人脸纹理。为了监督训练,我们进一步构建了一个大型的遮挡数据集,包括手动标记和合成遮挡。定性和定量结果证明了该方法的有效性和鲁棒性。 摘要:Occlusions are very common in face images in the wild, leading to the degraded performance of face-related tasks. Although much effort has been devoted to removing occlusions from face images, the varying shapes and textures of occlusions still challenge the robustness of current methods. As a result, current methods either rely on manual occlusion masks or only apply to specific occlusions. This paper proposes a novel face de-occlusion model based on face segmentation and 3D face reconstruction, which automatically removes all kinds of face occlusions with even blurred boundaries,e.g., hairs. The proposed model consists of a 3D face reconstruction module, a face segmentation module, and an image generation module. With the face prior and the occlusion mask predicted by the first two, respectively, the image generation module can faithfully recover the missing facial textures. To supervise the training, we further build a large occlusion dataset, with both manually labeled and synthetic occlusions. Qualitative and quantitative results demonstrate the effectiveness and robustness of the proposed method.

【4】 Autoencoder-based background reconstruction and foreground segmentation with background noise estimation 标题:基于自动编码器的背景重建和带背景噪声估计的前景分割 链接:https://arxiv.org/abs/2112.08001

作者:Bruno Sauvalle,Arnaud de La Fortelle 摘要:即使经过几十年的研究,动态场景背景重建和前景对象分割仍然被认为是一个有待解决的问题,因为存在各种挑战,例如照明变化、相机移动或由空气湍流或移动树木引起的背景噪声。本文提出使用自动编码器将视频序列的背景建模为低维流形,并将该自动编码器提供的重建背景与原始图像进行比较,以计算前景/背景分割模板。该模型的主要新颖之处在于,自动编码器也经过训练以预测背景噪声,这允许为每一帧计算一个像素相关阈值以执行背景/前景分割。尽管提议的模型不使用任何时间或运动信息,但它超过了CDnet 2014和LASIESTA数据集上无监督背景减法的最新技术,在摄像机移动的视频上有了显著的改进。 摘要:Even after decades of research, dynamic scene background reconstruction and foreground object segmentation are still considered as open problems due various challenges such as illumination changes, camera movements, or background noise caused by air turbulence or moving trees. We propose in this paper to model the background of a video sequence as a low dimensional manifold using an autoencoder and to compare the reconstructed background provided by this autoencoder with the original image to compute the foreground/background segmentation masks. The main novelty of the proposed model is that the autoencoder is also trained to predict the background noise, which allows to compute for each frame a pixel-dependent threshold to perform the background/foreground segmentation. Although the proposed model does not use any temporal or motion information, it exceeds the state of the art for unsupervised background subtraction on the CDnet 2014 and LASIESTA datasets, with a significant improvement on videos where the camera is moving.

【5】 Self-Ensembling GAN for Cross-Domain Semantic Segmentation 标题:用于跨域语义切分的自集成GAN 链接:https://arxiv.org/abs/2112.07999

作者:Yonghao Xu,Fengxiang He,Bo Du,Liangpei Zhang,Dacheng Tao 摘要:深度神经网络(DNN)极大地提高了语义分割的性能。然而,训练DNN通常需要大量像素级标记的数据,这在实践中是昂贵且耗时的。为了减轻注释负担,本文提出了一种利用跨域数据进行语义分割的自感知生成对抗网络(SE-GAN)。在SE-GAN中,教师网络和学生网络构成了生成语义切分图的自感知模型,该模型与鉴别器一起构成了一个GAN。尽管其简单,我们发现SE-GAN可以显著提高对抗性训练的性能并增强模型的稳定性,后者是大多数基于对抗性训练的方法的共同障碍。我们从理论上分析SE-GAN,并提供$mathcal O(1/sqrt{N})$泛化界($N$是训练样本量),这表明控制鉴别器的假设复杂度以增强泛化能力。因此,我们选择一个简单的网络作为鉴别器。在两个标准设置下进行的广泛和系统的实验表明,所提出的方法明显优于当前最先进的方法。我们模型的源代码将很快提供。 摘要:Deep neural networks (DNNs) have greatly contributed to the performance gains in semantic segmentation. Nevertheless, training DNNs generally requires large amounts of pixel-level labeled data, which is expensive and time-consuming to collect in practice. To mitigate the annotation burden, this paper proposes a self-ensembling generative adversarial network (SE-GAN) exploiting cross-domain data for semantic segmentation. In SE-GAN, a teacher network and a student network constitute a self-ensembling model for generating semantic segmentation maps, which together with a discriminator, forms a GAN. Despite its simplicity, we find SE-GAN can significantly boost the performance of adversarial training and enhance the stability of the model, the latter of which is a common barrier shared by most adversarial training-based methods. We theoretically analyze SE-GAN and provide an $mathcal O(1/sqrt{N})$ generalization bound ($N$ is the training sample size), which suggests controlling the discriminator's hypothesis complexity to enhance the generalizability. Accordingly, we choose a simple network as the discriminator. Extensive and systematic experiments in two standard settings demonstrate that the proposed method significantly outperforms current state-of-the-art approaches. The source code of our model will be available soon.

【6】 M-FasterSeg: An Efficient Semantic Segmentation Network Based on Neural Architecture Search 标题:M-FasterSeg:一种基于神经结构搜索的高效语义分割网络 链接:https://arxiv.org/abs/2112.07918

作者:Huiyu Kuang 摘要:图像语义分割技术是智能系统理解自然场景的关键技术之一。作为视觉智能领域的重要研究方向之一,该技术在移动机器人、无人机、智能驾驶、智能安全等领域有着广泛的应用前景。然而,在移动机器人的实际应用中,可能会出现分割语义标签预测不准确、分割对象和背景边缘信息丢失等问题。提出了一种基于深度学习网络的语义分割网络的改进结构,该网络结合了自注意神经网络和神经网络结构搜索方法。首先,使用神经网络搜索方法NAS(neural Architecture search)寻找具有多个分辨率分支的语义分割网络。在搜索过程中,结合自注意网络结构模块对搜索到的神经网络结构进行调整,然后将不同分支搜索到的语义分割网络组合成快速语义分割网络结构,并将图片输入到网络结构中,得到最终的预测结果。在Cityscapes数据集上的实验结果表明,该算法的准确率为69.8%,分割速度为48/s,在实时性和准确性之间达到了很好的平衡,能够优化边缘分割,在复杂场景中具有较好的性能。良好的鲁棒性适合于实际应用。 摘要:Image semantic segmentation technology is one of the key technologies for intelligent systems to understand natural scenes. As one of the important research directions in the field of visual intelligence, this technology has broad application scenarios in the fields of mobile robots, drones, smart driving, and smart security. However, in the actual application of mobile robots, problems such as inaccurate segmentation semantic label prediction and loss of edge information of segmented objects and background may occur. This paper proposes an improved structure of a semantic segmentation network based on a deep learning network that combines self-attention neural network and neural network architecture search methods. First, a neural network search method NAS (Neural Architecture Search) is used to find a semantic segmentation network with multiple resolution branches. In the search process, combine the self-attention network structure module to adjust the searched neural network structure, and then combine the semantic segmentation network searched by different branches to form a fast semantic segmentation network structure, and input the picture into the network structure to get the final forecast result. The experimental results on the Cityscapes dataset show that the accuracy of the algorithm is 69.8%, and the segmentation speed is 48/s. It achieves a good balance between real-time and accuracy, can optimize edge segmentation, and has a better performance in complex scenes. Good robustness is suitable for practical application.

【7】 Decoupling Zero-Shot Semantic Segmentation 标题:解耦的Zero-Shot语义切分 链接:https://arxiv.org/abs/2112.07910

作者:Jian Ding,Nan Xue,Gui-Song Xia,Dengxin Dai 备注:14 pages, 8 figures 摘要:Zero-Shot语义分割(Zero-shot segmentation,ZS3)旨在分割训练中未发现的新类别。现有的工作将ZS3描述为一个像素级的Zero-Shot分类问题,并借助于仅使用文本预训练的语言模型将语义知识从可见类转移到不可见类。虽然简单,但像素级ZS3公式显示,集成视觉语言模型的能力有限,这些模型通常通过图像-文本对进行预训练,目前显示出视觉任务的巨大潜力。受人类经常执行段级语义标记这一观察结果的启发,我们建议将ZS3解耦为两个子任务:1)类无关分组任务,将像素分组为段。2) 分段上的零炮分类任务。前一个子任务不涉及类别信息,可以直接传输到组像素,用于看不见的类别。后一个子任务在段级别执行,并提供了一种自然的方式来利用大规模视觉语言模型,该模型使用ZS3的图像-文本对(例如CLIP)进行预训练。基于解耦公式,我们提出了一个简单而有效的零炮语义分割模型ZegFormer,该模型在ZS3标准基准上的性能大大优于以前的方法,例如,在PASCAL VOC上35个点,在COCO Stuff上3个点,对于看不见的类的mIoU。守则将于https://github.com/dingjiansw101/ZegFormer. 摘要:Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zero-shot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a class-agnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former sub-task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter subtask performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 35 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.

【8】 Gaze Estimation with Eye Region Segmentation and Self-Supervised Multistream Learning 标题:基于眼区分割和自监督多数据流学习的视线估计 链接:https://arxiv.org/abs/2112.07878

作者:Zunayed Mahmud,Paul Hungler,Ali Etemad 备注:5 pages, 1 figure, 3 tables, Accepted in AAAI-22 Workshop on Human-Centric Self-Supervised Learning 摘要:我们提出了一种新的多流网络学习鲁棒眼睛表征凝视估计。我们首先使用模拟器创建一个合成数据集,其中包含详细描述可见眼球和虹膜的眼睛区域遮罩。然后,我们使用U-Net类型的模型执行眼睛区域分割,我们随后使用该模型为真实世界的眼睛图像生成眼睛区域遮罩。接下来,我们使用自监督对比学习在真实域中预训练眼睛图像编码器,以学习广义眼睛表征。最后,在我们的多流框架中,这个预训练的眼睛编码器,以及另外两个用于可见眼球区域和虹膜的编码器被并行使用,以从真实世界的图像中提取用于凝视估计的显著特征。我们在两种不同的评估设置下在Eyedip数据集上演示了我们的方法的性能,并实现了最先进的结果,优于该数据集上的所有现有基准。我们还进行了额外的实验,以验证我们的自监督网络对于用于训练的不同数量的标记数据的鲁棒性。 摘要:We present a novel multistream network that learns robust eye representations for gaze estimation. We first create a synthetic dataset containing eye region masks detailing the visible eyeball and iris using a simulator. We then perform eye region segmentation with a U-Net type model which we later use to generate eye region masks for real-world eye images. Next, we pretrain an eye image encoder in the real domain with self-supervised contrastive learning to learn generalized eye representations. Finally, this pretrained eye encoder, along with two additional encoders for visible eyeball region and iris, are used in parallel in our multistream framework to extract salient features for gaze estimation from real-world images. We demonstrate the performance of our method on the EYEDIAP dataset in two different evaluation settings and achieve state-of-the-art results, outperforming all the existing benchmarks on this dataset. We also conduct additional experiments to validate the robustness of our self-supervised network with respect to different amounts of labeled data used for training.

【9】 Image Segmentation with Homotopy Warping 标题:基于同伦扭曲的图像分割 链接:https://arxiv.org/abs/2112.07812

作者:Xiaoling Hu,Chao Chen 备注:13 pages, 11 figures 摘要:除了每像素精度外,拓扑正确性对于具有精细尺度结构的图像分割也至关重要,例如卫星图像和生物医学图像。在本文中,通过利用数字拓扑理论,我们确定图像中对拓扑至关重要的位置。通过关注这些关键位置,我们提出了一种新的同伦扭曲损失来训练深度图像分割网络,以获得更好的拓扑精度。为了有效地识别这些拓扑关键位置,我们提出了一种利用距离变换的新算法。所提出的算法,以及损失函数,自然地推广到二维和三维环境中的不同拓扑结构。所提出的损失函数有助于deep-nets在拓扑感知度量方面获得更好的性能,优于最先进的拓扑保持分割方法。 摘要:Besides per-pixel accuracy, topological correctness is also crucial for the segmentation of images with fine-scale structures, e.g., satellite images and biomedical images. In this paper, by leveraging the theory of digital topology, we identify locations in an image that are critical for topology. By focusing on these critical locations, we propose a new homotopy warping loss to train deep image segmentation networks for better topological accuracy. To efficiently identity these topologically critical locations, we propose a new algorithm exploiting the distance transform. The proposed algorithm, as well as the loss function, naturally generalize to different topological structures in both 2D and 3D settings. The proposed loss function helps deep nets achieve better performance in terms of topology-aware metrics, outperforming state-of-the-art topology-preserving segmentation methods.

【10】 RA V-Net: Deep learning network for automated liver segmentation 标题:RAV-Net:用于肝脏自动分割的深度学习网络 链接:https://arxiv.org/abs/2112.08232

作者:Zhiqi Lee,Sumin Qi,Chongchong Fan,Ziwei Xie 摘要:肝脏的精确分割是疾病诊断的先决条件。自动分割是计算机辅助肝脏疾病检测和诊断的一个重要应用。近年来,医学图像的自动化处理取得了突破性进展。然而,腹部扫描CT图像的低对比度和肝脏形态的复杂性使得精确的自动分割具有挑战性。本文提出了一种改进的基于U-Net的医学图像自动分割模型rav-Net。它有以下三个主要创新。提出了CofRes模型(复合原始特征残差模型)。具有更复杂的卷积层和跳跃连接,使其获得更高级别的图像特征提取能力,防止梯度消失或爆炸。为了减少模型的计算量,提出了AR模块(注意恢复模块)。此外,通过调整信道和LSTM卷积来感测编码和解码模块的数据像素之间的空间特征。最后,有效地保留了图像特征。引入了CA模块(channelattentionmodule,通道注意模块),该模块用于提取具有依赖关系的相关通道,并通过矩阵点积对其进行增强,同时削弱不具有依赖关系的无关通道。达到了通道注意的目的。LSTM卷积和CA模块提供的注意机制是神经网络性能的有力保证。U-Net网络的精度为0.9862,精度为0.9118,DSC为0.8547,JSC为0.82。RA V-Net的评估指标,准确度:0.9968,精密度:0.9597,DSC:0.9654,JSC:0.9414。分割效果最具代表性的指标是DSC,它比U-Net提高了0.1107,JSC提高了0.1214。 摘要:Accurate segmentation of the liver is a prerequisite for the diagnosis of disease. Automated segmentation is an important application of computer-aided detection and diagnosis of liver disease. In recent years, automated processing of medical images has gained breakthroughs. However, the low contrast of abdominal scan CT images and the complexity of liver morphology make accurate automatic segmentation challenging. In this paper, we propose RA V-Net, which is an improved medical image automatic segmentation model based on U-Net. It has the following three main innovations. CofRes Module (Composite Original Feature Residual Module) is proposed. With more complex convolution layers and skip connections to make it obtain a higher level of image feature extraction capability and prevent gradient disappearance or explosion. AR Module (Attention Recovery Module) is proposed to reduce the computational effort of the model. In addition, the spatial features between the data pixels of the encoding and decoding modules are sensed by adjusting the channels and LSTM convolution. Finally, the image features are effectively retained. CA Module (Channel Attention Module) is introduced, which used to extract relevant channels with dependencies and strengthen them by matrix dot product, while weakening irrelevant channels without dependencies. The purpose of channel attention is achieved. The attention mechanism provided by LSTM convolution and CA Module are strong guarantees for the performance of the neural network. The accuracy of U-Net network: 0.9862, precision: 0.9118, DSC: 0.8547, JSC: 0.82. The evaluation metrics of RA V-Net, accuracy: 0.9968, precision: 0.9597, DSC: 0.9654, JSC: 0.9414. The most representative metric for the segmentation effect is DSC, which improves 0.1107 over U-Net, and JSC improves 0.1214.

Zero/Few Shot|迁移|域适配|自适应(1篇)

【1】 Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image Retrieval 标题:基于Zero-Shot草图图像检索的模态感知三元组硬挖掘 链接:https://arxiv.org/abs/2112.07966

作者:Zongheng Huang,YiFan Sun,Chuchu Han,Changxin Gao,Nong Sang 备注:13 pages, 7 figures 摘要:本文从跨模态度量学习的角度解决了基于Zero-Shot草图的图像检索(ZS-SBIR)问题最近在深度度量学习方面的良好实践。该任务有两个特点:1)Zero-Shot设置需要具有良好类内紧凑性和类间差异的度量空间来识别新类;2)草图查询和照片库处于不同的模式。度量学习观点从两个方面有利于ZS-SBIR。首先,它通过深度度量学习(DML)中最近的良好实践促进了改进。通过结合DML中的两种基本学习方法,即分类训练和成对训练,我们为ZS-SBIR建立了一个强大的基线。在没有任何干扰的情况下,该基线可以实现具有竞争力的检索精度。其次,它提供了一种见解,即正确地抑制模态间隙是至关重要的。为此,我们设计了一种新的方法,称为模态感知三重态硬挖掘(MATHM)。MATHM通过三种类型的成对学习、跨模态样本对、模态内样本对及其组合来增强基线。我们还设计了一种自适应加权方法,在训练过程中动态平衡这三个分量。实验结果证实,MATHM在强基线的基础上带来了另一轮显著的改进,并建立了新的最先进的性能。例如,在柏林大学数据集上,我们实现了47.88+2.94\%mAP@all和58.28+2.34\%Prec@100.代码将公开提供。 摘要:This paper tackles the Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) problem from the viewpoint of cross-modality metric learning. % with recent good practices in deep metric learning. This task has two characteristics: 1) the zero-shot setting requires a metric space with good within-class compactness and the between-class discrepancy for recognizing the novel classes and 2) the sketch query and the photo gallery are in different modalities. The metric learning viewpoint benefits ZS-SBIR from two aspects. First, it facilitates improvement through recent good practices in deep metric learning (DML). By combining two fundamental learning approaches in DML, emph{e.g.}, classification training and pairwise training, we set up a strong baseline for ZS-SBIR. Without bells and whistles, this baseline achieves competitive retrieval accuracy. Second, it provides an insight that properly suppressing the modality gap is critical. To this end, we design a novel method named Modality-Aware Triplet Hard Mining (MATHM). MATHM enhances the baseline with three types of pairwise learning, emph{e.g.}, a cross-modality sample pair, a within-modality sample pair, and their combination.We also design an adaptive weighting method to balance these three components during training dynamically. Experimental results confirm that MATHM brings another round of significant improvement based on the strong baseline and sets up new state-of-the-art performance. For example, on the TU-Berlin dataset, we achieve 47.88+2.94\% mAP@all and 58.28+2.34\% Prec@100. Code will be publicly available.

半弱无监督|主动学习|不确定性(5篇)

【1】 Improving Self-supervised Learning with Automated Unsupervised Outlier Arbitration 标题:利用自动无监督孤立点仲裁改进自监督学习 链接:https://arxiv.org/abs/2112.08132

作者:Yu Wang,Jingyang Lin,Jingjing Zou,Yingwei Pan,Ting Yao,Tao Mei 备注:NeurIPS 2021; Code is publicly available at: this https URL 摘要:我们的工作揭示了现有主流自监督学习方法的结构性缺陷。虽然自监督学习框架通常认为流行的完美实例级不变性假设是理所当然的,但我们仔细研究了其背后的陷阱。特别是,我们认为,现有的用于生成多个积极观点的增强管道自然会引入分布外(OOD)样本,从而破坏下游任务的学习。对输入产生不同的积极增强并不总是有利于下游任务。为了克服这个固有的缺陷,我们引入了一个轻量级的潜在变量模型UOTA,针对自监督学习的视图采样问题。UOTA自适应搜索最重要的采样区域以生成视图,并为离群点鲁棒自监督学习方法提供可行的选择。我们的方法直接推广到许多主流的自监督学习方法,不管损失的性质是否是对比的。我们的经验表明,UOTA比最先进的自我监督范式具有明显的优势,这很好地证明了现有方法中嵌入的OOD样本问题的存在。特别是,我们从理论上证明了该方案的优点归结为保证估计方差和偏差减少。代码可从以下网址获取:https://github.com/ssl-codelab/uota. 摘要:Our work reveals a structured shortcoming of the existing mainstream self-supervised learning methods. Whereas self-supervised learning frameworks usually take the prevailing perfect instance level invariance hypothesis for granted, we carefully investigate the pitfalls behind. Particularly, we argue that the existing augmentation pipeline for generating multiple positive views naturally introduces out-of-distribution (OOD) samples that undermine the learning of the downstream tasks. Generating diverse positive augmentations on the input does not always pay off in benefiting downstream tasks. To overcome this inherent deficiency, we introduce a lightweight latent variable model UOTA, targeting the view sampling issue for self-supervised learning. UOTA adaptively searches for the most important sampling region to produce views, and provides viable choice for outlier-robust self-supervised learning approaches. Our method directly generalizes to many mainstream self-supervised learning approaches, regardless of the loss's nature contrastive or not. We empirically show UOTA's advantage over the state-of-the-art self-supervised paradigms with evident margin, which well justifies the existence of the OOD sample issue embedded in the existing approaches. Especially, we theoretically prove that the merits of the proposal boil down to guaranteed estimator variance and bias reduction. Code is available: at https://github.com/ssl-codelab/uota.

【2】 Self-Supervised Monocular Depth and Ego-Motion Estimation in Endoscopy: Appearance Flow to the Rescue 标题:内窥镜检查中自我监控的单眼深度和自我运动估计:流向抢救的外观流 链接:https://arxiv.org/abs/2112.08122

作者:Shuwei Shao,Zhongcai Pei,Weihai Chen,Wentao Zhu,Xingming Wu,Dianmin Sun,Baochang Zhang 备注:Accepted by Medical Image Analysis 摘要:最近,自监督学习技术被应用于从单目视频中计算深度和自我运动,在自主驾驶场景中取得了显著的性能。深度和自我运动自监督学习的一个广泛采用的假设是,图像亮度在相邻帧内保持不变。不幸的是,内窥镜场景不符合这一假设,因为在数据采集过程中存在由照明变化、非朗伯反射和互反射引起的严重亮度波动,并且这些亮度波动不可避免地会降低深度和运动估计精度。在这项工作中,我们引入了一个称为外观流的新概念来解决亮度不一致的问题。外观流考虑了亮度模式的任何变化,使我们能够开发一个广义动态图像约束。此外,我们还构建了一个统一的自监督框架来同时估计内窥镜场景中的单目深度和自我运动,该框架包括结构模块、运动模块、外观模块和对应模块,用于精确地重建外观和校准图像亮度。在SCARED数据集和EndoSLAM数据集上进行了大量实验,所提出的统一框架大大超过了其他自监督方法。为了验证我们的框架在不同患者和摄像机上的泛化能力,我们在SOARD上训练了我们的模型,但在SERV-CT和Hamlyn数据集上进行了测试,没有任何微调,优异的结果显示了其强大的泛化能力。代码将在以下位置提供:url{https://github.com/ShuweiShao/AF-SfMLearner}. 摘要:Recently, self-supervised learning technology has been applied to calculate depth and ego-motion from monocular videos, achieving remarkable performance in autonomous driving scenarios. One widely adopted assumption of depth and ego-motion self-supervised learning is that the image brightness remains constant within nearby frames. Unfortunately, the endoscopic scene does not meet this assumption because there are severe brightness fluctuations induced by illumination variations, non-Lambertian reflections and interreflections during data collection, and these brightness fluctuations inevitably deteriorate the depth and ego-motion estimation accuracy. In this work, we introduce a novel concept referred to as appearance flow to address the brightness inconsistency problem. The appearance flow takes into consideration any variations in the brightness pattern and enables us to develop a generalized dynamic image constraint. Furthermore, we build a unified self-supervised framework to estimate monocular depth and ego-motion simultaneously in endoscopic scenes, which comprises a structure module, a motion module, an appearance module and a correspondence module, to accurately reconstruct the appearance and calibrate the image brightness. Extensive experiments are conducted on the SCARED dataset and EndoSLAM dataset, and the proposed unified framework exceeds other self-supervised approaches by a large margin. To validate our framework's generalization ability on different patients and cameras, we train our model on SCARED but test it on the SERV-CT and Hamlyn datasets without any fine-tuning, and the superior results reveal its strong generalization ability. Code will be available at: url{https://github.com/ShuweiShao/AF-SfMLearner}.

【3】 Towards General and Efficient Active Learning 标题:走向全面有效的主动学习 链接:https://arxiv.org/abs/2112.07963

作者:Yichen Xie,Masayoshi Tomizuka,Wei Zhan 摘要:主动学习旨在选择信息量最大的样本,以利用有限的注释预算。大多数现有工作都遵循一条繁琐的管道,分别在每个数据集上重复耗时的模型训练和批处理数据选择多次。本文提出了一种新的通用有效的主动学习(GEAL)方法,以挑战这一现状。利用一个在大数据集上预先训练的公共可用模型,我们的方法可以在不同数据集上进行数据选择过程,对同一模型进行单次推理。为了捕获图像内部的细微局部信息,我们提出了一种可以从预先训练的网络的中间特征中轻松提取的知识簇。与麻烦的批量选择策略不同,所有数据样本都是通过在细粒度知识集群级别执行K-中心贪婪一次性选择的。整个过程只需要单通道模型推理,无需训练或监督,使我们的方法在时间复杂度方面明显优于现有技术数百倍。大量的实验广泛地证明了我们的方法在目标检测、语义分割、深度估计和图像分类方面的良好性能。 摘要:Active learning aims to select the most informative samples to exploit limited annotation budgets. Most existing work follows a cumbersome pipeline by repeating the time-consuming model training and batch data selection multiple times on each dataset separately. We challenge this status quo by proposing a novel general and efficient active learning (GEAL) method in this paper. Utilizing a publicly available model pre-trained on a large dataset, our method can conduct data selection processes on different datasets with a single-pass inference of the same model. To capture the subtle local information inside images, we propose knowledge clusters that are easily extracted from the intermediate features of the pre-trained network. Instead of the troublesome batch selection strategy, all data samples are selected in one go by performing K-Center-Greedy in the fine-grained knowledge cluster level. The entire procedure only requires single-pass model inference without training or supervision, making our method notably superior to prior arts in terms of time complexity by up to hundreds of times. Extensive experiments widely demonstrate the promising performance of our method on object detection, semantic segmentation, depth estimation, and image classification.

【4】 Robust Depth Completion with Uncertainty-Driven Loss Functions 标题:具有不确定性驱动损失函数的鲁棒深度完井 链接:https://arxiv.org/abs/2112.07895

作者:Yufan Zhu,Weisheng Dong,Leida Li,Jinjian Wu,Xin Li,Guangming Shi 备注:accepted by AAAI2022 摘要:从稀疏激光雷达扫描中恢复密集深度图像是一项具有挑战性的任务。尽管颜色引导的稀疏到稠密深度补全方法非常流行,但它们在优化过程中平等地对待像素,忽略了稀疏深度图中的不均匀分布特征和合成地面真实值中累积的异常值。在这项工作中,我们引入了不确定性驱动的损失函数,以提高深度完井的鲁棒性,并处理深度完井中的不确定性。具体来说,我们提出了一个明确的不确定性公式,用于使用Jeffrey的先验知识进行稳健深度完井。引入参数不确定驱动损耗,并将其转化为新的损耗函数,该函数对噪声或缺失数据具有鲁棒性。同时,我们提出了一个多尺度联合预测模型,可以同时预测深度图和不确定性图。估计的不确定性映射还用于对具有高不确定性的像素执行自适应预测,从而产生用于细化完成结果的残差映射。我们的方法已经在KITTI深度完成基准上进行了测试,并在MAE、IMAE和IRMSE度量方面实现了最先进的鲁棒性性能。 摘要:Recovering a dense depth image from sparse LiDAR scans is a challenging task. Despite the popularity of color-guided methods for sparse-to-dense depth completion, they treated pixels equally during optimization, ignoring the uneven distribution characteristics in the sparse depth map and the accumulated outliers in the synthesized ground truth. In this work, we introduce uncertainty-driven loss functions to improve the robustness of depth completion and handle the uncertainty in depth completion. Specifically, we propose an explicit uncertainty formulation for robust depth completion with Jeffrey's prior. A parametric uncertain-driven loss is introduced and translated to new loss functions that are robust to noisy or missing data. Meanwhile, we propose a multiscale joint prediction model that can simultaneously predict depth and uncertainty maps. The estimated uncertainty map is also used to perform adaptive prediction on the pixels with high uncertainty, leading to a residual map for refining the completion results. Our method has been tested on KITTI Depth Completion Benchmark and achieved the state-of-the-art robustness performance in terms of MAE, IMAE, and IRMSE metrics.

【5】 Mining Minority-class Examples With Uncertainty Estimates 标题:挖掘具有不确定性估计的少数类实例 链接:https://arxiv.org/abs/2112.07835

作者:Gursimran Singh,Lingyang Chu,Lanjun Wang,Jian Pei,Qi Tian,Yong Zhang 摘要:在现实世界中,对象的出现频率自然会发生倾斜,形成长尾类分布,这导致统计上罕见的类的性能较差。一个很有希望的解决方案是挖掘尾部类示例以平衡训练数据集。然而,挖掘尾部类示例是一项非常具有挑战性的任务。例如,大多数成功的基于不确定性的挖掘方法都由于数据中的偏斜导致的类概率失真而难以实现。在这项工作中,我们提出了一种有效但简单的方法来克服这些挑战。我们的框架增强了被抑制的尾部类激活,然后,使用以一类数据为中心的方法来有效地识别尾部类示例。我们在跨越两个计算机视觉任务的三个数据集上对我们的框架进行了详尽的评估。少数类挖掘的实质性改进和微调模型的性能有力地证实了我们提出的解决方案的价值。 摘要:In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the training dataset. However, mining tail-class examples is a very challenging task. For instance, most of the otherwise successful uncertainty-based mining approaches struggle due to distortion of class probabilities resulting from skewness in data. In this work, we propose an effective, yet simple, approach to overcome these challenges. Our framework enhances the subdued tail-class activations and, thereafter, uses a one-class data-centric approach to effectively identify tail-class examples. We carry out an exhaustive evaluation of our framework on three datasets spanning over two computer vision tasks. Substantial improvements in the minority-class mining and fine-tuned model's performance strongly corroborate the value of our proposed solution.

时序|行为识别|姿态|视频|运动估计(3篇)

【1】 ST-MTL: Spatio-Temporal Multitask Learning Model to Predict Scanpath While Tracking Instruments in Robotic Surgery 标题:ST-MTL:机器人手术器械跟踪时预测扫描路径的时空多任务学习模型 链接:https://arxiv.org/abs/2112.08189

作者:Mobarakol Islam,Vibashan VS,Chwee Ming Lim,Hongliang Ren 备注:12 pages 摘要:在图像引导的机器人手术中,跟踪仪器在任务导向注意力表征学习中具有巨大的潜力。结合认知能力,使摄像机控制自动化,使外科医生能够更加专注于处理手术器械。目的是缩短手术时间,方便外科医生和患者进行手术。我们提出了一种端到端可训练的时空多任务学习(ST-MTL)模型,该模型具有共享编码器和时空解码器,用于实时手术器械分割和面向任务的显著性检测。在共享参数的MTL模型中,将多个损失函数优化为一个收敛点仍然是一个开放的挑战。我们采用一种新的异步时空优化(ASTO)技术,通过计算每个解码器的独立梯度来解决这个问题。我们还设计了一个竞争性的挤压和激励单元,通过铸造一个保持弱特征、激励强特征并执行动态空间和通道特征重新校准的跳过连接。为了捕获更好的长期时空依赖性,我们通过连接连续帧的高级编码器特征来增强长短时记忆(LSTM)模块。我们还引入了Sinkhorn正则化损失,通过保持计算效率来增强面向任务的显著性检测。我们在MICCAI 2017机器人仪器分割挑战的数据集上生成任务感知显著性图和仪器的扫描路径。与最先进的分割和显著性方法相比,我们的模型优于大多数评估指标,并在挑战中产生了出色的性能。 摘要:Representation learning of the task-oriented attention while tracking instrument holds vast potential in image-guided robotic surgery. Incorporating cognitive ability to automate the camera control enables the surgeon to concentrate more on dealing with surgical instruments. The objective is to reduce the operation time and facilitate the surgery for both surgeons and patients. We propose an end-to-end trainable Spatio-Temporal Multi-Task Learning (ST-MTL) model with a shared encoder and spatio-temporal decoders for the real-time surgical instrument segmentation and task-oriented saliency detection. In the MTL model of shared parameters, optimizing multiple loss functions into a convergence point is still an open challenge. We tackle the problem with a novel asynchronous spatio-temporal optimization (ASTO) technique by calculating independent gradients for each decoder. We also design a competitive squeeze and excitation unit by casting a skip connection that retains weak features, excites strong features, and performs dynamic spatial and channel-wise feature recalibration. To capture better long term spatio-temporal dependencies, we enhance the long-short term memory (LSTM) module by concatenating high-level encoder features of consecutive frames. We also introduce Sinkhorn regularized loss to enhance task-oriented saliency detection by preserving computational efficiency. We generate the task-aware saliency maps and scanpath of the instruments on the dataset of the MICCAI 2017 robotic instrument segmentation challenge. Compared to the state-of-the-art segmentation and saliency methods, our model outperforms most of the evaluation metrics and produces an outstanding performance in the challenge.

【2】 Temporal Action Proposal Generation with Background Constraint 标题:具有背景约束的时态行动建议生成 链接:https://arxiv.org/abs/2112.07984

作者:Haosen Yang,Wenhao Wu,Lining Wang,Sheng Jin,Boyang Xia,Hongxun Yao,Hujie Huang 备注:Accepted by AAAI2022. arXiv admin note: text overlap with arXiv:2105.12043 摘要:时间动作建议生成(TAPG)是一项具有挑战性的任务,其目的是在具有时间边界的未剪辑视频中定位动作实例。为了评估提案的可信度,现有的工作通常预测提案的行动分数,这些分数由提案和地面真相之间的联合时间交叉(tIoU)进行监督。在本文中,我们创新性地提出了一种通用的辅助背景约束思想,通过利用背景预测分数来限制提案的可信度,从而进一步抑制低质量提案。通过这种方式,背景约束概念可以很容易地插入现有的TAPG方法(例如,BMN、GTAD)中。从这个角度出发,我们提出了背景约束网络(BCNet),以进一步利用丰富的动作和背景信息。具体来说,我们引入了一个用于可靠置信度评估的动作-背景交互模块,该模块通过在帧和剪辑级别的注意机制对动作和背景之间的不一致性进行建模。在两个流行的基准上进行了广泛的实验,即ActivityNet-1.3和THUMOS14。结果表明,我们的方法优于最先进的方法。利用现有的动作分类器,我们的方法在时间动作定位任务上也取得了显著的效果。 摘要:Temporal action proposal generation (TAPG) is a challenging task that aims to locate action instances in untrimmed videos with temporal boundaries. To evaluate the confidence of proposals, the existing works typically predict action score of proposals that are supervised by the temporal Intersection-over-Union (tIoU) between proposal and the ground-truth. In this paper, we innovatively propose a general auxiliary Background Constraint idea to further suppress low-quality proposals, by utilizing the background prediction score to restrict the confidence of proposals. In this way, the Background Constraint concept can be easily plug-and-played into existing TAPG methods (e.g., BMN, GTAD). From this perspective, we propose the Background Constraint Network (BCNet) to further take advantage of the rich information of action and background. Specifically, we introduce an Action-Background Interaction module for reliable confidence evaluation, which models the inconsistency between action and background by attention mechanisms at the frame and clip levels. Extensive experiments are conducted on two popular benchmarks, i.e., ActivityNet-1.3 and THUMOS14. The results demonstrate that our method outperforms state-of-the-art methods. Equipped with the existing action classifier, our method also achieves remarkable performance on the temporal action localization task.

【3】 Transcoded Video Restoration by Temporal Spatial Auxiliary Network 标题:基于时间空间辅助网络的转码视频恢复 链接:https://arxiv.org/abs/2112.07948

作者:Li Xu,Gang He,Jinjia Zhou,Jie Lei,Weiying Xie,Yunsong Li,Yu-Wing Tai 备注:Accepted by AAAI2022 摘要:在大多数视频平台(如Youtube和TikTok)中,播放的视频通常经过多种视频编码,如录制设备的硬件编码、视频编辑应用程序的软件编码以及视频应用服务器的单/多视频转码。以往的压缩视频恢复工作通常假定压缩伪影是由一次性编码引起的。因此,导出的解在实践中通常不太有效。在本文中,我们提出了一种新的方法,时空辅助网络(TSAN),用于转码视频恢复。我们的方法考虑视频编码和转码之间的独特特性,我们认为初始浅编码视频作为中间标签,以帮助网络进行自我监督的注意力训练。此外,我们利用相邻的多帧信息,提出了时间变形对齐和金字塔空间融合的转码视频恢复方法。实验结果表明,该方法的性能优于已有的方法。该守则可于https://github.com/icecherylXuli/TSAN. 摘要:In most video platforms, such as Youtube, and TikTok, the played videos usually have undergone multiple video encodings such as hardware encoding by recording devices, software encoding by video editing apps, and single/multiple video transcoding by video application servers. Previous works in compressed video restoration typically assume the compression artifacts are caused by one-time encoding. Thus, the derived solution usually does not work very well in practice. In this paper, we propose a new method, temporal spatial auxiliary network (TSAN), for transcoded video restoration. Our method considers the unique traits between video encoding and transcoding, and we consider the initial shallow encoded videos as the intermediate labels to assist the network to conduct self-supervised attention training. In addition, we employ adjacent multi-frame information and propose the temporal deformable alignment and pyramidal spatial fusion for transcoded video restoration. The experimental results demonstrate that the performance of the proposed method is superior to that of the previous techniques. The code is available at https://github.com/icecherylXuli/TSAN.

医学相关(1篇)

【1】 Quantitative analysis of visual representation of sign elements in COVID-19 context 标题:冠状病毒背景下标志元素视觉表征的定量分析 链接:https://arxiv.org/abs/2112.08219

作者:María Jesús Cano-Martínez,Miguel Carrasco,Joaquín Sandoval,César González-Martín 摘要:表象是人类重新呈现现实的方式,无论是外在的还是内在的。因此,视觉表现作为一种交流手段,使用元素构建叙事,就像口头和书面语言一样。我们建议使用计算机分析,对与疫情相关的视觉创作中使用的元素进行定量分析,使用新冠病毒艺术博物馆Instagram账户中编辑的图像,分析用于代表全球事件主观体验的不同元素。这一过程采用了基于机器学习的技术来检测图像中的对象,因此该算法能够学习和检测每个研究图像中包含的对象。这项研究揭示了为创造叙事而在图像中重复的元素以及在样本中建立的联想关系,得出结论,尽管所有创作都带有主观性,在选择要包含在视觉表示中的对象时,存在共享和简化决策的某些参数 摘要:Representation is the way in which human beings re-present the reality of what is happening, both externally and internally. Thus, visual representation as a means of communication uses elements to build a narrative, just as spoken and written language do. We propose using computer analysis to perform a quantitative analysis of the elements used in the visual creations that have been produced in reference to the epidemic, using the images compiled in The Covid Art Museum's Instagram account to analyze the different elements used to represent subjective experiences with regard to a global event. This process has been carried out with techniques based on machine learning to detect objects in the images so that the algorithm can be capable of learning and detecting the objects contained in each study image. This research reveals that the elements that are repeated in images to create narratives and the relations of association that are established in the sample, concluding that, despite the subjectivity that all creation entails, there are certain parameters of shared and reduced decisions when it comes to selecting objects to be included in visual representations

GAN|对抗|攻击|生成相关(4篇)

【1】 Leveraging Image-based Generative Adversarial Networks for Time Series Generation 标题:利用基于图像的生成性对抗性网络生成时间序列 链接:https://arxiv.org/abs/2112.08060

作者:Justin Hellermann,Stefan Lessmann 摘要:生成模型在采样质量、多样性和特征分离方面取得了巨大成功。时间序列的生成模型缺乏这些优点,因为缺少表示,它捕获时间动态并允许采样反转。本文提出了跨期返回图(IRP)表示,以便于使用基于图像的生成对抗网络生成时间序列。事实证明,该表示法在捕捉时间序列特征方面是有效的,与其他表示法相比,它具有可逆性和尺度不变性。经验基准证实了这些特征,并证明IRP使现成的带有梯度惩罚的Wasserstein GAN能够对真实时间序列进行采样,其性能优于基于RNN的专门GAN,同时降低了模型复杂性。 摘要:Generative models synthesize image data with great success regarding sampling quality, diversity and feature disentanglement. Generative models for time series lack these benefits due to a missing representation, which captures temporal dynamics and allows inversion for sampling. The paper proposes the intertemporal return plot (IRP) representation to facilitate the use of image-based generative adversarial networks for time series generation. The representation proves effective in capturing time series characteristics and, compared to alternative representations, benefits from invertibility and scale-invariance. Empirical benchmarks confirm these features and demonstrate that the IRP enables an off-the-shelf Wasserstein GAN with gradient penalty to sample realistic time series, which outperform a specialized RNN-based GAN, while simultaneously reducing model complexity.

【2】 Exploring the Asynchronous of the Frequency Spectra of GAN-generated Facial Images 标题:GaN人脸图像频谱异步性的研究 链接:https://arxiv.org/abs/2112.08050

作者:Binh M. Le,Simon S. Woo 备注:International Workshop on Safety and Security of Deep Learning IJCAI, 2021 摘要:生成性对抗网络(GAN)的迅速发展引起了人们对其恶意滥用的关注,尤其是在创建虚假人脸图像时。尽管提出的许多方法成功地检测到了基于GAN的合成图像,但它们仍然受到对大量训练伪图像数据集的需求以及检测器对未知人脸图像通用性的挑战的限制。在本文中,我们提出了一种探索颜色通道异步频谱的新方法,该方法简单但有效,可用于训练无监督和有监督学习模型来区分基于GAN的合成图像。我们进一步研究了一个训练模型的可转移性,该模型在一个源域中从我们建议的特征中学习,并在另一个目标域中使用特征分布的先验知识进行验证。我们的实验结果表明,频域中的光谱差异是一种实用的伪影,可以有效地检测各种类型的GAN基生成图像。 摘要:The rapid progression of Generative Adversarial Networks (GANs) has raised a concern of their misuse for malicious purposes, especially in creating fake face images. Although many proposed methods succeed in detecting GAN-based synthetic images, they are still limited by the need for large quantities of the training fake image dataset and challenges for the detector's generalizability to unknown facial images. In this paper, we propose a new approach that explores the asynchronous frequency spectra of color channels, which is simple but effective for training both unsupervised and supervised learning models to distinguish GAN-based synthetic images. We further investigate the transferability of a training model that learns from our suggested features in one source domain and validates on another target domains with prior knowledge of the features' distribution. Our experimental results show that the discrepancy of spectra in the frequency domain is a practical artifact to effectively detect various types of GAN-based generated images.

【3】 Object Pursuit: Building a Space of Objects via Discriminative Weight Generation 标题:对象追打:通过歧视性权重生成构建对象空间 链接:https://arxiv.org/abs/2112.07954

作者:Chuanyu Pan,Yanchao Yang,Kaichun Mo,Yueqi Duan,Leonidas Guibas 摘要:我们提出了一个持续学习以对象为中心的表示的框架,用于视觉学习和理解。现有的以对象为中心的表示要么依赖于对场景中的对象进行个性化的监控,要么执行难以处理现实世界中复杂场景的无监控解纠缠。为了减轻注释负担并放松对数据统计复杂性的限制,我们的方法利用交互有效地采样对象的各种变化和相应的训练信号,同时学习以对象为中心的表示。在整个学习过程中,对象以未知身份的随机顺序一个接一个地流化,并与潜在代码关联,这些代码可以通过卷积超网络为每个对象合成鉴别权重。此外,通过对学习对象的重新识别和遗忘预防,使学习过程高效、稳健。我们对所提出的框架的关键特性进行了广泛的研究,并分析了所学表示的特征。此外,我们还证明了该框架在学习表征方面的能力,可以提高下游任务中的标记效率。我们的代码和经过训练的模型将公开提供。 摘要:We propose a framework to continuously learn object-centric representations for visual learning and understanding. Existing object-centric representations either rely on supervisions that individualize objects in the scene, or perform unsupervised disentanglement that can hardly deal with complex scenes in the real world. To mitigate the annotation burden and relax the constraints on the statistical complexity of the data, our method leverages interactions to effectively sample diverse variations of an object and the corresponding training signals while learning the object-centric representations. Throughout learning, objects are streamed one by one in random order with unknown identities, and are associated with latent codes that can synthesize discriminative weights for each object through a convolutional hypernetwork. Moreover, re-identification of learned objects and forgetting prevention are employed to make the learning process efficient and robust. We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations. Furthermore, we demonstrate the capability of the proposed framework in learning representations that can improve label efficiency in downstream tasks. Our code and trained models will be made publicly available.

【4】 Efficient Geometry-aware 3D Generative Adversarial Networks 标题:一种高效的几何感知3D生成性对抗网络 链接:https://arxiv.org/abs/2112.07945

作者:Eric R. Chan,Connor Z. Lin,Matthew A. Chan,Koki Nagano,Boxiao Pan,Shalini De Mello,Orazio Gallo,Leonidas Guibas,Jonathan Tremblay,Sameh Khamis,Tero Karras,Gordon Wetzstein 备注:Project page: this https URL 摘要:仅使用单视图2D照片集无监督生成高质量多视图一致图像和3D形状一直是一个长期的挑战。现有的3D GAN要么是计算密集型的,要么是不一致的近似值;前者限制了生成图像的质量和分辨率,后者对多视图一致性和形状质量产生不利影响。在这项工作中,我们在不过度依赖这些近似的情况下提高了3D GANs的计算效率和图像质量。为此,我们引入了一种表现型混合显式-隐式网络体系结构,它与其他设计选择一起,不仅实时合成高分辨率多视图一致性图像,而且还生成高质量的三维几何图形。通过解耦特征生成和神经渲染,我们的框架能够利用最先进的2D CNN生成器,如StyleGAN2,并继承它们的效率和表达能力。我们用FFHQ和AFHQ猫演示了最先进的3D感知合成,以及其他实验。 摘要:Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. For this purpose, we introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.

OCR|文本相关(2篇)

【1】 Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution 标题:文本格式塔:笔画感知场景文本图像超分辨率 链接:https://arxiv.org/abs/2112.08171

作者:Jingye Chen,Haiyang Yu,Jianqi Ma,Bin Li,Xiangyang Xue 备注:Accepted to AAAI2022. Code is available at this https URL 摘要:近十年来,随着深度学习的蓬勃发展,场景文本识别技术得到了飞速发展。然而,低分辨率场景文本图像的识别仍然是一个挑战。尽管已经提出了一些超分辨率方法来解决这个问题,但它们通常将文本图像视为一般图像,而忽略了笔画(文本的原子单位)的视觉质量对文本识别起着至关重要的作用这一事实。格式塔心理学认为,在先验知识的指导下,人类能够将部分细节组合成最相似的对象。同样,当人类观察低分辨率文本图像时,他们会固有地使用部分笔划级别的细节来恢复整体角色的外观。受格式塔心理学的启发,我们提出了一种包含笔划聚焦模块(SFM)的笔划感知场景文本图像超分辨率方法,以关注文本图像中字符的笔划级内部结构。具体地说,我们试图设计规则,在笔划级别分解英文字符和数字,然后预先训练文本识别器,以提供笔划级别的注意图作为位置线索,目的是控制生成的超分辨率图像和高分辨率地面真相之间的一致性。大量的实验结果验证了该方法确实能够在TextZoom和人工构建的汉字数据集degrade-IC13上生成更清晰的图像。此外,由于建议的SFM仅用于在训练时提供冲程水平指导,因此在测试阶段不会带来任何时间开销。代码可在https://github.com/FudanVI/FudanOCR/tree/main/text-gestalt. 摘要:In the last decade, the blossom of deep learning has witnessed the rapid development of scene text recognition. However, the recognition of low-resolution scene text images remains a challenge. Even though some super-resolution methods have been proposed to tackle this problem, they usually treat text images as general images while ignoring the fact that the visual quality of strokes (the atomic unit of text) plays an essential role for text recognition. According to Gestalt Psychology, humans are capable of composing parts of details into the most similar objects guided by prior knowledge. Likewise, when humans observe a low-resolution text image, they will inherently use partial stroke-level details to recover the appearance of holistic characters. Inspired by Gestalt Psychology, we put forward a Stroke-Aware Scene Text Image Super-Resolution method containing a Stroke-Focused Module (SFM) to concentrate on stroke-level internal structures of characters in text images. Specifically, we attempt to design rules for decomposing English characters and digits at stroke-level, then pre-train a text recognizer to provide stroke-level attention maps as positional clues with the purpose of controlling the consistency between the generated super-resolution image and high-resolution ground truth. The extensive experimental results validate that the proposed method can indeed generate more distinguishable images on TextZoom and manually constructed Chinese character dataset Degraded-IC13. Furthermore, since the proposed SFM is only used to provide stroke-level guidance when training, it will not bring any time overhead during the test phase. Code is available at https://github.com/FudanVI/FudanOCR/tree/main/text-gestalt.

【2】 SPTS: Single-Point Text Spotting 标题:SPTS:单点文本定位 链接:https://arxiv.org/abs/2112.07917

作者:Dezhi Peng,Xinyu Wang,Yuliang Liu,Jiaxin Zhang,Mingxin Huang,Songxuan Lai,Shenggao Zhu,Jing Li,Dahua Lin,Chunhua Shen,Lianwen Jin 摘要:几乎所有场景文本定位(检测和识别)方法都依赖于昂贵的框注释(例如,文本行框、单词级框和字符级框)。第一次,我们证明了训练场景文本定位模型可以通过对每个实例的单个点进行极低成本的注释来实现。我们提出了一种端到端的场景文本定位方法,该方法将场景文本定位作为一项序列预测任务,如语言建模。给定一幅图像作为输入,我们将期望的检测和识别结果表示为一个离散标记序列,并使用自回归变换器预测该序列。我们在几个水平、多方向和任意形状的场景文本基准上取得了令人满意的结果。最重要的是,我们表明,性能对点注释的位置并不十分敏感,这意味着它比需要精确位置的边界框更容易注释和自动生成。我们相信,这样一次开创性的尝试为场景文本识别应用提供了一个比以前更大规模的重要机会。 摘要:Almost all scene text spotting (detection and recognition) methods rely on costly box annotation (e.g., text-line box, word-level box, and character-level box). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task, like language modeling. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive transformer to predict the sequence. We achieve promising results on several horizontal, multi-oriented, and arbitrarily shaped scene text benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated and automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible.

Attention注意力(1篇)

【1】 Consistent Depth Prediction under Various Illuminations using Dilated Cross Attention 标题:利用扩展交叉注意进行不同光照下的一致深度预测 链接:https://arxiv.org/abs/2112.08006

作者:Zitian Zhang,Chuhua Xian 备注:14 pages 摘要:在本文中,我们的目标是解决在不同光照条件下复杂场景中的一致深度预测问题。基于RGB-D传感器或虚拟渲染的现有室内数据集有两个关键限制-稀疏深度贴图(NYU depth V2)和非真实照明(SUN CG、SceneNet RGB-D)。我们建议使用internet 3D室内场景并手动调整其照明,以渲染照片逼真的RGB照片及其相应的深度和BRDF贴图,从而获得一个称为Vari数据集的新室内深度数据集。我们提出了一个简单的卷积块DCA,通过对编码特征应用深度可分离的扩展卷积来处理全局信息和减少参数。我们对这些扩展的特征进行交叉关注,以保持不同照明条件下深度预测的一致性。通过在Vari数据集上与当前最先进的方法进行比较,对我们的方法进行了评估,并在我们的实验中观察到了显著的改进。我们还进行了消融研究,对纽约大学深度V2的模型进行了微调,并对真实数据进行了评估,以进一步验证我们的DCA块的有效性。代码、预先训练的权重和Vari数据集都是开源的。 摘要:In this paper, we aim to solve the problem of consistent depth prediction in complex scenes under various illumination conditions. The existing indoor datasets based on RGB-D sensors or virtual rendering have two critical limitations - sparse depth maps (NYU Depth V2) and non-realistic illumination (SUN CG, SceneNet RGB-D). We propose to use internet 3D indoor scenes and manually tune their illuminations to render photo-realistic RGB photos and their corresponding depth and BRDF maps, obtaining a new indoor depth dataset called Vari dataset. We propose a simple convolutional block named DCA by applying depthwise separable dilated convolution on encoded features to process global information and reduce parameters. We perform cross attention on these dilated features to retain the consistency of depth prediction under different illuminations. Our method is evaluated by comparing it with current state-of-the-art methods on Vari dataset and a significant improvement is observed in our experiments. We also conduct the ablation study, finetune our model on NYU Depth V2 and also evaluate on real-world data to further validate the effectiveness of our DCA block. The code, pre-trained weights and Vari dataset are open-sourced.

人脸|人群计数(3篇)

【1】 ForgeryNet -- Face Forgery Analysis Challenge 2021: Methods and Results 标题:ForgeryNet--2021年人脸伪造分析挑战赛:方法和结果 链接:https://arxiv.org/abs/2112.08325

作者:Yinan He,Lu Sheng,Jing Shao,Ziwei Liu,Zhaofan Zou,Zhizhi Guo,Shan Jiang,Curitis Sun,Guosheng Zhang,Keyao Wang,Haixiao Yue,Zhibin Hong,Wanguo Wang,Zhenyu Li,Qi Wang,Zhenli Wang,Ronghao Xu,Mingwen Zhang,Zhiheng Wang,Zhenhang Huang,Tianming Zhang,Ningning Zhao 备注:Technical report. Challenge website: this https URL 摘要:真实感合成技术的快速发展已经达到了一个临界点,真实图像和操纵图像之间的边界开始模糊。最近,一个由290万张图像和221247段视频组成的超大规模深度面部伪造数据集ForgeryNet已经发布。就数据规模、操作(7个图像级方法、8个视频级方法)、扰动(36个独立的和更多的混合扰动)和注释(630万个分类标签、290万个操作区域注释和221247个临时伪造段标签)而言,它是迄今为止最大的公开可用方法。本文报告了FurGyyN-Furgy分析挑战2021的方法和结果,它采用了FurGyNETBasic。模型评估在专用测试集上脱机进行。共有186名参赛者报名参加比赛,11支参赛队伍提交了有效的参赛作品。我们将分析排名靠前的解决方案,并对未来的工作方向进行讨论。 摘要:The rapid progress of photorealistic synthesis techniques has reached a critical point where the boundary between real and manipulated images starts to blur. Recently, a mega-scale deep face forgery dataset, ForgeryNet which comprised of 2.9 million images and 221,247 videos has been released. It is by far the largest publicly available in terms of data-scale, manipulations (7 image-level approaches, 8 video-level approaches), perturbations (36 independent and more mixed perturbations), and annotations (6.3 million classification labels, 2.9 million manipulated area annotations, and 221,247 temporal forgery segment labels). This paper reports methods and results in the ForgeryNet - Face Forgery Analysis Challenge 2021, which employs the ForgeryNet benchmark. The model evaluation is conducted offline on the private test set. A total of 186 participants registered for the competition, and 11 teams made valid submissions. We will analyze the top-ranked solutions and present some discussion on future work directions.

【2】 LookinGood^π: Real-time Person-independent Neural Re-rendering for High-quality Human Performance Capture 标题:LookinGood^π:用于高质量人类行为捕捉的独立于人的神经网络实时重绘制 链接:https://arxiv.org/abs/2112.08037

作者:Xiqi Yang,Kewei Yang,Kang Chen,Weidong Zhang,Weiwei Xu 摘要:我们提出LookinGood^{pi},这是一种新的神经再渲染方法,旨在(1)实时改善人因捕捉系统低质量重建结果的渲染质量;(2) 提高了神经网络对隐形人的泛化能力。我们的关键思想是利用重建几何体的渲染图像作为指导,帮助从少量参考图像中预测特定于人的细节,从而增强重新渲染的结果。鉴于此,我们设计了一个双分支网络。粗略分支用于修复某些瑕疵(即孔、噪声)并获得渲染输入的粗略版本,而细节分支用于预测扭曲参考中的“正确”细节。在细节分支的训练中,有效地融合了两个分支的特征,实现了对渲染图像的引导,提高了扭曲精度和细节保真度。我们证明了我们的方法在对看不见的人生成高保真图像方面优于最先进的方法。 摘要:We propose LookinGood^{pi}, a novel neural re-rendering approach that is aimed to (1) improve the rendering quality of the low-quality reconstructed results from human performance capture system in real-time; (2) improve the generalization ability of the neural rendering network on unseen people. Our key idea is to utilize the rendered image of reconstructed geometry as the guidance to assist the prediction of person-specific details from few reference images, thus enhancing the re-rendered result. In light of this, we design a two-branch network. A coarse branch is designed to fix some artifacts (i.e. holes, noise) and obtain a coarse version of the rendered input, while a detail branch is designed to predict "correct" details from the warped references. The guidance of the rendered image is realized by blending features from two branches effectively in the training of the detail branch, which improves both the warping accuracy and the details' fidelity. We demonstrate that our method outperforms state-of-the-art methods at producing high-fidelity images on unseen people.

【3】 Does a Face Mask Protect my Privacy?: Deep Learning to Predict Protected Attributes from Masked Face Images 标题:面膜能保护我的隐私吗?:深度学习从蒙面脸部图像中预测受保护的属性 链接:https://arxiv.org/abs/2112.07879

作者:Sachith Seneviratne,Nuran Kasthuriarachchi,Sanka Rasnayaka,Danula Hettiachchi,Ridwan Shariffdeen 备注:Accepted to AJCAI 2021 - 34th Australasian Joint Conference on Artificial Intelligence, Feb 2022, Sydney, Australia 摘要:2019冠状病毒疾病的防治工作迅速实施,以倡导预防性方法。尽管这类系统具有积极的好处,但也存在通过侵犯用户隐私进行攻击的可能性。在这项工作中,我们通过使用蒙面人脸图像预测隐私敏感的软生物特征来分析人脸生物特征识别系统的隐私入侵性。我们训练并应用了一个基于ResNet-50架构的CNN,该CNN具有20003幅合成遮罩图像,并测量了隐私入侵。尽管人们普遍认为戴口罩有利于隐私,但我们发现戴口罩对隐私侵犯没有显著影响。在我们的实验中,我们能够从蒙面人脸图像中准确预测性别(94.7%)、种族(83.1%)和年龄(MAE 6.21和RMSE 8.33)。我们提出的方法可以作为评估利用隐私敏感信息的人工智能系统隐私入侵性的基线工具。我们将所有贡献开源,以实现可再生产性和更广泛的研究社区使用。 摘要:Contactless and efficient systems are implemented rapidly to advocate preventive methods in the fight against the COVID-19 pandemic. Despite the positive benefits of such systems, there is potential for exploitation by invading user privacy. In this work, we analyse the privacy invasiveness of face biometric systems by predicting privacy-sensitive soft-biometrics using masked face images. We train and apply a CNN based on the ResNet-50 architecture with 20,003 synthetic masked images and measure the privacy invasiveness. Despite the popular belief of the privacy benefits of wearing a mask among people, we show that there is no significant difference to privacy invasiveness when a mask is worn. In our experiments we were able to accurately predict sex (94.7%),race (83.1%) and age (MAE 6.21 and RMSE 8.33) from masked face images. Our proposed approach can serve as a baseline utility to evaluate the privacy-invasiveness of artificial intelligence systems that make use of privacy-sensitive information. We open-source all contributions for re-producibility and broader use by the research community.

跟踪(2篇)

【1】 FEAR: Fast, Efficient, Accurate and Robust Visual Tracker 标题:恐惧:快速、高效、准确和强大的视觉跟踪器 链接:https://arxiv.org/abs/2112.07957

作者:Vasyl Borsuk,Roman Vei,Orest Kupyn,Tetiana Martyniuk,Igor Krashenyi,Jiři Matas 摘要:我们介绍了恐惧,一种新颖、快速、高效、准确和健壮的暹罗视觉跟踪器。我们引入了一个用于对象模型自适应的结构块,称为双模板表示,以及一个像素级融合块,以实现模型的额外灵活性和效率。与标准相关模块相比,双模板模块仅使用单个可学习参数来合并时间信息,而像素级融合块使用更少的参数来编码更具辨别力的特征。通过使用新颖的模块插入复杂的主干,FEAR-M和FEAR-L跟踪器在准确性和效率方面超过了大多数暹罗跟踪器。优化版FEAR-XS采用轻量级主干网,跟踪速度比当前的暹罗跟踪器快10倍以上,同时保持接近最先进的结果。FEAR-XS跟踪器比LightTrack[62]小2.4倍,速度快4.3倍,精确度更高。此外,我们通过引入能耗和执行速度基准,扩展了模型效率的定义。应要求提供源代码、预先训练的模型和评估协议 摘要:We present FEAR, a novel, fast, efficient, accurate, and robust Siamese visual tracker. We introduce an architecture block for object model adaption, called dual-template representation, and a pixel-wise fusion block to achieve extra flexibility and efficiency of the model. The dual-template module incorporates temporal information with only a single learnable parameter, while the pixel-wise fusion block encodes more discriminative features with fewer parameters compared to standard correlation modules. By plugging-in sophisticated backbones with the novel modules, FEAR-M and FEAR-L trackers surpass most Siamesetrackers on several academic benchmarks in both accuracy and efficiencies. Employed with the lightweight backbone, the optimized version FEAR-XS offers more than 10 times faster tracking than current Siamese trackers while maintaining near state-of-the-art results. FEAR-XS tracker is 2.4x smaller and 4.3x faster than LightTrack [62] with superior accuracy. In addition, we expand the definition of the model efficiency by introducing a benchmark on energy consumption and execution speed. Source code, pre-trained models, and evaluation protocol will be made available upon request

【2】 Homography Decomposition Networks for Planar Object Tracking 标题:单应分解网络在平面目标跟踪中的应用 链接:https://arxiv.org/abs/2112.07909

作者:Xinrui Zhan,Yueran Liu,Jianke Zhu,Yang Li 备注:Accepted at AAAI 2022, preprint version 摘要:平面目标跟踪在人工智能应用中起着重要作用,如机器人技术、视觉伺服和视觉SLAM。尽管以前的平面跟踪器在大多数情况下都能很好地工作,但由于两个连续帧之间的快速运动和大变换,它仍然是一项具有挑战性的任务。这个问题背后的根本原因是当单应参数空间的搜索范围变大时,这样一个非线性系统的条件数不稳定地变化。为此,我们提出了一种新的单应分解网络(HDN)方法,通过将单应变换分解为两组,大大减少并稳定了条件数。具体地说,设计了一个相似变换估计器,通过深度卷积等变网络对第一组进行稳健预测。利用高置信度的尺度和旋转估计,通过简单的回归模型估计残差变换。此外,所提出的端到端网络以半监督方式进行训练。大量实验表明,在具有挑战性的POT、UCSB和POIC数据集上,我们提出的方法在很大程度上优于最新的平面跟踪方法。 摘要:Planar object tracking plays an important role in AI applications, such as robotics, visual servoing, and visual SLAM. Although the previous planar trackers work well in most scenarios, it is still a challenging task due to the rapid motion and large transformation between two consecutive frames. The essential reason behind this problem is that the condition number of such a non-linear system changes unstably when the searching range of the homography parameter space becomes larger. To this end, we propose a novel Homography Decomposition Networks~(HDN) approach that drastically reduces and stabilizes the condition number by decomposing the homography transformation into two groups. Specifically, a similarity transformation estimator is designed to predict the first group robustly by a deep convolution equivariant network. By taking advantage of the scale and rotation estimation with high confidence, a residual transformation is estimated by a simple regression model. Furthermore, the proposed end-to-end network is trained in a semi-supervised fashion. Extensive experiments show that our proposed approach outperforms the state-of-the-art planar tracking methods at a large margin on the challenging POT, UCSB and POIC datasets.

图像视频检索|Re-id相关(1篇)

【1】 Value Retrieval with Arbitrary Queries for Form-like Documents 标题:具有任意查询的类表单文档的值检索 链接:https://arxiv.org/abs/2112.07820

作者:Mingfei Gao,Le Xue,Chetan Ramaiah,Chen Xing,Ran Xu,Caiming Xiong 摘要:我们提出了对表单类文档进行任意查询的值检索,以减少人工处理表单的工作量。与以前只处理固定字段项集的方法不同,我们的方法基于对表单布局和语义的理解来预测任意查询的目标值。为了进一步提高模型性能,我们提出了一种简单文档语言建模(simpleDLM)策略来提高大规模模型预训练中的文档理解。实验结果表明,我们的方法显著优于我们的基线,与最先进的预训练方法相比,simpleDLM进一步提高了我们的价值检索性能,F1分数约为17\%。代码将公开提供。 摘要:We propose value retrieval with arbitrary queries for form-like documents to reduce human effort of processing forms. Unlike previous methods that only address a fixed set of field items, our method predicts target value for an arbitrary query based on the understanding of layout and semantics of a form. To further boost model performance, we propose a simple document language modeling (simpleDLM) strategy to improve document understanding on large-scale model pre-training. Experimental results show that our method outperforms our baselines significantly and the simpleDLM further improves our performance on value retrieval by around 17\% F1 score compared with the state-of-the-art pre-training method. Code will be made publicly available.

裁剪|量化|加速|压缩相关(1篇)

【1】 An Experimental Study of the Impact of Pre-training on the Pruning of a Convolutional Neural Network 标题:预训练对卷积神经网络修剪影响的实验研究 链接:https://arxiv.org/abs/2112.08227

作者:Nathan Hubens,Matei Mancas,Bernard Gosselin,Marius Preda,Titus Zaharia 备注:7 pages, published at APPIS 2020 摘要:近年来,深度神经网络在各个应用领域都取得了广泛的成功。然而,它们需要重要的计算和内存资源,这严重阻碍了它们的部署,尤其是在移动设备或实时应用程序上。神经网络通常涉及大量的参数,这些参数对应于网络的权值。通过训练过程获得的这些参数是网络性能的决定因素。然而,它们也是高度冗余的。剪枝方法主要是通过识别和去除不相关的权重来减少参数集的大小。在本文中,我们检验了训练策略对修剪效率的影响。考虑并比较了两种训练模式:(1)微调和(2)从头开始。在四个数据集(CIFAR10、CIFAR100、SVHN和Caltech101)和两个不同CNN(VGG16和MobileNet)上获得的实验结果表明,在大型语料库(如ImageNet)上预先训练,然后在特定数据集上进行微调的网络可以比传统网络更有效地修剪(高达80%的参数缩减)同样的网络从零开始训练。 摘要:In recent years, deep neural networks have known a wide success in various application domains. However, they require important computational and memory resources, which severely hinders their deployment, notably on mobile devices or for real-time applications. Neural networks usually involve a large number of parameters, which correspond to the weights of the network. Such parameters, obtained with the help of a training process, are determinant for the performance of the network. However, they are also highly redundant. The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights. In this paper, we examine the impact of the training strategy on the pruning efficiency. Two training modalities are considered and compared: (1) fine-tuned and (2) from scratch. The experimental results obtained on four datasets (CIFAR10, CIFAR100, SVHN and Caltech101) and for two different CNNs (VGG16 and MobileNet) demonstrate that a network that has been pre-trained on a large corpus (e.g. ImageNet) and then fine-tuned on a particular dataset can be pruned much more efficiently (up to 80% of parameter reduction) than the same network trained from scratch.

点云|SLAM|雷达|激光|深度RGBD相关(3篇)

【1】 Putting People in their Place: Monocular Regression of 3D People in Depth 标题:把人放回原位:3D人深度的单目回归 链接:https://arxiv.org/abs/2112.08274

作者:Yu Sun,Wu Liu,Qian Bao,Yili Fu,Tao Mei,Michael J. Black 备注:Code will be available at this https URL 摘要:给定一个多人的图像,我们的目标是直接回归所有人的姿势和形状以及他们的相对深度。然而,在不知道一个人的身高的情况下,推断一个人在图像中的深度从根本上来说是不明确的。当场景中包含大小非常不同的人(例如从婴儿到成人)时,这一问题尤其严重。要解决这个问题,我们需要做几件事。首先,我们开发了一种新的方法来推断多人在一张图像中的姿势和深度。以前的工作是通过在图像平面上进行推理来估计多人,而我们的方法称为BEV,它增加了一个额外的想象鸟瞰图表示,以明确说明深度的原因。BEV同时对图像和深度中的身体中心进行推理,并通过组合这些信息来估计3D身体位置。与以前的工作不同,BEV是一种端到端可微的单发方法。第二,身高随年龄而变化,如果不估计图像中人物的年龄,就无法分辨深度。为此,我们开发了一个3D人体模型空间,让BEV从婴儿到成人推断形状。第三,为了训练BEV,我们需要一个新的数据集。具体来说,我们创建了一个“相对人”(RH)数据集,其中包括年龄标签和图像中人与人之间的相对深度关系。在RH和AGORA上的大量实验证明了该模型和训练方案的有效性。BEV在深度推理、子形状估计和遮挡鲁棒性方面优于现有方法。代码和数据集将发布用于研究目的。 摘要:Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset will be released for research purposes.

【2】 Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry 标题:单视深度概率与多视点几何融合的多视点深度估计 链接:https://arxiv.org/abs/2112.08177

作者:Gwangbin Bae,Ignas Budvytis,Roberto Cipolla 摘要:多视图深度估计方法通常需要计算多视图代价量,这会导致巨大的内存消耗和缓慢的推理。此外,对于无纹理曲面、反射曲面和移动对象,多视图匹配可能会失败。对于此类故障模式,单视图深度估计方法通常更可靠。为此,我们提出了一种融合单视图深度概率和多视图几何的新框架MaGNet,以提高多视图深度估计的准确性、鲁棒性和效率。对于每一帧,MaGNet估计单个视图深度概率分布,参数化为像素高斯分布。然后使用为参考帧估计的分布对每像素深度候选进行采样。这种概率抽样使网络能够在评估较少深度候选的同时获得更高的精度。我们还提出了多视图匹配分数的深度一致性加权,以确保多视图深度与单视图预测一致。该方法在ScanNet、7场景和KITTI上实现了最先进的性能。定性评估表明,我们的方法对具有挑战性的伪影(如无纹理/反射表面和移动对象)更具鲁棒性。 摘要:Multi-view depth estimation methods typically require the computation of a multi-view cost-volume, which leads to huge memory consumption and slow inference. Furthermore, multi-view matching can fail for texture-less surfaces, reflective surfaces and moving objects. For such failure modes, single-view depth estimation methods are often more reliable. To this end, we propose MaGNet, a novel framework for fusing single-view depth probability with multi-view geometry, to improve the accuracy, robustness and efficiency of multi-view depth estimation. For each frame, MaGNet estimates a single-view depth probability distribution, parameterized as a pixel-wise Gaussian. The distribution estimated for the reference frame is then used to sample per-pixel depth candidates. Such probabilistic sampling enables the network to achieve higher accuracy while evaluating fewer depth candidates. We also propose depth consistency weighting for the multi-view matching score, to ensure that the multi-view depth is consistent with the single-view predictions. The proposed method achieves state-of-the-art performance on ScanNet, 7-Scenes and KITTI. Qualitative evaluation demonstrates that our method is more robust against challenging artifacts such as texture-less/reflective surfaces and moving objects.

【3】 Depth Refinement for Improved Stereo Reconstruction 标题:一种改进的立体重建深度细化算法 链接:https://arxiv.org/abs/2112.08070

作者:Amit Bracha,Noam Rotstein,David Bensaïd,Ron Slossberg,Ron Kimmel 摘要:深度估计是大量需要对环境进行3D评估的应用的基础,例如机器人技术、增强现实技术和自动驾驶等。深度估计的一个突出技术是立体匹配,它有几个优点:它被认为比其他深度传感技术更容易获得,可以实时产生密集的深度估计,并且近年来从深度学习的进步中受益匪浅。然而,当前用于从立体图像估计深度的技术仍然存在固有缺陷。为了重建深度,立体匹配算法首先估计左右图像之间的视差图,然后再应用几何三角剖分。一个简单的分析表明,深度误差与物体的距离成二次比例。因此,对于远离摄影机的对象,恒定的视差误差会转化为较大的深度误差。为了缓解这种二次关系,我们提出了一种简单但有效的方法,使用细化网络进行深度估计。我们的分析和实证结果表明,建议的学习过程减少了这种二次关系。我们在著名的基准和数据集(如Sceneflow和KITTI数据集)上评估了建议的细化过程,并证明了深度精度度量的显著改进。 摘要:Depth estimation is a cornerstone of a vast number of applications requiring 3D assessment of the environment, such as robotics, augmented reality, and autonomous driving to name a few. One prominent technique for depth estimation is stereo matching which has several advantages: it is considered more accessible than other depth-sensing technologies, can produce dense depth estimates in real-time, and has benefited greatly from the advances of deep learning in recent years. However, current techniques for depth estimation from stereoscopic images still suffer from a built-in drawback. To reconstruct depth, a stereo matching algorithm first estimates the disparity map between the left and right images before applying a geometric triangulation. A simple analysis reveals that the depth error is quadratically proportional to the object's distance. Therefore, constant disparity errors are translated to large depth errors for objects far from the camera. To mitigate this quadratic relation, we propose a simple but effective method that uses a refinement network for depth estimation. We show analytical and empirical results suggesting that the proposed learning procedure reduces this quadratic relation. We evaluate the proposed refinement procedure on well-known benchmarks and datasets, like Sceneflow and KITTI datasets, and demonstrate significant improvements in the depth accuracy metric.

3D|3D重建等相关(1篇)

【1】 3D Question Answering 标题:3D问答 链接:https://arxiv.org/abs/2112.08359

作者:Shuquan Ye,Dongdong Chen,Songfang Han,Jing Liao 摘要:近年来,可视化问答(VQA)技术取得了巨大的进步。然而,大多数工作只关注二维图像的问答任务。在本文中,我们首次尝试将VQA扩展到3D领域,这可以促进人工智能对3D真实场景的感知。与基于图像的VQA不同,3D问答(3DQA)以颜色点云为输入,需要外观和3D几何理解能力来回答3D相关问题。为此,,我们提出了一种新的基于转换器的3DQA框架 extbf{``3DQA-TR},它由两个编码器组成,分别用于利用外观和几何信息。外观、几何和语言问题的多模态信息最终可以通过3D语言Bert相互关注,以预测目标答案。为了验证我们提出的3DQA框架的有效性,我们进一步开发了第一个3DQA数据集 extbf{``ScanQA},它建立在ScanNet数据集的基础上,包含$sim$6K个问题,$sim$30K个答案,共806$scenes。在这个数据集上的大量实验证明了我们提出的3DQA框架相对于现有VQA框架的明显优势,以及我们主要设计的有效性。我们的代码和数据集将公开,以促进这方面的研究。 摘要:Visual Question Answering (VQA) has witnessed tremendous progress in recent years. However, most efforts only focus on the 2D image question answering tasks. In this paper, we present the first attempt at extending VQA to the 3D domain, which can facilitate artificial intelligence's perception of 3D real-world scenarios. Different from image based VQA, 3D Question Answering (3DQA) takes the color point cloud as input and requires both appearance and 3D geometry comprehension ability to answer the 3D-related questions. To this end, we propose a novel transformer-based 3DQA framework extbf{``3DQA-TR"}, which consists of two encoders for exploiting the appearance and geometry information, respectively. The multi-modal information of appearance, geometry, and the linguistic question can finally attend to each other via a 3D-Linguistic Bert to predict the target answers. To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset extbf{``ScanQA"}, which builds on the ScanNet dataset and contains $sim$6K questions, $sim$30K answers for $806$ scenes. Extensive experiments on this dataset demonstrate the obvious superiority of our proposed 3DQA framework over existing VQA frameworks, and the effectiveness of our major designs. Our code and dataset will be made publicly available to facilitate the research in this direction.

其他神经网络|深度学习|模型|建模(1篇)

【1】 Single Image Automatic Radial Distortion Compensation Using Deep Convolutional Network 标题:基于深卷积网络的单幅图像径向畸变自动补偿 链接:https://arxiv.org/abs/2112.08198

作者:Igor Janos,Wanda Benesova 摘要:在许多计算机视觉领域,输入图像必须符合针孔相机模型,即现实世界中的直线作为图像中的直线投影。在体育直播画面上执行计算机视觉任务带来了挑战性的要求,算法不能依赖于特定的校准模式,必须能够处理未知和未校准的摄像机、源自复杂电视镜头的径向失真、通过以下方式补偿失真的少量视觉线索:,以及实时性能的必要性。提出了一种基于深度卷积神经网络的单图像镜头畸变自动补偿方法,该方法利用多项式畸变模型的两个最高阶系数,在体育广播应用领域具有实时性和准确性。关键词:深卷积神经网络、径向畸变、单幅图像校正 摘要:In many computer vision domains, the input images must conform with the pinhole camera model, where straight lines in the real world are projected as straight lines in the image. Performing computer vision tasks on live sports broadcast footage imposes challenging requirements where the algorithms cannot rely on a specific calibration pattern must be able to cope with unknown and uncalibrated cameras, radial distortion originating from complex television lenses, few visual clues to compensate distortion by, and the necessity for real-time performance. We present a novel method for single-image automatic lens distortion compensation based on deep convolutional neural networks, capable of real-time performance and accuracy using two highest-order coefficients of the polynomial distortion model operating in the application domain of sports broadcast. Keywords: Deep Convolutional Neural Network, Radial Distortion, Single Image Rectification

其他(4篇)

【1】 Detail-aware Deep Clothing Animations Infused with Multi-source Attributes 标题:融合多源属性的细节感知深度服装动画 链接:https://arxiv.org/abs/2112.07974

作者:Tianxing Li,Rui Shi,Takashi Kanai 备注:14 pages, 12 figures 摘要:该文提出了一种新的基于学习的服装变形方法,为各种形状的人体在各种动画中穿着的服装生成丰富合理的细节变形。与现有的基于学习的方法不同,该方法需要针对不同的服装拓扑或姿势建立大量训练模型,并且无法轻松实现丰富的细节,我们使用统一的框架来高效、轻松地生成高保真变形。为了解决预测受多源属性影响的变形的挑战性问题,我们从新的角度提出了三种策略。具体来说,我们首先发现衣服和身体之间的配合对褶皱程度有重要影响。然后,我们设计了一个属性解析器来生成细节感知编码,并将其注入到图形神经网络中,从而增强了不同属性下细节的识别能力。此外,为了获得更好的收敛性和避免过度平滑的变形,我们提出了输出重构来减轻学习任务的复杂性。实验结果表明,本文提出的变形方法在泛化能力和细节质量方面均优于现有方法。 摘要:This paper presents a novel learning-based clothing deformation method to generate rich and reasonable detailed deformations for garments worn by bodies of various shapes in various animations. In contrast to existing learning-based methods, which require numerous trained models for different garment topologies or poses and are unable to easily realize rich details, we use a unified framework to produce high fidelity deformations efficiently and easily. To address the challenging issue of predicting deformations influenced by multi-source attributes, we propose three strategies from novel perspectives. Specifically, we first found that the fit between the garment and the body has an important impact on the degree of folds. We then designed an attribute parser to generate detail-aware encodings and infused them into the graph neural network, therefore enhancing the discrimination of details under diverse attributes. Furthermore, to achieve better convergence and avoid overly smooth deformations, we proposed output reconstruction to mitigate the complexity of the learning task. Experiment results show that our proposed deformation method achieves better performance over existing methods in terms of generalization ability and quality of details.

【2】 Predicting Media Memorability: Comparing Visual, Textual and Auditory Features 标题:预测媒体记忆力:比较视觉、文本和听觉特征 链接:https://arxiv.org/abs/2112.07969

作者:Lorin Sweeney,Graham Healy,Alan F. Smeaton 备注:3 pages 摘要:本文介绍了我们在中世纪2021中预测媒体记忆任务的方法,其目的是通过设置自动预测视频记忆性的任务来解决媒体记忆性问题。今年,我们从比较的角度处理这项任务,希望对三种探索模式中的每一种都有更深入的了解,并将去年(2020年)提交的结果作为参考点。与去年一样,我们在TRECVid2019数据集上测试的最佳短期记忆性模型(0.132)是一个基于帧的CNN,没有对任何TRECVid数据进行训练,而在Memento10k数据集上测试的最佳短期记忆性模型(0.524)是一个符合DenseNet121视觉特征的贝叶斯行驶回归器。 摘要:This paper describes our approach to the Predicting Media Memorability task in MediaEval 2021, which aims to address the question of media memorability by setting the task of automatically predicting video memorability. This year we tackle the task from a comparative standpoint, looking to gain deeper insights into each of three explored modalities, and using our results from last year's submission (2020) as a point of reference. Our best performing short-term memorability model (0.132) tested on the TRECVid2019 dataset -- just like last year -- was a frame based CNN that was not trained on any TRECVid data, and our best short-term memorability model (0.524) tested on the Memento10k dataset, was a Bayesian Ride Regressor fit with DenseNet121 visual features.

【3】 Autonomous Navigation System from Simultaneous Localization and Mapping 标题:基于同步定位和测绘的自主导航系统 链接:https://arxiv.org/abs/2112.07723

作者:Micheal Caracciolo,Owen Casciotti,Christopher Lloyd,Ernesto Sola-Thomas,Matthew Weaver,Kyle Bielby,Md Abdul Baset Sarker,Masudul H. Imtiaz 摘要:本文介绍了一种基于同步定位和地图(SLAM)的自主导航系统的开发。这项研究的动机是找到一种自主导航室内空间的解决方案。内部导航具有挑战性,因为它可能会不断发展。解决这一问题对于清洁、卫生行业和制造业等众多服务业来说都是必要的。本文的重点是描述为该自治系统开发的基于SLAM的软件体系结构。评估了该系统面向智能轮椅的潜在应用。当前的内部导航解决方案需要某种引导线,比如地板上的黑线。有了这个建议的解决方案,内部不需要改造来适应这个解决方案。此应用程序的源代码已开放源代码,因此可以将其重新用于类似的应用程序。此外,这个开源项目预计将由广泛的开源社区在过去的状态下进行改进。 摘要:This paper presents the development of a Simultaneous Localization and Mapping (SLAM) based Autonomous Navigation system. The motivation for this study was to find a solution for navigating interior spaces autonomously. Interior navigation is challenging as it can be forever evolving. Solving this issue is necessary for multitude of services, like cleaning, the health industry, and in manufacturing industries. The focus of this paper is the description of the SLAM-based software architecture developed for this proposed autonomous system. A potential application of this system, oriented to a smart wheelchair, was evaluated. Current interior navigation solutions require some sort of guiding line, like a black line on the floor. With this proposed solution, interiors do not require renovation to accommodate this solution. The source code of this application has been made open source so that it could be re-purposed for a similar application. Also, this open-source project is envisioned to be improved by the broad open-source community upon past its current state.

【4】 Identifying Class Specific Filters with L1 Norm Frequency Histograms in Deep CNNs 标题:利用L1范数频率直方图识别深层CNN中的类特定滤波器 链接:https://arxiv.org/abs/2112.07719

作者:Akshay Badola,Cherian Roy,Vineet Padmanabhan,Rajendra Lal 备注:19 pages, 5 figures, github repo: this https URL 摘要:深层神经网络的可解释性已成为一个主要的探索领域。尽管这些网络在许多任务中都达到了最先进的准确性,但要解释和解释他们的决定是极其困难的。在这项工作中,我们分析了深卷积网络的最后一层和倒数第二层,并提供了一种有效的方法,用于识别对网络的类决策贡献最大的特征子集。我们证明,与最后一层的维数相比,每个类别的此类特征数量要低得多,因此,深层CNN的决策面位于低维流形上,并且与网络深度成正比。我们的方法允许将最后一层分解为单独的子空间,与整个网络的最后一层相比,该子空间更易于解释,并且具有更低的计算成本。 摘要:Interpretability of Deep Neural Networks has become a major area of exploration. Although these networks have achieved state of the art accuracy in many tasks, it is extremely difficult to interpret and explain their decisions. In this work we analyze the final and penultimate layers of Deep Convolutional Networks and provide an efficient method for identifying subsets of features that contribute most towards the network's decision for a class. We demonstrate that the number of such features per class is much lower in comparison to the dimension of the final layer and therefore the decision surface of Deep CNNs lies on a low dimensional manifold and is proportional to the network depth. Our methods allow to decompose the final layer into separate subspaces which is far more interpretable and has a lower computational cost as compared to the final layer of the full network.