zl程序教程

您现在的位置是:首页 >  IT要闻

当前栏目

计算机视觉与模式识别学术速递[11.12]

2023-03-14 22:52:30 时间

cs.CV 方向,今日共计43篇

Transformer(2篇)

【1】 A Survey of Visual Transformers 标题:视觉Transformer综述 链接:https://arxiv.org/abs/2111.06091

作者:Yang Liu,Yao Zhang,Yixin Wang,Feng Hou,Jin Yuan,Jiang Tian,Yang Zhang,Zhongchao Shi,Jianping Fan,Zhiqiang He 摘要:Transformer是一种基于注意力的编解码体系结构,它彻底改变了自然语言处理领域。受这一重大成就的启发,最近在将Transformer式体系结构应用于计算机视觉(CV)领域方面进行了一些开创性的工作,这些工作证明了它们在各种CV任务中的有效性。凭借具有竞争力的建模能力,与现代卷积神经网络(CNN)相比,VisualTransformers在ImageNet、COCO和ADE20k等多个基准上取得了令人印象深刻的性能。在本文中,我们全面回顾了三个基本CV任务(分类、检测和分割)的一百多个不同的视觉变换器,其中提出了一种分类法,根据这些方法的动机、结构和使用场景来组织这些方法。由于训练设置和面向任务的不同,我们还对这些方法在不同配置上进行了评估,以便于直观比较,而不仅仅是各种基准。此外,我们还揭示了一系列重要但尚未开发的方面,这些方面可能使Transformer从众多体系结构中脱颖而出,例如,松散的高级语义嵌入,以弥合视觉和顺序转换器之间的差距。最后,对未来的研究方向提出了建议。 摘要:Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO, and ADE20k as compared with modern Convolution Neural Networks (CNN). In this paper, we have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios. Because of the differences in training settings and oriented tasks, we have also evaluated these methods on different configurations for easy and intuitive comparison instead of only various benchmarks. Furthermore, we have revealed a series of essential but unexploited aspects that may empower Transformer to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between visual and sequential Transformers. Finally, three promising future research directions are suggested for further investment.

【2】 Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture 标题:图形关系转换器:将成对的对象特征合并到转换器体系结构中 链接:https://arxiv.org/abs/2111.06075

作者:Michael Yang,Aditya Anantharaman,Zachary Kitowski,Derik Clive Robert 机构:Carnegie Mellon University, Pittsburgh, USA 备注:Presented as poster in CVPR 2021 Visual Question Answering Workshop 摘要:以前的研究,如VizWiz发现,能够阅读和推理图像中文本的视觉问答(VQA)系统在帮助视力受损者等应用领域非常有用。TextVQA是一个针对这个问题的VQA数据集,其中的问题需要回答系统来阅读和推理图像中的视觉对象和文本对象。TextVQA中的一个关键挑战是设计一个系统,该系统不仅可以有效地对视觉对象和文本对象进行单独的推理,还可以对这些对象之间的空间关系进行推理。这促使使用“边缘特征”,即关于每对对象之间关系的信息。当前的一些TextVQA模型解决了这个问题,但要么只使用关系类别(而不是边缘特征向量),要么不在Transformer架构中使用边缘特征。为了克服这些缺点,我们提出了一种图形关系变换器(GRT),它除了使用节点信息外,还使用边缘信息来计算变换器中的图形注意。我们发现,在不使用任何其他优化的情况下,建议的GRT方法在val集和测试集上的精度分别比M4C基线模型高0.65%和0.57%。定性上,我们观察到GRT比M4C具有更高的空间推理能力。 摘要:Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared towards this problem, where the questions require answering systems to read and reason about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. This motivates the use of 'edge features', that is, information about the relationship between each pair of objects. Some current TextVQA models address this problem but either only use categories of relations (rather than edge feature vectors) or do not use edge features within the Transformer architectures. In order to overcome these shortcomings, we propose a Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer. We find that, without using any other optimizations, the proposed GRT method outperforms the accuracy of the M4C baseline model by 0.65% on the val set and 0.57% on the test set. Qualitatively, we observe that the GRT has superior spatial reasoning ability to M4C.

检测相关(3篇)

【1】 csBoundary: City-scale Road-boundary Detection in Aerial Images for High-definition Maps 标题:CS边界:高清地图航空影像中的城市尺度道路边界检测 链接:https://arxiv.org/abs/2111.06020

作者:Zhenhua Xu,Yuxuan Liu,Lu Gan,Xiangcheng Hu,Yuxiang Sun,Lujia Wang,Ming Liu 摘要:高清(HD)地图可以为自动驾驶提供静态交通环境的精确几何和语义信息。道路边界是高清地图中包含的最重要信息之一,因为它区分道路区域和越野区域,可以引导车辆在道路区域内行驶。但在城市尺度上为高清地图标注道路边界需要耗费大量人力。为了实现自动高清地图标注,目前的工作使用语义分割或迭代图增长进行道路边界检测。然而,前者不能确保拓扑正确性,因为它在像素级工作,而后者则存在效率低下和漂移问题。为了解决上述问题,在这封信中,我们提出了一种称为csBoundary的新系统,用于在城市尺度上自动检测道路边界,以便进行高清地图注释。我们的网络以一个航空图像块作为输入,并从该图像直接推断出连续的道路边界图(即顶点和边)。为了生成城市尺度的道路边界图,我们将从所有图像块中获得的图形缝合起来。我们的csBoundary在公共基准数据集上进行了评估和比较。结果表明了我们的优越性。随附的演示视频可在我们的项目页面url上找到{https://sites.google.com/view/csboundary/}. 摘要:High-Definition (HD) maps can provide precise geometric and semantic information of static traffic environments for autonomous driving. Road-boundary is one of the most important information contained in HD maps since it distinguishes between road areas and off-road areas, which can guide vehicles to drive within road areas. But it is labor-intensive to annotate road boundaries for HD maps at the city scale. To enable automatic HD map annotation, current work uses semantic segmentation or iterative graph growing for road-boundary detection. However, the former could not ensure topological correctness since it works at the pixel level, while the latter suffers from inefficiency and drifting issues. To provide a solution to the aforementioned problems, in this letter, we propose a novel system termed csBoundary to automatically detect road boundaries at the city scale for HD map annotation. Our network takes as input an aerial image patch, and directly infers the continuous road-boundary graph (i.e., vertices and edges) from this image. To generate the city-scale road-boundary graph, we stitch the obtained graphs from all the image patches. Our csBoundary is evaluated and compared on a public benchmark dataset. The results demonstrate our superiority. The accompanied demonstration video is available at our project page url{https://sites.google.com/view/csboundary/}.

【2】 Detecting COVID-19 from Chest Computed Tomography Scans using AI-Driven Android Application 标题:使用人工智能驱动的Android应用从胸部CT扫描中检测冠状病毒 链接:https://arxiv.org/abs/2111.06254

作者:Aryan Verma,Sagar B. Amin,Muhammad Naeem,Monjoy Saha 摘要:到2021年6月,新冠病毒-19(2019年冠状病毒病)大流行影响了全球1.86亿人,死亡人数超过400万。其严重程度已使全球医疗体系陷入紧张。胸部计算机断层扫描(CT)在新冠病毒-19的诊断和预测中具有潜在的作用。设计一个经济高效且便于在手机等资源有限的设备上操作的诊断系统将提高胸部CT扫描的临床使用率,并提供快速、移动和可访问的诊断功能。这项工作建议开发一种新的Android应用程序,使用高效准确的深度学习算法从胸部CT扫描中检测新冠病毒-19感染。它进一步创建了一个注意力热图,通过作为本工作一部分开发的算法在CT扫描中的肺实质分割区域增强,该算法显示了肺部感染区域。我们提出了一种结合多线程的选择方法,可以在Android设备上更快地生成热图,从而将处理时间减少约93%。本研究中训练用于检测新冠病毒-19的神经网络的F1评分和准确率均为99.58%,灵敏度为99.69%,优于CT扫描诊断新冠病毒领域的大多数结果。这项工作将有助于大量实践,并帮助医生快速有效地对患者进行分类,以早期诊断新冠病毒-19。 摘要:The COVID-19 (coronavirus disease 2019) pandemic affected more than 186 million people with over 4 million deaths worldwide by June 2021. The magnitude of which has strained global healthcare systems. Chest Computed Tomography (CT) scans have a potential role in the diagnosis and prognostication of COVID-19. Designing a diagnostic system which is cost-efficient and convenient to operate on resource-constrained devices like mobile phones would enhance the clinical usage of chest CT scans and provide swift, mobile, and accessible diagnostic capabilities. This work proposes developing a novel Android application that detects COVID-19 infection from chest CT scans using a highly efficient and accurate deep learning algorithm. It further creates an attention heatmap, augmented on the segmented lung parenchyma region in the CT scans through an algorithm developed as a part of this work, which shows the regions of infection in the lungs. We propose a selection approach combined with multi-threading for a faster generation of heatmaps on Android Device, which reduces the processing time by about 93%. The neural network trained to detect COVID-19 in this work is tested with F1 score and accuracy, both of 99.58% and sensitivity of 99.69%, which is better than most of the results in the domain of COVID diagnosis from CT scans. This work will be beneficial in high volume practices and help doctors triage patients in the early diagnosis of the COVID-19 quickly and efficiently.

【3】 Advancing Brain Metastases Detection in T1-Weighted Contrast-Enhanced 3D MRI using Noisy Student-based Training 标题:基于噪声学生训练的T1加权对比增强3D MRI脑转移瘤检测 链接:https://arxiv.org/abs/2111.05959

作者:Engin Dikici,Xuan V. Nguyen,Matthew Bigelow,John. L. Ryu,Luciano M. Prevedello 摘要:早期检测脑转移(BM)可能对癌症患者的预后产生积极影响。我们之前开发了一个在T1加权对比增强3D磁共振图像(T1c)中检测小BM(直径小于15mm)的框架,以帮助医学专家完成这项时间敏感和高风险的任务。该框架使用一个专用的卷积神经网络(CNN),该网络使用标记的T1c数据进行训练,其中由放射科医生提供地面真实BM分割。本研究旨在提出一个基于噪声学生的自我训练策略框架,以利用大量未标记的T1c数据(即没有BM分割或检测的数据)。因此,工作(1)描述了学生和教师的CNN架构,(2)介绍了数据和模型噪声机制,(3)引入了一种新的伪标记策略,该策略考虑了框架的学习BM检测灵敏度。最后,描述了利用这些组件的半监督学习策略。我们通过2倍交叉验证,使用217个标记和1247个未标记T1c检查进行验证。仅使用标记检查的框架产生9.23个误报,BM检测灵敏度为90%;然而,在相同的灵敏度水平下,使用引入的学习策略的框架导致错误检测减少了约9%(即8.44)。此外,虽然使用75%和50%的标记数据集的实验导致算法性能下降(分别为12.19和13.89个误报),但基于噪声的学生训练策略的影响不太明显(分别为10.79和12.37个误报)。 摘要:The detection of brain metastases (BM) in their early stages could have a positive impact on the outcome of cancer patients. We previously developed a framework for detecting small BM (with diameters of less than 15mm) in T1-weighted Contrast-Enhanced 3D Magnetic Resonance images (T1c) to assist medical experts in this time-sensitive and high-stakes task. The framework utilizes a dedicated convolutional neural network (CNN) trained using labeled T1c data, where the ground truth BM segmentations were provided by a radiologist. This study aims to advance the framework with a noisy student-based self-training strategy to make use of a large corpus of unlabeled T1c data (i.e., data without BM segmentations or detections). Accordingly, the work (1) describes the student and teacher CNN architectures, (2) presents data and model noising mechanisms, and (3) introduces a novel pseudo-labeling strategy factoring in the learned BM detection sensitivity of the framework. Finally, it describes a semi-supervised learning strategy utilizing these components. We performed the validation using 217 labeled and 1247 unlabeled T1c exams via 2-fold cross-validation. The framework utilizing only the labeled exams produced 9.23 false positives for 90% BM detection sensitivity; whereas, the framework using the introduced learning strategy led to ~9% reduction in false detections (i.e., 8.44) for the same sensitivity level. Furthermore, while experiments utilizing 75% and 50% of the labeled datasets resulted in algorithm performance degradation (12.19 and 13.89 false positives respectively), the impact was less pronounced with the noisy student-based training strategy (10.79 and 12.37 false positives respectively).

分类|识别相关(7篇)

【1】 Towards Domain-Independent and Real-Time Gesture Recognition Using mmWave Signal 标题:基于毫米波信号的独立于域的实时手势识别 链接:https://arxiv.org/abs/2111.06195

作者:Yadong Li,Dongheng Zhang,Jinbo Chen,Jinwei Wan,Dong Zhang,Yang Hu,Qibin Sun,Yan Chen 备注:The paper is submitted to the journal of IEEE Transactions on Mobile Computing. And it is still under review 摘要:使用毫米波(mmWave)信号的人体手势识别提供了有吸引力的应用,包括智能家居和车内接口。虽然现有工作在受控环境下取得了良好的性能,但由于需要密集的数据收集、适应新领域(即环境、人员和位置)时的额外训练工作以及实时识别性能差,实际应用仍然受到限制。在本文中,我们提出了DI手势,一个独立于领域的实时毫米波手势识别系统。具体来说,我们首先通过时空处理推导出与人类手势相对应的信号变化。为了增强系统的鲁棒性和减少数据收集工作,我们设计了一个基于信号模式和手势变化之间相关性的数据增强框架。此外,我们还提出了一种动态窗口机制来自动准确地进行手势分割,从而实现实时识别。最后,我们构建了一个轻量级的神经网络,从数据中提取时空信息进行手势分类。大量实验结果表明,对于新用户、新环境和新位置,DI手势的平均准确率分别为97.92%、99.18%和98.76%。在实时场景中,DI-Gesutre的准确率达到97%以上,平均推理时间为2.87ms,这表明我们的系统具有优越的鲁棒性和有效性。 摘要:Human gesture recognition using millimeter wave (mmWave) signals provides attractive applications including smart home and in-car interface. While existing works achieve promising performance under controlled settings, practical applications are still limited due to the need of intensive data collection, extra training efforts when adapting to new domains (i.e. environments, persons and locations) and poor performance for real-time recognition. In this paper, we propose DI-Gesture, a domain-independent and real-time mmWave gesture recognition system. Specifically, we first derive the signal variation corresponding to human gestures with spatial-temporal processing. To enhance the robustness of the system and reduce data collecting efforts, we design a data augmentation framework based on the correlation between signal patterns and gesture variations. Furthermore, we propose a dynamic window mechanism to perform gesture segmentation automatically and accurately, thus enable real-time recognition. Finally, we build a lightweight neural network to extract spatial-temporal information from the data for gesture classification. Extensive experimental results show DI-Gesture achieves an average accuracy of 97.92%, 99.18% and 98.76% for new users, environments and locations, respectively. In real-time scenario, the accuracy of DI-Gesutre reaches over 97% with average inference time of 2.87ms, which demonstrates the superior robustness and effectiveness of our system.

【2】 A Novel Approach for Deterioration and Damage Identification in Building Structures Based on Stockwell-Transform and Deep Convolutional Neural Network 标题:基于斯托克韦尔变换和深卷积神经网络的建筑结构劣化损伤识别新方法 链接:https://arxiv.org/abs/2111.06155

作者:Vahid Reza Gharehbaghi,Hashem Kalbkhani,Ehsan Noroozinejad Farsangi,T. Y. Yang,Andy Nguyene,Seyedali Mirjalili,C. Málaga-Chuquitaype 机构:a Research Scholar, Kharazmi University, Tehran, Iran, b AProfessor, Department of Electrical Engineering, Urmia University of Technology, Urmia, Iran, (h.kalbkhaniuut.ac.ir), c AProfessor, Graduate University of Advanced Technology 备注:11 figures and 11 Tables, Accepted in Journal of Structural Integrity and Maintenance 摘要:本文提出了一种新的退化和损伤识别方法(DIP),并将其应用于建筑模型。在这些类型的结构上应用相关的挑战与响应的强相关性有关,在处理具有高噪声水平的真实环境振动时,响应的强相关性变得更加复杂。因此,利用低成本环境振动设计DIP,以使用Stockwell变换(ST)生成频谱图来分析加速度响应。随后,ST输出成为两系列卷积神经网络(CNN)的输入,用于识别建筑物模型的劣化和损坏。据我们所知,这是第一次通过ST和CNN的高精度组合在建筑模型上评估损坏和劣化。 摘要:In this paper, a novel deterioration and damage identification procedure (DIP) is presented and applied to building models. The challenge associated with applications on these types of structures is related to the strong correlation of responses, which gets further complicated when coping with real ambient vibrations with high levels of noise. Thus, a DIP is designed utilizing low-cost ambient vibrations to analyze the acceleration responses using the Stockwell transform (ST) to generate spectrograms. Subsequently, the ST outputs become the input of two series of Convolutional Neural Networks (CNNs) established for identifying deterioration and damage to the building models. To the best of our knowledge, this is the first time that both damage and deterioration are evaluated on building models through a combination of ST and CNN with high accuracy.

【3】 Open surgery tool classification and hand utilization using a multi-camera system 标题:使用多摄像机系统的开放式手术工具分类和手使用 链接:https://arxiv.org/abs/2111.06098

作者:Kristina Basiev,Adam Goldbraikh,Carla M Pugh,Shlomi Laufer 机构:Israel Institute of Technology, Haifa, Israel., Applied Mathematics Department, Technion – Israel Institute of, Department of Surgery, Stanford University School of Medicine, Stanford, California. 备注:12 pages, 3 figures, submitted to IPCAI 2022 摘要:目的:这项工作的目标是使用多摄像头视频对开放手术工具进行分类,并确定每只手拿着哪种工具。多摄像头系统有助于防止开放手术视频数据中的遮挡。此外,结合多个视图,例如覆盖整个手术区域的俯视摄像机和聚焦于手部运动和解剖的特写摄像机,可以提供手术流程的更全面视图。然而,多摄像机数据融合带来了一个新的挑战:一个工具可能在一个摄像机中可见,而在另一个摄像机中不可见。因此,我们将全球地面真相定义为使用的工具,无论其可见性如何。因此,当系统快速响应视频中可见的变化时,应长时间记住图像以外的工具。方法:受试者(n=48)进行模拟肠道修复。使用了俯视图和特写镜头。YOLOv5用于工具和手部检测。使用具有每秒30帧(fps)的1秒窗口的高频LSTM和具有每秒3帧的40秒窗口的低频LSTM进行空间、时间和多摄像机集成。结果:六个系统的准确度和F1分别为:俯视图(0.88/0.88)、特写(0.81,0.83)、两个摄像头(0.9/0.9)、高fps LSTM(0.92/0.93)、低fps LSTM(0.9/0.91)和我们的最终架构多摄像头分类器(0.93/0.94)。结论:通过将多摄像机阵列的高fps和低fps系统相结合,我们提高了全球地面真实度的分类能力。 摘要:Purpose: The goal of this work is to use multi-camera video to classify open surgery tools as well as identify which tool is held in each hand. Multi-camera systems help prevent occlusions in open surgery video data. Furthermore, combining multiple views such as a Top-view camera covering the full operative field and a Close-up camera focusing on hand motion and anatomy, may provide a more comprehensive view of the surgical workflow. However, multi-camera data fusion poses a new challenge: a tool may be visible in one camera and not the other. Thus, we defined the global ground truth as the tools being used regardless their visibility. Therefore, tools that are out of the image should be remembered for extensive periods of time while the system responds quickly to changes visible in the video. Methods: Participants (n=48) performed a simulated open bowel repair. A Top-view and a Close-up cameras were used. YOLOv5 was used for tool and hand detection. A high frequency LSTM with a 1 second window at 30 frames per second (fps) and a low frequency LSTM with a 40 second window at 3 fps were used for spatial, temporal, and multi-camera integration. Results: The accuracy and F1 of the six systems were: Top-view (0.88/0.88), Close-up (0.81,0.83), both cameras (0.9/0.9), high fps LSTM (0.92/0.93), low fps LSTM (0.9/0.91), and our final architecture the Multi-camera classifier(0.93/0.94). Conclusion: By combining a system with a high fps and a low fps from the multiple camera array we improved the classification abilities of the global ground truth.

【4】 Synthetic Document Generator for Annotation-free Layout Recognition 标题:用于无注释版面识别的合成文档生成器 链接:https://arxiv.org/abs/2111.06016

作者:Natraj Raman,Sameena Shah,Manuela Veloso 机构:JPMorgan AI Research, London, UK., New York, USA. 摘要:分析文档的布局以确定标题、节、表、图等对于理解其内容至关重要。基于深度学习的文档图像布局结构检测方法具有广阔的应用前景。然而,这些方法在训练期间需要大量带注释的示例,获取这些示例既昂贵又耗时。我们在这里描述了一个合成文档生成器,它可以自动生成真实的文档,其中包含布局元素的空间位置、范围和类别的标签。提出的生成过程将文档的每个物理组件视为随机变量,并使用贝叶斯网络图对其内在依赖性进行建模。我们使用随机模板的分层公式允许在文档之间共享参数,以保留广泛的主题,但分布特征产生视觉上独特的样本,从而捕获复杂和多样的布局。我们的经验表明,纯基于合成文档训练的深度布局检测模型可以与使用真实文档的模型的性能相匹配。 摘要:Analyzing the layout of a document to identify headers, sections, tables, figures etc. is critical to understanding its content. Deep learning based approaches for detecting the layout structure of document images have been promising. However, these methods require a large number of annotated examples during training, which are both expensive and time consuming to obtain. We describe here a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of the layout elements. The proposed generative process treats every physical component of a document as a random variable and models their intrinsic dependencies using a Bayesian Network graph. Our hierarchical formulation using stochastic templates allow parameter sharing between documents for retaining broad themes and yet the distributional characteristics produces visually unique samples, thereby capturing complex and diverse layouts. We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.

【5】 Feature Generation for Long-tail Classification 标题:面向长尾分类的特征生成 链接:https://arxiv.org/abs/2111.05956

作者:Rahul Vigneswaran,Marc T. Law,Vineeth N. Balasubramanian,Makarand Tapaswi 机构:Indian Institute of Technology, Hyderabad, NVIDIA, IIIT Hyderabad, India 备注:Accepted at ICVGIP'21. Code available at this https URL 摘要:视觉世界自然会表现出对象或场景实例数量的不平衡,从而导致长尾分布。这种不平衡对基于深度学习的分类模型提出了重大挑战。tail类的过采样实例试图解决这种不平衡。然而,有限的视觉多样性导致网络表现能力差。一个简单的解决方法是解耦表示和分类器网络,并仅使用过采样来训练分类器。在本文中,我们探索了一个方向,即通过估计尾部类别的分布来尝试生成有意义的特征,而不是重复地对同一图像(以及因此产生的特征)进行重新采样。受最近关于Few-Shot学习的工作的启发,我们创建了校准分布,以对随后用于训练分类器的附加特征进行采样。通过在具有不同不平衡因子的CIFAR-100-LT(长尾)数据集和mini-ImageNet LT(长尾)数据集上的若干实验,我们展示了我们的方法的有效性,并建立了一个新的最新技术。我们还使用t-SNE可视化对生成的特征进行定性分析,并分析用于校准尾类分布的最近邻。我们的代码可在https://github.com/rahulvigneswaran/TailCalibX. 摘要:The visual world naturally exhibits an imbalance in the number of object or scene instances resulting in a emph{long-tailed distribution}. This imbalance poses significant challenges for classification models based on deep learning. Oversampling instances of the tail classes attempts to solve this imbalance. However, the limited visual diversity results in a network with poor representation ability. A simple counter to this is decoupling the representation and classifier networks and using oversampling only to train the classifier. In this paper, instead of repeatedly re-sampling the same image (and thereby features), we explore a direction that attempts to generate meaningful features by estimating the tail category's distribution. Inspired by ideas from recent work on few-shot learning, we create calibrated distributions to sample additional features that are subsequently used to train the classifier. Through several experiments on the CIFAR-100-LT (long-tail) dataset with varying imbalance factors and on mini-ImageNet-LT (long-tail), we show the efficacy of our approach and establish a new state-of-the-art. We also present a qualitative analysis of generated features using t-SNE visualizations and analyze the nearest neighbors used to calibrate the tail class distributions. Our code is available at https://github.com/rahulvigneswaran/TailCalibX.

【6】 An Extensive Study of User Identification via Eye Movements across Multiple Datasets 标题:跨多个数据集的眼动用户识别的扩展研究 链接:https://arxiv.org/abs/2111.05901

作者:Sahar Mahdie Klim Al Zaidawi,Martin H. U. Prinzler,Jonas Lührs,Sebastian Maneth 机构:Database Lab, University of Bremen , Germany, Martin H.U. Prinzler 备注:11 pages, 5 figures, submitted to Signal Processing: Image Communication 摘要:一些研究报告称,基于眼动特征的生物特征识别可用于身份验证。本文基于George和Routray最初提出的方法的改进版本,对通过多个数据集的眼球运动进行用户识别进行了广泛的研究。我们针对影响识别准确性的几个因素分析了我们的方法,如刺激类型、IVT参数(用于将轨迹分割为注视和扫视),添加了新的特征,如眼动的高阶导数、眨眼信息的包含、模板老化、,年龄和性别。我们发现三种方法,即选择最佳IVT参数、添加高阶导数特征和包括额外眨眼分类器,对识别精度有积极影响。改进幅度从几个百分点到其中一个数据集令人印象深刻的9%增长不等。 摘要:Several studies have reported that biometric identification based on eye movement characteristics can be used for authentication. This paper provides an extensive study of user identification via eye movements across multiple datasets based on an improved version of method originally proposed by George and Routray. We analyzed our method with respect to several factors that affect the identification accuracy, such as the type of stimulus, the IVT parameters (used for segmenting the trajectories into fixation and saccades), adding new features such as higher-order derivatives of eye movements, the inclusion of blink information, template aging, age and gender.We find that three methods namely selecting optimal IVT parameters, adding higher-order derivatives features and including an additional blink classifier have a positive impact on the identification accuracy. The improvements range from a few percentage points, up to an impressive 9 % increase on one of the datasets.

【7】 Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention 标题:基于跨模态注意的多模态端到端群体情感识别 链接:https://arxiv.org/abs/2111.05890

作者:Lev Evtodienko 机构:Higher School of Economics 摘要:由于视频的复杂性,对群体情绪进行分类是一项具有挑战性的任务,其中不仅要考虑视觉信息,还要考虑音频信息。现有的多模态情感识别的工作都是使用大量的方法,即使用预训练的神经网络作为特征抽取器,然后对提取的特征进行融合。然而,这种方法不考虑多模态数据的属性和特征提取器不能对特定任务进行微调,这对整体模型精度是不利的。为此,我们的影响是双重的:(i)我们端到端地训练模型,这使得神经网络的早期层能够适应,同时考虑两种模式的后期融合层;(ii)我们模型的所有层次都经过了微调,以适应情绪识别的下游任务,因此无需从头开始训练神经网络。我们的模型实现了60.37%的最佳验证精度,比VGAF数据集基线高出约8.5%,与现有作品、音频和视频模式具有竞争力。 摘要:Classifying group-level emotions is a challenging task due to complexity of video, in which not only visual, but also audio information should be taken into consideration. Existing works on multimodal emotion recognition are using bulky approach, where pretrained neural networks are used as a feature extractors and then extracted features are being fused. However, this approach does not consider attributes of multimodal data and feature extractors cannot be fine-tuned for specific task which can be disadvantageous for overall model accuracy. To this end, our impact is twofold: (i) we train model end-to-end, which allows early layers of neural network to be adapted with taking into account later, fusion layers, of two modalities; (ii) all layers of our model was fine-tuned for downstream task of emotion recognition, so there were no need to train neural networks from scratch. Our model achieves best validation accuracy of 60.37% which is approximately 8.5% higher, than VGAF dataset baseline and is competitive with existing works, audio and video modalities.

分割|语义相关(4篇)

【1】 The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos 标题:客观性的出现:从视频中学习Zero-Shot分割 链接:https://arxiv.org/abs/2111.06394

作者:Runtao Liu,Zhirong Wu,Stella X. Yu,Stephen Lin 机构:Microsoft Research Asia, John Hopkins University, UC Berkeley ICSI 备注:This paper has been accepted to NeurIPS 2021 摘要:人类可以很容易地分割移动的物体,而不知道它们是什么。连续的视觉观察可能产生的对象性促使我们同时从未标记的视频中建模分组和运动。我们的前提是,一个视频在同一场景中有不同的视图,这些视图由移动组件关联,正确的区域分割和区域流将允许相互视图合成,而无需任何外部监督即可从数据本身进行检查。我们的模型从两个独立的路径开始:一个外观路径为单个图像输出基于特征的区域分割,另一个运动路径为一对图像输出运动特征。然后,它将它们绑定到称为“分段流”的联合表示中,该表示汇集每个区域上的流偏移,并为整个场景提供移动区域的总体特征。通过训练模型以最小化基于片段流的视图合成误差,我们的外观和运动路径自动学习区域分割和流估计,而无需分别从低级边缘或光流建立它们。我们的模型展示了外观路径中令人惊讶的对象性的出现,超过了以前在图像Zero-Shot对象分割、无监督测试时间自适应视频运动对象分割和监督微调语义图像分割方面的工作。我们的工作是第一个真正的端到端Zero-Shot视频对象分割。它不仅为分割和跟踪提供了通用的对象性,而且在没有增强工程的情况下,它的性能也优于目前流行的基于图像的对比学习方法。 摘要:Humans can easily segment moving objects without knowing what they are. That objectness could emerge from continuous visual observations motivates us to model grouping and movement concurrently from unlabeled videos. Our premise is that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis which can be checked from the data itself without any external supervision. Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images. It then binds them in a conjoint representation called segment flow that pools flow offsets over each region and provides a gross characterization of moving regions for the entire scene. By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively. Our model demonstrates the surprising emergence of objectness in the appearance pathway, surpassing prior works on zero-shot object segmentation from an image, moving object segmentation from a video with unsupervised test-time adaptation, and semantic image segmentation by supervised fine-tuning. Our work is the first truly end-to-end zero-shot object segmentation from videos. It not only develops generic objectness for segmentation and tracking, but also outperforms prevalent image-based contrastive learning methods without augmentation engineering.

【2】 Dense Unsupervised Learning for Video Segmentation 标题:密集无监督学习在视频分割中的应用 链接:https://arxiv.org/abs/2111.06265

作者:Nikita Araslanov,Simone Schaub-Meyer,Stefan Roth 机构:Department of Computer Science, TU Darmstadt, hessian.AI 备注:To appear at NeurIPS*2021. Code: this https URL 摘要:提出了一种新的无监督学习视频对象分割方法。与以前的工作不同,我们的公式允许在完全卷积区域中直接学习密集特征表示。我们依靠均匀网格采样来提取一组锚,并训练我们的模型在视频间和视频内消除它们之间的歧义。然而,训练这样一个模型的简单方案会导致退化解。我们建议通过一个简单的正则化方案来防止这种情况,该方案将分割任务的等变特性与相似性变换相适应。我们的训练目标允许高效实施,并显示出快速的训练融合。在已建立的VOS基准上,尽管使用的训练数据和计算能力显著减少,但我们的方法超过了以前工作的分割精度。 摘要:We present a novel approach to unsupervised learning for video object segmentation (VOS). Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. We rely on uniform grid sampling to extract a set of anchors and train our model to disambiguate between them on both inter- and intra-video levels. However, a naive scheme to train such a model results in a degenerate solution. We propose to prevent this with a simple regularisation scheme, accommodating the equivariance property of the segmentation task to similarity transformations. Our training objective admits efficient implementation and exhibits fast training convergence. On established VOS benchmarks, our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.

【3】 Semantic-aware Representation Learning Via Probability Contrastive Loss 标题:基于概率对比损失的语义感知表征学习 链接:https://arxiv.org/abs/2111.06021

作者:Junjie Li,Yixin Zhang,Zilei Wang,Keyu Tu 机构:University of Science and Technology of China 备注:15 pages,3 figures 摘要:最近的特征对比学习(FCL)在无监督表征学习中表现出了良好的性能。然而,对于标记数据和未标记数据属于同一语义空间的闭集表示学习,FCL由于在优化过程中不涉及类语义,因此不能显示出压倒性的优势。因此,尽管生成的特征信息丰富,但不能保证通过从标记数据中学习的类权重轻松分类。为了解决这个问题,我们在本文中提出了一种新的概率对比学习(PCL),它不仅可以产生丰富的特征,而且可以强制它们分布在类原型中。具体来说,我们建议使用softmax之后的输出概率来执行对比学习,而不是FCL中提取的特征。显然,这种方法可以在优化过程中利用类语义。此外,我们建议删除传统FCL中的$ell{2}$归一化,直接使用$ell{1}$归一化概率进行对比学习。我们提出的PCL是简单有效的。我们在三个封闭集图像分类任务上进行了广泛的实验,即无监督域自适应、半监督学习和半监督域自适应。在多个数据集上的结果表明,我们的PCL可以持续获得可观的收益,并在所有三项任务中实现最先进的性能。 摘要:Recent feature contrastive learning (FCL) has shown promising performance in unsupervised representation learning. For the close-set representation learning where labeled data and unlabeled data belong to the same semantic space, however, FCL cannot show overwhelming gains due to not involving the class semantics during optimization. Consequently, the produced features do not guarantee to be easily classified by the class weights learned from labeled data although they are information-rich. To tackle this issue, we propose a novel probability contrastive learning (PCL) in this paper, which not only produces rich features but also enforces them to be distributed around the class prototypes. Specifically, we propose to use the output probabilities after softmax to perform contrastive learning instead of the extracted features in FCL. Evidently, such a way can exploit the class semantics during optimization. Moreover, we propose to remove the $ell_{2}$ normalization in the traditional FCL and directly use the $ell_{1}$-normalized probability for contrastive learning. Our proposed PCL is simple and effective. We conduct extensive experiments on three close-set image classification tasks, i.e., unsupervised domain adaptation, semi-supervised learning, and semi-supervised domain adaptation. The results on multiple datasets demonstrate that our PCL can consistently get considerable gains and achieves the state-of-the-art performance for all three tasks.

【4】 Trustworthy Medical Segmentation with Uncertainty Estimation 标题:基于不确定性估计的可信医学分割 链接:https://arxiv.org/abs/2111.05978

作者:Giuseppina Carannante,Dimah Dera,Nidhal C. Bouaynaya,Rasool Ghulam,Hassan M. Fathallah-Shaykh 机构:University of Texas Rio Grande Valley 摘要:深度学习(DL)由于其精确性、效率和客观性,在重塑医疗系统方面具有巨大的前景。然而,DL模型对噪声和分布外输入的脆弱性阻碍了其在临床上的应用。大多数系统在没有关于模型不确定性或置信度的进一步信息的情况下产生点估计。本文介绍了一种新的用于分段神经网络不确定性量化的贝叶斯深度学习框架,特别是编码器-解码器结构。该框架使用一阶泰勒级数近似,通过最大化证据下界来传播和学习给定训练数据的模型参数分布的前两个矩(均值和协方差)。输出包括两个映射:分割图像和分割的不确定性映射。分割决策中的不确定性由预测分布的协方差矩阵捕获。我们评估了基于磁共振成像和计算机断层扫描的医学图像分割数据的框架。我们在多个基准数据集上的实验表明,与最先进的分割模型相比,该框架对噪声和对抗性攻击更具鲁棒性。此外,所提出框架的不确定性映射将低置信度(或相当高的不确定性)与测试输入图像中被噪声、伪影或敌对攻击破坏的补丁相关联。因此,当模型做出错误预测或遗漏部分分割结构(例如肿瘤)时,通过在不确定性图中呈现更高的值,模型可以自我评估其分割决策。 摘要:Deep Learning (DL) holds great promise in reshaping the healthcare systems given its precision, efficiency, and objectivity. However, the brittleness of DL models to noisy and out-of-distribution inputs is ailing their deployment in the clinic. Most systems produce point estimates without further information about model uncertainty or confidence. This paper introduces a new Bayesian deep learning framework for uncertainty quantification in segmentation neural networks, specifically encoder-decoder architectures. The proposed framework uses the first-order Taylor series approximation to propagate and learn the first two moments (mean and covariance) of the distribution of the model parameters given the training data by maximizing the evidence lower bound. The output consists of two maps: the segmented image and the uncertainty map of the segmentation. The uncertainty in the segmentation decisions is captured by the covariance matrix of the predictive distribution. We evaluate the proposed framework on medical image segmentation data from Magnetic Resonances Imaging and Computed Tomography scans. Our experiments on multiple benchmark datasets demonstrate that the proposed framework is more robust to noise and adversarial attacks as compared to state-of-the-art segmentation models. Moreover, the uncertainty map of the proposed framework associates low confidence (or equivalently high uncertainty) to patches in the test input images that are corrupted with noise, artifacts or adversarial attacks. Thus, the model can self-assess its segmentation decisions when it makes an erroneous prediction or misses part of the segmentation structures, e.g., tumor, by presenting higher values in the uncertainty map.

半弱无监督|主动学习|不确定性(4篇)

【1】 Unsupervised Part Discovery from Contrastive Reconstruction 标题:基于对比重建的无监督部分发现 链接:https://arxiv.org/abs/2111.06349

作者:Subhabrata Choudhury,Iro Laina,Christian Rupprecht,Andrea Vedaldi 机构:Visual Geometry Group, University of Oxford, Oxford, UK 备注:To appear in NeurIPS 2021. Project page: this https URL 摘要:自监督视觉表征学习的目标是学习强的、可转移的图像表征,大多数研究集中在对象或场景层面。另一方面,部分水平的表征学习受到的关注明显较少。在本文中,我们提出了一种无监督的对象部分发现和分割方法,并做出了三点贡献。首先,我们通过一组目标构造一个代理任务,这些目标鼓励模型学习将图像有意义地分解为各个部分。其次,先前的工作主张重建或聚类预先计算的特征作为零件的代理;我们的经验表明,仅此一点不太可能找到有意义的部分;这主要是因为它们的低分辨率和分类网络在空间上抹去信息的趋势。我们认为,在像素级的图像重建可以缓解这个问题,作为一个补充线索。最后,我们证明了基于关键点回归的标准评估与分割质量没有很好的相关性,因此引入了不同的度量,NMI和ARI,可以更好地描述对象分解为多个部分。我们的方法产生的语义部分在细粒度但视觉上不同的类别中是一致的,在三个基准数据集上优于最新水平。代码位于项目页面:https://www.robots.ox.ac.uk/~vgg/研究/不明嫌犯零件/。 摘要:The goal of self-supervised visual representation learning is to learn strong, transferable image representations, with the majority of research focusing on object or scene level. On the other hand, representation learning at part level has received significantly less attention. In this paper, we propose an unsupervised approach to object part discovery and segmentation and make three contributions. First, we construct a proxy task through a set of objectives that encourages the model to learn a meaningful decomposition of the image into its parts. Secondly, prior work argues for reconstructing or clustering pre-computed features as a proxy to parts; we show empirically that this alone is unlikely to find meaningful parts; mainly because of their low resolution and the tendency of classification networks to spatially smear out information. We suggest that image reconstruction at the level of pixels can alleviate this problem, acting as a complementary cue. Lastly, we show that the standard evaluation based on keypoint regression does not correlate well with segmentation quality and thus introduce different metrics, NMI and ARI, that better characterize the decomposition of objects into parts. Our method yields semantic parts which are consistent across fine-grained but visually distinct categories, outperforming the state of the art on three benchmark datasets. Code is available at the project page: https://www.robots.ox.ac.uk/~vgg/research/unsup-parts/.

【2】 Self-Supervised Real-time Video Stabilization 标题:自监控实时视频稳定 链接:https://arxiv.org/abs/2111.05980

作者:Jinsoo Choi,Jaesik Park,In So Kweon 机构:KAIST, Republic of Korea, POSTECH 备注:BMVC 2021 摘要:视频是一种流行的媒体形式,在线视频流最近在这一领域大受欢迎。在这项工作中,我们提出了一种新的实时视频稳定方法——将抖动的视频转换为稳定的视频,就好像它是通过框架实时稳定的一样。我们的框架可以以自我监督的方式进行训练,不需要通过特殊的硬件设置(即,立体声设备上的两个摄像头或额外的运动传感器)捕获数据。我们的框架由给定帧之间的变换估计器组成,用于全局稳定性调整,然后通过空间平滑光流减少场景视差,以进一步提高稳定性。然后,边距修复模块填充稳定期间创建的缺失边距区域,以减少后裁剪量。这些连续步骤将失真和边距裁剪降至最低,同时增强稳定性。因此,我们的方法优于最先进的实时视频稳定方法以及需要优化摄像机轨迹的离线方法。无论分辨率如何(例如480p或1080p),我们的方法程序大约需要24.3ms,产生41fps。 摘要:Videos are a popular media form, where online video streaming has recently gathered much popularity. In this work, we propose a novel method of real-time video stabilization - transforming a shaky video to a stabilized video as if it were stabilized via gimbals in real-time. Our framework is trainable in a self-supervised manner, which does not require data captured with special hardware setups (i.e., two cameras on a stereo rig or additional motion sensors). Our framework consists of a transformation estimator between given frames for global stability adjustments, followed by scene parallax reduction module via spatially smoothed optical flow for further stability. Then, a margin inpainting module fills in the missing margin regions created during stabilization to reduce the amount of post-cropping. These sequential steps reduce distortion and margin cropping to a minimum while enhancing stability. Hence, our approach outperforms state-of-the-art real-time video stabilization methods as well as offline methods that require camera trajectory optimization. Our method procedure takes approximately 24.3 ms yielding 41 fps regardless of resolution (e.g., 480p or 1080p).

【3】 Self-Supervised Multi-Object Tracking with Cross-Input Consistency 标题:具有交叉输入一致性的自监督多目标跟踪 链接:https://arxiv.org/abs/2111.05943

作者:Favyen Bastani,Songtao He,Sam Madden 机构:MIT CSAIL 备注:NeurIPS 2021 摘要:在本文中,我们提出了一种自监督学习过程,用于训练仅给定未标记视频的鲁棒多目标跟踪(MOT)模型。虽然在单目标跟踪的先前工作中已经提出了几种自监督学习信号,如颜色传播和周期一致性,但这些信号不能直接用于训练RNN模型,这是实现精确MOT所需的:它们产生退化模型,例如,始终将新检测与具有最近初始检测的轨迹相匹配。我们提出了一种新的自我监控信号,我们称之为交叉输入一致性:通过在每个输入中隐藏关于序列的不同信息,为同一视频序列构造两个不同的输入。然后,我们通过在每个输入上独立应用RNN模型来计算该序列中的轨迹,并训练该模型在两个输入上产生一致的轨迹。我们在MOT17和KITTI上评估了我们的无监督方法——值得注意的是,我们发现,尽管只对未标记视频进行了训练,但我们的无监督方法优于过去1-2年中发布的四种有监督方法,包括Tracktor++、FAMNet、GSM和mmMOT。 摘要:In this paper, we propose a self-supervised learning procedure for training a robust multi-object tracking (MOT) model given only unlabeled video. While several self-supervisory learning signals have been proposed in prior work on single-object tracking, such as color propagation and cycle-consistency, these signals cannot be directly applied for training RNN models, which are needed to achieve accurate MOT: they yield degenerate models that, for instance, always match new detections to tracks with the closest initial detections. We propose a novel self-supervisory signal that we call cross-input consistency: we construct two distinct inputs for the same sequence of video, by hiding different information about the sequence in each input. We then compute tracks in that sequence by applying an RNN model independently on each input, and train the model to produce consistent tracks across the two inputs. We evaluate our unsupervised method on MOT17 and KITTI -- remarkably, we find that, despite training only on unlabeled video, our unsupervised approach outperforms four supervised methods published in the last 1--2 years, including Tracktor++, FAMNet, GSM, and mmMOT.

【4】 A Histopathology Study Comparing Contrastive Semi-Supervised and Fully Supervised Learning 标题:对比半监督和全监督学习的组织病理学研究 链接:https://arxiv.org/abs/2111.05882

作者:Lantian Zhang,Mohamed Amgad,Lee A. D. Cooper 机构:North Shore Country Day, Winnetka, IL, USA, Department of Pathology, Northwestern University, Chicago, IL, USA 备注:7 pages, 4 figures, 4 tables 摘要:在开发计算病理模型时,数据标记通常是最具挑战性的任务。病理学家的参与对于生成准确的标签是必要的,病理学家时间的限制和对大型标记数据集的需求导致了在一些领域的研究,包括使用患者级别标签的弱监督学习、机器辅助注释和主动学习。在这篇文章中,我们探讨了自我监督学习,以减少标记负担的计算病理学。我们在使用Barlow Twins方法对乳腺癌组织进行分类的背景下对此进行了探讨,并将自我监督与低数据场景中的预训练网络等替代方案进行了比较。对于本文探索的任务,我们发现ImageNet预训练网络在很大程度上优于使用Barlow Twins获得的自监督表示。 摘要:Data labeling is often the most challenging task when developing computational pathology models. Pathologist participation is necessary to generate accurate labels, and the limitations on pathologist time and demand for large, labeled datasets has led to research in areas including weakly supervised learning using patient-level labels, machine assisted annotation and active learning. In this paper we explore self-supervised learning to reduce labeling burdens in computational pathology. We explore this in the context of classification of breast cancer tissue using the Barlow Twins approach, and we compare self-supervision with alternatives like pre-trained networks in low-data scenarios. For the task explored in this paper, we find that ImageNet pre-trained networks largely outperform the self-supervised representations obtained using Barlow Twins.

时序|行为识别|姿态|视频|运动估计(3篇)

【1】 6D Pose Estimation with Combined Deep Learning and 3D Vision Techniques for a Fast and Accurate Object Grasping 标题:快速准确抓取目标的深度学习与三维视觉相结合的6D位姿估计 链接:https://arxiv.org/abs/2111.06276

作者:Tuan-Tang Le,Trung-Son Le,Yu-Ru Chen,Joel Vidal,Chyi-Yeu Lin 摘要:实时机器人抓取,支持后续精确的手操作任务,是高度先进的自主系统的优先目标。然而,这样一种算法,可以执行足够准确的抓取时间效率尚未找到。该文提出了一种新的方法,该方法采用两阶段的方法,将使用深度神经网络的快速2D对象识别与随后基于点对特征框架的精确快速的6D姿势估计相结合,形成一个实时3D对象识别和多对象类场景的抓取解决方案。建议的解决方案有可能在实时应用程序上稳健地执行,需要效率和准确性。为了验证我们的方法,我们进行了广泛而彻底的实验,包括我们自己的数据集的艰苦准备。实验结果表明,该方法在5cm5度度量上的准确率为97.37%,在平均距离度量上的准确率为99.37%。实验结果表明,使用所提出的方法,总体相对改善62%(5cm5度度量)和52.48%(平均距离度量)。此外,姿势估计执行在运行时间上也显示出平均47.6%的改进。最后,为了说明系统在实时操作中的整体效率,进行了一个拾取和放置机器人实验,显示了令人信服的成功率和90%的准确率。此实验视频可在https://sites.google.com/view/dl-ppf6dpose/. 摘要:Real-time robotic grasping, supporting a subsequent precise object-in-hand operation task, is a priority target towards highly advanced autonomous systems. However, such an algorithm which can perform sufficiently-accurate grasping with time efficiency is yet to be found. This paper proposes a novel method with a 2-stage approach that combines a fast 2D object recognition using a deep neural network and a subsequent accurate and fast 6D pose estimation based on Point Pair Feature framework to form a real-time 3D object recognition and grasping solution capable of multi-object class scenes. The proposed solution has a potential to perform robustly on real-time applications, requiring both efficiency and accuracy. In order to validate our method, we conducted extensive and thorough experiments involving laborious preparation of our own dataset. The experiment results show that the proposed method scores 97.37% accuracy in 5cm5deg metric and 99.37% in Average Distance metric. Experiment results have shown an overall 62% relative improvement (5cm5deg metric) and 52.48% (Average Distance metric) by using the proposed method. Moreover, the pose estimation execution also showed an average improvement of 47.6% in running time. Finally, to illustrate the overall efficiency of the system in real-time operations, a pick-and-place robotic experiment is conducted and has shown a convincing success rate with 90% of accuracy. This experiment video is available at https://sites.google.com/view/dl-ppf6dpose/.

【2】 Towards Live Video Analytics with On-Drone Deeper-yet-Compatible Compression 标题:使用无人机上更深入但兼容的压缩技术实现实时视频分析 链接:https://arxiv.org/abs/2111.06263

作者:Junpeng Guo,Chunyi Peng 机构:Purdue University 摘要:在这项工作中,我们介绍了DCC(更深入但兼容的压缩),这是一种在现有编解码器的基础上构建的用于实时无人机来源的边缘辅助视频分析的技术。DCC解决了一个重要的技术问题,即在不影响在边缘执行的视频分析任务的准确性和及时性的情况下,将流视频从无人机压缩到边缘。DCC的灵感来自这样一个事实,即并非流视频中的每一位对视频分析都具有同等价值,这为传统分析不经意的视频编解码器技术开辟了新的压缩空间。我们利用无人机特定的背景和来自目标检测的中间提示来追求保持分析质量所需的自适应保真度。我们在一个展示车辆检测的应用程序中原型化了DCC,并在典型场景中验证了其效率。与基线方法相比,DCC将传输容量减少了9.5倍,与最先进的方法相比,传输容量减少了19-683%,检测精度相当。 摘要:In this work, we present DCC(Deeper-yet-Compatible Compression), one enabling technique for real-time drone-sourced edge-assisted video analytics built on top of the existing codec. DCC tackles an important technical problem to compress streamed video from the drone to the edge without scarifying accuracy and timeliness of video analytical tasks performed at the edge. DCC is inspired by the fact that not every bit in streamed video is equally valuable to video analytics, which opens new compression room over the conventional analytics-oblivious video codec technology. We exploit drone-specific context and intermediate hints from object detection to pursue adaptive fidelity needed to retain analytical quality. We have prototyped DCC in one showcase application of vehicle detection and validated its efficiency in representative scenarios. DCC has reduced transmission volume by 9.5-fold over the baseline approach and 19-683% over the state-of-the-art with comparable detection accuracy.

【3】 Spatio-Temporal Scene-Graph Embedding for Autonomous Vehicle Collision Prediction 标题:基于时空场景图的自主车辆碰撞预测 链接:https://arxiv.org/abs/2111.06123

作者:Arnav V. Malawade,Shih-Yuan Yu,Brandon Hsu,Deepan Muthirayan,Pramod P. Khargonekar,Mohammad A. Al Faruque 机构:Department of Electrical Engineering & Computer Science, University of California - Irvine, Irvine, CA 摘要:在自动驾驶汽车(AVs)中,预警系统依靠碰撞预测来确保乘员安全。然而,使用深度卷积网络的最先进方法要么无法模拟冲突,要么成本太高/速度太慢,不太适合部署在AV边缘硬件上。为了解决这些限制,我们提出了sg2vec,这是一种时空场景图嵌入方法,它使用图形神经网络(GNN)和长短时记忆(LSTM)层通过视觉场景感知预测未来的碰撞。我们证明,sg2vec在合成数据集上预测碰撞的准确率比最新方法高8.11%,比最新方法早39.07%,在具有挑战性的真实碰撞数据集上预测碰撞的准确率高29.47%。我们还表明,sg2vec在将知识从合成数据集转移到真实驾驶数据集方面优于最新技术。最后,我们证明了sg2vec在行业标准Nvidia DRIVE PX 2平台上的推理速度比最先进的方法快9.3倍,模型更小88.0%,功耗更低32.4%,能耗更低92.8%,更适合在边缘实现。 摘要:In autonomous vehicles (AVs), early warning systems rely on collision prediction to ensure occupant safety. However, state-of-the-art methods using deep convolutional networks either fail at modeling collisions or are too expensive/slow, making them less suitable for deployment on AV edge hardware. To address these limitations, we propose sg2vec, a spatio-temporal scene-graph embedding methodology that uses Graph Neural Network (GNN) and Long Short-Term Memory (LSTM) layers to predict future collisions via visual scene perception. We demonstrate that sg2vec predicts collisions 8.11% more accurately and 39.07% earlier than the state-of-the-art method on synthesized datasets, and 29.47% more accurately on a challenging real-world collision dataset. We also show that sg2vec is better than the state-of-the-art at transferring knowledge from synthetic datasets to real-world driving datasets. Finally, we demonstrate that sg2vec performs inference 9.3x faster with an 88.0% smaller model, 32.4% less power, and 92.8% less energy than the state-of-the-art method on the industry-standard Nvidia DRIVE PX 2 platform, making it more suitable for implementation on the edge.

自动驾驶|车辆|车道检测等(2篇)

【1】 Automatically identifying a mobile phone user's position within a vehicle 标题:自动识别移动电话用户在车辆内的位置 链接:https://arxiv.org/abs/2111.06306

作者:Matt Knutson,Kevin Kramer,Sara Seifert,Ryan Chamberlain 机构:Minnesota HealthSolutions, Minneapolis, MN 备注:4 pages, 1 figure 摘要:在美国,交通伤害和死亡是主要的健康风险。驾驶时使用手机会使机动车碰撞风险增加四倍。这项工作证明了使用手机摄像头被动检测手机用户在车内位置的可行性。在一个大的、不同的数据集中,我们能够正确地识别用户是坐在驾驶员座椅上还是乘客座椅上,准确率为94.9%。应用程序开发人员可以使用此模型在用户驾驶时选择性地更改或锁定功能,但如果用户是移动车辆中的乘客,则不能使用此模型。 摘要:Traffic-related injuries and fatalities are major health risks in the United States. Mobile phone use while driving quadruples the risk for a motor vehicle crash. This work demonstrates the feasibility of using the mobile phone camera to passively detect the location of the phone's user within a vehicle. In a large, varied dataset we were able correctly identify if the user was in the driver's seat or one of the passenger seats with 94.9% accuracy. This model could be used by application developers to selectively change or lock functionality while a user is driving, but not if the user is a passenger in a moving vehicle.

【2】 Traffic4cast -- Large-scale Traffic Prediction using 3DResNet and Sparse-UNet 标题:Traffic4cast--基于3DResNet和Sparse-UNET的大规模流量预测 链接:https://arxiv.org/abs/2111.05990

作者:Bo Wang,Reza Mohajerpoor,Chen Cai,Inhi Kim,Hai L. Vu 机构:Institute of Transport Studies, CSIRO's Data, NSW , Australia, Institute Civil and Environmental Engineering Department, Kongju National University, South Korea 摘要:IARAI competition Traffic4cast 2021旨在根据之前获得的静态和动态交通信息预测短期城市范围内的高分辨率交通状态。目的是建立一个机器学习模型,利用历史数据点预测多个大型城市分区的标准化平均交通速度和流量。该模型应该是通用的,可以应用于新城市。通过考虑时空特征学习和建模效率,我们探索了3DResNet和稀疏UNet方法用于本次竞赛中的任务。基于3DResNet的模型使用3D卷积来学习时空特征,并应用顺序卷积层来增强输出的时间关系。稀疏UNet模型使用稀疏卷积作为时空特征学习的主干。由于后一种算法主要关注输入的非零数据点,因此在保持有竞争力的精度的同时,它大大减少了计算时间。我们的结果表明,这两种模型都比基线算法取得了更好的性能。有关代码和预训练模型,请访问https://github.com/resuly/Traffic4Cast-2021. 摘要:The IARAI competition Traffic4cast 2021 aims to predict short-term city-wide high-resolution traffic states given the static and dynamic traffic information obtained previously. The aim is to build a machine learning model for predicting the normalized average traffic speed and flow of the subregions of multiple large-scale cities using historical data points. The model is supposed to be generic, in a way that it can be applied to new cities. By considering spatiotemporal feature learning and modeling efficiency, we explore 3DResNet and Sparse-UNet approaches for the tasks in this competition. The 3DResNet based models use 3D convolution to learn the spatiotemporal features and apply sequential convolutional layers to enhance the temporal relationship of the outputs. The Sparse-UNet model uses sparse convolutions as the backbone for spatiotemporal feature learning. Since the latter algorithm mainly focuses on non-zero data points of the inputs, it dramatically reduces the computation time, while maintaining a competitive accuracy. Our results show that both of the proposed models achieve much better performance than the baseline algorithms. The codes and pretrained models are available at https://github.com/resuly/Traffic4Cast-2021.

人脸|人群计数(2篇)

【1】 Clicking Matters:Towards Interactive Human Parsing 标题:点击事项:走向交互式人工分析 链接:https://arxiv.org/abs/2111.06162

作者:Yutong Gao,Liqian Liang,Congyan Lang,Songhe Feng,Yidong Li,Yunchao Wei 备注:Human parsing, interactive segmentation, semantic segmentation 摘要:在这项工作中,我们关注交互式人体解析(IHP),其目的是在用户交互的指导下将人体图像分割成多个人体部位。这项新任务继承了人类解析的类感知特性,而传统的交互式图像分割方法通常不区分类,无法很好地解决这一问题。为了解决这个新任务,我们首先利用用户点击来识别给定图像中的不同人体部位。这些点击随后被转换成语义感知的定位图,这些定位图与RGB图像相连,形成分割网络的输入,并生成初始解析结果。为了使网络在校正过程中更好地感知用户的意图,我们研究了几种主要的细化方法,并揭示了基于随机抽样的点击增强是提高校正效果的最佳方法。此外,我们还提出了一种语义感知损失(SP-loss)来增强训练,它可以有效地利用点击的语义关系进行优化。据了解,这项工作是首次尝试在交互式环境下处理人工解析任务。我们的IHP解决方案在基准LIP上实现了8500万用户,在PASCAL Person Part和CIHP上实现了8000万用户,在Helen上实现了75000万用户,每个类的点击次数分别为1.95、3.02、2.84和1.09次。这些结果表明,我们只需花费很少的人力就可以获得高质量的人工解析掩码。我们希望这项工作能够激励更多的研究人员在未来为IHP开发数据高效的解决方案。 摘要:In this work, we focus on Interactive Human Parsing (IHP), which aims to segment a human image into multiple human body parts with guidance from users' interactions. This new task inherits the class-aware property of human parsing, which cannot be well solved by traditional interactive image segmentation approaches that are generally class-agnostic. To tackle this new task, we first exploit user clicks to identify different human parts in the given image. These clicks are subsequently transformed into semantic-aware localization maps, which are concatenated with the RGB image to form the input of the segmentation network and generate the initial parsing result. To enable the network to better perceive user's purpose during the correction process, we investigate several principal ways for the refinement, and reveal that random-sampling-based click augmentation is the best way for promoting the correction effectiveness. Furthermore, we also propose a semantic-perceiving loss (SP-loss) to augment the training, which can effectively exploit the semantic relationships of clicks for better optimization. To the best knowledge, this work is the first attempt to tackle the human parsing task under the interactive setting. Our IHP solution achieves 85\% mIoU on the benchmark LIP, 80\% mIoU on PASCAL-Person-Part and CIHP, 75\% mIoU on Helen with only 1.95, 3.02, 2.84 and 1.09 clicks per class respectively. These results demonstrate that we can simply acquire high-quality human parsing masks with only a few human effort. We hope this work can motivate more researchers to develop data-efficient solutions to IHP in the future.

【2】 Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis 标题:野外舞蹈:神经动态外观合成的单目人体动画 链接:https://arxiv.org/abs/2111.05916

作者:Tuanfeng Y. Wang,Duygu Ceylan,Krishna Kumar Singh,Niloy J. Mitra 机构:Adobe Research, University College London 摘要:合成运动中人体的动态外观在AR/VR和视频编辑等应用中起着核心作用。虽然最近提出了许多方法来解决这一问题,但处理具有复杂纹理和高动态运动的宽松服装仍然具有挑战性。在本文中,我们提出了一种基于视频的外观合成方法,解决了这些挑战,并展示了以前从未展示过的野生视频的高质量结果。具体来说,我们采用了一种基于StyleGAN的体系结构来完成基于特定人视频的运动重定目标任务。我们引入了一种新的运动特征,用于调整生成器权重以捕获动态外观变化,以及正则化基于单帧的姿势估计以提高时间相干性。我们在一组具有挑战性的视频上评估了我们的方法,并表明我们的方法在定性和定量上都达到了最先进的性能。 摘要:Synthesizing dynamic appearances of humans in motion plays a central role in applications such as AR/VR and video editing. While many recent methods have been proposed to tackle this problem, handling loose garments with complex textures and high dynamic motion still remains challenging. In this paper, we propose a video based appearance synthesis method that tackles such challenges and demonstrates high quality results for in-the-wild videos that have not been shown before. Specifically, we adopt a StyleGAN based architecture to the task of person specific video based motion retargeting. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes as well as regularizing the single frame based pose estimates to improve temporal coherency. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively.

蒸馏|知识提取(1篇)

【1】 Keys to Accurate Feature Extraction Using Residual Spiking Neural Networks 标题:利用残差尖峰神经网络进行精确特征提取的关键 链接:https://arxiv.org/abs/2111.05955

作者:Alex Vicente-Sola,Davide L. Manna,Paul Kirkland,Gaetano Di Caterina,Trevor Bihl 机构:University ofStrathclyde 备注:13 pages, 5 figures, 14 tables 摘要:尖峰神经网络(SNN)已成为传统人工神经网络(ANN)的一种有趣的替代方案,这得益于其时间处理能力以及在神经形态硬件中的低交换(大小、重量和功率)和节能实现。然而,训练SNN所涉及的挑战限制了它们在准确性方面的性能,从而限制了它们的应用。因此,改进学习算法和神经网络结构以获得更精确的特征提取是当前SNN研究的重点之一。在这篇文章中,我们对现代扣球体系结构的关键组件进行了研究。我们对从性能最好的网络中获取的图像分类数据集中的不同技术进行了经验比较。我们设计了一个成功剩余网络(ResNet)体系结构的尖峰版本,并在其上测试了不同的组件和训练策略。我们的研究结果为SNN设计提供了最先进的指导,在尝试构建最佳视觉特征提取器时,可以做出明智的选择。最后,我们的网络在CIFAR-10(94.1%)和CIFAR-100(74.5%)数据集中的性能优于以前的SNN体系结构,并与DVS-CIFAR10(71.3%)中的最新技术相匹配,参数比以前的最新技术更少,并且不需要ANN-SNN转换。代码可在https://github.com/VicenteAlex/Spiking_ResNet. 摘要:Spiking neural networks (SNNs) have become an interesting alternative to conventional artificial neural networks (ANN) thanks to their temporal processing capabilities and their low-SWaP (Size, Weight, and Power) and energy efficient implementations in neuromorphic hardware. However the challenges involved in training SNNs have limited their performance in terms of accuracy and thus their applications. Improving learning algorithms and neural architectures for a more accurate feature extraction is therefore one of the current priorities in SNN research. In this paper we present a study on the key components of modern spiking architectures. We empirically compare different techniques in image classification datasets taken from the best performing networks. We design a spiking version of the successful residual network (ResNet) architecture and test different components and training strategies on it. Our results provide a state of the art guide to SNN design, which allows to make informed choices when trying to build the optimal visual feature extractor. Finally, our network outperforms previous SNN architectures in CIFAR-10 (94.1%) and CIFAR-100 (74.5%) datasets and matches the state of the art in DVS-CIFAR10 (71.3%), with less parameters than the previous state of the art and without the need for ANN-SNN conversion. Code available at https://github.com/VicenteAlex/Spiking_ResNet.

其他神经网络|深度学习|模型|建模(9篇)

【1】 Full-Body Visual Self-Modeling of Robot Morphologies 标题:机器人形态的全身视觉自建模 链接:https://arxiv.org/abs/2111.06389

作者:Boyuan Chen,Robert Kwiatkowski,Carl Vondrick,Hod Lipson 机构:Columbia University 备注:Project website: this https URL 摘要:物理身体的内部计算模型是机器人和动物计划和控制其行动的能力的基础。这些“自我模型”允许机器人考虑多个可能的未来行动的结果,而不在物理现实中尝试它们。完全数据驱动自建模的最新进展使机器能够直接从任务无关的交互数据中学习自己的正向运动学。然而,正向运动学模型只能预测形态的有限方面,如末端执行器的位置或关节和质量的速度。一个关键的挑战是建模整个形态学和运动学,而不事先知道形态学的哪些方面与未来任务相关。在这里,我们提出了一种更有用的自建模形式,可以根据机器人的状态回答空间占用查询,而不是直接建模正向运动学。这种查询驱动的自模型在空间域中是连续的、高效的、完全可微的和运动感知的。在物理实验中,我们演示了视觉自我模型如何精确到工作空间的1%,从而使机器人能够执行各种运动规划和控制任务。视觉自建模还可以让机器人检测、定位和恢复现实世界中的损坏,从而提高机器的弹性。我们的项目网站位于:https://robot-morphology.cs.columbia.edu/ 摘要:Internal computational models of physical bodies are fundamental to the ability of robots and animals alike to plan and control their actions. These "self-models" allow robots to consider outcomes of multiple possible future actions, without trying them out in physical reality. Recent progress in fully data-driven self-modeling has enabled machines to learn their own forward kinematics directly from task-agnostic interaction data. However, forward-kinema-tics models can only predict limited aspects of the morphology, such as the position of end effectors or velocity of joints and masses. A key challenge is to model the entire morphology and kinematics, without prior knowledge of what aspects of the morphology will be relevant to future tasks. Here, we propose that instead of directly modeling forward-kinematics, a more useful form of self-modeling is one that could answer space occupancy queries, conditioned on the robot's state. Such query-driven self models are continuous in the spatial domain, memory efficient, fully differentiable and kinematic aware. In physical experiments, we demonstrate how a visual self-model is accurate to about one percent of the workspace, enabling the robot to perform various motion planning and control tasks. Visual self-modeling can also allow the robot to detect, localize and recover from real-world damage, leading to improved machine resiliency. Our project website is at: https://robot-morphology.cs.columbia.edu/

【2】 Learning Signal-Agnostic Manifolds of Neural Fields 标题:学习信号不可知的神经场流形 链接:https://arxiv.org/abs/2111.06387

作者:Yilun Du,Katherine M. Collins,Joshua B. Tenenbaum,Vincent Sitzmann 机构:Katherine Collins, MIT CSAIL, MIT BCS, MIT CBMM 备注:NeurIPS 2021, additional results and code at this https URL 摘要:深度神经网络已被广泛用于跨图像、形状和音频信号等模式学习数据集的潜在结构。然而,现有的模型通常依赖于模态,需要定制的体系结构和目标来处理不同类别的信号。我们利用神经场以模态独立的方式捕获图像、形状、音频和跨模态视听域中的底层结构。我们的任务是学习流形,我们的目标是推断数据所在的低维局部线性子空间。通过加强流形、局部线性和局部等距的覆盖,我们的模型(称为GEM)学习捕获跨模式数据集的底层结构。然后,我们可以沿着流形的线性区域移动,以获得样本之间的感知一致性插值,并可以进一步使用GEM恢复流形上的点,不仅收集输入图像的各种完整信息,还收集音频或图像信号的跨模态幻觉。最后,我们展示了通过遍历GEM的底层流形,我们可以在信号域中生成新样本。有关代码和其他结果,请访问https://yilundu.github.io/gem/. 摘要:Deep neural networks have been used widely to learn the latent structure of datasets, across modalities such as images, shapes, and audio signals. However, existing models are generally modality-dependent, requiring custom architectures and objectives to process different classes of signals. We leverage neural fields to capture the underlying structure in image, shape, audio and cross-modal audiovisual domains in a modality-independent manner. We cast our task as one of learning a manifold, where we aim to infer a low-dimensional, locally linear subspace in which our data resides. By enforcing coverage of the manifold, local linearity, and local isometry, our model -- dubbed GEM -- learns to capture the underlying structure of datasets across modalities. We can then travel along linear regions of our manifold to obtain perceptually consistent interpolations between samples, and can further use GEM to recover points on our manifold and glean not only diverse completions of input images, but cross-modal hallucinations of audio or image signals. Finally, we show that by walking across the underlying manifold of GEM, we may generate new samples in our signal domains. Code and additional results are available at https://yilundu.github.io/gem/.

【3】 Discovering and Explaining the Representation Bottleneck of DNNs 标题:发现并解释DNNs的表示瓶颈 链接:https://arxiv.org/abs/2111.06236

作者:Huiqi Deng,Qihan Ren,Xu Chen,Hao Zhang,Jie Ren,Quanshi Zhang 机构:Shanghai Jiao Tong University 摘要:本文从深层神经网络(DNN)中编码的输入变量之间相互作用的复杂性出发,探讨了深度神经网络(DNN)特征表示的瓶颈。为此,我们关注输入变量之间的多阶交互,其中阶表示交互的复杂性。我们发现DNN更可能编码过于简单的交互和过于复杂的交互,但通常无法学习中等复杂度的交互。对于不同的任务,不同的DNN普遍存在这种现象。这一现象表明DNN和人类之间存在认知鸿沟,我们称之为表征瓶颈。我们从理论上证明了表征瓶颈的根本原因。此外,我们提出了一个损失来鼓励/惩罚特定复杂交互的学习,并分析了不同复杂交互的表征能力。 摘要:This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.

【4】 Towards Axiomatic, Hierarchical, and Symbolic Explanation for Deep Models 标题:深层模型的公理化、层次化和符号化解释 链接:https://arxiv.org/abs/2111.06206

作者:Jie Ren,Mingjie Li,Qihan Ren,Huiqi Deng,Quanshi Zhang 机构:a Shanghai Jiao Tong University 摘要:本文提出了一种层次符号与或图(AOG)来客观地解释由训练有素的深层推理模型编码的内部逻辑。我们首先在博弈论中定义了解释者模型的客观性,并发展了由深度模型编码的and或逻辑的严格表示。AOG解释者的客观性和可信度在理论上得到了保证,在实验上也得到了验证。此外,我们还提出了一些技巧来提高解释的简洁性。 摘要:This paper proposes a hierarchical and symbolic And-Or graph (AOG) to objectively explain the internal logic encoded by a well-trained deep model for inference. We first define the objectiveness of an explainer model in game theory, and we develop a rigorous representation of the And-Or logic encoded by the deep model. The objectiveness and trustworthiness of the AOG explainer are both theoretically guaranteed and experimentally verified. Furthermore, we propose several techniques to boost the conciseness of the explanation.

【5】 Fine-Grained Image Analysis with Deep Learning: A Survey 标题:基于深度学习的细粒度图像分析研究综述 链接:https://arxiv.org/abs/2111.06119

作者:Xiu-Shen Wei,Yi-Zhe Song,Oisin Mac Aodha,Jianxin Wu,Yuxin Peng,Jinhui Tang,Jian Yang,Serge Belongie 机构:Song is with University of Surrey, Mac Aodha is with the University ofEdinburgh, Nanjing University, Peng is with Peking University 备注:Accepted by IEEE TPAMI 摘要:细粒度图像分析(FGIA)是计算机视觉和模式识别中一个长期存在的基本问题,是各种实际应用的基础。FGIA的任务是分析来自从属类别的视觉对象,例如鸟类物种或汽车模型。细粒度图像分析固有的小类间和大类内变化使其成为一个具有挑战性的问题。利用深度学习的进步,近年来我们见证了以深度学习为动力的FGIA的显著进步。在本文中,我们对这些进展进行了系统的综述,试图通过整合两个基本的细粒度研究领域——细粒度图像识别和细粒度图像检索,重新定义和拓宽FGIA领域。此外,我们还回顾了FGIA的其他关键问题,如公开的基准数据集和相关的领域特定应用程序。最后,我们强调了一些需要社区进一步探索的研究方向和开放性问题。 摘要:Fine-grained image analysis (FGIA) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, e.g., species of birds or models of cars. The small inter-class and large intra-class variation inherent to fine-grained image analysis makes it a challenging problem. Capitalizing on advances in deep learning, in recent years we have witnessed remarkable progress in deep learning powered FGIA. In this paper we present a systematic survey of these advances, where we attempt to re-define and broaden the field of FGIA by consolidating two fundamental fine-grained research areas -- fine-grained image recognition and fine-grained image retrieval. In addition, we also review other key issues of FGIA, such as publicly available benchmark datasets and related domain-specific applications. We conclude by highlighting several research directions and open problems which need further exploration from the community.

【6】 FINO: Flow-based Joint Image and Noise Model 标题:FINO:基于流的图像和噪声联合模型 链接:https://arxiv.org/abs/2111.06031

作者:Lanqing Guo,Siyu Huang,Haosen Liu,Bihan Wen 机构:School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore, SmartMore Corporation, China 摘要:图像恢复中的一个基本挑战是去噪,其目标是从噪声测量值中估计干净的图像。为了解决这样一个不适定反问题,现有的去噪方法主要集中在利用有效的自然图像先验知识。尽管噪声模型可以为去噪算法提供补充信息,但噪声模型的使用和分析往往被忽略。在本文中,我们提出了一种新的基于流的联合图像和噪声模型(FINO),该模型将潜在空间中的图像和噪声明显解耦,并通过一系列可逆变换进行无损重建。我们进一步提出了一种可变交换策略来对齐图像中的结构信息,并提出了一种基于空间最小相关信息的噪声相关矩阵来约束噪声。实验结果表明,FINO具有去除合成加性高斯白噪声(AWGN)和真实噪声的能力。此外,FINO在去除空间变化噪声和估计不准确的噪声方面的推广远远超过了流行和最先进的方法。 摘要:One of the fundamental challenges in image restoration is denoising, where the objective is to estimate the clean image from its noisy measurements. To tackle such an ill-posed inverse problem, the existing denoising approaches generally focus on exploiting effective natural image priors. The utilization and analysis of the noise model are often ignored, although the noise model can provide complementary information to the denoising algorithms. In this paper, we propose a novel Flow-based joint Image and NOise model (FINO) that distinctly decouples the image and noise in the latent space and losslessly reconstructs them via a series of invertible transformations. We further present a variable swapping strategy to align structural information in images and a noise correlation matrix to constrain the noise based on spatially minimized correlation information. Experimental results demonstrate FINO's capacity to remove both synthetic additive white Gaussian noise (AWGN) and real noise. Furthermore, the generalization of FINO to the removal of spatially variant noise and noise with inaccurate estimation surpasses that of the popular and state-of-the-art methods by large margins.

【7】 Robust Learning via Ensemble Density Propagation in Deep Neural Networks 标题:基于集成密度传播的深度神经网络鲁棒学习 链接:https://arxiv.org/abs/2111.05953

作者:Giuseppina Carannante,Dimah Dera,Ghulam Rasool,Nidhal C. Bouaynaya,Lyudmila Mihaylova 机构:⋆ Rowan University, Department of Electrical and Computer Engineering, Glassboro, NJ, † University of Sheffield, Department of Automatic Control and Systems Engineering, United Kingdom 备注:submitted to 2020 IEEE International Workshop on Machine Learning for Signal Processing 摘要:对于深度神经网络(DNN),在不确定、嘈杂或敌对环境中学习是一项具有挑战性的任务。我们提出了一种新的基于贝叶斯估计和变分推理的鲁棒学习方法。我们提出了密度通过DNN层传播的问题,并使用系综密度传播(EnDP)方案进行了求解。EnDP方法允许我们将变分概率分布的矩传播到贝叶斯DNN的各个层,从而能够估计模型输出处预测分布的平均值和协方差。我们使用MNIST和CIFAR-10数据集进行的实验表明,经过训练的模型对随机噪声和敌对攻击的鲁棒性有显著提高。 摘要:Learning in uncertain, noisy, or adversarial environments is a challenging task for deep neural networks (DNNs). We propose a new theoretically grounded and efficient approach for robust learning that builds upon Bayesian estimation and Variational Inference. We formulate the problem of density propagation through layers of a DNN and solve it using an Ensemble Density Propagation (EnDP) scheme. The EnDP approach allows us to propagate moments of the variational probability distribution across the layers of a Bayesian DNN, enabling the estimation of the mean and covariance of the predictive distribution at the output of the model. Our experiments using MNIST and CIFAR-10 datasets show a significant improvement in the robustness of the trained models to random noise and adversarial attacks.

【8】 Self-Compression in Bayesian Neural Networks 标题:贝叶斯神经网络中的自压缩 链接:https://arxiv.org/abs/2111.05950

作者:Giuseppina Carannante,Dimah Dera,Ghulam Rasool,Nidhal C. Bouaynaya 机构:Rowan University, Department of Electrical and Computer Engineering, Glassboro, NJ 备注:submitted to 2020 IEEE International Workshop on Machine Learning for Signal Processing 摘要:机器学习模型已经在各种任务上实现了人类水平的性能。这一成功的代价是高昂的计算和存储开销,这使得机器学习算法难以部署在边缘设备上。通常,为了提高性能,必须部分牺牲准确性,以减少内存使用和能耗。现有方法通过降低参数精度或消除冗余参数来压缩网络。在本文中,我们通过贝叶斯框架对网络压缩提出了新的见解。我们表明,贝叶斯神经网络自动发现模型参数中的冗余,从而实现自压缩,这与不确定性通过网络层的传播有关。我们的实验结果表明,通过删除网络本身识别的参数,可以成功地压缩网络结构,同时保持相同的精度水平。 摘要:Machine learning models have achieved human-level performance on various tasks. This success comes at a high cost of computation and storage overhead, which makes machine learning algorithms difficult to deploy on edge devices. Typically, one has to partially sacrifice accuracy in favor of an increased performance quantified in terms of reduced memory usage and energy consumption. Current methods compress the networks by reducing the precision of the parameters or by eliminating redundant ones. In this paper, we propose a new insight into network compression through the Bayesian framework. We show that Bayesian neural networks automatically discover redundancy in model parameters, thus enabling self-compression, which is linked to the propagation of uncertainty through the layers of the network. Our experimental results show that the network architecture can be successfully compressed by deleting parameters identified by the network itself while retaining the same level of accuracy.

【9】 On the Equivalence between Neural Network and Support Vector Machine 标题:论神经网络与支持向量机的等价性 链接:https://arxiv.org/abs/2111.06063

作者:Yilan Chen,Wei Huang,Lam M. Nguyen,Tsui-Wei Weng 机构:Computer Science and Engineering, University of California San Diego, La Jolla, CA, Engineering and Information Technology, University of Technology Sydney, Ultimo, Australia, IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY 备注:35th Conference on Neural Information Processing Systems (NeurIPS 2021) 摘要:最近的研究表明,通过梯度下降训练的无限宽神经网络(NN)的动力学可以用神经切线核(NTK)citep{jacot2018neural}来表征。在平方损失下,通过梯度下降以无限小的学习率训练的无限宽NN等价于具有NTKcitep{arora2019exact}的核回归。然而,目前已知的等价性仅适用于岭回归,而NN与其他核机器(KMs),例如支持向量机(SVM)之间的等价性仍然未知。因此,在这项工作中,我们建议建立神经网络和支持向量机之间的等价性,特别是通过软边缘损失训练无限宽神经网络和通过次梯度下降训练NTK的标准软边缘支持向量机。我们的主要理论结果包括:建立了NN与一系列具有有限宽度边界的$ell_2$正则化KMs之间的等价性,这是以前的工作所不能处理的,并且表明由此类正则化损失函数训练的每个有限宽度NN约为一KM。此外,我们还证明了我们的理论可以实现三个实际应用,包括(i)通过相应的知识管理实现神经网络的泛化界;(ii)无限宽NN的{非平凡}鲁棒性证书(而现有的鲁棒性验证方法将提供空洞的边界);(iii)本质上比以前的核回归更稳健的无限宽NNs。我们的实验代码可在url获取{https://github.com/leslie-CH/equiv-nn-svm}. 摘要:Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) citep{jacot2018neural}. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK citep{arora2019exact}. However, the equivalence is only known for ridge regression currently citep{arora2019harnessing}, while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalence between NN and a broad family of $ell_2$ regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) extit{non-vacuous} generalization bound of NN via the corresponding KM; (ii) extit{non-trivial} robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments are available at url{https://github.com/leslie-CH/equiv-nn-svm}.

其他(6篇)

【1】 Masked Autoencoders Are Scalable Vision Learners 标题:蒙版自动编码器是可扩展视觉学习器 链接:https://arxiv.org/abs/2111.06377

作者:Kaiming He,Xinlei Chen,Saining Xie,Yanghao Li,Piotr Dollár,Ross Girshick 机构:Piotr Doll´ar, ∗equal technical contribution, †project lead, Facebook AI Research (FAIR) 备注:Tech report 摘要:本文证明了蒙面自动编码器(MAE)是一种可扩展的计算机视觉自监督学习器。我们的MAE方法很简单:我们屏蔽输入图像的随机补丁并重建丢失的像素。它基于两个核心设计。首先,我们开发了一个非对称编码器-解码器体系结构,其中编码器仅在可见的补丁子集上运行(没有掩码标记),同时还有一个轻量级解码器,用于从潜在表示和掩码标记重建原始图像。其次,我们发现掩蔽高比例的输入图像(例如75%)会产生一项不平凡且有意义的自我监督任务。将这两种设计结合起来,使我们能够高效地训练大型模型:我们加快训练速度(3倍或更多)并提高准确性。我们的可扩展方法允许学习具有良好通用性的高容量模型:例如,在仅使用ImageNet-1K数据的方法中,vanilla ViT巨型模型的精确度最高(87.8%)。下游任务中的迁移性能优于有监督的预训练,并显示出良好的伸缩行为。 摘要:This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

【2】 Indian Licence Plate Dataset in the wild 标题:野外印度车牌数据集 链接:https://arxiv.org/abs/2111.06054

作者:Sanchit Tanwar,Ayush Tiwari,Ritesh Chowdhry 备注:5 pages, 4 figures, 3 tables 摘要:印度车牌检测是一个尚未在开源水平上进行太多探索的问题。有专有的解决方案可供使用,但没有可用于执行实验和测试不同方法的大型开源数据集。大多数可用的大型数据集适用于中国、巴西、,但是,在这些数据集上训练的模型在印度车牌上表现不佳,因为使用的字体样式和车牌设计因国家而异。本文介绍了一个印度车牌数据集,该数据集包含16192张图像和21683张车牌,每个车牌和相应字符都有4个点的注释我们提出了一个基准模型,该模型使用语义分割来解决车牌检测问题。我们提出了一种两阶段的方法,第一阶段是定位车牌,第二阶段是读取裁剪后的车牌图像中的文本。我们测试了基准目标检测和语义分割模型,第二阶段使用基于lprnet的OCR。 摘要:Indian Licence Plate Detection is a problem that has not been explored much at an open-source level.There are proprietary solutions available for it, but there is no big open-source dataset that can be used to perform experiments and test different approaches.Most of the large datasets available are for countries like China, Brazil, but the model trained on these datasets does not perform well on Indian plates because the font styles and plate designs used vary significantly from country to country.This paper introduces an Indian license plate dataset with 16192 images and 21683 plate plates annotated with 4 points for each plate and each character in the corresponding plate.We present a benchmark model that uses semantic segmentation to solve number plate detection. We propose a two-stage approach in which the first stage is for localizing the plate, and the second stage is to read the text in cropped plate image.We tested benchmark object detection and semantic segmentation model, for the second stage, we used lprnet based OCR.

【3】 Hybrid Saturation Restoration for LDR Images of HDR Scenes 标题:HDR场景LDR图像的混合饱和恢复 链接:https://arxiv.org/abs/2111.06038

作者:Chaobing Zheng,Zhengguo Li,Shiqian Wu 机构:one with a smallerChaobing Zheng and Shiqian Wu are with the Institute of Roboticsand Intelligent Systems, School of Information Science and Engineering, Wuhan University of Science and Technology 备注:arXiv admin note: text overlap with arXiv:2007.02042 摘要:从高动态范围(HDR)场景捕获的低动态范围(LDR)图像中存在阴影和高光区域。恢复LDR图像的饱和区域是一个不适定问题。本文通过融合基于模型和数据驱动的方法,对LDR图像的饱和区域进行恢复。通过这种神经增强,首先通过基于模型的方法从底层LDR图像生成两幅合成LDR图像。一个比输入图像亮以恢复阴影区域,另一个比输入图像暗以恢复高亮度区域。然后通过一种新的曝光感知饱和恢复网络(EARN)对两幅合成图像进行细化。最后,通过HDR合成算法或多尺度曝光融合算法将两个合成图像和输入图像组合在一起。该算法可以嵌入任何智能手机或数码相机中,生成信息丰富的LDR图像。 摘要:There are shadow and highlight regions in a low dynamic range (LDR) image which is captured from a high dynamic range (HDR) scene. It is an ill-posed problem to restore the saturated regions of the LDR image. In this paper, the saturated regions of the LDR image are restored by fusing model-based and data-driven approaches. With such a neural augmentation, two synthetic LDR images are first generated from the underlying LDR image via the model-based approach. One is brighter than the input image to restore the shadow regions and the other is darker than the input image to restore the high-light regions. Both synthetic images are then refined via a novel exposedness aware saturation restoration network (EASRN). Finally, the two synthetic images and the input image are combined together via an HDR synthesis algorithm or a multi-scale exposure fusion algorithm. The proposed algorithm can be embedded in any smart phones or digital cameras to produce an information-enriched LDR image.

【4】 A soft thumb-sized vision-based sensor with accurate all-round force perception 标题:一种具有准确全方位力感知的软拇指大小视觉传感器 链接:https://arxiv.org/abs/2111.05934

作者:Huanbo Sun,Katherine J. Kuchenbecker,Georg Martius 机构:Autonomous Learning Group, Max Planck Institute for Intelligent Systems, Tübingen, Germany., Haptic Intelligence Department, Max Planck Institute for Intelligent Systems, Stuttgart, Germany. 备注:1 table, 5 figures, 24 pages for the main manuscript. 5 tables, 12 figures, 27 pages for the supplementary material. 8 supplementary videos 摘要:基于视觉的触觉传感器由于价格合理的高分辨率摄像机和成功的计算机视觉技术,已成为机器人触摸的一种很有前途的方法。然而,它们的物理设计和提供的信息还不能满足实际应用的要求。我们提出了一种健壮、柔软、低成本、基于视觉、拇指大小的三维触觉传感器Insight:它在整个锥形传感表面上持续提供方向力分布图。该传感器围绕内置单目摄像头构建,只有一层弹性体模压在刚性框架上,以保证灵敏度、鲁棒性和软接触。此外,Insight是第一个使用准直器将光度立体光和结构光结合起来检测其易于更换的柔性外壳的三维变形的系统。力信息由深度神经网络推断,该网络将图像映射到三维接触力(法向和剪切)的空间分布。Insight的总体空间分辨率为0.4 mm,力大小精度约为0.03 N,力方向精度约为5度,范围为0.03-2 N,适用于具有不同接触面积的多个不同触点。所提出的硬件和软件设计概念可应用于各种机器人部件。 摘要:Vision-based haptic sensors have emerged as a promising approach to robotic touch due to affordable high-resolution cameras and successful computer-vision techniques. However, their physical design and the information they provide do not yet meet the requirements of real applications. We present a robust, soft, low-cost, vision-based, thumb-sized 3D haptic sensor named Insight: it continually provides a directional force-distribution map over its entire conical sensing surface. Constructed around an internal monocular camera, the sensor has only a single layer of elastomer over-molded on a stiff frame to guarantee sensitivity, robustness, and soft contact. Furthermore, Insight is the first system to combine photometric stereo and structured light using a collimator to detect the 3D deformation of its easily replaceable flexible outer shell. The force information is inferred by a deep neural network that maps images to the spatial distribution of 3D contact force (normal and shear). Insight has an overall spatial resolution of 0.4 mm, force magnitude accuracy around 0.03 N, and force direction accuracy around 5 degrees over a range of 0.03--2 N for numerous distinct contacts with varying contact area. The presented hardware and software design concepts can be transferred to a wide variety of robot parts.

【5】 Related Work on Image Quality Assessment 标题:图像质量评价的相关工作 链接:https://arxiv.org/abs/2111.06291

作者:Dongxu Wang 机构:Southwest Jiaotong University, Chengdu, China 备注:5 pages 摘要:由于在视觉信号采集、压缩、传输和显示的各个阶段都会出现质量下降,因此图像质量评估(IQA)在基于图像的应用中起着至关重要的作用。根据参考图像是否完整和可用,图像质量评估可分为三类:完全参考(FR)、减少参考(RR)和非参考(NR)。本文将回顾最新的图像质量评估算法。 摘要:Due to the existence of quality degradations introduced in various stages of visual signal acquisition, compression, transmission and display, image quality assessment (IQA) plays a vital role in image-based applications. According to whether the reference image is complete and available, image quality evaluation can be divided into three categories: Full-Reference(FR), Reduced- Reference(RR), and Non- Reference(NR). This article will review the state-of-the-art image quality assessment algorithms.

【6】 CodEx: A Modular Framework for Joint Temporal De-blurring and Tomographic Reconstruction 标题:CODEX:一种联合时域去模糊和层析重建的模块化框架 链接:https://arxiv.org/abs/2111.06069

作者:Soumendu Majee,Selin Aslan,Charles A. Bouman,Doga Gursoy 机构:Do˘ga G¨ursoy Member, IEEE 摘要:在许多计算机断层扫描(CT)成像应用中,快速收集随时间移动或变化的对象的数据非常重要。层析成像采集通常被假定为步进式拍摄,其中对象旋转到每个所需角度,并拍摄一个视图。但是,步进和拍摄采集速度较慢,可能会浪费光子,因此在实际中,在采集数据时,如果对象连续旋转,则会进行飞行扫描。但是,这可能会导致运动模糊视图,从而导致重建时产生严重的运动伪影。在本文中,我们介绍了CodEx,一个用于联合去模糊和层析重建的模块化框架,它可以有效地反转飞行扫描中引入的运动模糊。该方法是一种新的捕获方法与一种新的非凸贝叶斯重建算法的协同组合。CodEx的工作原理是使用已知的二进制代码对采集进行编码,然后重建算法将其反转。使用精心选择的二进制代码对测量值进行编码可以提高反演过程的精度。CodEx重建方法使用交替方向乘子法(ADMM)将反问题分解为迭代去模糊和重建子问题,使重建切实可行。我们给出了模拟和实验数据的重建结果,以证明我们的方法的有效性。 摘要:In many computed tomography (CT) imaging applications, it is important to rapidly collect data from an object that is moving or changing with time. Tomographic acquisition is generally assumed to be step-and-shoot, where the object is rotated to each desired angle, and a view is taken. However, step-and-shoot acquisition is slow and can waste photons, so in practice fly-scanning is done where the object is continuously rotated while collecting data. However, this can result in motion-blurred views and consequently reconstructions with severe motion artifacts. In this paper, we introduce CodEx, a modular framework for joint de-blurring and tomographic reconstruction that can effectively invert the motion blur introduced in fly-scanning. The method is a synergistic combination of a novel acquisition method with a novel non-convex Bayesian reconstruction algorithm. CodEx works by encoding the acquisition with a known binary code that the reconstruction algorithm then inverts. Using a well chosen binary code to encode the measurements can improve the accuracy of the inversion process. The CodEx reconstruction method uses the alternating direction method of multipliers (ADMM) to split the inverse problem into iterative deblurring and reconstruction sub-problems, making reconstruction practical to implement. We present reconstruction results on both simulated and experimental data to demonstrate the effectiveness of our method.