zl程序教程

您现在的位置是:首页 >  数据库

当前栏目

统计学学术速递[11.11]

2023-03-14 22:52:08 时间

stat统计学,共计29篇

【1】 A proposal to integrate Data Envelopment Analysis and Le Chatelier principle 标题:一种将数据包络分析与Le Chatelier原理相结合的建议 链接:https://arxiv.org/abs/2111.05763

作者:Filippo Elba 机构:Italy. 摘要:本文旨在估计一个行业中企业的短期和长期有效生产前沿。Le Chatelier原理代表了理论框架,而用于寻找前沿的工具是非参数数据包络分析技术。该提案试图克服在这方面所做的其他努力似乎存在的主要局限性。 摘要:The article aims to estimate the short and long run efficient production frontiers for firm operating in an industry. The Le Chatelier principle represents the theoretical framework, while the tool used to find the frontiers is the non-parametric Data Envelopment Analysis technique. The proposal tries to overcome the main limitations that other efforts done in this context seem to have.

【2】 A Probabilistic Domain-knowledge Framework for Nosocomial Infection Risk Estimation of Communicable Viral Diseases in Healthcare Personnel: A Case Study for COVID-19 标题:医护人员传染性病毒性疾病医院感染风险评估的概率领域知识框架--以冠状病毒为例 链接:https://arxiv.org/abs/2111.05761

作者:Phat K. Huynh,Arveity R. Setty,Om P. Yadav,Trung Q. Le 机构:working, Sanford, ND, USA, (e-mail:, Hospitals saw an increasing number of outbreaks of CVDs over, the last decade, which had negative impacts on patient and, healthcare workers’ morbidity and mortality [,]. Among 备注:10 pages, 4 figures, Journal of Biomedical and Health Informatics 摘要:传染性病毒性疾病(CVD)的医院获得性感染对全球医疗工作者构成了巨大挑战。医护人员(HCP)面临着医院获得性感染的持续风险,以及随后更高的发病率和死亡率。我们提出了一个领域知识驱动的感染风险模型来量化个体HCP和人群水平的医疗机构风险。对于个体水平的风险估计,提出了一个时变感染风险模型来捕捉CVD的传播动态。在人群水平上,使用由三个特征集(包括个体水平因素、工程控制因素和行政控制因素)构建的贝叶斯网络模型估计感染风险。敏感性分析表明,个人感染风险的不确定性可归因于两个变量:密切接触者的数量和病毒传播概率。利用冠状病毒病案例研究,在传播概率模型、个体水平风险模型和人群水平风险模型中进行了模型验证。关于第一个,多元逻辑回归用于英国的横断面数据,AIC值为7317.70,交叉验证准确率为78.23%。对于第二个模型,我们收集了不同职业中实验室确认的HCP新冠病毒-19病例。职业特定风险评估表明,风险最高的职业是注册护士、医疗助理和呼吸治疗师,估计风险分别为0.0189、0.0188和0.0176。为了验证人口水平风险模型,估计了德克萨斯州和加利福尼亚州的感染风险。提议的模型将对HCP的PPE分配和安全计划产生重大影响 摘要:Hospital-acquired infections of communicable viral diseases (CVDs) are posing a tremendous challenge to healthcare workers globally. Healthcare personnel (HCP) is facing a consistent risk of hospital-acquired infections, and subsequently higher rates of morbidity and mortality. We proposed a domain knowledge-driven infection risk model to quantify the individual HCP and the population-level healthcare facility risks. For individual-level risk estimation, a time-variant infection risk model is proposed to capture the transmission dynamics of CVDs. At the population-level, the infection risk is estimated using a Bayesian network model constructed from three feature sets including individual-level factors, engineering control factors, and administrative control factors. The sensitivity analyses indicated that the uncertainty in the individual infection risk can be attributed to two variables: the number of close contacts and the viral transmission probability. The model validation was implemented in the transmission probability model, individual level risk model, and population-level risk model using a Coronavirus disease case study. Regarding the first, multivariate logistic regression was applied for a cross-sectional data in the UK with an AIC value of 7317.70 and a 10-fold cross validation accuracy of 78.23%. For the second model, we collected laboratory-confirmed COVID-19 cases of HCP in different occupations. The occupation-specific risk evaluation suggested the highest-risk occupations were registered nurses, medical assistants, and respiratory therapists, with estimated risks of 0.0189, 0.0188, and 0.0176, respectively. To validate the population-level risk model, the infection risk in Texas and California was estimated. The proposed model will significantly influence the PPE allocation and safety plans for HCP

【3】 A K-function for inhomogeneous random measures with geometric features标题:具有几何特征的非齐次随机测度的K-函数链接:https://arxiv.org/abs/2111.05735

作者:Anne Marie Svane,Hans Jacob Teglbjærg Stephensen,Rasmus Waagepetersen 机构:Department of Mathematical Sciences, Aalborg University, Denmark, Department of Computer Science, University of Copenhagen, Denmark 摘要:本文介绍了一个$K$-函数,用于评估由标记点过程生成的非齐次随机测度的二阶性质。标记可以是几何对象,如纤维或正体积的集合,并且提出的$K$-函数考虑了标记的几何特征,例如纤维的切线方向。$K$-函数需要对随机测度的非均匀密度函数进行估计。我们介绍了基于参数模型的密度函数的参数估计,这些参数模型代表了非齐次随机测度的大规模特征。所提出的方法适用于模拟纤维模式以及混凝土中钢纤维的三维数据集。 摘要:This paper introduces a $K$-function for assessing second-order properties of inhomogeneous random measures generated by marked point processes. The marks can be geometric objects like fibers or sets of positive volume, and the presented $K$-function takes into account geometric features of the marks, such as tangent directions of fibers. The $K$-function requires an estimate of the inhomogeneous density function of the random measure. We introduce parametric estimates for the density function based on parametric models that represent large scale features of the inhomogeneous random measure. The proposed methodology is applied to simulated fiber patterns as well as a three-dimensional data set of steel fibers in concrete.

【4】 Diversity of symptom phenotypes in SARS-CoV-2 community infections observed in multiple large datasets 标题:在多个大数据集中观察到的SARS-CoV-2社区感染症状表型的多样性 链接:https://arxiv.org/abs/2111.05728

作者:Martyn Fyles,Karina-Doris Vihta,Carole H Sudre,Harry Long,Rajenki Das,Caroline Jay,Tom Wingfield,Fergus Cumming,William Green,Pantelis Hadjipantelis,Joni Kirk,Claire J Steves,Sebastien Ourselin,Graham Medley,Elizabeth Fearon,Thomas House 机构:Department of Mathematics, University of Manchester, Manchester, UK, The Alan Turing Institute for Data Science and Artificial Intelligence, London NW,DB, Nuffield Department of Medicine, University of Oxford, Oxford, UK. 备注:43 pages; 25 figures 摘要:了解SARS-CoV-2社区感染临床症状的变异性是管理持续的CoV-19大流行的关键。在这里,我们收集了来自常规测试的四个大型和不同的数据集,在英国进行一项具有人口代表性的家庭调查和参与式移动监测,并使用统计学和机器学习中的尖端无监督分类技术来描述有症状的SARS-CoV-2 PCR阳性社区病例的症状表型。我们探讨了不同数据集和年龄段的共性。虽然我们观察到由于病例所经历的症状总数而导致的症状分离,但我们也看到将症状分为胃肠道、呼吸系统和其他类型,以及在极端年龄段的不同症状并存模式。预计这将对社区SARS-CoV-2病例的识别和管理产生影响。 摘要:Understanding variability in clinical symptoms of SARS-CoV-2 community infections is key in management of the ongoing COVID-19 pandemic. Here we bring together four large and diverse datasets deriving from routine testing, a population-representative household survey and participatory mobile surveillance in the United Kingdom and use cutting-edge unsupervised classification techniques from statistics and machine learning to characterise symptom phenotypes among symptomatic SARS-CoV-2 PCR-positive community cases. We explore commonalities across datasets and by age bands. While we observe separation due to the total number of symptoms experienced by cases, we also see a separation of symptoms into gastrointestinal, respiratory and other types, and different symptom co-occurrence patterns at the extremes of age. This is expected to have implications for identification and management of community SARS-CoV-2 cases.

【5】 Spatial statistics and stochastic partial differential equations: a mechanistic viewpoint 标题:空间统计与随机偏微分方程:一种机械观点 链接:https://arxiv.org/abs/2111.05724

作者:Lionel Roques,Denis Allard,Samuel Soubeyrand 机构:Biostatistics and Spatial Processes (BioSP), INRAE, Avignon, France 摘要:随机偏微分方程(SPDE)方法,目前常用于空间统计学中构造高斯随机场,从基于微观粒子运动的机械角度重新审视,从而将伪微分算子与扩散核联系起来。我们首先在列维飞行和涉及分数拉普拉斯算子(FL)的偏微分方程之间建立联系。相应的福克-普朗克偏微分方程将作为基础,通过考虑SPDE的一般形式,以及考虑扩散、漂移和反应的术语,提出新的概括。我们详细说明了与厚尾扩散核相关的FL算子(有或没有线性反应项)和与薄尾核相关的阻尼FL算子之间的差异,前者描述了长距离依赖,后者对应于短距离依赖。然后,对具有非平稳空间和时间变化外力的SPDE随机场进行了说明,并引入了非线性双稳态反应项。讨论了后者的物理意义和可能的应用。回到上述方程的微粒解释,我们在一个相对简单的情况下描述了它们与点过程的联系。我们揭示了它们产生的点过程的性质,并展示了与概率观测模型相关联的机械模型如何在分层设置中用于估计粒子动力学参数。 摘要:The Stochastic Partial Differential Equation (SPDE) approach, now commonly used in spatial statistics to construct Gaussian random fields, is revisited from a mechanistic perspective based on the movement of microscopic particles, thereby relating pseudo-differential operators to dispersal kernels. We first establish a connection between L'evy flights and PDEs involving the Fractional Laplacian (FL) operator. The corresponding Fokker-Planck PDEs will serve as a basis to propose new generalisations by considering a general form of SPDE with terms accounting for dispersal, drift and reaction. We detail the difference between the FL operator (with or without linear reaction term) associated with a fat-tailed dispersal kernel and therefore describing long-distance dependencies, and the damped FL operator associated with a thin-tailed kernel, thus corresponding to short-distance dependencies. Then, SPDE-based random fields with non-stationary external spatially and temporally varying force are illustrated and nonlinear bistable reaction term are introduced. The physical meaning of the latter and possible applications are discussed. Returning to the particulate interpretation of the above-mentioned equations, we describe in a relatively simple case their links with point processes. We unravel the nature of the point processes they generate and show how such mechanistic models, associated to a probabilistic observation model, can be used in a hierarchical setting to estimate the parameters of the particle dynamics.

【6】 variable selection and missing data imputation in categorical genomic data analysis by integrated ridge regression and random forest 标题:岭回归与随机林相结合的分类基因组数据分析中的变量选择与缺失数据填补 链接:https://arxiv.org/abs/2111.05714

作者:Siru Wang,Guoqi Qian 机构:Siru Wang is with the School of Mathematics and Statistics, The University of Melbourne, au‡Guoqi Qian is with the School of Mathematics and Statistics 摘要:全基因组关联研究(GWAS)产生的基因组数据通常不仅规模大,而且不完整。其不完全性的一种特殊形式是具有不可忽略缺失机制的缺失值。基因组数据的内在复杂性在通过统计变量选择方法开发无偏且信息丰富的表型-基因型关联分析程序方面提出了重大挑战。在本文中,我们开发了一个分类表型-基因型关联分析的连贯程序,在GWAS数据中存在具有不可忽略缺失机制的缺失值的情况下,通过集成最先进的随机森林变量选择方法、加权岭回归法和EM缺失数据插补算法,线性统计假设检验用于确定缺失机制。两个模拟GWA用于验证所提出程序的性能。然后应用该程序分析来自乳腺癌GWAS的真实数据集。 摘要:Genomic data arising from a genome-wide association study (GWAS) are often not only of large-scale, but also incomplete. A specific form of their incompleteness is missing values with non-ignorable missingness mechanism. The intrinsic complications of genomic data present significant challenges in developing an unbiased and informative procedure of phenotype-genotype association analysis by a statistical variable selection approach. In this paper we develop a coherent procedure of categorical phenotype-genotype association analysis, in the presence of missing values with non-ignorable missingness mechanism in GWAS data, by integrating the state-of-the-art methods of random forest for variable selection, weighted ridge regression with EM algorithm for missing data imputation, and linear statistical hypothesis testing for determining the missingness mechanism. Two simulated GWAS are used to validate the performance of the proposed procedure. The procedure is then applied to analyze a real data set from breast cancer GWAS.

【7】 Comparing dominance of tennis' big three via multiple-output Bayesian quantile regression models 标题:用多输出贝叶斯分位数回归模型比较网球三巨头的优势 链接:https://arxiv.org/abs/2111.05631

作者:Bruno Santos 机构:University of Kent 摘要:网球历史上有无数伟大的男性网球运动员,我们经常对讨论谁是有史以来最伟大的运动员感兴趣。虽然我们不想在这里回答这个问题,但我们深入比较了一些关键的统计数据,这些统计数据与大满贯冠军最多的男性选手的优势有关,目前是:德约科维奇、费德勒和纳达尔,按字母顺序排列。在这里,我们考虑每分钟完成的比赛和各自的比赛中的相对点,作为衡量对其他球员的优势。我们认为重要的协变量,如表面,赢或输,比赛类型和他们的对手是否是世界排名前20位的球员,以创建一个更完整的比较他们的表现。我们考虑多个输出响应变量的贝叶斯分位数回归模型,以考虑分钟和相对点赢得之间的依赖关系。这种方法很有说服力,因为我们不需要为响应变量的联合概率分布选择概率分布。我们的结果与纳达尔在红土球场的优势、费德勒在草地球场的优势和德约科维奇在硬地球场的优势的共同直觉一致,因为他们在这些场地上都取得了成功;尽管纳达尔在红土场比赛中的优势是独一无二的。费德勒显示出他在球场上获胜的时间优势,而德约科维奇在大多数比较中在相对得分方面占据优势。虽然分钟数可以直接与比赛风格联系在一起,但相对分数维度可以更直接地表达对对手的不同程度的优势,在这一分析中,德约科维奇似乎是整体领先者。 摘要:Tennis has seen a myriad of great male tennis players throughout its history and we are often interested in the discussion of who is/was the greatest player of all time. While we do not try to answer this question here, we delve into comparing some key statistics related to dominance over their opponents for the male players with the most Grand Slam titles, currently: Djokovic, Federer and Nadal, in alphabetical order. Here we consider the minutes played and the relative points in each of their completed matches, as a measure of dominance against other players. We consider important covariates such as surface, win or loss, type of tournament and whether their opponent was a top 20 ranked player in the world or not, to create a more complete comparison of their performance. We consider a Bayesian quantile regression model for multiple-output response variables to take into account the dependence between minutes and relative points won. This approach is compelling since we do not need to choose a probability distribution for the joint probability distribution of our response variable. Our results agree with the common intuition of Nadal's superiority in clay courts, Federer's superiority in grass courts and Djokovic's superiority in hard courts given their success in each of these surfaces; though Nadal's dominance in clay court games is unique. Federer shows his dominance regarding minutes spent in the court in wins, while Djokovic takes the edge when considering the dimension of relative points won, for most of the comparisons. While minutes can be directly connected to style of play, the relative points dimension could express more directly different levels of advantage over their opponent, in which Djokovic seems to be the overall leader in this analysis.

【8】 Tracking multiple spawning targets using Poisson multi-Bernoulli mixtures on sets of tree trajectories 标题:基于树轨迹集的泊松多贝努利混合跟踪多个产卵目标 链接:https://arxiv.org/abs/2111.05620

作者:Ángel F. García-Fernández,Lennart Svensson 机构: Universidad Antonio deNebrija 摘要:提出了一种基于树轨迹集空间的泊松多贝努利混合(PMBM)滤波器,用于多目标跟踪。树轨迹包含目标及其子体的所有轨迹信息,这些信息因繁殖过程而出现。每个树都包含一组分支,其中每个分支都有目标或一个子树及其系谱的轨迹信息。对于具有多贝努利产卵的标准动态和测量模型,后验值是PMBM密度,每个贝努利具有关于潜在树轨迹的信息。为了实现高效的计算,我们推导了一个近似的PMBM滤波器,其中每个伯努利树轨迹都有多个伯努利分支,这些分支是通过最小化Kullback-Leibler散度获得的。由此产生的滤波器提高了模拟场景中最先进算法的跟踪性能。 摘要:This paper proposes a Poisson multi-Bernoulli mixture (PMBM) filter on the space of sets of tree trajectories for multiple target tracking with spawning targets. A tree trajectory contains all trajectory information of a target and its descendants, which appear due to the spawning process. Each tree contains a set of branches, where each branch has trajectory information of a target or one of the descendants and its genealogy. For the standard dynamic and measurement models with multi-Bernoulli spawning, the posterior is a PMBM density, with each Bernoulli having information on a potential tree trajectory. To enable a computationally efficient implementation, we derive an approximate PMBM filter in which each Bernoulli tree trajectory has multi-Bernoulli branches obtained by minimising the Kullback-Leibler divergence. The resulting filter improves tracking performance of state-of-the-art algorithms in a simulated scenario.

【9】 Learning Graphs from Smooth and Graph-Stationary Signals with Hidden Variables 标题:从光滑和图平稳的隐变量信号中学习图 链接:https://arxiv.org/abs/2111.05588

作者:Andrei Buciulea,Samuel Rey,Antonio G. Marques 机构:Member, IEEE 摘要:从(顶点)信号观测值推断网络拓扑是数据科学和工程学科中的一个突出问题。大多数现有方案假设所有节点的观测都可用,但在许多实际环境中,只有一部分节点是可访问的。一种自然的(有时是有效的)方法是忽略未观察到的节点的作用,但这会忽略潜在的网络效应,从而恶化估计图的质量。不同的是,本文研究了从节点观测推断网络拓扑的问题,同时考虑了隐藏(潜在)变量的存在。我们的方案假设观测节点的数量远大于隐藏变量的数量,并基于最新的图形信号处理模型将信号与底层图形联系起来。具体来说,我们超越了经典的相关和偏相关方法,并假设信号在所寻求的图中是平滑和/或平稳的。这些假设被编码成不同的约束优化问题,并显式地考虑了隐藏变量的存在。由于由此产生的问题是病态和非凸的,因此利用了所提出公式的块矩阵结构,并提出了合适的凸正则化松弛。在合成数据集和真实数据集上的数值实验展示了所开发方法的性能,并将其与现有替代方法进行了比较。 摘要:Network-topology inference from (vertex) signal observations is a prominent problem across data-science and engineering disciplines. Most existing schemes assume that observations from all nodes are available, but in many practical environments, only a subset of nodes is accessible. A natural (and sometimes effective) approach is to disregard the role of unobserved nodes, but this ignores latent network effects, deteriorating the quality of the estimated graph. Differently, this paper investigates the problem of inferring the topology of a network from nodal observations while taking into account the presence of hidden (latent) variables. Our schemes assume the number of observed nodes is considerably larger than the number of hidden variables and build on recent graph signal processing models to relate the signals and the underlying graph. Specifically, we go beyond classical correlation and partial correlation approaches and assume that the signals are smooth and/or stationary in the sought graph. The assumptions are codified into different constrained optimization problems, with the presence of hidden variables being explicitly taken into account. Since the resulting problems are ill-conditioned and non-convex, the block matrix structure of the proposed formulations is leveraged and suitable convex-regularized relaxations are presented. Numerical experiments over synthetic and real-world datasets showcase the performance of the developed methods and compare them with existing alternatives.

【10】 Clustering of longitudinal data: A tutorial on a variety of approaches 标题:纵向数据聚类:关于各种方法的教程 链接:https://arxiv.org/abs/2111.05469

作者:Niek Den Teuling,Steffen Pauws,Edwin van den Heuvel 机构: Eindhoven University of Technology, Tilburg University 备注:37 pages, 12 figures 摘要:在过去二十年中,在许多研究领域,识别纵向数据中具有不同趋势的群体的方法越来越受到关注。为了支持研究者,我们总结了文献中关于纵向聚类的指导。此外,我们提出了一系列纵向聚类方法,包括基于组的轨迹建模(GBTM)、增长混合建模(GMM)和纵向k-均值(KML)。这些方法是在基本层面上介绍的,并列出了它们的优点、局限性和模型扩展。随着数据收集的最新发展,人们关注这些方法对密集纵向数据(ILD)的适用性。我们使用R中提供的包在合成数据集上演示了这些方法的应用。 摘要:During the past two decades, methods for identifying groups with different trends in longitudinal data have become of increasing interest across many areas of research. To support researchers, we summarize the guidance from the literature regarding longitudinal clustering. Moreover, we present a selection of methods for longitudinal clustering, including group-based trajectory modeling (GBTM), growth mixture modeling (GMM), and longitudinal k-means (KML). The methods are introduced at a basic level, and strengths, limitations, and model extensions are listed. Following the recent developments in data collection, attention is given to the applicability of these methods to intensive longitudinal data (ILD). We demonstrate the application of the methods on a synthetic dataset using packages available in R.

【11】 Which priors matter? Benchmarking models for learning latent dynamics 标题:哪些前科很重要?学习潜在动力学的基准模型 链接:https://arxiv.org/abs/2111.05458

作者:Aleksandar Botev,Andrew Jaegle,Peter Wirnsberger,Daniel Hennes,Irina Higgins 机构:DeepMind, London 摘要:学习动力学是机器学习(ML)许多重要应用的核心,例如机器人技术和自动驾驶。在这些设置中,ML算法通常需要使用高维观测(如图像)对物理系统进行推理,而不需要访问底层状态。最近,有几种方法提出将经典力学中的先验知识集成到ML模型中,以解决从图像进行物理推理的挑战。在这项工作中,我们冷静地看待这些模型的当前功能。为此,我们介绍了一套由17个数据集组成的数据集,这些数据集基于表现出广泛动态的物理系统进行视觉观察。我们对物理激励方法的主要类别以及几个强基线进行了彻底和详细的比较。虽然包含物理先验的模型通常可以学习具有理想性质的潜在空间,但我们的结果表明,这些方法无法显著改进标准技术。尽管如此,我们发现使用连续和时间可逆动力学有利于所有类别的模型。 摘要:Learning dynamics is at the heart of many important applications of machine learning (ML), such as robotics and autonomous driving. In these settings, ML algorithms typically need to reason about a physical system using high dimensional observations, such as images, without access to the underlying state. Recently, several methods have proposed to integrate priors from classical mechanics into ML models to address the challenge of physical reasoning from images. In this work, we take a sober look at the current capabilities of these models. To this end, we introduce a suite consisting of 17 datasets with visual observations based on physical systems exhibiting a wide range of dynamics. We conduct a thorough and detailed comparison of the major classes of physically inspired methods alongside several strong baselines. While models that incorporate physical priors can often learn latent spaces with desirable properties, our results demonstrate that these methods fail to significantly improve upon standard techniques. Nonetheless, we find that the use of continuous and time-reversible dynamics benefits models of all classes.

【12】 Function-on-function linear quantile regression 标题:函数-函数线性分位数回归 链接:https://arxiv.org/abs/2111.05374

作者:Ufuk Beyaztas,Han Lin Shang 机构:, Department of Statistics, Marmara University, Department of Actuarial Studies and Business Analytics, Macquarie University 备注:24 pages, 8 figures, to appear at the Mathematical Modelling and Analysis 摘要:在这项研究中,我们提出了一个函数对函数的线性分位数回归模型,允许多个函数预测建立一个更灵活和稳健的方法。该模型首先在估计阶段通过功能主成分分析范式转换为有限维空间。然后使用估计的函数主成分函数对其进行近似,并根据主成分得分构造分位数回归模型的估计参数。此外,我们提出了一个贝叶斯信息准则来确定函数主成分分解中使用的最佳截断常数数目。此外,一个逐步向前的过程和贝叶斯信息准则用于确定模型中包含的显著预测因子。我们采用非参数bootstrap方法来构造响应函数的预测区间。通过几个montecarlo实验和一个经验数据示例,对所提方法的有限样本性能进行了评估,并将所得结果与现有模型的结果进行了比较。 摘要:In this study, we propose a function-on-function linear quantile regression model that allows for more than one functional predictor to establish a more flexible and robust approach. The proposed model is first transformed into a finite-dimensional space via the functional principal component analysis paradigm in the estimation phase. It is then approximated using the estimated functional principal component functions, and the estimated parameter of the quantile regression model is constructed based on the principal component scores. In addition, we propose a Bayesian information criterion to determine the optimum number of truncation constants used in the functional principal component decomposition. Moreover, a stepwise forward procedure and the Bayesian information criterion are used to determine the significant predictors for including in the model. We employ a nonparametric bootstrap procedure to construct prediction intervals for the response functions. The finite sample performance of the proposed method is evaluated via several Monte Carlo experiments and an empirical data example, and the results produced by the proposed method are compared with the ones from existing models.

【13】 Graph Matching via Optimal Transport 标题:基于最优传输的图匹配 链接:https://arxiv.org/abs/2111.05366

作者:Ali Saad-Eldin,Benjamin D. Pedigo,Carey E. Priebe,Joshua T. Vogelstein 机构:Department of Biomedical Engineering, Johns Hopkins University, Department of Applied Mathematics and Statistics, Johns Hopkins University, Institute for Computational Medicine, Kavli Neuroscience Discovery Institute, Johns Hopkins University 摘要:图匹配问题旨在找到两个图的节点之间的对齐,以最小化邻接不一致的数量。由于图形匹配在运筹学、计算机视觉、神经科学等领域的应用,解决图形匹配问题变得越来越重要。然而,目前最先进的算法在匹配非常大的图时效率低下,尽管它们能产生很好的精度。这些算法的主要计算瓶颈是线性分配问题,它必须在每次迭代中解决。在本文中,我们利用优化交通领域的最新进展来取代线性分配算法。我们介绍了GOAT,它是对最先进的图形匹配近似算法“FAQ”(Vogelstein,2015)的一种修改,将其线性和分配步骤替换为Cuturi(2013)的“光速最佳传输”方法。该修改提高了速度和经验匹配精度。该方法的有效性在模拟和真实数据的匹配图中得到了验证。 摘要:The graph matching problem seeks to find an alignment between the nodes of two graphs that minimizes the number of adjacency disagreements. Solving the graph matching is increasingly important due to it's applications in operations research, computer vision, neuroscience, and more. However, current state-of-the-art algorithms are inefficient in matching very large graphs, though they produce good accuracy. The main computational bottleneck of these algorithms is the linear assignment problem, which must be solved at each iteration. In this paper, we leverage the recent advances in the field of optimal transport to replace the accepted use of linear assignment algorithms. We present GOAT, a modification to the state-of-the-art graph matching approximation algorithm "FAQ" (Vogelstein, 2015), replacing its linear sum assignment step with the "Lightspeed Optimal Transport" method of Cuturi (2013). The modification provides improvements to both speed and empirical matching accuracy. The effectiveness of the approach is demonstrated in matching graphs in simulated and real data examples.

【14】 Simulating cloud-aerosol interactions made by ship emissions 标题:模拟船舶排放产生的云-气溶胶相互作用 链接:https://arxiv.org/abs/2111.05356

作者:Lekha Patel,Lyndsay Shand 备注:None 摘要:卫星图像可以探测到由穿越我们海洋的大型船只排放的气溶胶形成的临时云迹或船迹,这是全球气候模型无法直接再现的现象。船迹是海洋云层变亮的明显例子,这是一种潜在的太阳气候干预,显示出有助于应对气候变化的希望。船舶的排放路径是否会明显影响上方的云层,以及船舶轨迹持续的时间在很大程度上取决于其混合的边界层的排气类型和特性。为了能够从统计上推断船舶排放气溶胶的寿命,并描述其形成时的大气条件,第一步是使用数学替代模型而不是昂贵的物理模型,模拟这些云-气溶胶相互作用的路径,参数可从图像中推断。这将使我们能够比较我们期望船舶航迹在何时/何地可见,不受大气条件的影响,以及从卫星图像中实际观察到的情况,从而能够推断船舶航迹在何种大气条件下形成。在本文中,我们将讨论一种随机模拟自然生成的云中船舶引起的气溶胶包裹行为的方法。我们的方法可以使用风场和潜在相关的大气变量来确定云气溶胶轨迹的近似运动和行为,并使用随机微分方程(SDE)来模拟云气溶胶路径的持续性行为。SDE结合了漂移和扩散项,分别描述了气溶胶包裹通过风的运动及其在大气中的扩散率。我们用一个模拟风场和船舶路径的例子成功地演示了我们提出的方法。 摘要:Satellite imagery can detect temporary cloud trails or ship tracks formed from aerosols emitted from large ships traversing our oceans, a phenomenon that global climate models cannot directly reproduce. Ship tracks are observable examples of marine cloud brightening, a potential solar climate intervention that shows promise in helping combat climate change. Whether or not a ship's emission path visibly impacts the clouds above and how long a ship track visibly persists largely depends on the exhaust type and properties of the boundary layer with which it mixes. In order to be able to statistically infer the longevity of ship-emitted aerosols and characterize atmospheric conditions under which they form, a first step is to simulate, with mathematical surrogate model rather than an expensive physical model, the path of these cloud-aerosol interactions with parameters that are inferable from imagery. This will allow us to compare when/where we would expect to ship tracks to be visible, independent of atmospheric conditions, with what is actually observed from satellite imagery to be able to infer under what atmospheric conditions do ship tracks form. In this paper, we will discuss an approach to stochastically simulate the behavior of ship induced aerosols parcels within naturally generated clouds. Our method can use wind fields and potentially relevant atmospheric variables to determine the approximate movement and behavior of the cloud-aerosol tracks, and uses a stochastic differential equation (SDE) to model the persistence behavior of cloud-aerosol paths. This SDE incorporates both a drift and diffusion term which describes the movement of aerosol parcels via wind and their diffusivity through the atmosphere, respectively. We successfully demonstrate our proposed approach with an example using simulated wind fields and ship paths.

【15】 Evaluation of a meta-analysis of the association between red and processed meat and selected human health effects 标题:红肉和加工肉与选定的人类健康影响之间关系的荟萃分析的评价 链接:https://arxiv.org/abs/2111.05337

作者:S. Stanley Young,Warren Kindzierski 备注:20 pages, 1 figure, 7 Tables 摘要:背景:多个独立研究、观察研究或随机研究的风险比或p值可以通过计算结合起来,在荟萃分析中对研究问题进行全面评估。然而,不可再生性危机目前困扰着广泛的科学学科,包括营养流行病学。进行评估以评估荟萃分析的可靠性,该荟萃分析检查了红肉和加工肉与选定的人类健康影响(全因死亡率、心血管死亡率、总体癌症死亡率、乳腺癌发病率、结直肠癌发病率、2型糖尿病发病率)之间的关联。方法:从荟萃分析中使用的105篇随机选择的15篇基础论文(14%)中统计统计测试和模型的数量。将125个风险结果的95%置信限的相对风险转换为p值,并构建p值图以评估p值的效应异质性。结果:在15篇随机选择的基础论文中,可能进行的统计测试数量很大,中位数=20736(四分位间距=1728到331776)。六种选定健康效应的每个p值图显示随机模式(p值>0.05),或小p值<0.001的双组分混合物,而其他p值显示为随机。鉴于在15篇选定的基础论文中进行了大量的统计测试,不能排除有问题的研究实践作为小p值的解释。结论:这项独立分析补充了原始荟萃分析的结果,发现红色荟萃分析和由此产生的加工肉类荟萃分析中使用的基础论文没有为声称的健康影响提供证据。 摘要:Background: Risk ratios or p-values from multiple, independent studies, observational or randomized, can be computationally combined to provide an overall assessment of a research question in meta-analysis. However, an irreproducibility crisis currently afflicts a wide range of scientific disciplines, including nutritional epidemiology. An evaluation was undertaken to assess the reliability of a meta-analysis examining the association between red and processed meat and selected human health effects (all-cause mortality, cardiovascular mortality, overall cancer mortality, breast cancer incidence, colorectal cancer incidence, type 2 diabetes incidence). Methods: The number of statistical tests and models were counted in 15 randomly selected base papers (14%) from 105 used in the meta-analysis. Relative risk with 95% confidence limits for 125 risk results were converted to p-values and p-value plots were constructed to evaluate the effect heterogeneity of the p-values. Results: The number of statistical tests possible in the 15 randomly selected base papers was large, median = 20,736 (interquartile range = 1,728 to 331,776). Each p-value plot for the six selected health effects showed either a random pattern (p-values > 0.05), or a two-component mixture with small p-values < 0.001 while other p-values appeared random. Given potentially large numbers of statistical tests conducted in the 15 selected base papers, questionable research practices cannot be ruled out as explanations for small p-values. Conclusions: This independent analysis, which complements the findings of the original meta-analysis, finds that the base papers used in the red and resulting processed meat meta-analysis do not provide evidence for the claimed health effects.

【16】 Searching in the Forest for Local Bayesian Optimization 标题:局部贝叶斯优化的森林搜索 链接:https://arxiv.org/abs/2111.05834

作者:Difan Deng,Marius Lindauer 机构:Leibniz University Hannover 摘要:由于其样本效率,贝叶斯优化(BO)已成为处理昂贵的黑盒优化问题(如超参数优化(HPO))的一种流行方法。最近的实证实验表明,HPO问题的损失情况往往比以前假设的更为有利,即在最佳情况下是单峰和凸的,因此,如果BO框架能够专注于那些有希望的局部区域,则其效率可能会更高。在本文中,我们提出了BOinG,这是一种针对中型配置空间定制的两阶段方法,因为在许多HPO问题中会遇到这种方法。在第一阶段,我们建立了一个具有随机森林的可伸缩全局代理模型来描述整个景观结构。此外,我们通过自下而上的方法在上层树结构上选择一个有希望的子区域。在第二阶段,利用该次区域的局部模型来建议下一步要评估的点。实证实验表明,BOinG能够利用典型HPO问题的结构,并且在综合函数和HPO的中等规模问题上表现尤其出色。 摘要:Because of its sample efficiency, Bayesian optimization (BO) has become a popular approach dealing with expensive black-box optimization problems, such as hyperparameter optimization (HPO). Recent empirical experiments showed that the loss landscapes of HPO problems tend to be more benign than previously assumed, i.e. in the best case uni-modal and convex, such that a BO framework could be more efficient if it can focus on those promising local regions. In this paper, we propose BOinG, a two-stage approach that is tailored toward mid-sized configuration spaces, as one encounters in many HPO problems. In the first stage, we build a scalable global surrogate model with a random forest to describe the overall landscape structure. Further, we choose a promising subregion via a bottom-up approach on the upper-level tree structure. In the second stage, a local model in this subregion is utilized to suggest the point to be evaluated next. Empirical experiments show that BOinG is able to exploit the structure of typical HPO problems and performs particularly well on mid-sized problems from synthetic functions and HPO.

【17】 Gradients are Not All You Need 标题:渐变不是您需要的全部 链接:https://arxiv.org/abs/2111.05803

作者:Luke Metz,C. Daniel Freeman,Samuel S. Schoenholz,Tal Kachman 机构:Google Research, Brain Team, Radboud University, Donders Institute for Brain, Cognition and Behaviour 摘要:可微编程技术在社区中得到了广泛的应用,并在过去几十年中促成了机器学习的复兴。虽然这些方法很强大,但也有局限性。在这篇简短的报告中,我们讨论了一种常见的基于混沌的故障模式,它出现在各种可微环境中,从递归神经网络和数值物理模拟到训练学习的优化器。我们将这一失败追溯到所研究系统的雅可比矩阵谱,并提供从业者预计这一失败何时会破坏其基于微分的优化算法的标准。 摘要:Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms.

【18】 Distribution-Invariant Differential Privacy 标题:分布不变的差分隐私 链接:https://arxiv.org/abs/2111.05791

作者:Xuan Bi,Xiaotong Shen 机构: Carlson School of Management, University of Minnesota, edu; 2School of Statistics 摘要:差异隐私正在成为保护公共共享数据隐私的一个黄金标准。它已广泛应用于社会科学、数据科学、公共卫生、信息技术和美国十年一次的人口普查。然而,为了保证不同的隐私,现有的方法可能不可避免地改变原始数据分析的结论,因为私有化经常改变样本分布。这种现象被称为隐私保护和统计准确性之间的权衡。在这项工作中,我们通过开发一种分布不变私有化(DIP)方法来打破这种平衡,以协调高统计精度和严格的差异隐私。因此,任何下游统计或机器学习任务都会产生与使用原始数据基本相同的结论。从数字上看,在同样严格的隐私保护下,DIP在两次模拟和三次真实基准测试中获得了更高的统计精度。 摘要:Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public health, information technology, and the U.S. decennial census. Nevertheless, to guarantee differential privacy, existing methods may unavoidably alter the conclusion of original data analysis, as privatization often changes the sample distribution. This phenomenon is known as the trade-off between privacy protection and statistical accuracy. In this work, we break this trade-off by developing a distribution-invariant privatization (DIP) method to reconcile both high statistical accuracy and strict differential privacy. As a result, any downstream statistical or machine learning task yields essentially the same conclusion as if one used the original data. Numerically, under the same strictness of privacy protection, DIP achieves superior statistical accuracy in two simulations and on three real-world benchmarks.

【19】 Data Compression: Multi-Term Approach 标题:数据压缩:多期限方法 链接:https://arxiv.org/abs/2111.05775

作者:Pablo Soto-Quiros,Anatoli Torokhti 机构:∗Centre for Industrial and Applied Mathematics, University of South Australia, Adelaide, SA , Australia, †Escuela de Matem´aticas, Instituto Tecnol´ogico de Costa Rica, Cartago , Costa Rica 摘要:在信号样本方面,我们提出并证明了一种新的降秩多项变换,简称MTT,在一定条件下,它可以提供比已知最佳降秩变换更好的关联精度。其基本思想是构造比已知最优变换中的参数更多的要优化的变换。这是通过将已知的变换结构扩展到包含附加项的形式来实现的-MTT有四个矩阵以最小化成本。MTT结构还有一个特殊的变换,可以减少数值负载。因此,MTT组件的变化提高了MTT性能。 摘要:In terms of signal samples, we propose and justify a new rank reduced multi-term transform, abbreviated as MTT, which, under certain conditions, may provide better-associated accuracy than that of known optimal rank reduced transforms. The basic idea is to construct the transform with more parameters to optimize than those in the known optimal transforms. This is realized by the extension of the known transform structures to the form that includes additional terms - the MTT has four matrices to minimize the cost. The MTT structure has also a special transformation that decreases the numerical load. As a result, the MTT performance is improved by the variation of the MTT components.

【20】 The mathematics of contagious diseases and their limitations in forecasting 标题:传染病的数学及其在预测中的局限性 链接:https://arxiv.org/abs/2111.05727

作者:C. O. S. Sorzano 摘要:本文探讨了理解传染病演变的数学模型。最广为人知的一组模型是基于一组微分方程的分区模型。但这些并不是唯一的模式。本综述访问了许多不同的模型族。此外,我们展示这些族,不是作为不相关的实体,而是遵循一个共同的线索,其中一个模型的问题或假设由另一个模型解决或概括。通过这种方式,我们可以了解它们之间的关系、假设、简化,以及最终的局限性。受当前新冠疫情的影响,我们特别关注传播预测。我们举例说明了进行现实预测所遇到的困难。总的来说,它们只是一个生物和社会复杂性要大得多的现实的近似。特别麻烦的是潜在的巨大可变性、问题的时变性以及难以估计可靠模型所需的参数。此外,我们还将看到,这些模型具有乘法性质,这意味着系统参数中的小误差会导致预测中的巨大不确定性。随机或基于agent的模型可以克服基于微分或随机方程的系统建模问题。它们的主要困难在于,它们与可用的数据一样准确和真实,可以用来估计它们的详细参数化,而且这些详细数据通常不由建模者处理。尽管预测传染病演变的数学模型的预测能力非常有限,但这些模型对于规划干预措施仍然非常有用,因为如果所有其他参数保持不变,它们可以计算其影响。它们对于理解复杂系统中疾病传播的特性也非常有用。 摘要:This article explores mathematical models for understanding the evolution of contagious diseases. The most widely known set of models are the compartmental ones, which are based on a set of differential equations. But these are not the only models. This review visits many different families of models. Additionally, we show these families, not as unrelated entities, but following a common thread in which the problems or assumptions of a model are solved or generalized by another model. In this way, we can understand their relationships, assumptions, simplifications, and, ultimately, limitations. Prompted by the current Covid19 pandemic, we have a special focus on spread forecasting. We illustrate the difficulties encountered to do realistic predictions. In general, they are only approximations to a reality whose biological and societal complexity is much larger. Particularly troublesome are the large underlying variability, the problem's time-varying nature, and the difficulty to estimate the required parameters for a faithful model. Additionally, we will also see that these models have a multiplicative nature implying that small errors in the system parameters cause a huge uncertainty in the prediction. Stochastic or agent-based models can overcome some of the modeling problems of systems based on differential or stochastic equations. Their main difficulty is that they are as accurate and realistic as the data available to estimate their detailed parametrization, and very often this detailed data is not at the modeller's disposal. Although the predictive power of mathematical models to forecast the evolution of a contagious disease is very limited, these models are still very useful to plan interventions as they can calculate their impact if all other parameters stay fixed. They are also very useful to understand the properties of disease propagation in complex systems.

【21】 STNN-DDI: A Substructure-aware Tensor Neural Network to Predict Drug-Drug Interactions 标题:STNN-DDI:一种预测药物相互作用的子结构感知张量神经网络 链接:https://arxiv.org/abs/2111.05708

作者:Hui Yu,ShiYu Zhao,JianYu Shi 机构:School of Computer Science, Northwestern Polytechnical University, Xi’an , China, School of Life Sciences, Northwestern Polytechnical University, Xi’an , China. 摘要:动机:多类型药物相互作用(DDI)的计算预测有助于减少多药治疗中意外的副作用。尽管现有的计算方法取得了令人鼓舞的结果,但它们忽略了药物的作用主要是由其化学亚结构引起的。此外,它们的可解释性仍然很弱。结果:本文假设两种药物之间的相互作用是由它们的局部化学结构(子结构)引起的,它们的DDI类型是由不同子结构集之间的联系决定的,我们设计了一种新的子结构-张量神经网络DDI预测模型(STNN-DDI)。该模型学习(子结构,交互类型,子结构)三元组的三维张量,表征子结构-子结构相互作用(SSI)空间。根据具有特定化学意义的预定义子结构列表,将药物映射到此SSI空间使STNN-DDI能够以统一的形式以可解释的方式在转导和诱导情景中执行多类型DDI预测。与基于深度学习的最新基线的比较表明,STNN-DDI在AUC、AUPR、准确度和精密度方面都有显著提高,具有优越性。更重要的是,案例研究通过揭示药物间关于DDI感兴趣类型的关键子结构对和揭示给定DDI中相互作用类型特异性子结构对来说明其可解释性。总之,STNN-DDI为预测DDI以及解释药物间的相互作用机制提供了一种有效的方法。 摘要:Motivation: Computational prediction of multiple-type drug-drug interaction (DDI) helps reduce unexpected side effects in poly-drug treatments. Although existing computational approaches achieve inspiring results, they ignore that the action of a drug is mainly caused by its chemical substructures. In addition, their interpretability is still weak. Results: In this paper, by supposing that the interactions between two given drugs are caused by their local chemical structures (sub-structures) and their DDI types are determined by the linkages between different substructure sets, we design a novel Substructure-ware Tensor Neural Network model for DDI prediction (STNN-DDI). The proposed model learns a 3-D tensor of (substructure, in-teraction type, substructure) triplets, which characterizes a substructure-substructure interaction (SSI) space. According to a list of predefined substructures with specific chemical meanings, the mapping of drugs into this SSI space enables STNN-DDI to perform the multiple-type DDI prediction in both transductive and inductive scenarios in a unified form with an explicable manner. The compar-ison with deep learning-based state-of-the-art baselines demonstrates the superiority of STNN-DDI with the significant improvement of AUC, AUPR, Accuracy, and Precision. More importantly, case studies illustrate its interpretability by both revealing a crucial sub-structure pair across drugs regarding a DDI type of interest and uncovering interaction type-specific substructure pairs in a given DDI. In summary, STNN-DDI provides an effective approach to predicting DDIs as well as explaining the interaction mechanisms among drugs.

【22】 Understanding the Generalization Benefit of Model Invariance from a Data Perspective 标题:从数据角度理解模型不变性的泛化效益 链接:https://arxiv.org/abs/2111.05529

作者:Sicheng Zhu,Bang An,Furong Huang 机构:Department of Computer Science, University of Maryland, College Park 备注:Accepted to NeurIPS 2021 摘要:机器学习模型在某些类型的数据转换下具有不变性,在实践中表现出更好的泛化能力。然而,对不变性为什么有利于推广的原则性理解是有限的。给定一个数据集,通常没有原则性的方法来选择“合适的”数据转换,在这种转换下,模型不变性可以保证更好的泛化。本文通过引入由变换引起的样本覆盖,即数据集的一个代表子集,可以通过变换近似地恢复整个数据集,研究了模型不变性的泛化优势。对于任何数据转换,我们基于样本覆盖率为不变模型提供精确的泛化边界。我们还通过转换所诱导的样本覆盖数,即其诱导样本覆盖的最小大小,来描述一组数据转换的“适用性”。我们表明,对于样本覆盖数较小的“合适”变换,我们可以收紧推广边界。此外,我们提出的样本覆盖数可以进行经验评估,从而为选择变换以发展模型不变性以更好地推广提供了指导。在多个数据集上的实验中,我们评估了一些常用变换的样本覆盖数,并表明一组变换(例如,3D视图变换)的较小样本覆盖数表明不变模型的测试和训练误差之间的差距较小,这验证了我们的命题。 摘要:Machine learning models that are developed to be invariant under certain types of data transformations have shown improved generalization in practice. However, a principled understanding of why invariance benefits generalization is limited. Given a dataset, there is often no principled way to select "suitable" data transformations under which model invariance guarantees better generalization. This paper studies the generalization benefit of model invariance by introducing the sample cover induced by transformations, i.e., a representative subset of a dataset that can approximately recover the whole dataset using transformations. For any data transformations, we provide refined generalization bounds for invariant models based on the sample cover. We also characterize the "suitability" of a set of data transformations by the sample covering number induced by transformations, i.e., the smallest size of its induced sample covers. We show that we may tighten the generalization bounds for "suitable" transformations that have a small sample covering number. In addition, our proposed sample covering number can be empirically evaluated and thus provides a guide for selecting transformations to develop model invariance for better generalization. In experiments on multiple datasets, we evaluate sample covering numbers for some commonly used transformations and show that the smaller sample covering number for a set of transformations (e.g., the 3D-view transformation) indicates a smaller gap between the test and training error for invariant models, which verifies our propositions.

【23】 ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees 标题:ResNEst和DenseNEst:基于挡路的改进表示保证的DNN模型 链接:https://arxiv.org/abs/2111.05496

作者:Kuan-Lin Chen,Ching-Hua Lee,Harinath Garudadri,Bhaskar D. Rao 机构:Department of Electrical and Computer Engineering,Qualcomm Institute, University of California, San Diego, La Jolla, CA , USA 备注:24 pages. Accepted by NeurIPS 2021 摘要:最近文献中使用的证明残差网络(resnet)优于线性预测的模型实际上不同于计算机视觉中广泛使用的标准resnet。除了标量值输出或单个残差块等假设外,这些模型在馈入最终仿射层的最终残差表示处没有非线性。为了对这种非线性差异进行编码并揭示线性估计特性,我们定义了resnest,即残差非线性估计,方法是简单地从标准resnet的最后一个残差表示处删除非线性。我们表明,具有瓶颈块的宽重嵌套始终可以保证标准重嵌套所要达到的非常理想的训练特性,即,在给定相同的基本元素集的情况下,添加更多的块不会降低性能。为了证明这一点,我们首先认识到resnest是基础函数模型,它受到基础学习和线性预测耦合问题的限制。然后,为了将预测权重与基础学习解耦,我们构建了一种称为增强ResNEst(a-ResNEst)的特殊体系结构,该体系结构始终保证在添加块的情况下不会产生更差的性能。因此,这样的a-ResNEst使用相应的基础为ResNEst建立了经验风险下界。我们的结果表明,重新嵌套确实存在减少特征重用的问题;然而,可以通过充分扩展或加宽输入空间来避免这种情况,从而实现上述理想特性。受已证明优于resnet的densenet的启发,我们还提出了相应的新模型,称为稠密连接非线性估计器(DenseNEst)。我们证明了任何密集度都可以表示为带有瓶颈块的宽重嵌套。与重新设计不同,密集型建筑在不进行任何特殊的建筑重新设计的情况下展现出理想的性能。 摘要:Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, these models have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction. Then, to decouple prediction weights from basis learning, we construct a special architecture termed augmented ResNEst (A-ResNEst) that always guarantees no worse performance with the addition of a block. As a result, such an A-ResNEst establishes empirical risk lower bounds for a ResNEst using corresponding bases. Our results demonstrate ResNEsts indeed have a problem of diminishing feature reuse; however, it can be avoided by sufficiently expanding or widening the input space, leading to the above-mentioned desirable property. Inspired by the DenseNets that have been shown to outperform ResNets, we also propose a corresponding new model called Densely connected Nonlinear Estimator (DenseNEst). We show that any DenseNEst can be represented as a wide ResNEst with bottleneck blocks. Unlike ResNEsts, DenseNEsts exhibit the desirable property without any special architectural re-design.

【24】 DP-REC: Private & Communication-Efficient Federated Learning 标题:DP-REC:私有高效通信联合学习 链接:https://arxiv.org/abs/2111.05454

作者:Aleksei Triastcyn,Matthias Reisser,Christos Louizos 机构:Qualcomm AI Research∗ 摘要:隐私和通信效率是神经网络联合训练的重要挑战,将它们结合起来仍然是一个开放的问题。在这项工作中,我们开发了一种将高度压缩通信和差异隐私(DP)相结合的方法。我们将基于相对熵编码(REC)的压缩技术引入到联邦设置中。通过对REC的微小修改,我们得到了一个可证明的差异私有学习算法DP-REC,并展示了如何计算其隐私保证。我们的实验表明,DP-REC可大幅降低通信成本,同时提供与最新技术相当的隐私保障。 摘要:Privacy and communication efficiency are important challenges in federated training of neural networks, and combining them is still an open problem. In this work, we develop a method that unifies highly compressed communication and differential privacy (DP). We introduce a compression technique based on Relative Entropy Coding (REC) to the federated setting. With a minor modification to REC, we obtain a provably differentially private learning algorithm, DP-REC, and show how to compute its privacy guarantees. Our experiments demonstrate that DP-REC drastically reduces communication costs while providing privacy guarantees comparable to the state-of-the-art.

【25】 Constrained Instance and Class Reweighting for Robust Learning under Label Noise 标题:标签噪声下鲁棒学习的约束实例和类加权 链接:https://arxiv.org/abs/2111.05428

作者:Abhishek Kumar,Ehsan Amid 机构:Google Research, Brain Team 备注:27 pages, including Appendix 摘要:深度神经网络在监督学习中表现出了令人印象深刻的性能,这是由于它们能够很好地适应所提供的训练数据。然而,它们的性能在很大程度上取决于训练数据的质量,并且在存在噪声的情况下往往会下降。我们提出了一种处理标签噪声的原则性方法,目的是为单个实例和类标签分配重要性权重。我们的方法通过构造一类约束优化问题来工作,这些约束优化问题为这些重要权重生成简单的闭式更新。所提出的优化问题在每个小批量中解决,从而避免了在整个数据集上存储和更新权重的需要。我们的优化框架还为解决标签噪声(如标签引导)的现有标签平滑启发式算法提供了理论视角。我们在几个基准数据集上对我们的方法进行了评估,并观察到在存在标签噪声的情况下有相当大的性能提升。 摘要:Deep neural networks have shown impressive performance in supervised learning, enabled by their ability to fit well to the provided training data. However, their performance is largely dependent on the quality of the training data and often degrades in the presence of noise. We propose a principled approach for tackling label noise with the aim of assigning importance weights to individual instances and class labels. Our method works by formulating a class of constrained optimization problems that yield simple closed form updates for these importance weights. The proposed optimization problems are solved per mini-batch which obviates the need of storing and updating the weights over the full dataset. Our optimization framework also provides a theoretical perspective on existing label smoothing heuristics for addressing label noise (such as label bootstrapping). We evaluate our method on several benchmark datasets and observe considerable performance gains in the presence of label noise.

【26】 Statistical Perspectives on Reliability of Artificial Intelligence Systems 标题:人工智能系统可靠性的统计透视 链接:https://arxiv.org/abs/2111.05391

作者:Yili Hong,Jiayi Lian,Li Xu,Jie Min,Yueyao Wang,Laura J. Freeman,Xinwei Deng 机构:Department of Statistics, Virginia Tech, Blacksburg, VA 备注:40 pages 摘要:人工智能(AI)系统在许多领域变得越来越流行。然而,人工智能技术仍处于发展阶段,许多问题需要解决。其中,需要证明人工智能系统的可靠性,以便公众能够放心地使用人工智能系统。在本文中,我们提供了人工智能系统可靠性的统计观点。与其他考虑不同,人工智能系统的可靠性主要集中在时间维度上。也就是说,系统可以在预期的时间段内执行其设计的功能。我们介绍了人工智能可靠性研究的智能统计框架,该框架包括五个部分:系统结构、可靠性指标、故障原因分析、可靠性评估和测试计划。我们回顾了可靠性数据分析和软件可靠性中的传统方法,并讨论了如何将这些现有方法转化为人工智能系统的可靠性建模和评估。我们还描述了人工智能可靠性建模和分析的最新发展,概述了该领域的统计研究挑战,包括分布外检测、训练集的影响、对抗性攻击、模型准确性和不确定性量化,并讨论了这些主题如何与人工智能可靠性相关,举例说明。最后,我们讨论了AI可靠性评估的数据收集和测试计划,以及如何改进系统设计以获得更高的AI可靠性。论文最后作了一些总结。 摘要:Artificial intelligence (AI) systems have become increasingly popular in many areas. Nevertheless, AI technologies are still in their developing stages, and many issues need to be addressed. Among those, the reliability of AI systems needs to be demonstrated so that the AI systems can be used with confidence by the general public. In this paper, we provide statistical perspectives on the reliability of AI systems. Different from other considerations, the reliability of AI systems focuses on the time dimension. That is, the system can perform its designed functionality for the intended period. We introduce a so-called SMART statistical framework for AI reliability research, which includes five components: Structure of the system, Metrics of reliability, Analysis of failure causes, Reliability assessment, and Test planning. We review traditional methods in reliability data analysis and software reliability, and discuss how those existing methods can be transformed for reliability modeling and assessment of AI systems. We also describe recent developments in modeling and analysis of AI reliability and outline statistical research challenges in this area, including out-of-distribution detection, the effect of the training set, adversarial attacks, model accuracy, and uncertainty quantification, and discuss how those topics can be related to AI reliability, with illustrative examples. Finally, we discuss data collection and test planning for AI reliability assessment and how to improve system designs for higher AI reliability. The paper closes with some concluding remarks.

【27】 Probabilistic predictions of SIS epidemics on networks based on population-level observations 标题:基于人口水平观测的网络上SIS疫情概率预测 链接:https://arxiv.org/abs/2111.05369

作者:Tanja Zerenner,Francesco Di Lauro,Masoumeh Dashti,Luc Berthouze,Istvan Z. Kiss 机构: Kissa†aDepartment of Mathematics, University of Sussex, UKbBig Data Institute, Nuffield Department of Medicine, University of Oxford, UKcDepartment of Informatics 摘要:我们预测了在常规、ErdH{o}s-R{e}nyi和Barab'asi-Albert网络上进行中的易感感染(SIS)流行病的未来进程。众所周知,接触网影响流行病在人群中的传播。因此,在这种情况下,在人口一级对流行病的观察包含了有关潜在网络的信息。这些信息反过来有助于预测正在发生的流行病的未来进程。为了在预测框架中利用这一点,网络上SIS流行病的精确高维随机模型近似于低维代理模型。代理模型基于出生和死亡过程;底层网络的影响由出生率的参数模型描述。我们从经验上证明,替代模型捕捉到了流行病一旦达到某一点就不会消失的内在随机性。贝叶斯参数推断允许将模型参数和基础网络类别的不确定性直接纳入概率预测中。对许多情景的评估表明,在大多数情况下,产生的预测间隔充分量化了预测不确定性。只要人口水平数据在足够长的时间内可用,即使不经常采样,该模型也能产生良好的预测,其中基础网络得到正确识别,预测不确定性主要反映了传播流行病的内在随机性。对于从较短观测周期推断的预测,参数和网络类别的不确定性主导预测不确定性。所提出的方法依赖于最少的数据,并且在数值上是有效的,这使得它无论是作为一个独立的推理和预测方案,还是与其他方法结合,都具有吸引力。 摘要:We predict the future course of ongoing susceptible-infected-susceptible (SIS) epidemics on regular, ErdH{o}s-R'{e}nyi and Barab'asi-Albert networks. It is known that the contact network influences the spread of an epidemic within a population. Therefore, observations of an epidemic, in this case at the population-level, contain information about the underlying network. This information, in turn, is useful for predicting the future course of an ongoing epidemic. To exploit this in a prediction framework, the exact high-dimensional stochastic model of an SIS epidemic on a network is approximated by a lower-dimensional surrogate model. The surrogate model is based on a birth-and-death process; the effect of the underlying network is described by a parametric model for the birth rates. We demonstrate empirically that the surrogate model captures the intrinsic stochasticity of the epidemic once it reaches a point from which it will not die out. Bayesian parameter inference allows for uncertainty about the model parameters and the class of the underlying network to be incorporated directly into probabilistic predictions. An evaluation of a number of scenarios shows that in most cases the resulting prediction intervals adequately quantify the prediction uncertainty. As long as the population-level data is available over a long-enough period, even if not sampled frequently, the model leads to excellent predictions where the underlying network is correctly identified and prediction uncertainty mainly reflects the intrinsic stochasticity of the spreading epidemic. For predictions inferred from shorter observational periods, uncertainty about parameters and network class dominate prediction uncertainty. The proposed method relies on minimal data and is numerically efficient, which makes it attractive either as a standalone inference and prediction scheme or in conjunction with other methods.

【28】 The Jacobi Theta Distribution 标题:Jacobi Theta分布 链接:https://arxiv.org/abs/2111.05336

作者:Caleb Deen Bastian,Grzegorz Rempala,Herschel Rabitz 机构:Rabitz‡, Program in Applied and Computational Mathematics, Princeton, Division of Biostatistics, The Ohio State University, Columbus, OH., Department of Mathematics, The Ohio State University, Columbus, OH. USA 备注:16 pages, 8 figures 摘要:我们通过指数随机变量在无限反平方律曲面上的离散积分形成雅可比θ分布。它是连续的,由正实支撑,有一个正参数,是单峰的,正偏斜的,轻量级的。其累积分布和密度函数用雅可比θ函数表示。我们描述了渐近和对数正态近似、推论以及这种分布在建模中的一些应用。 摘要:We form the Jacobi theta distribution through discrete integration of exponential random variables over an infinite inverse square law surface. It is continuous, supported on the positive reals, has a single positive parameter, is unimodal, positively skewed, and leptokurtic. Its cumulative distribution and density functions are expressed in terms of the Jacobi theta function. We describe asymptotic and log-normal approximations, inference, and a few applications of such distributions to modeling.

【29】 Active Sampling for Linear Regression Beyond the ell_2 Norm链接:https://arxiv.org/abs/2111.04888

作者:Cameron Musco,Christopher Musco,David P. Woodruff,Taisuke Yasuda 机构:UMass Amherst, NYU, CMU 备注:Abstract shortened to meet arXiv limits 摘要:我们研究了线性回归的主动抽样算法,其目的是只查询目标向量$binmathbb{R}^n$的少量条目,并输出一个接近极小值的$min{xinmathbb{R}^d}\\ Ax-b\$,其中$ainmathbb{R}{n imes d}$是一个设计矩阵,$\\\ cdot\\是一些损失函数。对于任何$0<p<infty$的$ellu p$范数回归,我们给出了一种基于Lewis加权抽样的算法,该算法只使用$ ilde{O}(d^{max(1,{p/2})}/mathrm{poly}(epsilon))$查询$b$即可输出$(1+epsilon)$近似解。我们表明,这种对$d$的依赖是最优的,直到对数因子。我们的结果解决了Chen和Derezi{n}ski最近提出的一个公开问题,他们给出了$ellu_1$范数的近似最优界,以及$ellu_p$回归与$in(1,2)$的次优界。我们还提供了损失函数的第一个总灵敏度上界$O(d^{max{1,p/2}log^2n)$,其最大程度为$p$多项式增长。这改进了Tukan、Maalouf和Feldman最近的结果。通过结合我们对$ellu p$回归结果的技术,我们得到了一个主动回归算法,使得$ ilde O(d^{1+max{1,p/2}/mathrm{poly}(epsilon))$查询,回答了Chen和Derezi{n}ski的另一个开放问题。对于Huber损失的重要特例,我们进一步改进了活动样本复杂度为$ ilde O(d^{(1+sqrt2)/2}/epsilon^c)$的界,以及非活动样本复杂度为$ ilde O(d^{4-2sqrt 2}/epsilon^c)$的界,改进了之前由于Clarkson和Woodruff的Huber回归的$d^4$界。我们的灵敏度边界具有进一步的含义,改进了以前使用灵敏度采样的各种结果,包括Orlicz范数子空间嵌入和鲁棒子空间近似。最后,我们的主动抽样结果给出了在每$ellp$范数下Kronecker乘积回归的第一次次次线性时间算法。 摘要:We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector $binmathbb{R}^n$ and output a near minimizer to $min_{xinmathbb{R}^d}|Ax-b|$, where $Ainmathbb{R}^{n imes d}$ is a design matrix and $|cdot|$ is some loss function. For $ell_p$ norm regression for any $0<p<infty$, we give an algorithm based on Lewis weight sampling that outputs a $(1+epsilon)$ approximate solution using just $ ilde{O}(d^{max(1,{p/2})}/mathrm{poly}(epsilon))$ queries to $b$. We show that this dependence on $d$ is optimal, up to logarithmic factors. Our result resolves a recent open question of Chen and Derezi'{n}ski, who gave near optimal bounds for the $ell_1$ norm, and suboptimal bounds for $ell_p$ regression with $pin(1,2)$. We also provide the first total sensitivity upper bound of $O(d^{max{1,p/2}}log^2 n)$ for loss functions with at most degree $p$ polynomial growth. This improves a recent result of Tukan, Maalouf, and Feldman. By combining this with our techniques for the $ell_p$ regression result, we obtain an active regression algorithm making $ ilde O(d^{1+max{1,p/2}}/mathrm{poly}(epsilon))$ queries, answering another open question of Chen and Derezi'{n}ski. For the important special case of the Huber loss, we further improve our bound to an active sample complexity of $ ilde O(d^{(1+sqrt2)/2}/epsilon^c)$ and a non-active sample complexity of $ ilde O(d^{4-2sqrt 2}/epsilon^c)$, improving a previous $d^4$ bound for Huber regression due to Clarkson and Woodruff. Our sensitivity bounds have further implications, improving a variety of previous results using sensitivity sampling, including Orlicz norm subspace embeddings and robust subspace approximation. Finally, our active sampling results give the first sublinear time algorithms for Kronecker product regression under every $ell_p$ norm.