zl程序教程

您现在的位置是:首页 >  数据库

当前栏目

金融/语音/音频处理学术速递[12.15]

2023-04-18 15:34:12 时间

q-fin金融,共计12篇

cs.SD语音,共计15篇

eess.AS音频处理,共计15篇

1.q-fin金融:

【1】 The Oracle estimator is suboptimal for global minimum variance portfolio optimisation 标题:对于全局最小方差投资组合优化,Oracle估计器是次优的 链接:https://arxiv.org/abs/2112.07521

作者:Christian Bongiorno,Damien Challet 机构:Université Paris-Saclay, CentraleSupélec, Laboratoire de Mathématiques et Informatique pour la Complexité et les Systèmes, Gif-sur-Yvette, France 摘要:一个常见的误解是,协方差矩阵的Oracle特征值估值器产生最佳实现投资组合绩效。实际上,Oracle估计器只是修改经验协方差矩阵的特征值,以最小化滤波后的协方差矩阵和实现的协方差矩阵之间的Frobenius距离。只有当样本内特征向量与样本外特征向量一致时,才能得到最佳投资组合。在所有其他情况下,最优特征值校正可以从二次规划问题的解中获得。求解它表明,Oracle估值器仅在每个资产的无限个数据点的限制下产生最佳投资组合,并且仅在平稳系统中产生最佳投资组合。 摘要:A common misconception is that the Oracle eigenvalue estimator of the covariance matrix yields the best realized portfolio performance. In reality, the Oracle estimator simply modifies the empirical covariance matrix eigenvalues so as to minimize the Frobenius distance between the filtered and the realized covariance matrices. This leads to the best portfolios only when the in-sample eigenvectors coincide with the out-of-sample ones. In all the other cases, the optimal eigenvalue correction can be obtained from the solution of a Quadratic-Programming problem. Solving it shows that the Oracle estimators only yield the best portfolios in the limit of infinite data points per asset and only in stationary systems.

【2】 On The Quality Of Cryptocurrency Markets: Centralized Versus Decentralized Exchanges 标题:论加密货币市场的性质:集中交易与分散交易 链接:https://arxiv.org/abs/2112.07386

作者:Andrea Barbon,Angelo Ranaldo 机构:ch) is Assistant Professor of Finance at the University ofSt, ch) is Full Professor of Finance and Sys-temic Risk at the University of St 摘要:尽管分散化交易所越来越多地被采用,但人们对其市场质量知之甚少。为了阐明这一问题,我们通过评估市场质量的两个关键方面:价格效率和市场流动性,将分散的区块链场馆(DEX)与集中加密交易所(CEX)进行了比较。使用一个新颖而全面的数据集,我们发现总体CEX提供了更好的市场质量,但DEX在交易额超过10万美元时具有竞争力。此外,DEX价格效率较低的主要决定因素是来自工作证明区块链的高天然气价格。我们提出并实证验证了DEX流动性供给的程式化理论,该理论将交易量、协议费用和均衡流动性联系起来。我们的模型确定了未来DEX超过CEX的定量条件。 摘要:Despite the growing adoption of decentralized exchanges, not much is yet known about their market quality. To shed light on this issue, we compare decentralized blockchain-based venues (DEX) to centralized crypto exchanges (CEX) by assessing two key aspects of market quality: price efficiency and market liquidity. Using a novel and comprehensive data set, we find that overall CEX provide better market quality but DEX become competitive for transactions exceeding $ 100,000. Further, the main determinant of the lower price-efficiency of DEX is the high gas price stemming from proof-of-work blockchains. We propose and empirically validate a stylized theory of DEX liquidity provision, which links trading volumes, protocol fees, and liquidity in equilibrium. Our model identifies quantitative conditions for DEX to overtake CEX in the future.

【3】 Deep Partial Hedging 标题:深度部分套期保值 链接:https://arxiv.org/abs/2112.07335

作者:Songyan Hou,Thomas Krabichler,Marcus Wunsch 摘要:利用深度学习技术(参见[B“uh+19]),我们证明神经网络可以成功地训练,以复制[FL00]首次在部分套期保值的背景下推导出的修改后的支付函数. 这种方法不仅能够更好地适应离散时间内套期保值的实际设置,还可以考虑交易成本以及一般市场动态。 摘要:Using techniques from deep learning (cf. [B"uh+19]), we show that neural networks can be trained successfully to replicate the modified payoff functions that were first derived in the context of partial hedging by [FL00]. Not only does this approach better accommodate the realistic setting of hedging in discrete time, it also allows for the inclusion of transaction costs as well as general market dynamics.

【4】 The road to safety- Examining the nexus between road infrastructure and crime in rural India 标题:通向安全之路--考察印度农村地区道路基础设施与犯罪之间的关系 链接:https://arxiv.org/abs/2112.07314

作者:Ritika Jain,Shreya Biswas 机构:Centre for Development Studies, Trivandrum, India, Assistant Professor, Department of Economics and Finance, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, India 摘要:本研究通过一项全国代表性调查,探讨了印度农村道路基础设施与犯罪率之间的关系。一方面,在村庄修建道路可以增加连通性,促进就业,提高生活水平,减少犯罪活动。另一方面,如果道路的利益在村民之间的分配不均匀,则可能导致更高的不平等性和更高的犯罪率。我们使用印度人类发展调查的两次浪潮对这种关系进行了实证检验。我们使用工具变量估计策略,观察到在印度农村地区修建道路减少了犯罪。这些发现对于放宽严格的仪器外生条件和使用替代措施是有力的。在探索这些道路时,我们发现改善街道照明、改善公共巴士服务和增加就业是道路基础设施阻止犯罪的几个直接潜在渠道。我们还发现,有公路的村庄与确认公路广泛经济效益的各种不平等措施之间存在负相关。我们的研究还强调,在机构较弱、收入不平等程度较高的国家,道路对犯罪的负面影响更为明显。 摘要:This study examines the relationship between road infrastructure and crime rate in rural India using a nationally representative survey. On the one hand, building roads in villages may increase connectivity, boost employment, and lead to better living standards, reducing criminal activities. On the other hand, if the benefits of roads are non-uniformly distributed among villagers, it may lead to higher inequality and possibly higher crime. We empirically test the relationship using the two waves of the Indian Human Development Survey. We use an instrumental variable estimation strategy and observe that building roads in rural parts of India has reduced crime. The findings are robust to relaxing the strict instrument exogeneity condition and using alternate measures. On exploring the pathways, we find that improved street lighting, better public bus services and higher employment are a few of the direct potential channels through which road infrastructure impedes crime. We also find a negative association between villages with roads and various types of inequality measures confirming the broad economic benefits of roads. Our study also highlights that the negative impact of roads on crime is more pronounced in states with weaker institutions and higher income inequality.

【5】 Compensatory model for quantile estimation and application to VaR 标题:分位数估计的补偿模型及其在VaR中的应用 链接:https://arxiv.org/abs/2112.07278

作者:Shuzhen Yang 机构:Shandong University-Zhong Tai Securities Institute for Financial Studies, Shandong University 备注:23 pages, 6 figures 摘要:与通常估计时间序列的分布然后从分布中获得分位数的过程不同,我们开发了一个补偿模型来改进给定分布估计下的分位数估计。在补偿模型中引入了一种新的惩罚项。我们证明了惩罚项可以控制给定时间序列分位数估计的收敛误差,并获得自适应调整分位数估计。仿真和实证分析表明,在给定的分布估计下,补偿模型可以显著提高风险价值(VaR)的性能。 摘要:In contrast to the usual procedure of estimating the distribution of a time series and then obtaining the quantile from the distribution, we develop a compensatory model to improve the quantile estimation under a given distribution estimation. A novel penalty term is introduced in the compensatory model. We prove that the penalty term can control the convergence error of the quantile estimation of a given time series, and obtain an adaptive adjusted quantile estimation. Simulation and empirical analysis indicate that the compensatory model can significantly improve the performance of the value at risk (VaR) under a given distribution estimation.

【6】 Modal equilibrium of a tradable credit scheme with a trip-based MFD and logit-based decision-making 标题:基于出行MFD和Logit决策的可交易信贷方案的模式均衡 链接:https://arxiv.org/abs/2112.07277

作者:Louis Balzer,Ludovic Leclercq 机构:Univ Gustave Eiffel, Univ Lyon, ENTPE, LICIT, F-, Lyon, France 备注:Submitted to TR Part C 摘要:在过去的十年里,有关可交易信贷计划(TCS)作为缓解交通拥堵的需求管理系统的文献十分活跃。大多数提议的公式基于静态模型,因此不考虑拥塞动态。本文考虑了弹性需求,通过限制一天内允许进入网络的车辆数量,实现了TCS以促进模式转换。基于出行的宏观基本图(MFD)模型代表了整个城市范围内的交通动态。我们假设用户拥有不同的OD对,并根据logit模型在驾驶汽车或乘坐公交之间进行选择。我们的目标是计算TCS下均衡时的模式份额和信贷价格。行程时间相对于模态份额线性化,以提高收敛性。然后,我们提出了一种方法来寻找信用费用最小化的总旅行时间单独或结合碳排放。里昂大都会7:00至10:00的典型需求曲线说明了拟议的方法。我们表明,在推导TCS下的模态平衡时,交通动力学和出行异质性是重要的。描述了一种计算行程时间线性化的方法,并与经典下降法(MSA)进行了比较。所提出的线性化是一种很有前途的工具,可以避免基于trip的MFD隐式公式的复杂性。在优化的TCS下,通过增加24个点的PT份额,总行程时间减少17%,碳排放减少45%。 摘要:The literature about tradable credit schemes (TCS) as a demand management system alleviating congestion flourished in the past decade. Most proposed formulations are based on static models and thus do not account for the congestion dynamics. This paper considers elastic demand and implements a TCS to foster modal shift by restricting the number of cars allowed in the network over the day. A trip-based Macroscopic Fundamental Diagram (MFD) model represents the traffic dynamics at the whole urban scale. We assume the users have different OD pairs and choose between driving their car or riding the transit following a logit model. We aim to compute the modal shares and credit price at equilibrium under TCS. The travel times are linearized with respect to the modal shares to improve the convergence. We then present a method to find the credit charge minimizing the total travel time alone or combined with the carbon emission. The proposed methodology is illustrated with a typical demand profile from 7:00 to 10:00 for Lyon Metropolis. We show that traffic dynamics and trip heterogeneity matter when deriving the modal equilibrium under a TCS. A method is described to compute the linearization of the travel times and compared against a classical descend method (MSA). The proposed linearization is a promising tool to circumvent the complexity of the implicit formulation of the trip-based MFD. Under an optimized TCS, the total travel time decreases by 17% and the carbon emission by 45% by increasing the PT share by 24 points.

【7】 Urban Housing Prices and Migration's Fertility Intentions: Based on the 2018 China Migrants' Dynamic Survey 标题:城市房价与流动人口生育意向&基于2018年中国流动人口动态调查 链接:https://arxiv.org/abs/2112.07273

作者:Jingwen Tan,Shixi Kang 机构:School of Economics, Henan University, Jinming Avenue, Kaifeng, China, A R T I C L E I N F O 备注:11 pages, 0 figures 摘要:在我国流动人口规模不断扩大的同时,生育率明显低于稳定的代际人口更替水平,人力资源供给的结构性失衡引起了广泛关注。本文基于2018年全国流动人口动态监测调查的数据,使用LPM和Probit模型估计房价对流动人口生育意愿的影响。滞后的土地销售价格被用作房价的工具变量,以缓解潜在的内生性问题。结果表明,流动人口房价与家庭收入之比每增加100%,流入地劳动适龄女性流动人口的生育意愿将下降4.42%,相对房价对劳动力生育意愿的边际效应为EXP(-0.222);流动人口生育意愿对房价的敏感性受流入地基础设施建设的调节效应影响。年龄较低、家庭规模较小、受教育程度较高的工作年龄女性移民在流入地区生育的意愿较高。基于上述发现,本研究试图为中国经济转型期的主线制度变迁和经济均衡发展提供一个新的实践视角。 摘要:While the size of China's mobile population continues to expand, the fertility rate is significantly lower than the stable generation replacement level of the population, and the structural imbalance of human resource supply has attracted widespread attention. This paper uses LPM and Probit models to estimate the impact of house prices on the fertility intentions of the mobile population based on data from the 2018 National Mobile Population Dynamics Monitoring Survey. The lagged land sales price is used as an instrumental variable of house price to mitigate the potential endogeneity problem. The results show that for every 100\% increase in the ratio of house price to household income of mobile population, the fertility intention of the female mobile population of working age at the inflow location will decrease by 4.42\%, and the marginal effect of relative house price on labor force fertility intention is EXP(-0.222); the sensitivity of mobile population fertility intention to house price is affected by the moderating effect of infrastructure construction at the inflow location. The willingness to have children in the inflow area is higher for female migrants of working age with lower age, smaller family size and higher education. Based on the above findings, the study attempts to provide a new practical perspective for the mainline institutional change and balanced economic development in China's economic transition phase.

【8】 Finding the instrumental variables of household registration: A discussion of the impact of China's household registration system on the citizenship of the migrant population 标题:寻找户籍的工具变量--兼论我国户籍制度对流动人口公民身份的影响 链接:https://arxiv.org/abs/2112.07268

作者:Jingwen Tan,Shixi Kang 机构:School of Economics, Henan University, Jinming Avenue, Kaifeng, China, A R T I C L E I N F O 备注:12 pages, 2 figures 摘要:由于中国二元户籍制度的特殊性和所附权益的差异,户籍作为控制变量在经验证据中普遍存在。在计划生育政策背景下,本文提出将家庭规模和子女数量作为户籍的工具变量,并对其进行定性和统计分析,验证其相关性和外生性,同时实证分析了户籍制度对流动人口公民身份的影响。在对城市、个体控制变量和固定效应进行控制后,得出以下结论:家庭规模和子女数量作为户籍工具变量时通过过度认同检验;非农业家庭在流入城市的定居意愿比农业家庭低20.2%,就业水平比农业家庭低7.28%;户籍性质对就业的影响机制对于非流动人口群体仍然有效。 摘要:Due to the specificity of China's dualistic household registration system and the differences in the rights and interests attached to it, household registration is prevalent as a control variable in the empirical evidence. In the context of family planning policies, this paper proposes to use family size and number of children as instrumental variables for household registration, and discusses qualitatively and statistically verifies their relevance and exogeneity, while empirically analyzing the impact of the household registration system on citizenship of the mobile population. After controlling for city, individual control variables and fixed effects, the following conclusions are drawn: family size and number of children pass the over-identification test when used as instrumental variables for household registration; non-agricultural households have about 20.2\% lower settlement intentions and 7.28\% lower employment levels in inflow cities than agricultural households; the mechanism of the effect of the nature of household registration on employment still holds for the non-mobile population group.

【9】 How much flexibility is available for a just energy transition in Europe? 标题:在欧洲,公正的能源过渡有多大的灵活性? 链接:https://arxiv.org/abs/2112.07247

作者:Tim T. Pedersen,Mikael Skou Andersen,Marta Victoria,Gorm B. Andresen 机构:• Monte Carlo study of uncertain national emission reduction targets in EU, • Moderate increases in total system cost reveal a large range of alternatives, to EU ETS, • Strong anticorrelation of national emissions indicates carbon leakage given 摘要:欧洲能源供应向碳中和转变,应该是高效、公平、快速的。原则上,欧洲排放交易系统(ETS)确保了转型的效率,创造了一个共同的排放市场。公平的目标是通过努力分享条例,根据成员国的经济能力进行校准。这两项立法的目的是在效率和公平之间进行权衡。使用截至2030年的电力供应高级能源系统优化模型,对30000个国家减排目标配置样本进行了蒙特卡罗模拟。结果显示,在大多数情况下,超过国家目标的减排在经济上是有利的。相反,对一些国家来说,巨大的减排成本是不可避免的。与最具成本效益的CO2分配相比,接受成本的适度增加可以实现包含基于替代正义的分配标准的替代CO2排放分配。 摘要:The transition of Europe's energy supply towards carbon neutrality should be efficient, fair, and fast. In principle, the efficiency of the transition is ensured by the European Emissions Trading System (ETS), creating a common emissions market. Fairness is aimed for with the Effort Sharing Regulation, calibrated for the economic capacity of member states. These two pieces of legislation are aiming for a trade-off between efficiency and fairness. A Monte Carlo simulation with 30.000 samples of national reduction target configurations has been performed using an advanced energy system optimization model of electricity supply as of 2030. Results reveal a group of countries where emissions reductions beyond the national targets, in most scenarios, are economically favorable. Contrarily, for some countries large abatement costs are unavoidable. Compared to the most cost-effective CO2 allocation, accepting a moderate increase in cost enables alternative CO2 emissions allocations that incorporate alternative justice-based distribution criteria.

【10】 Efficient differentiable quadratic programming layers: an ADMM approach 标题:高效可微二次规划层:ADMM方法 链接:https://arxiv.org/abs/2112.07464

作者:Andrew Butler,Roy Kwon 机构:University of Toronto, Department of Mechanical and Industrial Engineering 摘要:神经网络结构的最新进展允许将凸优化问题无缝集成为端到端可训练神经网络中的可微层。然而,将大中型二次规划集成到深度神经网络结构中是一个挑战,因为用内点方法精确求解二次规划在变量数量上具有最坏情况下的立方复杂性。在本文中,我们提出了一种基于交替方向乘数法(ADMM)的替代网络层架构,该架构能够扩展到具有中等数量变量的问题。后向微分是通过对修正的定点迭代的残差映射进行隐式微分来实现的。模拟结果证明了ADMM层的计算优势,对于中等规模的问题,它比OptNet二次规划层大约快一个数量级。此外,与基于KKT最优性条件的展开微分或隐式微分的标准方法相比,从记忆和计算的角度来看,我们新的后向传递例程是有效的。最后,我们以综合预测和优化范式中的投资组合优化为例进行总结。 摘要:Recent advances in neural-network architecture allow for seamless integration of convex optimization problems as differentiable layers in an end-to-end trainable neural network. Integrating medium and large scale quadratic programs into a deep neural network architecture, however, is challenging as solving quadratic programs exactly by interior-point methods has worst-case cubic complexity in the number of variables. In this paper, we present an alternative network layer architecture based on the alternating direction method of multipliers (ADMM) that is capable of scaling to problems with a moderately large number of variables. Backward differentiation is performed by implicit differentiation of the residual map of a modified fixed-point iteration. Simulated results demonstrate the computational advantage of the ADMM layer, which for medium scaled problems is approximately an order of magnitude faster than the OptNet quadratic programming layer. Furthermore, our novel backward-pass routine is efficient, from both a memory and computation standpoint, in comparison to the standard approach based on unrolled differentiation or implicit differentiation of the KKT optimality conditions. We conclude with examples from portfolio optimization in the integrated prediction and optimization paradigm.

【11】 Ride-Sourcing Platforms with Mixed Autonomy: How will Autonomous Vehicles Affect Others on a Ride-Sourcing Network? 标题:混合自主的网约车平台:自动驾驶汽车将如何影响网约车网络上的其他人? 链接:https://arxiv.org/abs/2112.07218

作者:Di Ao,Zhijie Lai,Sen Li 机构:Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology 摘要:本文研究了自动驾驶汽车(AV)如何影响乘客、雇佣人工驾驶人以及骑乘资源网络上的平台。我们考虑一个骑行采购市场,其中AVS和人力驱动的混合部署的平台,以提供流动的按需服务在劳动法规。在这个市场中,骑乘设施采购平台决定空间价格、车队规模、人工驾驶人付款和车辆迁移策略,以实现利润最大化,而个人乘客则在不同的交通方式之间进行选择,以最小化其出行成本。提出了一个市场均衡模型,以捕捉乘客、人工驾驶人、AVs和网络上的骑乘采购平台之间的相互作用。将整个问题描述为一个具有网络约束的非凸规划,并在保证性能的前提下,提出了一种求解该问题近似解的算法。我们的研究表明,骑乘采购平台优先在高需求地区部署AV,以获得更高的利润。随着AVs涌入这些高需求地区,它们与城市核心区的人力司机竞争,迫使它们迁往郊区。这导致人类司机的赚钱机会减少,乘客的空间不平等加剧。我们还表明,设定最低工资可以保护驾驶员免受AVs的负面影响,同时车辆总供给和乘客需求几乎不受影响。然而,存在一个阈值,超过该阈值,最低工资将触发劳动力供应的范式转变,平台将完全用AVs取代所有人类司机。这表明,应在混合环境中仔细设计最低工资,以避免人为司机失去工作机会。这些结果与旧金山的实际案例研究进行了验证。 摘要:This paper investigates how autonomous vehicles (AV) affect passengers, for-hire human drivers, and the platform on a ride-sourcing network. We consider a ride-sourcing market where a mixture of AVs and human drivers is deployed by the platform to provide mobility-on-demand services under labor regulations. In this market, the ride-sourcing platform determines the spatial prices, fleet size, human driver payments, and vehicle relocation strategies to maximize its profit, while individual passengers choose between different transport modes to minimize their travel costs. A market equilibrium model is proposed to capture the interactions among passengers, human drivers, AVs, and the ride-sourcing platform over the network. The overall problem is formulated as a non-convex program with network constraints, and an algorithm is developed to derive its approximate solution with performance guarantee. Our study shows that ride-sourcing platform prioritizes AV deployment in high-demand areas to make a higher profit. As AVs flood into these high-demand areas, they compete with human drivers in the urban core and push them to relocate to suburbs. This leads to reduced earning opportunities for human drivers and increased spatial inequity for passengers. We also show that placing a wage floor may protect drivers from the negative impact of AVs, and meanwhile the total vehicle supply and passenger demand are almost unaffected. However, there exists a threshold beyond which the minimum wage will trigger a paradigm shift of labor supply where the platform will completely replace all human drivers with AVs. This indicates that the minimum wage should be carefully designed in the mixed environment to avoid the loss of job opportunities for human drivers. These results are validated with realistic case studies for San Francisco.

【12】 Data-driven integration of regularized mean-variance portfolios 标题:正则化均值-方差投资组合的数据驱动整合 链接:https://arxiv.org/abs/2112.07016

作者:Andrew Butler,Roy H. Kwon 机构:University of Toronto, Department of Mechanical and Industrial Engineering 摘要:均值-方差优化(MVO)对输入的估计误差高度敏感。最近,MVO程序的范数惩罚被证明是一种有效的正则化技术,可以帮助减轻估计误差的不利影响。在本文中,我们用参数化的$L_1$和$L_2$范数惩罚函数的凸组合来扩充标准MVO程序。由此产生的程序是一个参数化惩罚二次规划(PPQP),其原始和对偶形式被证明是约束二次规划(QP)。我们利用可微QP的神经网络结构的最新进展,提出了一种新的、数据驱动的随机优化框架,用于优化基于最终决策的MVO问题中的参数化正则化结构。该框架具有高度灵活性,能够以完全集成的方式联合优化预测和正则化模型参数。我们使用全球期货数据提供了一些历史模拟,并强调了随机优化方法的优势和灵活性。 摘要:Mean-variance optimization (MVO) is known to be highly sensitive to estimation error in its inputs. Recently, norm penalization of MVO programs has proven to be an effective regularization technique that can help mitigate the adverse effects of estimation error. In this paper, we augment the standard MVO program with a convex combination of parameterized $L_1$ and $L_2$ norm penalty functions. The resulting program is a parameterized penalized quadratic program (PPQP) whose primal and dual form are shown to be constrained quadratic programs (QPs). We make use of recent advances in neural-network architecture for differentiable QPs and present a novel, data-driven stochastic optimization framework for optimizing parameterized regularization structures in the context of the final decision-based MVO problem. The framework is highly flexible and capable of jointly optimizing both prediction and regularization model parameters in a fully integrated manner. We provide several historical simulations using global futures data and highlight the benefits and flexibility of the stochastic optimization approach.

2.cs.SD语音:

【1】 On the Use of External Data for Spoken Named Entity Recognition 标题:浅谈外部数据在口语命名实体识别中的应用 链接:https://arxiv.org/abs/2112.07648

作者:Ankita Pasad,Felix Wu,Suwon Shon,Karen Livescu,Kyu J. Han 机构:ASAPP, Toyota Technological Institute at Chicago 摘要:口语理解(SLU)任务涉及从语音信号到语义标签的映射。考虑到这些任务的复杂性,良好的性能可能需要大量标记的数据集,这些数据集很难为每个新任务和域收集。然而,自监督语音表示的最新进展使得考虑有限的标记数据学习SLU模型是可行的。在这项工作中,我们将重点放在低资源语音命名实体识别(NER)上,并解决以下问题:除了自我监督的预训练外,我们如何使用未为任务注释的外部语音和/或文本数据?我们采用各种方法,包括自我训练,知识蒸馏和转移学习,并考虑其适用于端到端的模型和管道(语音识别,其次是文本模型)的方法。我们发现,其中一些方法在资源受限的环境中提高了性能,而不仅仅是预先训练好的表示。与之前的工作相比,我们发现F1成绩提高了16%。虽然最佳基线模型是管道方法,但使用外部数据时的最佳性能最终是通过端到端模型实现的。我们提供了详细的比较和分析,例如表明端到端模型能够关注更具体的单词。 摘要:Spoken language understanding (SLU) tasks involve mapping from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline (speech recognition followed by text NER model) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations alone. Compared to prior work, we find improved F1 scores of up to 16%. While the best baseline model is a pipeline approach, the best performance when using external data is ultimately achieved by an end-to-end model. We provide detailed comparisons and analyses, showing for example that end-to-end models are able to focus on the more NER-specific words.

【2】 End-to-end speaker diarization with transformer 标题:带Transformer的端到端扬声器二值化 链接:https://arxiv.org/abs/2112.07463

作者:Yongquan Lai,Xin Tang,Yuanyuan Fu,Rui Fang 机构:Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China 备注:submitted to icassp2022 摘要:说话人二值化与计算机视觉中的语义分割有关。受MaskFormercite{cheng2021per}将语义分割视为一组预测问题的启发,我们提出了一种端到端的方法来预测由二元掩码、声音活动和说话人向量组成的一组目标。我们的模型,即coin extit{DiFormer},主要基于说话人编码器和特征金字塔网络(FPN)模块来提取多尺度说话人特征,然后将这些特征反馈到transformer编码器解码器中,以从学习的查询嵌入中预测一组二值化目标。为了考虑语音信号的时间特性,在掩模预测模块中插入双向LSTM以提高时间一致性。我们的模型以统一的方式处理未知数量的说话人、语音重叠以及语音活动检测。在多媒体和会议数据集上的实验证明了该方法的有效性。 摘要:Speaker diarization is connected to semantic segmentation in computer vision. Inspired from MaskFormer cite{cheng2021per} which treats semantic segmentation as a set-prediction problem, we propose an end-to-end approach to predict a set of targets consisting of binary masks, vocal activities and speaker vectors. Our model, which we coin extit{DiFormer}, is mainly based on a speaker encoder and a feature pyramid network (FPN) module to extract multi-scale speaker features which are then fed into a transformer encoder-decoder to predict a set of diarization targets from learned query embedding. To account for temporal characteristics of speech signal, bidirectional LSTMs are inserted into the mask prediction module to improve temporal consistency. Our model handles unknown number of speakers, speech overlaps, as well as vocal activity detection in a unified way. Experiments on multimedia and meeting datasets demonstrate the effectiveness of our approach.

【3】 Supervised Learning for Multi Zone Sound Field Reproduction under Harsh Environmental Conditions 标题:恶劣环境下多区域声场再现的有监督学习 链接:https://arxiv.org/abs/2112.07349

作者:Henry Sallandt,Philipp Krah,Mathias Lemke 机构:Institute of Fluid Mechanics and Engineering Acoustics, Technical University Berlin, M¨uller-Breslau-Str. , Berlin, Germany, Institute of Mathematics, Straße des ,. Juni , Berlin, Germany 备注:Preprint submitted for publication 摘要:这篇手稿提出了一种使用监督学习的多区域声场再现方法。传统的多区域声场再现方法假定音速恒定,忽略了风和温度分层等非线性效应。我们展示了如何使用传递函数的监督学习来克服这些限制。通过声学对比度和再现误差来测量溶液的质量。我们的结果表明,对于所选择的设置,即使在相对较小的风速下,当在训练模型中考虑风时,声学对比度和再现误差可以提高16 dB。 摘要:This manuscript presents an approach for multi zone sound field reproduction using supervised learning. Traditional multi zone sound field reproduction methods assume constant speed of sound, neglecting nonlinear effects like wind and temperature stratification. We show how to overcome these restrictions using supervised learning of transfer functions. The quality of the solution is measured by the acoustic contrast and the reproduction error. Our results show that for the chosen setup, even with relatively small wind speeds, the acoustic contrast and reproduction error can be improved by up to 16 dB, when wind is considered in the trained model.

【4】 Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice 标题:基于咳嗽、呼吸和声音参数的一维卷积神经网络和人体呼吸音增强的冠状病毒病自动诊断 链接:https://arxiv.org/abs/2112.07285

作者:Kranthi Kumar Lella,Alphonse Pja 机构:Department of Computer Applications, NIT Tiruchirappalli, Tamil Nadu, India 备注:None 摘要:呼吸音分类问题在过去一年中得到了临床科学家和医学研究者的关注,以诊断COVID-19病。到目前为止,各种人工智能模型(AI)进入真实世界以检测人类产生的声音如语音/语音、咳嗽和呼吸的COVID-19疾病。卷积神经网络(CNN)模型是基于人工智能(AI)实现的,用于解决机器上的许多实际问题。在此背景下,2019冠状病毒疾病的诊断和治疗,从呼吸、声音、呼吸等呼吸声中检测出COVID-19型呼吸系统疾病。基于2019冠状病毒疾病2019冠状病毒疾病的数据集,采用基于增强的机制来提高COVID-19声音数据集的预处理性能,并利用一维卷积网络实现COVID-19疾病诊断自动化。此外,使用DDAE(数据去噪自动编码器)技术来生成深度声音特征,例如1D CNN的输入函数,而不是采用MFCC(Mel频率倒谱系数)的标准输入,并且它比以前的模型具有更好的精度和性能。 摘要:The issue in respiratory sound classification has attained good attention from the clinical scientists and medical researcher's group in the last year to diagnosing COVID-19 disease. To date, various models of Artificial Intelligence (AI) entered into the real-world to detect the COVID-19 disease from human-generated sounds such as voice/speech, cough, and breath. The Convolutional Neural Network (CNN) model is implemented for solving a lot of real-world problems on machines based on Artificial Intelligence (AI). In this context, one dimension (1D) CNN is suggested and implemented to diagnose respiratory diseases of COVID-19 from human respiratory sounds such as a voice, cough, and breath. An augmentation-based mechanism is applied to improve the preprocessing performance of the COVID-19 sounds dataset and to automate COVID-19 disease diagnosis using the 1D convolutional network. Furthermore, a DDAE (Data De-noising Auto Encoder) technique is used to generate deep sound features such as the input function to the 1D CNN instead of adopting the standard input of MFCC (Mel-frequency cepstral coefficient), and it is performed better accuracy and performance than previous models.

【5】 Noise Reduction and Driving Event Extraction Method for Performance Improvement on Driving Noise-based Surface Anomaly Detection 标题:提高基于驾驶噪声的表面异常检测性能的降噪和驾驶事件提取方法 链接:https://arxiv.org/abs/2112.07214

作者:YeongHyeon Park,JoonSung Lee,Myung Jin Kim,Wonseok Park 机构:SK Planet Co., Ltd. 备注:3 pages, 3 figures, 2 tables 摘要:路面上的异物(如雨水或黑冰)会减少轮胎与路面之间的摩擦。上述情况将降低制动性能,并使车身姿态难以控制。在这种情况下,至少有可能造成财产损失。在最坏的情况下,将发生人身伤害。为了避免这一问题,提出了一种基于车辆行驶噪声的道路异常检测模型。然而,先前的建议不考虑额外的噪声,与驾驶噪声混合,并且跳过没有车辆驾驶的时刻的计算。在本文中,我们提出了一种简单的驱动事件提取方法和降噪方法,以提高计算效率和异常检测性能。 摘要:Foreign substances on the road surface, such as rainwater or black ice, reduce the friction between the tire and the surface. The above situation will reduce the braking performance and make difficult to control the vehicle body posture. In that case, there is a possibility of property damage at least. In the worst case, personal damage will be occured. To avoid this problem, a road anomaly detection model is proposed based on vehicle driving noise. However, the prior proposal does not consider the extra noise, mixed with driving noise, and skipping calculations for moments without vehicle driving. In this paper, we propose a simple driving event extraction method and noise reduction method for improving computational efficiency and anomaly detection performance.

【6】 Cross-modal Music Emotion Recognition Using Composite Loss-based Embeddings 标题:基于复合损失嵌入的跨模态音乐情感识别 链接:https://arxiv.org/abs/2112.07192

作者:Naoki Takashima,Frédéric Li,Marcin Grzegorzek,Kimiaki Shirahama 机构: Shirahama is member of the Department of Informatics 备注:12 pages, 5 figures 摘要:大多数音乐情感识别方法使用单向分类或回归,根据音乐样本的分布估计一般情感,但不考虑情感变化(例如,幸福感可进一步分为多幸福感、中等幸福感或少幸福感)。我们提出了一种跨模态音乐情感识别方法,通过考虑音乐样本的一般特征和特定特征,将音乐样本与共同空间中的情感联系起来。由于人类主观感知的原因,音乐样本与情感的关联是不确定的,因此我们计算得到的基于复合损失的嵌入,以最大化两个统计特征,一个是基于典型相关分析的音乐样本与情感之间的相关性,另一个是音乐样本和KL发散情绪之间的概率相似性。在两个基准数据集上的实验证明了我们的方法优于单向基线。此外,详细的分析表明,我们的方法可以实现鲁棒的跨模态音乐情感识别,不仅可以识别与特定情感匹配的音乐样本,还可以检测特定音乐样本中表达的情感。 摘要:Most music emotion recognition approaches use one-way classification or regression that estimates a general emotion from a distribution of music samples, but without considering emotional variations (e.g., happiness can be further categorised into much, moderate or little happiness). We propose a cross-modal music emotion recognition approach that associates music samples with emotions in a common space by considering both of their general and specific characteristics. Since the association of music samples with emotions is uncertain due to subjective human perceptions, we compute composite loss-based embeddings obtained to maximise two statistical characteristics, one being the correlation between music samples and emotions based on canonical correlation analysis, and the other being a probabilistic similarity between a music sample and an emotion with KL-divergence. Experiments on two benchmark datasets demonstrate the superiority of our approach over one-way baselines. In addition, detailed analysis show that our approach can accomplish robust cross-modal music emotion recognition that not only identifies music samples matching with a specific emotion but also detects emotions expressed in a certain music sample.

【7】 Explore Long-Range Context feature for Speaker Verification 标题:探索用于说话人确认的远程上下文特征 链接:https://arxiv.org/abs/2112.07134

作者:Zhuo Li 机构:Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese, University of Chinese Academy of Sciences, Beijing, China 备注:rejected by interspeech2021 摘要:捕获长距离依赖和建模长时间上下文被证明有利于说话人验证任务。在本文中,我们提出了分层分割块(HS块)和深度可分离自我注意(DSSA)模块的组合,分别从局部和全局角度捕获更丰富的多音域上下文说话人特征。具体而言,HS块将特征图和过滤器拆分为若干组,并将其堆叠在一个块中,从而局部放大感受野(RFs)。DSSA模块通过深度分离策略和显式稀疏注意策略对多头自我注意机制进行改进,以全局建模成对关系,并捕获每个通道中的有效长程依赖关系。在Voxceleb和SITW上进行了实验。通过应用HS块和DSSA模块的组合,我们的最佳系统在Voxceleb1测试集上实现了1.27%的EER,在SITW上实现了1.56%的EER。 摘要:Capturing long-range dependency and modeling long temporal contexts is proven to benefit speaker verification tasks. In this paper, we propose the combination of the Hierarchical-Split block(HS-block) and the Depthwise Separable Self-Attention(DSSA) module to capture richer multi-range context speaker features from a local and global perspective respectively. Specifically, the HS-block splits the feature map and filters into several groups and stacks them in one block, which enlarges the receptive fields(RFs) locally. The DSSA module improves the multi-head self-attention mechanism by the depthwise-separable strategy and explicit sparse attention strategy to model the pairwise relations globally and captures effective long-range dependencies in each channel. Experiments are conducted on the Voxceleb and SITW. Our best system achieves 1.27% EER on the Voxceleb1 test set and 1.56% on SITW by applying the combination of HS-block and DSSA module.

【8】 Real-Time Neural Voice Camouflage 标题:实时神经语音伪装 链接:https://arxiv.org/abs/2112.07076

作者:Mia Chiquier,Chengzhi Mao,Carl Vondrick 机构:Department of Computer Science, Columbia University, New York, NY 备注:14 pages 摘要:自动语音识别系统为应用创造了令人兴奋的可能性,但也为系统窃听提供了机会。我们提出了一种方法,可以在不影响房间里人与人之间对话的情况下,通过这些系统在空中伪装一个人的声音。标准的对抗性攻击在实时流情况下无效,因为在执行攻击时,信号的特征将发生变化。我们引入预测攻击,通过预测未来最有效的攻击来实现实时性能。在实时性约束下,我们的方法比通过字错误率测量的基线多4.17倍,通过字符错误率测量的基线多7.27倍。此外,我们还证明了我们的方法在物理距离的现实环境中是切实有效的。 摘要:Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping. We propose a method to camouflage a person's voice over-the-air from these systems without inconveniencing the conversation between people in the room. Standard adversarial attacks are not effective in real-time streaming situations because the characteristics of the signal will have changed by the time the attack is executed. We introduce predictive attacks, which achieve real-time performance by forecasting the attack that will be the most effective in the future. Under real-time constraints, our method jams the established speech recognition system DeepSpeech 4.17x more than baselines as measured through word error rate, and 7.27x more as measured through character error rate. We furthermore demonstrate our approach is practically effective in realistic environments over physical distances.

【9】 Event Based Time-Vectors for auditory features extraction: a neuromorphic approach for low power audio recognition 标题:基于事件的时间矢量听觉特征提取:一种用于低功耗音频识别的神经形态学方法 链接:https://arxiv.org/abs/2112.07011

作者:Marco Rasetto,Juan P. Dominguez-Morales,Angel Jimenez-Fernandez,Ryad Benosman 机构:Department of Bioengineering and Center for Neural Basis of Cognition, University of Pittsburgh, Carnegie Mellon University, Pittsburgh, USA, Robotics and Technology of Computers Lab, Universidad de Sevilla, Seville, Spain 备注:10 pages, 7 figures 摘要:近年来,为了提高自然语言处理(NLP)和音频识别的技术水平,人们做出了巨大的努力。然而,这些努力往往导致更大、更复杂的模型的功耗和内存需求增加。这些解决方案没有满足物联网设备对低功耗、低内存效率计算的要求,因此无法满足日益增长的高效边缘计算需求。神经形态系统已被证明是许多应用中低功耗低延迟计算的优秀候选。因此,我们提出了一种神经形态结构,能够进行无监督的听觉特征识别。然后,我们在谷歌语音命令数据集的子集上验证网络。 摘要:In recent years tremendous efforts have been done to advance the state of the art for Natural Language Processing (NLP) and audio recognition. However, these efforts often translated in increased power consumption and memory requirements for bigger and more complex models. These solutions falls short of the constraints of IoT devices which need low power, low memory efficient computation, and therefore they fail to meet the growing demand of efficient edge computing. Neuromorphic systems have proved to be excellent candidates for low-power low-latency computation in a multitude of applications. For this reason we present a neuromorphic architecture, capable of unsupervised auditory feature recognition. We then validate the network on a subset of Google's Speech Commands dataset.

【10】 Decoding High-level Imagined Speech using Attention-based Deep Neural Networks 标题:基于注意力的深度神经网络解码高级想象语音 链接:https://arxiv.org/abs/2112.06922

作者:Dae-Hyeok Lee,Sung-Jin Kim,Keon-Woo Lee 机构:Dept. Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea, Dept. Artificial Intelligence 备注:4 pages, 2 figures 摘要:脑机接口(BCI)是一种通过反映人的状态和意图来实现人与设备之间通信的技术。在进行假想语音时,用户将语音想象为实际说话。在解码基于想象语音的脑电信号的情况下,可以更直观地执行复杂任务,但解码性能低于其他BCI范式。我们修改了之前的模型,对基于想象语音的脑电信号进行解码。十名受试者参加了实验。我们提出的方法对四个词进行分类的平均准确率为0.5648。换句话说,我们提出的方法在学习局部特征方面具有显著的优势。因此,我们证明了基于想象语音的脑电信号解码具有鲁棒性的可行性。 摘要:Brain-computer interface (BCI) is the technology that enables the communication between humans and devices by reflecting status and intentions of humans. When conducting imagined speech, the users imagine the pronunciation as if actually speaking. In the case of decoding imagined speech-based EEG signals, complex task can be conducted more intuitively, but decoding performance is lower than that of other BCI paradigms. We modified our previous model for decoding imagined speech-based EEG signals. Ten subjects participated in the experiment. The average accuracy of our proposed method was 0.5648 for classifying four words. In other words, our proposed method has significant strength in learning local features. Hence, we demonstrated the feasibility of decoding imagined speech-based EEG signals with robust performance.

【11】 Visualizing Ensemble Predictions of Music Mood 标题:音乐情绪的合奏预测可视化 链接:https://arxiv.org/abs/2112.07627

作者:Zelin Ye,Min Chen 机构:University of Oxford, UK 备注:10 pages, 7 figures, submitted to EuroVis 2022 摘要:与其他分类问题(如流派、作曲家或时期)相比,音乐情绪分类一直是一个具有挑战性的问题。解决这一难题的一个解决方案是使用集成机器学习模型。在本文中,我们展示了可视化技术可以有效地传达流行的预测以及沿时间轴的不同音乐部分的不确定性,同时能够分析单个ML模型及其对不同音乐数据的应用。除了传统的视觉设计,如堆叠线图、Themerier和基于像素的可视化,我们还引入了Themerier的一种新变体,称为“双通量Themerier”,它允许观众比堆叠线图和Themerier更容易地观察和测量最流行的预测。测试表明,可视化集合预测在模型开发工作流和使用模型预测注释音乐方面都很有帮助。 摘要:Music mood classification has been a challenging problem in comparison with some other classification problems (e.g., genre, composer, or period). One solution for addressing this challenging is to use an of ensemble machine learning models. In this paper, we show that visualization techniques can effectively convey the popular prediction as well as uncertainty at different music sections along the temporal axis, while enabling the analysis of individual ML models in conjunction with their application to different musical data. In addition to the traditional visual designs, such as stacked line graph, ThemeRiver, and pixel-based visualization, we introduced a new variant of ThemeRiver, called "dual-flux ThemeRiver", which allows viewers to observe and measure the most popular prediction more easily than stacked line graph and ThemeRiver. Testing indicates that visualizing ensemble predictions is helpful both in model-development workflows and for annotating music using model predictions.

【12】 Robustifying automatic speech recognition by extracting slowly varying features 标题:通过提取缓慢变化的特征来实现自动语音识别的ROBUST化 链接:https://arxiv.org/abs/2112.07400

作者:Matias Pizarro,Dorothea Kolossa,Asja Fischer 机构:Ruhr University Bochum, Germany 摘要:在过去的几年中,已经证明深度学习系统在对抗性攻击下非常脆弱。基于神经网络的自动语音识别(ASR)系统也不例外。有针对性和无针对性的攻击可以修改音频输入信号,使人类仍能识别相同的单词,而ASR系统则被引导预测不同的转录。在本文中,我们提出了一种针对目标对抗性攻击的防御机制,包括在将输入反馈给ASR系统之前,通过应用慢速特征分析、低通滤波器或两者,从音频信号中移除快速变化的特征。我们对以这种方式预处理的数据训练的混合ASR模型进行了实证分析。虽然生成的模型在良性数据上表现得相当好,但它们对目标对手攻击的鲁棒性显著提高:我们最终提出的模型在干净数据上表现出与基线模型类似的性能,同时鲁棒性提高了四倍以上。 摘要:In the past few years, it has been shown that deep learning systems are highly vulnerable under attacks with adversarial examples. Neural-network-based automatic speech recognition (ASR) systems are no exception. Targeted and untargeted attacks can modify an audio input signal in such a way that humans still recognise the same words, while ASR systems are steered to predict a different transcription. In this paper, we propose a defense mechanism against targeted adversarial attacks consisting in removing fast-changing features from the audio signals, either by applying slow feature analysis, a low-pass filter, or both, before feeding the input to the ASR system. We perform an empirical analysis of hybrid ASR models trained on data pre-processed in such a way. While the resulting models perform quite well on benign data, they are significantly more robust against targeted adversarial attacks: Our final, proposed model shows a performance on clean data similar to the baseline model, while being more than four times more robust.

【13】 Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model 标题:利用预先训练的声学和语言模型改进CTC/注意混合端到端语音识别 链接:https://arxiv.org/abs/2112.07254

作者:Keqi Deng,Songjun Cao,Yike Zhang,Long Ma 机构:Tencent Cloud Xiaowei, Beijing, China, Institute of Acoustics, Chinese Academy of Sciences, China, University of Chinese Academy of Sciences, China 备注:ASRU2021 摘要:最近,自监督预训练在端到端(E2E)自动语音识别(ASR)中取得了令人印象深刻的结果。然而,主导序列对序列(S2S)E2E模型仍然难以充分利用自监督预训练方法,因为其解码器以声学表示为条件,因此无法单独预训练。在本文中,我们提出了一种基于混合CTC/注意E2E模型的预训练转换器(Preformer)S2S ASR体系结构,以充分利用预训练声学模型(AMs)和语言模型(LMs)。在我们的框架中,编码器是用预训练AM(wav2vec2.0)初始化的。预成型器在训练和推理过程中将CTC作为辅助任务。此外,我们还设计了一个单交叉译码器(OCD),它放松了对声学表示的依赖,因此可以使用预训练LM(DistilGPT2)对其进行初始化。实验在Aisell-1语料库上进行,在测试集上实现了$4.6\%$字符错误率(CER)。与我们的香草混合CTC/注意力转换器基线相比,我们提出的基于CTC/注意力的预成型器相对CER降低27\%$。据我们所知,这是首次在S2S ASR系统中使用预训练AM和LM。 摘要:Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pre-training methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a $4.6\%$ character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields $27\%$ relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

【14】 Spatiogram: A phase based directional angular measure and perceptual weighting for ensemble source width 标题:空间图:一种基于相位的集合信源宽度方向角测量和感知加权方法 链接:https://arxiv.org/abs/2112.07216

作者:Arthi S,Sreenivas T V 机构:Indian Institute of Science, Bangalore-, India 备注:12 pages, 11 figures 摘要:在音乐厅研究中,听觉间互相关(IACC)是一种信号依赖性的方法,被用作感知源宽度的度量。在分布式源的情况下,感知源宽度也使用相同的度量。在这项工作中,我们检验了IACC在这两种情况下的有效性,并针对类集合分布源开发了一种改进的度量。我们将感知集合源宽度(ESW)的新目标度量分解为两个分量(i)基于相位的方向角度量,即与音色无关的(空间度量)和(ii)平均时间带宽能量(MTBE),即感知权重(音色度量)。这种空间和音色测量的组合可以扩展为确定音乐厅和房间声学中任意信号的听觉源宽度(ASW)和听者包络(LEV)的替代测量。 摘要:In concert hall studies, inter-aural cross-correlation (IACC), which is signal dependent, is used as a measure of perceptual source width. The same measure is used for perceptual source width in the case of distributed sources also. In this work, we examine the validity of IACC for both the cases and develop an improved measure for ensemble-like distributed sources. We decompose the new objective measure for perceptual ensemble source width (ESW) into two components (i) phase based directional angular measure, which is timbre independent (spatial measure) and (ii) mean time-bandwidth energy (MTBE), a perceptual weight, (timbre measure). This combination of spatial and timbral measures can be extended as an alternate measure for determining auditory source width (ASW) and listener envelopment (LEV) of arbitrary signals in concert-hall and room acoustics.

【15】 ImportantAug: a data augmentation agent for speech 标题:ImportantAug:一种语音数据增强剂 链接:https://arxiv.org/abs/2112.07156

作者:Viet Anh Trinh,Hassan Salami Kavaki,Michael I Mandel 机构: The Graduate Center, CUNY, New York, USA, Brooklyn College, CUNY, New York, USA 备注:Submitted to ICASSP 2022 摘要:我们介绍IMPORTATAUG,一种通过向语音的不重要区域和非重要区域添加噪声来增加语音分类和识别模型训练数据的技术。通过数据增强代理预测每个话语的重要性,该数据增强代理经过训练以最大化其添加的噪声量,同时最小化其对识别性能的影响。我们的方法的有效性在谷歌语音命令(GSC)数据集的第二版上得到了验证。在标准GSC测试集上,与传统的噪声增强相比,它实现了23.3%的相对错误率降低。传统的噪声增强将噪声应用于语音,而不考虑在何处可能最有效。与不增加数据的基线相比,它还提供了25.4%的错误率降低。此外,在添加额外噪声的两个测试集上,所提出的算法优于传统的噪声增强算法和基线算法。 摘要:We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.

3.eess.AS音频处理:

【1】 Visualizing Ensemble Predictions of Music Mood 标题:音乐情绪的合奏预测可视化 链接:https://arxiv.org/abs/2112.07627

作者:Zelin Ye,Min Chen 机构:University of Oxford, UK 备注:10 pages, 7 figures, submitted to EuroVis 2022 摘要:与其他分类问题(如流派、作曲家或时期)相比,音乐情绪分类一直是一个具有挑战性的问题。解决这一难题的一个解决方案是使用集成机器学习模型。在本文中,我们展示了可视化技术可以有效地传达流行的预测以及沿时间轴的不同音乐部分的不确定性,同时能够分析单个ML模型及其对不同音乐数据的应用。除了传统的视觉设计,如堆叠线图、Themerier和基于像素的可视化,我们还引入了Themerier的一种新变体,称为“双通量Themerier”,它允许观众比堆叠线图和Themerier更容易地观察和测量最流行的预测。测试表明,可视化集合预测在模型开发工作流和使用模型预测注释音乐方面都很有帮助。 摘要:Music mood classification has been a challenging problem in comparison with some other classification problems (e.g., genre, composer, or period). One solution for addressing this challenging is to use an of ensemble machine learning models. In this paper, we show that visualization techniques can effectively convey the popular prediction as well as uncertainty at different music sections along the temporal axis, while enabling the analysis of individual ML models in conjunction with their application to different musical data. In addition to the traditional visual designs, such as stacked line graph, ThemeRiver, and pixel-based visualization, we introduced a new variant of ThemeRiver, called "dual-flux ThemeRiver", which allows viewers to observe and measure the most popular prediction more easily than stacked line graph and ThemeRiver. Testing indicates that visualizing ensemble predictions is helpful both in model-development workflows and for annotating music using model predictions.

【2】 Robustifying automatic speech recognition by extracting slowly varying features 标题:通过提取缓慢变化的特征来实现自动语音识别的ROBUST化 链接:https://arxiv.org/abs/2112.07400

作者:Matias Pizarro,Dorothea Kolossa,Asja Fischer 机构:Ruhr University Bochum, Germany 摘要:在过去的几年中,已经证明深度学习系统在对抗性攻击下非常脆弱。基于神经网络的自动语音识别(ASR)系统也不例外。有针对性和无针对性的攻击可以修改音频输入信号,使人类仍能识别相同的单词,而ASR系统则被引导预测不同的转录。在本文中,我们提出了一种针对目标对抗性攻击的防御机制,包括在将输入反馈给ASR系统之前,通过应用慢速特征分析、低通滤波器或两者,从音频信号中移除快速变化的特征。我们对以这种方式预处理的数据训练的混合ASR模型进行了实证分析。虽然生成的模型在良性数据上表现得相当好,但它们对目标对手攻击的鲁棒性显著提高:我们最终提出的模型在干净数据上表现出与基线模型类似的性能,同时鲁棒性提高了四倍以上。 摘要:In the past few years, it has been shown that deep learning systems are highly vulnerable under attacks with adversarial examples. Neural-network-based automatic speech recognition (ASR) systems are no exception. Targeted and untargeted attacks can modify an audio input signal in such a way that humans still recognise the same words, while ASR systems are steered to predict a different transcription. In this paper, we propose a defense mechanism against targeted adversarial attacks consisting in removing fast-changing features from the audio signals, either by applying slow feature analysis, a low-pass filter, or both, before feeding the input to the ASR system. We perform an empirical analysis of hybrid ASR models trained on data pre-processed in such a way. While the resulting models perform quite well on benign data, they are significantly more robust against targeted adversarial attacks: Our final, proposed model shows a performance on clean data similar to the baseline model, while being more than four times more robust.

【3】 Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model 标题:利用预先训练的声学和语言模型改进CTC/注意混合端到端语音识别 链接:https://arxiv.org/abs/2112.07254

作者:Keqi Deng,Songjun Cao,Yike Zhang,Long Ma 机构:Tencent Cloud Xiaowei, Beijing, China, Institute of Acoustics, Chinese Academy of Sciences, China, University of Chinese Academy of Sciences, China 备注:ASRU2021 摘要:最近,自监督预训练在端到端(E2E)自动语音识别(ASR)中取得了令人印象深刻的结果。然而,主导序列对序列(S2S)E2E模型仍然难以充分利用自监督预训练方法,因为其解码器以声学表示为条件,因此无法单独预训练。在本文中,我们提出了一种基于混合CTC/注意E2E模型的预训练转换器(Preformer)S2S ASR体系结构,以充分利用预训练声学模型(AMs)和语言模型(LMs)。在我们的框架中,编码器是用预训练AM(wav2vec2.0)初始化的。预成型器在训练和推理过程中将CTC作为辅助任务。此外,我们还设计了一个单交叉译码器(OCD),它放松了对声学表示的依赖,因此可以使用预训练LM(DistilGPT2)对其进行初始化。实验在Aisell-1语料库上进行,在测试集上实现了$4.6\%$字符错误率(CER)。与我们的香草混合CTC/注意力转换器基线相比,我们提出的基于CTC/注意力的预成型器相对CER降低27\%$。据我们所知,这是首次在S2S ASR系统中使用预训练AM和LM。 摘要:Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pre-training methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a $4.6\%$ character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields $27\%$ relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

【4】 Spatiogram: A phase based directional angular measure and perceptual weighting for ensemble source width 标题:空间图:一种基于相位的集合信源宽度方向角测量和感知加权方法 链接:https://arxiv.org/abs/2112.07216

作者:Arthi S,Sreenivas T V 机构:Indian Institute of Science, Bangalore-, India 备注:12 pages, 11 figures 摘要:在音乐厅研究中,听觉间互相关(IACC)是一种信号依赖性的方法,被用作感知源宽度的度量。在分布式源的情况下,感知源宽度也使用相同的度量。在这项工作中,我们检验了IACC在这两种情况下的有效性,并针对类集合分布源开发了一种改进的度量。我们将感知集合源宽度(ESW)的新目标度量分解为两个分量(i)基于相位的方向角度量,即与音色无关的(空间度量)和(ii)平均时间带宽能量(MTBE),即感知权重(音色度量)。这种空间和音色测量的组合可以扩展为确定音乐厅和房间声学中任意信号的听觉源宽度(ASW)和听者包络(LEV)的替代测量。 摘要:In concert hall studies, inter-aural cross-correlation (IACC), which is signal dependent, is used as a measure of perceptual source width. The same measure is used for perceptual source width in the case of distributed sources also. In this work, we examine the validity of IACC for both the cases and develop an improved measure for ensemble-like distributed sources. We decompose the new objective measure for perceptual ensemble source width (ESW) into two components (i) phase based directional angular measure, which is timbre independent (spatial measure) and (ii) mean time-bandwidth energy (MTBE), a perceptual weight, (timbre measure). This combination of spatial and timbral measures can be extended as an alternate measure for determining auditory source width (ASW) and listener envelopment (LEV) of arbitrary signals in concert-hall and room acoustics.

【5】 ImportantAug: a data augmentation agent for speech 标题:ImportantAug:一种语音数据增强剂 链接:https://arxiv.org/abs/2112.07156

作者:Viet Anh Trinh,Hassan Salami Kavaki,Michael I Mandel 机构: The Graduate Center, CUNY, New York, USA, Brooklyn College, CUNY, New York, USA 备注:Submitted to ICASSP 2022 摘要:我们介绍IMPORTATAUG,一种通过向语音的不重要区域和非重要区域添加噪声来增加语音分类和识别模型训练数据的技术。通过数据增强代理预测每个话语的重要性,该数据增强代理经过训练以最大化其添加的噪声量,同时最小化其对识别性能的影响。我们的方法的有效性在谷歌语音命令(GSC)数据集的第二版上得到了验证。在标准GSC测试集上,与传统的噪声增强相比,它实现了23.3%的相对错误率降低。传统的噪声增强将噪声应用于语音,而不考虑在何处可能最有效。与不增加数据的基线相比,它还提供了25.4%的错误率降低。此外,在添加额外噪声的两个测试集上,所提出的算法优于传统的噪声增强算法和基线算法。 摘要:We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.

【6】 On the Use of External Data for Spoken Named Entity Recognition 标题:浅谈外部数据在口语命名实体识别中的应用 链接:https://arxiv.org/abs/2112.07648

作者:Ankita Pasad,Felix Wu,Suwon Shon,Karen Livescu,Kyu J. Han 机构:ASAPP, Toyota Technological Institute at Chicago 摘要:口语理解(SLU)任务涉及从语音信号到语义标签的映射。考虑到这些任务的复杂性,良好的性能可能需要大量标记的数据集,这些数据集很难为每个新任务和域收集。然而,自监督语音表示的最新进展使得考虑有限的标记数据学习SLU模型是可行的。在这项工作中,我们将重点放在低资源语音命名实体识别(NER)上,并解决以下问题:除了自我监督的预训练外,我们如何使用未为任务注释的外部语音和/或文本数据?我们采用各种方法,包括自我训练,知识蒸馏和转移学习,并考虑其适用于端到端的模型和管道(语音识别,其次是文本模型)的方法。我们发现,其中一些方法在资源受限的环境中提高了性能,而不仅仅是预先训练好的表示。与之前的工作相比,我们发现F1成绩提高了16%。虽然最佳基线模型是管道方法,但使用外部数据时的最佳性能最终是通过端到端模型实现的。我们提供了详细的比较和分析,例如表明端到端模型能够关注更具体的单词。 摘要:Spoken language understanding (SLU) tasks involve mapping from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline (speech recognition followed by text NER model) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations alone. Compared to prior work, we find improved F1 scores of up to 16%. While the best baseline model is a pipeline approach, the best performance when using external data is ultimately achieved by an end-to-end model. We provide detailed comparisons and analyses, showing for example that end-to-end models are able to focus on the more NER-specific words.

【7】 End-to-end speaker diarization with transformer 标题:带Transformer的端到端扬声器二值化 链接:https://arxiv.org/abs/2112.07463

作者:Yongquan Lai,Xin Tang,Yuanyuan Fu,Rui Fang 机构:Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China 备注:submitted to icassp2022 摘要:说话人二值化与计算机视觉中的语义分割有关。受MaskFormercite{cheng2021per}将语义分割视为一组预测问题的启发,我们提出了一种端到端的方法来预测由二元掩码、声音活动和说话人向量组成的一组目标。我们的模型,即coin extit{DiFormer},主要基于说话人编码器和特征金字塔网络(FPN)模块来提取多尺度说话人特征,然后将这些特征反馈到transformer编码器解码器中,以从学习的查询嵌入中预测一组二值化目标。为了考虑语音信号的时间特性,在掩模预测模块中插入双向LSTM以提高时间一致性。我们的模型以统一的方式处理未知数量的说话人、语音重叠以及语音活动检测。在多媒体和会议数据集上的实验证明了该方法的有效性。 摘要:Speaker diarization is connected to semantic segmentation in computer vision. Inspired from MaskFormer cite{cheng2021per} which treats semantic segmentation as a set-prediction problem, we propose an end-to-end approach to predict a set of targets consisting of binary masks, vocal activities and speaker vectors. Our model, which we coin extit{DiFormer}, is mainly based on a speaker encoder and a feature pyramid network (FPN) module to extract multi-scale speaker features which are then fed into a transformer encoder-decoder to predict a set of diarization targets from learned query embedding. To account for temporal characteristics of speech signal, bidirectional LSTMs are inserted into the mask prediction module to improve temporal consistency. Our model handles unknown number of speakers, speech overlaps, as well as vocal activity detection in a unified way. Experiments on multimedia and meeting datasets demonstrate the effectiveness of our approach.

【8】 Supervised Learning for Multi Zone Sound Field Reproduction under Harsh Environmental Conditions 标题:恶劣环境下多区域声场再现的有监督学习 链接:https://arxiv.org/abs/2112.07349

作者:Henry Sallandt,Philipp Krah,Mathias Lemke 机构:Institute of Fluid Mechanics and Engineering Acoustics, Technical University Berlin, M¨uller-Breslau-Str. , Berlin, Germany, Institute of Mathematics, Straße des ,. Juni , Berlin, Germany 备注:Preprint submitted for publication 摘要:这篇手稿提出了一种使用监督学习的多区域声场再现方法。传统的多区域声场再现方法假定音速恒定,忽略了风和温度分层等非线性效应。我们展示了如何使用传递函数的监督学习来克服这些限制。通过声学对比度和再现误差来测量溶液的质量。我们的结果表明,对于所选择的设置,即使在相对较小的风速下,当在训练模型中考虑风时,声学对比度和再现误差可以提高16 dB。 摘要:This manuscript presents an approach for multi zone sound field reproduction using supervised learning. Traditional multi zone sound field reproduction methods assume constant speed of sound, neglecting nonlinear effects like wind and temperature stratification. We show how to overcome these restrictions using supervised learning of transfer functions. The quality of the solution is measured by the acoustic contrast and the reproduction error. Our results show that for the chosen setup, even with relatively small wind speeds, the acoustic contrast and reproduction error can be improved by up to 16 dB, when wind is considered in the trained model.

【9】 Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice 标题:基于咳嗽、呼吸和声音参数的一维卷积神经网络和人体呼吸音增强的冠状病毒病自动诊断 链接:https://arxiv.org/abs/2112.07285

作者:Kranthi Kumar Lella,Alphonse Pja 机构:Department of Computer Applications, NIT Tiruchirappalli, Tamil Nadu, India 备注:None 摘要:2019冠状病毒疾病的分类已引起临床科普工作者和医学研究者的重视。迄今为止,2019冠状病毒疾病的人工智能(AI)的各种模型进入真实世界,从人类产生的声音,如语音/语音,咳嗽和呼吸检测COVID-19疾病。卷积神经网络(CNN)模型是基于人工智能(AI)实现的,用于解决机器上的许多实际问题。在此背景下,提出并实施了一维(1D)CNN来诊断COVID-19的呼吸系统疾病,如人的呼吸声音,如声音、咳嗽和呼吸。基于2019冠状病毒疾病2019冠状病毒疾病的数据集,采用基于增强的机制来提高COVID-19声音数据集的预处理性能,并利用一维卷积网络实现COVID-19疾病诊断自动化。此外,使用DDAE(数据去噪自动编码器)技术来生成深度声音特征,例如1D CNN的输入函数,而不是采用MFCC(Mel频率倒谱系数)的标准输入,并且它比以前的模型具有更好的精度和性能。 摘要:The issue in respiratory sound classification has attained good attention from the clinical scientists and medical researcher's group in the last year to diagnosing COVID-19 disease. To date, various models of Artificial Intelligence (AI) entered into the real-world to detect the COVID-19 disease from human-generated sounds such as voice/speech, cough, and breath. The Convolutional Neural Network (CNN) model is implemented for solving a lot of real-world problems on machines based on Artificial Intelligence (AI). In this context, one dimension (1D) CNN is suggested and implemented to diagnose respiratory diseases of COVID-19 from human respiratory sounds such as a voice, cough, and breath. An augmentation-based mechanism is applied to improve the preprocessing performance of the COVID-19 sounds dataset and to automate COVID-19 disease diagnosis using the 1D convolutional network. Furthermore, a DDAE (Data De-noising Auto Encoder) technique is used to generate deep sound features such as the input function to the 1D CNN instead of adopting the standard input of MFCC (Mel-frequency cepstral coefficient), and it is performed better accuracy and performance than previous models.

【10】 Noise Reduction and Driving Event Extraction Method for Performance Improvement on Driving Noise-based Surface Anomaly Detection 标题:提高基于驾驶噪声的表面异常检测性能的降噪和驾驶事件提取方法 链接:https://arxiv.org/abs/2112.07214

作者:YeongHyeon Park,JoonSung Lee,Myung Jin Kim,Wonseok Park 机构:SK Planet Co., Ltd. 备注:3 pages, 3 figures, 2 tables 摘要:路面上的异物(如雨水或黑冰)会减少轮胎与路面之间的摩擦。上述情况将降低制动性能,并使车身姿态难以控制。在这种情况下,至少有可能造成财产损失。在最坏的情况下,将发生人身伤害。为了避免这一问题,提出了一种基于车辆行驶噪声的道路异常检测模型。然而,先前的建议不考虑额外的噪声,与驾驶噪声混合,并且跳过没有车辆驾驶的时刻的计算。在本文中,我们提出了一种简单的驱动事件提取方法和降噪方法,以提高计算效率和异常检测性能。 摘要:Foreign substances on the road surface, such as rainwater or black ice, reduce the friction between the tire and the surface. The above situation will reduce the braking performance and make difficult to control the vehicle body posture. In that case, there is a possibility of property damage at least. In the worst case, personal damage will be occured. To avoid this problem, a road anomaly detection model is proposed based on vehicle driving noise. However, the prior proposal does not consider the extra noise, mixed with driving noise, and skipping calculations for moments without vehicle driving. In this paper, we propose a simple driving event extraction method and noise reduction method for improving computational efficiency and anomaly detection performance.

【11】 Cross-modal Music Emotion Recognition Using Composite Loss-based Embeddings 标题:基于复合损失嵌入的跨模态音乐情感识别 链接:https://arxiv.org/abs/2112.07192

作者:Naoki Takashima,Frédéric Li,Marcin Grzegorzek,Kimiaki Shirahama 机构: Shirahama is member of the Department of Informatics 备注:12 pages, 5 figures 摘要:大多数音乐情感识别方法使用单向分类或回归,根据音乐样本的分布估计一般情感,但不考虑情感变化(例如,幸福感可进一步分为多幸福感、中等幸福感或少幸福感)。我们提出了一种跨模态音乐情感识别方法,通过考虑音乐样本的一般特征和特定特征,将音乐样本与共同空间中的情感联系起来。由于人类主观感知的原因,音乐样本与情感的关联是不确定的,因此我们计算得到的基于复合损失的嵌入,以最大化两个统计特征,一个是基于典型相关分析的音乐样本与情感之间的相关性,另一个是音乐样本和KL发散情绪之间的概率相似性。在两个基准数据集上的实验证明了我们的方法优于单向基线。此外,详细的分析表明,我们的方法可以实现鲁棒的跨模态音乐情感识别,不仅可以识别与特定情感匹配的音乐样本,还可以检测特定音乐样本中表达的情感。 摘要:Most music emotion recognition approaches use one-way classification or regression that estimates a general emotion from a distribution of music samples, but without considering emotional variations (e.g., happiness can be further categorised into much, moderate or little happiness). We propose a cross-modal music emotion recognition approach that associates music samples with emotions in a common space by considering both of their general and specific characteristics. Since the association of music samples with emotions is uncertain due to subjective human perceptions, we compute composite loss-based embeddings obtained to maximise two statistical characteristics, one being the correlation between music samples and emotions based on canonical correlation analysis, and the other being a probabilistic similarity between a music sample and an emotion with KL-divergence. Experiments on two benchmark datasets demonstrate the superiority of our approach over one-way baselines. In addition, detailed analysis show that our approach can accomplish robust cross-modal music emotion recognition that not only identifies music samples matching with a specific emotion but also detects emotions expressed in a certain music sample.

【12】 Explore Long-Range Context feature for Speaker Verification 标题:探索用于说话人确认的远程上下文特征 链接:https://arxiv.org/abs/2112.07134

作者:Zhuo Li 机构:Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese, University of Chinese Academy of Sciences, Beijing, China 备注:rejected by interspeech2021 摘要:捕获长距离依赖和建模长时间上下文被证明有利于说话人验证任务。在本文中,我们提出了分层分割块(HS块)和深度可分离自我注意(DSSA)模块的组合,分别从局部和全局角度捕获更丰富的多音域上下文说话人特征。具体而言,HS块将特征图和过滤器拆分为若干组,并将其堆叠在一个块中,从而局部放大感受野(RFs)。DSSA模块通过深度分离策略和显式稀疏注意策略对多头自我注意机制进行改进,以全局建模成对关系,并捕获每个通道中的有效长程依赖关系。在Voxceleb和SITW上进行了实验。通过应用HS块和DSSA模块的组合,我们的最佳系统在Voxceleb1测试集上实现了1.27%的EER,在SITW上实现了1.56%的EER。 摘要:Capturing long-range dependency and modeling long temporal contexts is proven to benefit speaker verification tasks. In this paper, we propose the combination of the Hierarchical-Split block(HS-block) and the Depthwise Separable Self-Attention(DSSA) module to capture richer multi-range context speaker features from a local and global perspective respectively. Specifically, the HS-block splits the feature map and filters into several groups and stacks them in one block, which enlarges the receptive fields(RFs) locally. The DSSA module improves the multi-head self-attention mechanism by the depthwise-separable strategy and explicit sparse attention strategy to model the pairwise relations globally and captures effective long-range dependencies in each channel. Experiments are conducted on the Voxceleb and SITW. Our best system achieves 1.27% EER on the Voxceleb1 test set and 1.56% on SITW by applying the combination of HS-block and DSSA module.

【13】 Real-Time Neural Voice Camouflage 标题:实时神经语音伪装 链接:https://arxiv.org/abs/2112.07076

作者:Mia Chiquier,Chengzhi Mao,Carl Vondrick 机构:Department of Computer Science, Columbia University, New York, NY 备注:14 pages 摘要:自动语音识别系统为应用创造了令人兴奋的可能性,但也为系统窃听提供了机会。我们提出了一种方法,可以在不影响房间里人与人之间对话的情况下,通过这些系统在空中伪装一个人的声音。标准的对抗性攻击在实时流情况下无效,因为在执行攻击时,信号的特征将发生变化。我们引入预测攻击,通过预测未来最有效的攻击来实现实时性能。在实时性约束下,我们的方法比通过字错误率测量的基线多4.17倍,通过字符错误率测量的基线多7.27倍。此外,我们还证明了我们的方法在物理距离的现实环境中是切实有效的。 摘要:Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping. We propose a method to camouflage a person's voice over-the-air from these systems without inconveniencing the conversation between people in the room. Standard adversarial attacks are not effective in real-time streaming situations because the characteristics of the signal will have changed by the time the attack is executed. We introduce predictive attacks, which achieve real-time performance by forecasting the attack that will be the most effective in the future. Under real-time constraints, our method jams the established speech recognition system DeepSpeech 4.17x more than baselines as measured through word error rate, and 7.27x more as measured through character error rate. We furthermore demonstrate our approach is practically effective in realistic environments over physical distances.

【14】 Event Based Time-Vectors for auditory features extraction: a neuromorphic approach for low power audio recognition 标题:基于事件的时间矢量听觉特征提取:一种用于低功耗音频识别的神经形态学方法 链接:https://arxiv.org/abs/2112.07011

作者:Marco Rasetto,Juan P. Dominguez-Morales,Angel Jimenez-Fernandez,Ryad Benosman 机构:Department of Bioengineering and Center for Neural Basis of Cognition, University of Pittsburgh, Carnegie Mellon University, Pittsburgh, USA, Robotics and Technology of Computers Lab, Universidad de Sevilla, Seville, Spain 备注:10 pages, 7 figures 摘要:近年来,为了提高自然语言处理(NLP)和音频识别的技术水平,人们做出了巨大的努力。然而,这些努力往往导致更大、更复杂的模型的功耗和内存需求增加。这些解决方案没有满足物联网设备对低功耗、低内存效率计算的要求,因此无法满足日益增长的高效边缘计算需求。神经形态系统已被证明是许多应用中低功耗低延迟计算的优秀候选。因此,我们提出了一种神经形态结构,能够进行无监督的听觉特征识别。然后,我们在谷歌语音命令数据集的子集上验证网络。 摘要:In recent years tremendous efforts have been done to advance the state of the art for Natural Language Processing (NLP) and audio recognition. However, these efforts often translated in increased power consumption and memory requirements for bigger and more complex models. These solutions falls short of the constraints of IoT devices which need low power, low memory efficient computation, and therefore they fail to meet the growing demand of efficient edge computing. Neuromorphic systems have proved to be excellent candidates for low-power low-latency computation in a multitude of applications. For this reason we present a neuromorphic architecture, capable of unsupervised auditory feature recognition. We then validate the network on a subset of Google's Speech Commands dataset.

【15】 Decoding High-level Imagined Speech using Attention-based Deep Neural Networks 标题:基于注意力的深度神经网络解码高级想象语音 链接:https://arxiv.org/abs/2112.06922

作者:Dae-Hyeok Lee,Sung-Jin Kim,Keon-Woo Lee 机构:Dept. Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea, Dept. Artificial Intelligence 备注:4 pages, 2 figures 摘要:脑机接口(BCI)是一种通过反映人的状态和意图来实现人与设备之间通信的技术。在进行假想语音时,用户将语音想象为实际说话。在解码基于想象语音的脑电信号的情况下,可以更直观地执行复杂任务,但解码性能低于其他BCI范式。我们修改了之前的模型,对基于想象语音的脑电信号进行解码。十名受试者参加了实验。我们提出的方法对四个词进行分类的平均准确率为0.5648。换句话说,我们提出的方法在学习局部特征方面具有显著的优势。因此,我们证明了基于想象语音的脑电信号解码具有鲁棒性的可行性。 摘要:Brain-computer interface (BCI) is the technology that enables the communication between humans and devices by reflecting status and intentions of humans. When conducting imagined speech, the users imagine the pronunciation as if actually speaking. In the case of decoding imagined speech-based EEG signals, complex task can be conducted more intuitively, but decoding performance is lower than that of other BCI paradigms. We modified our previous model for decoding imagined speech-based EEG signals. Ten subjects participated in the experiment. The average accuracy of our proposed method was 0.5648 for classifying four words. In other words, our proposed method has significant strength in learning local features. Hence, we demonstrated the feasibility of decoding imagined speech-based EEG signals with robust performance.