zl程序教程

您现在的位置是:首页 >  IT要闻

当前栏目

金融/语音/音频处理学术速递[11.10]

2023-03-14 22:52:02 时间

q-fin金融,共计5篇

cs.SD语音,共计10篇

eess.AS音频处理,共计10篇

1.q-fin金融:

【1】 Do Firearm Markets Comply with Firearm Restrictions? How the Massachusetts Assault Weapons Ban Enforcement Notice Changed Firearm Sales 标题:枪支市场是否遵守枪支限制?马萨诸塞州禁止攻击性武器执行通知如何改变枪支销售 链接:https://arxiv.org/abs/2111.05272

作者:Meenakshi Balakrishna,Kenneth C. Wilbur 机构:University of California, San Diego 备注:arXiv admin note: substantial text overlap with arXiv:2102.02884 摘要:枪支市场在多大程度上遵守了枪支限制?马萨诸塞州总检察长在2016年发布了一份执行通知,宣布对该州攻击性武器禁令中的关键词语“副本和副本”进行新的解释。执行通知在五天内增加了1349支突击步枪的销量(+560%),随后在接下来的三周内减少了211支(-58%)。2017年突击步枪的销量比同期下降了64-66%,这表明执法通知减少了突击武器的销量,但许多违禁武器仍在继续销售。总的来说,结果显示了政策影响火器市场的速度,并为火器市场遵守攻击武器限制提供了上限。 摘要:How well do firearm markets comply with firearm restrictions? The Massachusetts Attorney General issued an Enforcement Notice in 2016 to announce a new interpretation of the key phrase "copies and duplicates" in the state's assault weapons ban. The Enforcement Notice increased assault rifle sales by 1,349 (+560%) within five days, followed by a reduction of 211 (-58%) over the next three weeks. Assault rifle sales were 64-66% lower in 2017 than in comparable earlier periods, suggesting that the Enforcement Notice reduced assault weapon sales but also that many banned weapons continued to be sold. Overall, the results show how quickly policy can affect the firearm market and provide an upper bound on firearm market compliance with assault weapon restrictions.

【2】 FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance 标题:FinRL-Podracer:面向量化金融的高性能可扩展深度强化学习 链接:https://arxiv.org/abs/2111.05188

作者:Zechu Li,Xiao-Yang Liu,Jiahao Zheng,Zhaoran Wang,Anwar Walid,Jian Guo 机构:Shenzhen Inst. of Advanced Tech., Northwestern University, Amazon & Columbia University, IDEA Research 备注:None 摘要:机器学习技术在金融市场投资中发挥着越来越重要的作用。然而,使用传统的监督学习方法进行金融定量建模存在一些局限性。深度强化学习技术的发展部分解决了这些问题。不幸的是,陡峭的学习曲线以及快速建模和敏捷开发的困难阻碍了金融研究人员在定量交易中使用深度强化学习。在本文中,我们提出了一个RLOps金融范式,并提出了一个FinRL Podracer框架,以加速深度强化学习(DRL)驱动的交易策略的开发流程,并提高交易绩效和训练效率。FinRL Podracer是一个云解决方案,具有高性能和高可扩展性,并承诺持续训练、持续集成和持续交付DRL驱动的交易策略,促进从算法创新到盈利交易策略的快速转变。首先,我们提出了一种具有集成策略的分代进化机制来提高DRL代理的交易性能,并通过多级映射将DRL算法的训练安排到GPU云上。然后,我们在GPU上进行了DRL组件的高性能优化训练。最后,我们评估了用于NVIDIA DGX SuperPOD云上股票趋势预测任务的FinRL Podracer框架。FinRL Podracer的表现优于三个流行的DRL库Ray RLlib、Stable Baseline 3和FinRL,即年回报率提高12%sim 35%,Sharpe比率提高0.1sim 0.6,训练时间加快3倍sim 7倍。我们通过在10分钟内以80美元100美元的GPU在NASDAQ-100成份股上以10年的分钟级数据训练交易代理,展示了高度的可扩展性。 摘要:Machine learning techniques are playing more and more important roles in finance market investment. However, finance quantitative modeling with conventional supervised learning approaches has a number of limitations. The development of deep reinforcement learning techniques is partially addressing these issues. Unfortunately, the steep learning curve and the difficulty in quick modeling and agile development are impeding finance researchers from using deep reinforcement learning in quantitative trading. In this paper, we propose an RLOps in finance paradigm and present a FinRL-Podracer framework to accelerate the development pipeline of deep reinforcement learning (DRL)-driven trading strategy and to improve both trading performance and training efficiency. FinRL-Podracer is a cloud solution that features high performance and high scalability and promises continuous training, continuous integration, and continuous delivery of DRL-driven trading strategies, facilitating a rapid transformation from algorithmic innovations into a profitable trading strategy. First, we propose a generational evolution mechanism with an ensemble strategy to improve the trading performance of a DRL agent, and schedule the training of a DRL algorithm onto a GPU cloud via multi-level mapping. Then, we carry out the training of DRL components with high-performance optimizations on GPUs. Finally, we evaluate the FinRL-Podracer framework for a stock trend prediction task on an NVIDIA DGX SuperPOD cloud. FinRL-Podracer outperforms three popular DRL libraries Ray RLlib, Stable Baseline 3 and FinRL, i.e., 12% sim 35% improvements in annual return, 0.1 sim 0.6 improvements in Sharpe ratio and 3 times sim 7 times speed-up in training time. We show the high scalability by training a trading agent in 10 minutes with $80$ A100 GPUs, on NASDAQ-100 constituent stocks with minute-level data over 10 years.

【3】 The Evolving Causal Structure of Equity Risk Factors 标题:股票风险因素的因果结构演变 链接:https://arxiv.org/abs/2111.05072

作者:Gabriele D'Acunto,Paolo Bajardi,Francesco Bonchi,Gianmarco De Francisci Morales 机构:ISI Foundation, Turin, Italy, Eurecat, Barcelona, Spain 摘要:近年来,多因素策略在金融行业越来越受欢迎,因为它们让投资者更好地了解其投资组合背后的风险驱动因素。此外,这些战略有望促进多样化,从而在金融动荡时期限制损失。然而,最近的研究报告表明,这些因素之间存在显著的冗余,这可能会在金融危机期间加强多因素投资组合之间的风险传染。因此,更好地理解因素之间的关系至关重要。借助因果结构学习方法的最新进展,本文对金融风险因素的因果结构及其随时间的演变进行了研究。特别是,我们分析的数据涵盖了与美国股市有关的11个风险因素,以每日频率跨越29年。我们的结果显示了潜在因果结构的统计显著稀疏化趋势。然而,这一趋势在金融压力期间被打破,在这期间,我们可以观察到由市场因素节点向外度的增长驱动的因果网络的致密化。最后,我们将其与因子互相关分析进行了比较,这进一步证实了因果分析对于深入了解因子系统动力学的重要性,特别是在经济衰退期间。从风险管理的角度来看,我们的发现尤其重要。它们将股票风险因素因果结构的演变与市场波动和日益恶化的宏观经济环境联系起来,并表明,在金融危机时期,对不同因素的暴露归结为对市场风险因素的暴露。 摘要:In recent years, multi-factor strategies have gained increasing popularity in the financial industry, as they allow investors to have a better understanding of the risk drivers underlying their portfolios. Moreover, such strategies promise to promote diversification and thus limit losses in times of financial turmoil. However, recent studies have reported a significant level of redundancy between these factors, which might enhance risk contagion among multi-factor portfolios during financial crises. Therefore, it is of fundamental importance to better understand the relationships among factors. Empowered by recent advances in causal structure learning methods, this paper presents a study of the causal structure of financial risk factors and its evolution over time. In particular, the data we analyze covers 11 risk factors concerning the US equity market, spanning a period of 29 years at daily frequency. Our results show a statistically significant sparsifying trend of the underlying causal structure. However, this trend breaks down during periods of financial stress, in which we can observe a densification of the causal network driven by a growth of the out-degree of the market factor node. Finally, we present a comparison with the analysis of factors cross-correlations, which further confirms the importance of causal analysis for gaining deeper insights in the dynamics of the factor system, particularly during economic downturns. Our findings are especially significant from a risk-management perspective. They link the evolution of the causal structure of equity risk factors with market volatility and a worsening macroeconomic environment, and show that, in times of financial crisis, exposure to different factors boils down to exposure to the market risk factor.

【4】 Analysis of Sectoral Profitability of the Indian Stock Market Using an LSTM Regression Model 标题:基于LSTM回归模型的印度股市行业盈利能力分析 链接:https://arxiv.org/abs/2111.04976

作者:Jaydip Sen,Saikat Mondal,Sidra Mehtab 机构:Department of Data Science, Praxis Business School, Kolkata, INDIA 备注:This was accepted for oral presentation and publication in the proceedings of the Deep Learning Developers' Conference (DLDC'2021) organized online from September 23 - September 24, 2021 by Analytics India Magazine, INIDA. The paper si 8 pages long, and it contains 15 figures and 14 tables 摘要:准确预测未来股票价格的预测模型设计一直被认为是一个有趣且具有挑战性的研究问题。由于现实世界中股票价格的波动性和随机性,受许多可控和不可控变量的影响,这项任务变得复杂。本文提出了一种基于长、短期记忆(LSTM)结构的优化预测模型,用于在指定的时间间隔内从网络上自动提取过去的股票价格,并在指定的预测期内预测其未来价格,并预测未来的股票价格。该模型用于根据其对印度国家证券交易所(NSE)七个不同部门70只重要股票的预测结果进行买卖交易。每个部门的盈利能力基于该部门股票在2010年1月1日至2021年8月26日期间产生的总利润得出。根据盈利能力价值对这些部门进行比较。此外,还对每个部门的模型预测精度进行了评估。结果表明,该模型对未来股票价格具有较高的预测精度。 摘要:Predictive model design for accurately predicting future stock prices has always been considered an interesting and challenging research problem. The task becomes complex due to the volatile and stochastic nature of the stock prices in the real world which is affected by numerous controllable and uncontrollable variables. This paper presents an optimized predictive model built on long-and-short-term memory (LSTM) architecture for automatically extracting past stock prices from the web over a specified time interval and predicting their future prices for a specified forecast horizon, and forecasts the future stock prices. The model is deployed for making buy and sell transactions based on its predicted results for 70 important stocks from seven different sectors listed in the National Stock Exchange (NSE) of India. The profitability of each sector is derived based on the total profit yielded by the stocks in that sector over a period from Jan 1, 2010 to Aug 26, 2021. The sectors are compared based on their profitability values. The prediction accuracy of the model is also evaluated for each sector. The results indicate that the model is highly accurate in predicting future stock prices.

【5】 American Hate Crime Trends Prediction with Event Extraction 标题:基于事件提取的美国仇恨犯罪趋势预测 链接:https://arxiv.org/abs/2111.04951

作者:Songqiao Han,Hailiang Huang,Jiangwei Liu,Shengsheng Xiao 机构:School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai , China 备注:12 pages, 5 figures, 4 tables 摘要:社交媒体平台可能为包含仇恨言论的话语提供潜在的空间,甚至更糟的是,可以作为仇恨犯罪的传播机制。FBI的统一犯罪报告(UCR)计划收集仇恨犯罪数据,并每年发布统计报告。这些统计数据为确定国家仇恨犯罪趋势提供了信息。这些统计数据还可以为执法机构提供有价值的整体和战略见解,或为立法者制定具体立法提供依据。然而,这些报告大多是在明年发布的,落后于许多迫切需要。最近的研究主要集中在社交媒体文本中仇恨言语的检测或对已确认犯罪的影响的实证研究。本文提出了一个框架,首先利用文本挖掘技术从《纽约时报》新闻中提取仇恨犯罪事件,然后利用结果预测美国国家和州层面的仇恨犯罪趋势。实验结果表明,与没有事件相关因素的时间序列或回归方法相比,我们的方法可以显著提高预测性能。我们的框架拓宽了国家层面和州层面仇恨犯罪趋势预测的方法。 摘要:Social media platforms may provide potential space for discourses that contain hate speech, and even worse, can act as a propagation mechanism for hate crimes. The FBI's Uniform Crime Reporting (UCR) Program collects hate crime data and releases statistic report yearly. These statistics provide information in determining national hate crime trends. The statistics can also provide valuable holistic and strategic insight for law enforcement agencies or justify lawmakers for specific legislation. However, the reports are mostly released next year and lag behind many immediate needs. Recent research mainly focuses on hate speech detection in social media text or empirical studies on the impact of a confirmed crime. This paper proposes a framework that first utilizes text mining techniques to extract hate crime events from New York Times news, then uses the results to facilitate predicting American national-level and state-level hate crime trends. Experimental results show that our method can significantly enhance the prediction performance compared with time series or regression methods without event-related factors. Our framework broadens the methods of national-level and state-level hate crime trends prediction.

2.cs.SD语音:

【1】 Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition 标题:交叉注意视听融合在空间情感识别中的应用 链接:https://arxiv.org/abs/2111.05222

作者:Gnana Praveen R,Eric Granger,Patrick Cardinal 机构:Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA), ´Ecole de technologie sup´erieure, Montreal, Canada 备注:Accepted in FG2021 摘要:多模态分析最近引起了人们对情感计算的极大兴趣,因为它可以比孤立的单峰方法提高情感识别的整体准确性。多模态情感识别的最有效技术有效地利用各种互补信息源,如面部、声音和生理模式,以提供全面的特征表示。在本文中,我们重点研究了基于视频中人脸和声音模式融合的多维情感识别,其中可能捕捉到复杂的时空关系。大多数现有的融合技术依赖于经常性的网络或传统的注意机制,不能有效地利用视听(A-V)模式的互补性。我们引入了一种交叉注意融合方法来提取跨视听模式的显著特征,从而能够准确预测配价和唤醒的连续值。我们新的交叉注意视听融合模型有效地利用了跨模态关系。特别是,它计算交叉注意权重,以关注各个模式中更具贡献性的特征,从而结合贡献性特征表示,然后将其输入到完全连接的层,用于预测效价和觉醒。该方法的有效性在RECOLA和疲劳(私有)数据集的视频上进行了实验验证。结果表明,我们的交叉注意A-V融合模型是一种经济有效的方法,优于最先进的融合方法。代码可用:url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion} 摘要:Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. Our new cross-attentional A-V fusion model efficiently leverages the inter-modal relationships. In particular, it computes cross-attention weights to focus on the more contributive features across individual modalities, and thereby combine contributive feature representations, which are then fed to fully connected layers for the prediction of valence and arousal. The effectiveness of the proposed approach is validated experimentally on videos from the RECOLA and Fatigue (private) data-sets. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches. Code is available: url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion}

【2】 CAESynth: Real-Time Timbre Interpolation and Pitch Control with Conditional Autoencoders 标题:CAESynth:带条件自动编码器的实时音色插值和音调控制 链接:https://arxiv.org/abs/2111.05174

作者:Aaron Valero Puche,Sukhan Lee 机构:Artificial Intelligence School, Sungkyunkwan University, Suwon-si , Republic of Korea 备注:MLSP 2021 摘要:在本文中,我们提出了一种基于条件自动编码器的新型音频合成器CAESynth。CAESynth通过在共享的潜在特征空间中插值参考声音实时合成音色,同时独立控制音高。我们表明,训练一个基于音色分类准确度的条件自动编码器,再加上音高内容的对抗性正则化,可以使潜在空间中的音色分布在音色插值和音高调节方面更加有效和稳定。所提出的方法不仅适用于音乐线索的创建,而且也适用于基于新的音色与环境声音混合的混合现实中音频启示的探索。我们通过实验证明,CAESynth通过音色插值和独立但精确的音高控制实现了平滑和高保真的实时音频合成,用于音乐线索以及环境声音的音频提供。在线共享Python实现以及一些生成的示例。 摘要:In this paper, we present a novel audio synthesizer, CAESynth, based on a conditional autoencoder. CAESynth synthesizes timbre in real-time by interpolating the reference sounds in their shared latent feature space, while controlling a pitch independently. We show that training a conditional autoencoder based on accuracy in timbre classification together with adversarial regularization of pitch content allows timbre distribution in latent space to be more effective and stable for timbre interpolation and pitch conditioning. The proposed method is applicable not only to creation of musical cues but also to exploration of audio affordance in mixed reality based on novel timbre mixtures with environmental sounds. We demonstrate by experiments that CAESynth achieves smooth and high-fidelity audio synthesis in real-time through timbre interpolation and independent yet accurate pitch control for musical cues as well as for audio affordance with environmental sound. A Python implementation along with some generated samples are shared online.

【3】 Losses, Dissonances, and Distortions 标题:损失、不和谐和扭曲 链接:https://arxiv.org/abs/2111.05128

作者:Pablo Samuel Castro 机构:Google Research, Brain Team 备注:In the 5th Machine Learning for Creativity and Design Workshop at NeurIPS 2021 摘要:在这篇论文中,我提出了一项研究,利用在训练简单函数逼近器过程中获得的损失和梯度,作为在钢琴独奏表演环境中产生音乐不和谐和视觉失真的机制。这些不和谐和扭曲不仅通过影响视觉效果,而且通过影响艺术音乐表演,成为艺术表演的一部分。该系统的设计使得表演者可以反过来影响训练过程本身,从而在两个过程之间创建一个闭环反馈:机器学习模型的训练和即兴钢琴曲的演奏。 摘要:In this paper I present a study in using the losses and gradients obtained during the training of a simple function approximator as a mechanism for creating musical dissonance and visual distortion in a solo piano performance setting. These dissonances and distortions become part of an artistic performance not just by affecting the visualizations, but also by affecting the artistic musical performance. The system is designed such that the performer can in turn affect the training process itself, thereby creating a closed feedback loop between two processes: the training of a machine learning model and the performance of an improvised piano piece.

【4】 Membership Inference Attacks Against Self-supervised Speech Models 标题:针对自监督语音模型的隶属度推理攻击 链接:https://arxiv.org/abs/2111.05113

作者:Wei-Cheng Tseng,Wei-Tsung Kao,Hung-yi Lee 机构:Graduate Institute of Communication Engineering, National Taiwan University, Taiwan 备注:Submitted to ICASSP 2022. Source code available at this https URL 摘要:最近,在连续语音上采用自监督学习(SSL)的思想开始受到关注。在大量未标记音频上预先训练的SSL模型可以生成通用表示,这有利于各种各样的语音处理任务。然而,尽管这些模型无处不在,但其潜在的隐私风险尚未得到很好的调查。在本文中,我们提出了第一个隐私分析几种SSL语音模型使用成员推理攻击(MIA)在黑盒访问。实验结果表明,这些预先训练好的模型在话语水平和说话人水平上都具有较高的对抗优势,容易受到MIA的影响,容易出现成员信息泄漏。此外,我们还进行了几项消融研究,以了解导致MIA成功的因素。 摘要:Recently, adapting the idea of self-supervised learning (SSL) on continuous speech has started gaining attention. SSL models pre-trained on a huge amount of unlabeled audio can generate general-purpose representations that benefit a wide variety of speech processing tasks. Despite their ubiquitous deployment, however, the potential privacy risks of these models have not been well investigated. In this paper, we present the first privacy analysis on several SSL speech models using Membership Inference Attacks (MIA) under black-box access. The experiment results show that these pre-trained models are vulnerable to MIA and prone to membership information leakage with high adversarial advantage scores in both utterance-level and speaker-level. Furthermore, we also conduct several ablation studies to understand the factors that contribute to the success of MIA.

【5】 Speaker Generation 标题:扬声器生成 链接:https://arxiv.org/abs/2111.05095

作者:Daisy Stanton,Matt Shannon,Soroosh Mariooryad,RJ Skerry-Ryan,Eric Battenberg,Tom Bagby,David Kao 机构:Google Research, USA 备注:12 pages, 3 figures, 4 tables, appendix with 2 tables 摘要:这项工作探索了在不存在的人声中合成语音的任务。我们将这项任务称为“说话人生成”,并介绍了TacoSpown,一个在这项任务中具有竞争力的系统。TacoSpawn是一种基于反复注意的文本到语音模型,它学习说话人嵌入空间上的分布,从而能够对新颖多样的说话人进行采样。我们的方法易于实现,并且不需要从说话人ID系统进行转移学习。我们提出了客观和主观指标来评估这项任务的性能,并证明我们提出的客观指标与人类对说话人相似性的感知相关。音频样本可以在我们的演示页面上找到。 摘要:This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.

【6】 RAVE: A variational autoencoder for fast and high-quality neural audio synthesis 标题:RAVE:一种用于快速高质量神经音频合成的变分自动编码器 链接:https://arxiv.org/abs/2111.05011

作者:Antoine Caillon,Philippe Esling 机构:IRCAM - Sorbonne Universit´e, CNRS UMR , place Igor Stravinsky, Paris, France 摘要:应用于音频的深层生成模型极大地提高了许多语音和音乐相关任务的最新水平。然而,由于原始波形建模仍然是一项固有的困难任务,音频生成模型要么计算密集,依赖于低采样率,要么控制或限制可能信号的性质比较复杂。在这些模型中,变分自动编码器(VAE)通过暴露潜在变量来控制生成,尽管它们通常存在合成质量较低的问题。在本文中,我们介绍了一种实时音频变分自动编码器(RAVE),它可以实现快速和高质量的音频波形合成。我们介绍了一种新的两阶段训练过程,即表征学习和对抗性微调。我们表明,使用潜在空间的训练后分析可以直接控制重建保真度和表示紧凑性。通过利用原始波形的多波段分解,我们证明了我们的模型是第一个能够生成48kHz音频信号的模型,同时在标准笔记本电脑CPU上运行速度比实时速度快20倍。我们使用定量和定性的主观实验来评估合成质量,并展示了我们的方法与现有模型相比的优越性。最后,我们介绍了我们的模型在音色传输和信号压缩方面的应用。我们所有的源代码和音频示例都是公开的。 摘要:Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.

【7】 Ultra-Low Power Keyword Spotting at the Edge 标题:边缘的超低功耗关键词定位 链接:https://arxiv.org/abs/2111.04988

作者:Mehmet Gorkem Ulkar,Osman Erman Okman 机构:Analog Devices Inc., Istanbul, Turkey 备注:5 pages, 5 figures 摘要:关键字识别(KWS)已经成为我们周围许多智能设备不可或缺的一部分,因为音频是与这些设备交互的最有效方式之一。KWS解决方案的准确性和性能一直是研究人员的主要关注点,由于深入学习,该领域取得了实质性进展。然而,随着KWS的应用扩展到物联网设备中,除了性能外,能效成为一个非常关键的要求。我们相信,KWS解决方案在硬件和神经网络(NN)模型架构中寻求功率优化比文献中的许多解决方案更有利,因为文献中主要考虑了问题的架构方面。在这项工作中,我们通过考虑部署在超低功率CNN加速器MAX78000上的端到端能源效率,设计了一个优化的KWS CNN模型。通过结合硬件和模型优化方法,我们在12个类中实现了96.3%的精度,而每次推理仅消耗251 uJ。我们将我们的结果与文献中其他基于小占地面积神经网络的KWS解决方案进行了比较。此外,为了清晰起见,我们在功耗优化的ARM Cortex-M4F中分享了我们模型的能耗,以描述所选硬件的有效性。 摘要:Keyword spotting (KWS) has become an indispensable part of many intelligent devices surrounding us, as audio is one of the most efficient ways of interacting with these devices. The accuracy and performance of KWS solutions have been the main focus of the researchers, and thanks to deep learning, substantial progress has been made in this domain. However, as the use of KWS spreads into IoT devices, energy efficiency becomes a very critical requirement besides the performance. We believe KWS solutions that would seek power optimization both in the hardware and the neural network (NN) model architecture are advantageous over many solutions in the literature where mostly the architecture side of the problem is considered. In this work, we designed an optimized KWS CNN model by considering end-to-end energy efficiency for the deployment at MAX78000, an ultra-low-power CNN accelerator. With the combined hardware and model optimization approach, we achieve 96.3\% accuracy for 12 classes while only consuming 251 uJ per inference. We compare our results with other small-footprint neural network-based KWS solutions in the literature. Additionally, we share the energy consumption of our model in power-optimized ARM Cortex-M4F to depict the effectiveness of the chosen hardware for the sake of clarity.

【8】 Cascaded Multilingual Audio-Visual Learning from Videos 标题:级联多语种视频视听学习 链接:https://arxiv.org/abs/2111.04823

作者:Andrew Rouditchenko,Angie Boggust,David Harwath,Samuel Thomas,Hilde Kuehne,Brian Chen,Rameswar Panda,Rogerio Feris,Brian Kingsbury,Michael Picheny,James Glass 机构:MIT CSAIL, USA, UT Austin, USA, IBM Research AI, USA, Columbia University, USA, NYU, USA 备注:Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset 摘要:在本文中,我们探讨了从教学视频中学习的自监督视听模型。之前的工作表明,这些模型在对大规模视频数据集进行训练后,可以将口语和声音与视觉内容联系起来,但它们仅在英语视频上进行训练和评估。为了学习多语言视听表示,我们提出了一种级联方法,该方法利用在英语视频上训练的模型,并将其应用于其他语言的视听数据,例如日语视频。通过我们的级联方法,我们显示检索性能比仅在日本视频上进行的训练提高了近10倍。我们还将经过英语视频训练的模型应用于日语和印地语口语图像字幕,实现了最先进的性能。 摘要:In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.

【9】 Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer 标题:利用RNNTransformer进行双话检测的联合AEC和波束形成 链接:https://arxiv.org/abs/2111.04904

作者:Vinay Kothapally,Yong Xu,Meng Yu,Shi-Xiong Zhang,Dong Yu 机构:∗Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, TX, USA, †Tencent AI Lab, Bellevue, WA, USA 备注:Submitted to ICASSP 2022 摘要:声回波消除(AEC)是全双工通信系统中用来消除远端语音声反馈的一种技术。然而,由于说话人引入的非线性失真,以及背景噪声、混响和双重对话场景,它们在自然环境中的性能会下降。为了解决非线性失真和共存的背景噪声,开发了几种基于深度神经网络(DNN)的联合AEC和去噪系统。这些系统基于纯粹的“黑箱”神经网络或将传统AEC算法与神经网络相结合的“混合”系统。我们提出了一个结合多通道AEC和我们最近提出的自关注递归神经网络(RNN)波束形成器的全深度学习框架。我们提出了一个结合多通道AEC和我们最近提出的自关注递归神经网络(RNN)波束形成器的全深度学习框架。此外,我们提出了一种基于多头注意变换器结构的双话检测变换器(DTDT)模块,该模块通过利用逐帧双话预测来计算随时间的注意。实验表明,该方法在提高ASR系统的语音质量和语音识别率方面优于其他方法。 摘要:Acoustic echo cancellation (AEC) is a technique used in full-duplex communication systems to eliminate acoustic feedback of far-end speech. However, their performance degrades in naturalistic environments due to nonlinear distortions introduced by the speaker, as well as background noise, reverberation, and double-talk scenarios. To address nonlinear distortions and co-existing background noise, several deep neural network (DNN)-based joint AEC and denoising systems were developed. These systems are based on either purely "black-box" neural networks or "hybrid" systems that combine traditional AEC algorithms with neural networks. We propose an all-deep-learning framework that combines multi-channel AEC and our recently proposed self-attentive recurrent neural network (RNN) beamformer. We propose an all-deep-learning framework that combines multi-channel AEC and our recently proposed self-attentive recurrent neural network (RNN) beamformer. Furthermore, we propose a double-talk detection transformer (DTDT) module based on the multi-head attention transformer structure that computes attention over time by leveraging frame-wise double-talk predictions. Experiments show that our proposed method outperforms other approaches in terms of improving speech quality and speech recognition rate of an ASR system.

【10】 Emotional Prosody Control for Speech Generation 标题:语音生成中的情感韵律控制 链接:https://arxiv.org/abs/2111.04730

作者:Sarath Sivaprasad,Saiteja Kosgi,Vineet Gandhi 机构:CVIT, KCIS, IIIT Hyderabad, TCS Research, Pune 摘要:机器生成语音的特点是其有限或不自然的情感变化。当前的文本到语音系统生成具有平坦情感、从预定义集合中选择的情感、从训练数据中的韵律序列中学习的平均变化或从源样式传输的语音。我们提出了一个文本到语音(TTS)系统,用户可以从一个连续且有意义的情感空间(唤醒价空间)中选择生成语音的情感。所提出的TTS系统可以从文本中生成任何说话人风格的语音,并对情感进行精细控制。我们表明,该系统可以处理训练过程中看不见的情绪,并且可以根据之前看不见的说话人的语音样本进行扩展。我们的工作将最先进的FastSpeech2主干扩展到多说话人环境,并提供令人垂涎的连续(可解释)情感控制,合成语音的质量不会出现任何明显下降。 摘要:Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech.

3.eess.AS音频处理:

【1】 Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer 标题:利用RNNTransformer进行双话检测的联合AEC和波束形成 链接:https://arxiv.org/abs/2111.04904

作者:Vinay Kothapally,Yong Xu,Meng Yu,Shi-Xiong Zhang,Dong Yu 机构:∗Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, TX, USA, †Tencent AI Lab, Bellevue, WA, USA 备注:Submitted to ICASSP 2022 摘要:声回波消除(AEC)是全双工通信系统中用来消除远端语音声反馈的一种技术。然而,由于说话人引入的非线性失真,以及背景噪声、混响和双重对话场景,它们在自然环境中的性能会下降。为了解决非线性失真和共存的背景噪声,开发了几种基于深度神经网络(DNN)的联合AEC和去噪系统。这些系统基于纯粹的“黑箱”神经网络或将传统AEC算法与神经网络相结合的“混合”系统。我们提出了一个结合多通道AEC和我们最近提出的自关注递归神经网络(RNN)波束形成器的全深度学习框架。我们提出了一个结合多通道AEC和我们最近提出的自关注递归神经网络(RNN)波束形成器的全深度学习框架。此外,我们提出了一种基于多头注意变换器结构的双话检测变换器(DTDT)模块,该模块通过利用逐帧双话预测来计算随时间的注意。实验表明,该方法在提高ASR系统的语音质量和语音识别率方面优于其他方法。 摘要:Acoustic echo cancellation (AEC) is a technique used in full-duplex communication systems to eliminate acoustic feedback of far-end speech. However, their performance degrades in naturalistic environments due to nonlinear distortions introduced by the speaker, as well as background noise, reverberation, and double-talk scenarios. To address nonlinear distortions and co-existing background noise, several deep neural network (DNN)-based joint AEC and denoising systems were developed. These systems are based on either purely "black-box" neural networks or "hybrid" systems that combine traditional AEC algorithms with neural networks. We propose an all-deep-learning framework that combines multi-channel AEC and our recently proposed self-attentive recurrent neural network (RNN) beamformer. We propose an all-deep-learning framework that combines multi-channel AEC and our recently proposed self-attentive recurrent neural network (RNN) beamformer. Furthermore, we propose a double-talk detection transformer (DTDT) module based on the multi-head attention transformer structure that computes attention over time by leveraging frame-wise double-talk predictions. Experiments show that our proposed method outperforms other approaches in terms of improving speech quality and speech recognition rate of an ASR system.

【2】 Emotional Prosody Control for Speech Generation 标题:语音生成中的情感韵律控制 链接:https://arxiv.org/abs/2111.04730

作者:Sarath Sivaprasad,Saiteja Kosgi,Vineet Gandhi 机构:CVIT, KCIS, IIIT Hyderabad, TCS Research, Pune 摘要:机器生成语音的特点是其有限或不自然的情感变化。当前的文本到语音系统生成具有平坦情感、从预定义集合中选择的情感、从训练数据中的韵律序列中学习的平均变化或从源样式传输的语音。我们提出了一个文本到语音(TTS)系统,用户可以从一个连续且有意义的情感空间(唤醒价空间)中选择生成语音的情感。所提出的TTS系统可以从文本中生成任何说话人风格的语音,并对情感进行精细控制。我们表明,该系统可以处理训练过程中看不见的情绪,并且可以根据之前看不见的说话人的语音样本进行扩展。我们的工作将最先进的FastSpeech2主干扩展到多说话人环境,并提供令人垂涎的连续(可解释)情感控制,合成语音的质量不会出现任何明显下降。 摘要:Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech.

【3】 Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition 标题:交叉注意视听融合在空间情感识别中的应用 链接:https://arxiv.org/abs/2111.05222

作者:Gnana Praveen R,Eric Granger,Patrick Cardinal 机构:Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA), ´Ecole de technologie sup´erieure, Montreal, Canada 备注:Accepted in FG2021 摘要:多模态分析最近引起了人们对情感计算的极大兴趣,因为它可以比孤立的单峰方法提高情感识别的整体准确性。多模态情感识别的最有效技术有效地利用各种互补信息源,如面部、声音和生理模式,以提供全面的特征表示。在本文中,我们重点研究了基于视频中人脸和声音模式融合的多维情感识别,其中可能捕捉到复杂的时空关系。大多数现有的融合技术依赖于经常性的网络或传统的注意机制,不能有效地利用视听(A-V)模式的互补性。我们引入了一种交叉注意融合方法来提取跨视听模式的显著特征,从而能够准确预测配价和唤醒的连续值。我们新的交叉注意视听融合模型有效地利用了跨模态关系。特别是,它计算交叉注意权重,以关注各个模式中更具贡献性的特征,从而结合贡献性特征表示,然后将其输入到完全连接的层,用于预测效价和觉醒。该方法的有效性在RECOLA和疲劳(私有)数据集的视频上进行了实验验证。结果表明,我们的交叉注意A-V融合模型是一种经济有效的方法,优于最先进的融合方法。代码可用:url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion} 摘要:Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. Our new cross-attentional A-V fusion model efficiently leverages the inter-modal relationships. In particular, it computes cross-attention weights to focus on the more contributive features across individual modalities, and thereby combine contributive feature representations, which are then fed to fully connected layers for the prediction of valence and arousal. The effectiveness of the proposed approach is validated experimentally on videos from the RECOLA and Fatigue (private) data-sets. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches. Code is available: url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion}

【4】 CAESynth: Real-Time Timbre Interpolation and Pitch Control with Conditional Autoencoders 标题:CAESynth:带条件自动编码器的实时音色插值和音调控制 链接:https://arxiv.org/abs/2111.05174

作者:Aaron Valero Puche,Sukhan Lee 机构:Artificial Intelligence School, Sungkyunkwan University, Suwon-si , Republic of Korea 备注:MLSP 2021 摘要:在本文中,我们提出了一种基于条件自动编码器的新型音频合成器CAESynth。CAESynth通过在共享的潜在特征空间中插值参考声音实时合成音色,同时独立控制音高。我们表明,训练一个基于音色分类准确度的条件自动编码器,再加上音高内容的对抗性正则化,可以使潜在空间中的音色分布在音色插值和音高调节方面更加有效和稳定。所提出的方法不仅适用于音乐线索的创建,而且也适用于基于新的音色与环境声音混合的混合现实中音频启示的探索。我们通过实验证明,CAESynth通过音色插值和独立但精确的音高控制实现了平滑和高保真的实时音频合成,用于音乐线索以及环境声音的音频提供。在线共享Python实现以及一些生成的示例。 摘要:In this paper, we present a novel audio synthesizer, CAESynth, based on a conditional autoencoder. CAESynth synthesizes timbre in real-time by interpolating the reference sounds in their shared latent feature space, while controlling a pitch independently. We show that training a conditional autoencoder based on accuracy in timbre classification together with adversarial regularization of pitch content allows timbre distribution in latent space to be more effective and stable for timbre interpolation and pitch conditioning. The proposed method is applicable not only to creation of musical cues but also to exploration of audio affordance in mixed reality based on novel timbre mixtures with environmental sounds. We demonstrate by experiments that CAESynth achieves smooth and high-fidelity audio synthesis in real-time through timbre interpolation and independent yet accurate pitch control for musical cues as well as for audio affordance with environmental sound. A Python implementation along with some generated samples are shared online.

【5】 Losses, Dissonances, and Distortions 标题:损失、不和谐和扭曲 链接:https://arxiv.org/abs/2111.05128

作者:Pablo Samuel Castro 机构:Google Research, Brain Team 备注:In the 5th Machine Learning for Creativity and Design Workshop at NeurIPS 2021 摘要:在这篇论文中,我提出了一项研究,利用在训练简单函数逼近器过程中获得的损失和梯度,作为在钢琴独奏表演环境中产生音乐不和谐和视觉失真的机制。这些不和谐和扭曲不仅通过影响视觉效果,而且通过影响艺术音乐表演,成为艺术表演的一部分。该系统的设计使得表演者可以反过来影响训练过程本身,从而在两个过程之间创建一个闭环反馈:机器学习模型的训练和即兴钢琴曲的演奏。 摘要:In this paper I present a study in using the losses and gradients obtained during the training of a simple function approximator as a mechanism for creating musical dissonance and visual distortion in a solo piano performance setting. These dissonances and distortions become part of an artistic performance not just by affecting the visualizations, but also by affecting the artistic musical performance. The system is designed such that the performer can in turn affect the training process itself, thereby creating a closed feedback loop between two processes: the training of a machine learning model and the performance of an improvised piano piece.

【6】 Membership Inference Attacks Against Self-supervised Speech Models 标题:针对自监督语音模型的隶属度推理攻击 链接:https://arxiv.org/abs/2111.05113

作者:Wei-Cheng Tseng,Wei-Tsung Kao,Hung-yi Lee 机构:Graduate Institute of Communication Engineering, National Taiwan University, Taiwan 备注:Submitted to ICASSP 2022. Source code available at this https URL 摘要:最近,在连续语音上采用自监督学习(SSL)的思想开始受到关注。在大量未标记音频上预先训练的SSL模型可以生成通用表示,这有利于各种各样的语音处理任务。然而,尽管这些模型无处不在,但其潜在的隐私风险尚未得到很好的调查。在本文中,我们提出了第一个隐私分析几种SSL语音模型使用成员推理攻击(MIA)在黑盒访问。实验结果表明,这些预先训练好的模型在话语水平和说话人水平上都具有较高的对抗优势,容易受到MIA的影响,容易出现成员信息泄漏。此外,我们还进行了几项消融研究,以了解导致MIA成功的因素。 摘要:Recently, adapting the idea of self-supervised learning (SSL) on continuous speech has started gaining attention. SSL models pre-trained on a huge amount of unlabeled audio can generate general-purpose representations that benefit a wide variety of speech processing tasks. Despite their ubiquitous deployment, however, the potential privacy risks of these models have not been well investigated. In this paper, we present the first privacy analysis on several SSL speech models using Membership Inference Attacks (MIA) under black-box access. The experiment results show that these pre-trained models are vulnerable to MIA and prone to membership information leakage with high adversarial advantage scores in both utterance-level and speaker-level. Furthermore, we also conduct several ablation studies to understand the factors that contribute to the success of MIA.

【7】 Speaker Generation 标题:扬声器生成 链接:https://arxiv.org/abs/2111.05095

作者:Daisy Stanton,Matt Shannon,Soroosh Mariooryad,RJ Skerry-Ryan,Eric Battenberg,Tom Bagby,David Kao 机构:Google Research, USA 备注:12 pages, 3 figures, 4 tables, appendix with 2 tables 摘要:这项工作探索了在不存在的人声中合成语音的任务。我们将这项任务称为“说话人生成”,并介绍了TacoSpown,一个在这项任务中具有竞争力的系统。TacoSpawn是一种基于反复注意的文本到语音模型,它学习说话人嵌入空间上的分布,从而能够对新颖多样的说话人进行采样。我们的方法易于实现,并且不需要从说话人ID系统进行转移学习。我们提出了客观和主观指标来评估这项任务的性能,并证明我们提出的客观指标与人类对说话人相似性的感知相关。音频样本可以在我们的演示页面上找到。 摘要:This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page.

【8】 RAVE: A variational autoencoder for fast and high-quality neural audio synthesis 标题:RAVE:一种用于快速高质量神经音频合成的变分自动编码器 链接:https://arxiv.org/abs/2111.05011

作者:Antoine Caillon,Philippe Esling 机构:IRCAM - Sorbonne Universit´e, CNRS UMR , place Igor Stravinsky, Paris, France 摘要:应用于音频的深层生成模型极大地提高了许多语音和音乐相关任务的最新水平。然而,由于原始波形建模仍然是一项固有的困难任务,音频生成模型要么计算密集,依赖于低采样率,要么控制或限制可能信号的性质比较复杂。在这些模型中,变分自动编码器(VAE)通过暴露潜在变量来控制生成,尽管它们通常存在合成质量较低的问题。在本文中,我们介绍了一种实时音频变分自动编码器(RAVE),它可以实现快速和高质量的音频波形合成。我们介绍了一种新的两阶段训练过程,即表征学习和对抗性微调。我们表明,使用潜在空间的训练后分析可以直接控制重建保真度和表示紧凑性。通过利用原始波形的多波段分解,我们证明了我们的模型是第一个能够生成48kHz音频信号的模型,同时在标准笔记本电脑CPU上运行速度比实时速度快20倍。我们使用定量和定性的主观实验来评估合成质量,并展示了我们的方法与现有模型相比的优越性。最后,我们介绍了我们的模型在音色传输和信号压缩方面的应用。我们所有的源代码和音频示例都是公开的。 摘要:Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.

【9】 Ultra-Low Power Keyword Spotting at the Edge 标题:边缘的超低功耗关键词定位 链接:https://arxiv.org/abs/2111.04988

作者:Mehmet Gorkem Ulkar,Osman Erman Okman 机构:Analog Devices Inc., Istanbul, Turkey 备注:5 pages, 5 figures 摘要:关键字识别(KWS)已经成为我们周围许多智能设备不可或缺的一部分,因为音频是与这些设备交互的最有效方式之一。KWS解决方案的准确性和性能一直是研究人员的主要关注点,由于深入学习,该领域取得了实质性进展。然而,随着KWS的应用扩展到物联网设备中,除了性能外,能效成为一个非常关键的要求。我们相信,KWS解决方案在硬件和神经网络(NN)模型架构中寻求功率优化比文献中的许多解决方案更有利,因为文献中主要考虑了问题的架构方面。在这项工作中,我们通过考虑部署在超低功率CNN加速器MAX78000上的端到端能源效率,设计了一个优化的KWS CNN模型。通过结合硬件和模型优化方法,我们在12个类中实现了96.3%的精度,而每次推理仅消耗251 uJ。我们将我们的结果与文献中其他基于小占地面积神经网络的KWS解决方案进行了比较。此外,为了清晰起见,我们在功耗优化的ARM Cortex-M4F中分享了我们模型的能耗,以描述所选硬件的有效性。 摘要:Keyword spotting (KWS) has become an indispensable part of many intelligent devices surrounding us, as audio is one of the most efficient ways of interacting with these devices. The accuracy and performance of KWS solutions have been the main focus of the researchers, and thanks to deep learning, substantial progress has been made in this domain. However, as the use of KWS spreads into IoT devices, energy efficiency becomes a very critical requirement besides the performance. We believe KWS solutions that would seek power optimization both in the hardware and the neural network (NN) model architecture are advantageous over many solutions in the literature where mostly the architecture side of the problem is considered. In this work, we designed an optimized KWS CNN model by considering end-to-end energy efficiency for the deployment at MAX78000, an ultra-low-power CNN accelerator. With the combined hardware and model optimization approach, we achieve 96.3\% accuracy for 12 classes while only consuming 251 uJ per inference. We compare our results with other small-footprint neural network-based KWS solutions in the literature. Additionally, we share the energy consumption of our model in power-optimized ARM Cortex-M4F to depict the effectiveness of the chosen hardware for the sake of clarity.

【10】 Cascaded Multilingual Audio-Visual Learning from Videos 标题:级联多语种视频视听学习 链接:https://arxiv.org/abs/2111.04823

作者:Andrew Rouditchenko,Angie Boggust,David Harwath,Samuel Thomas,Hilde Kuehne,Brian Chen,Rameswar Panda,Rogerio Feris,Brian Kingsbury,Michael Picheny,James Glass 机构:MIT CSAIL, USA, UT Austin, USA, IBM Research AI, USA, Columbia University, USA, NYU, USA 备注:Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset 摘要:在本文中,我们探讨了从教学视频中学习的自监督视听模型。之前的工作表明,这些模型在对大规模视频数据集进行训练后,可以将口语和声音与视觉内容联系起来,但它们仅在英语视频上进行训练和评估。为了学习多语言视听表示,我们提出了一种级联方法,该方法利用在英语视频上训练的模型,并将其应用于其他语言的视听数据,例如日语视频。通过我们的级联方法,我们显示检索性能比仅在日本视频上进行的训练提高了近10倍。我们还将经过英语视频训练的模型应用于日语和印地语口语图像字幕,实现了最先进的性能。 摘要:In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.