zl程序教程

您现在的位置是:首页 >  其他

当前栏目

金融/语音/音频处理学术速递[11.15]

2023-03-14 22:52:55 时间

q-fin金融,共计5篇

cs.SD语音,共计6篇

eess.AS音频处理,共计7篇

1.q-fin金融:

【1】 Can Air Pollution Save Lives? Air Quality and Risky Behaviors on Roads 标题:空气污染能拯救生命吗?空气质量与道路危险行为 链接:https://arxiv.org/abs/2111.06837

作者:Wen Hsu,Bing-Fang Hwang,Chau-Ren Jung,Yau-Huo,Shr 摘要:空气污染与风险厌恶程度的提高有关。这篇论文提供了第一个证据,表明这种效应减少了危及生命的危险行为。我们利用2009年至2015年台湾的事故记录和高分辨率空气质量数据,研究了空气污染对危险驾驶行为导致的交通事故的影响。我们发现,空气污染显著减少了因驾驶员违章而导致的事故,并且这种影响是非线性的。此外,我们的研究结果表明,空气污染主要通过视觉通道而不是通过呼吸系统来减少道路使用者的危险行为。 摘要:Air pollution has been linked to elevated levels of risk aversion. This paper provides the first evidence showing that such effect reduces life-threatening risky behaviors. We study the impact of air pollution on traffic accidents caused by risky driving behaviors, using the universe of accident records and high-resolution air quality data of Taiwan from 2009 to 2015. We find that air pollution significantly decreases accidents caused by driver violations, and that this effect is nonlinear. In addition, our results suggest that air pollution primarily reduces road users' risky behaviors through visual channels rather than through the respiratory system.

【2】 The cavity method for minority games between arbitrageurs on financial markets 标题:金融市场上套利者之间少数人博弈的空心法 链接:https://arxiv.org/abs/2111.06663

作者:Tim Ritmeester,Hildegard Meyer-Ortmanns 备注:35 pages, 7 figures 摘要:我们使用统计物理中的腔方法来分析执行市场套利的代理所玩的少数人博弈的瞬态和稳态动力学。在线性响应水平上,该方法允许包括市场对代理人的个体行为的反应以及代理人对市场的个体信息项的反应,由此我们导出了少数人博弈的自洽解。特别地,我们分析了当存在外部波动噪声时,一般非线性价格函数对套利量的影响。我们确定了由于噪声的存在而减少套利的条件。当空腔法扩展到市场价格对代理人先前行为的时间相关响应时,价格和信息的外部波动以及策略选择产生的噪声的单独贡献可以在瞬态动力学中进行,直到达到平稳状态。它解释了代理策略分数的时间演变:从最初的随机行走到中间时间尺度上的有界漂移,再到策略选择中的有效随机切换。与平均场法的居里-韦斯水平相比,空腔法所包含的市场反应捕捉到了一个现实特征,即代理人对某种策略选择有偏好,而不拘泥于单一选择。 摘要:We use the cavity method from statistical physics for analyzing the transient and stationary dynamics of a minority game that is played by agents performing market arbitrage. On the level of linear response the method allows to include the reaction of the market to individual actions of the agents as well as the reaction of the agents to individual information items of the market, from which we derive a self-consistent solution to the minority game. In particular we analyze the impact of general nonlinear price functions on the amount of arbitrage if noise from external fluctuations is present. We identify the conditions under which arbitrage gets reduced due to the presence of noise. When the cavity method is extended to time dependent response of the market price to previous actions of the agents, the individual contributions of noise from external fluctuations in price and information and from noise due to the choice of strategies can be pursued in the transient dynamics until a stationary state is reached. It explains the time evolution of scores of the agents' strategies: it changes from initially a random walk to bounded excursions on an intermediate time scale to effectively random switching in the choice between strategies. In contrast to a Curie-Weiss level of a mean-field approach, the market response included by the cavity method captures the realistic feature that the agents have a preference for a certain choice of strategies without getting stuck to a single choice.

【3】 Profit warnings and stock returns: Evidence from moroccan stock exchange 标题:盈利预警和股票回报:来自摩洛哥证券交易所的证据 链接:https://arxiv.org/abs/2111.06655

作者:Ilyas El Ghordaf,Abdelbari El Khamlichi 机构:University Mohammed First, Oujda, Morocco, LERSEM, ENCG, Chouaib Doukkali University, El Jadida, Morocco 备注:None 摘要:有一个重要的文献关注利润警告及其对股票回报的影响。我们从摩洛哥股票市场提供证据,该市场旨在成为非洲金融中心。尽管有这种实际的改进,但专注于该市场的学术研究却很少,我们的研究是在这种背景下的首次调查。利用事件研究方法和2009年至2016年期间在卡萨布兰卡证券交易所上市的公司样本,我们检验了在短期事件窗口中,定性警告的效果是否比定量警告更为负面。我们的实证结果表明,在公告日的平均异常收益率为负,且具有统计学意义。定性警告的负异常回报率大于定量警告。 摘要:There is an important literature focused on profit warnings and its impact on stock returns. We provide evidence from Moroccan stock market which aims to become an African financial hub. Despite this practical improvement, academic researches that focused on this market are scarce and our study is a first investigation in this context. Using the event study methodology and a sample of companies listed in Casablanca Stock Exchange for the period of 2009 to 2016, we examined whether the effect of qualitative warning is more negative compared to quantitative warnings in a short event window. Our empirical findings show that the average abnormal return on the date of announcement is negative and statistically significant. The magnitude of this negative abnormal return is greater for qualitative warnings than quantitative ones.

【4】 Can hesitancy be mitigated by free choice across COVID-19 vaccine types? 标题:可以通过自由选择不同的冠状病毒疫苗类型来减轻犹豫不决吗? 链接:https://arxiv.org/abs/2111.06462

作者:Kristóf Kutasi,Júlia Koltai,Ágnes Szabó-Morvai,Gergely Röst,Márton Karsai,Péter Bíró,Balázs Lengyel 机构: 1Rice University, Hungary6Debrecen University, Department of Economics, Hungary7University of Szeged, Bolyai Institute, Hungary8Alfr´ed R´enyi Institute of Mathematics 备注:29 pages, 9 figures 摘要:许多国家获得的新冠病毒-19疫苗数量超过了其民众愿意接种的数量。疫苗的丰富性和多样性为更好地理解疫苗犹豫不决创造了一个历史时刻。以前从未有过更多类型的疫苗可用于治疗疾病,疫苗相关的公共讨论的强度也是前所未有的。然而,迄今为止,由于疫苗类型的不同,犹豫不决的异质性一直被忽视,即使已知事实或可信的疫苗特征和患者属性会影响接受程度。我们通过分析五种疫苗类型的接受度和评估来解决这一问题,方法是利用在匈牙利第三波新冠肺炎大流行结束时进行的具有全国代表性的调查收集的信息,在匈牙利,公众可以大量获得独特的疫苗组合。我们的特例使我们能够量化不同疫苗类型的暴露偏好,因为我们可以评估不可接受的疫苗,甚至可以拒绝指定的疫苗等待另一种类型。我们发现,被调查者信任的信息来源不同地描述了他们对疫苗类型的态度,并导致不同的疫苗犹豫不决。阴谋论的信徒更可能评估mRNA疫苗(辉瑞和摩德纳)不可接受,而那些遵循政治家建议的人评估基于载体的疫苗(阿斯利康和斯普特尼克)或全病毒疫苗(国药集团)可接受的可能性更高。我们说明,与其他类型的疫苗相比,通过mRNA对非期望疫苗片段的排斥和对首选疫苗片段的重新选择可以提高人群的免疫力,同时通常可以改善对已接种疫苗的评估。这些结果突出表明,可用疫苗类型的更大差异和个人自由选择是扩大社会对疫苗接受度的理想条件。 摘要:Many countries have secured larger quantities of COVID-19 vaccines than their populace is willing to take. This abundance and variety of vaccines created a historical moment to understand vaccine hesitancy better. Never before were more types of vaccines available for an illness and the intensity of vaccine-related public discourse is unprecedented. Yet, the heterogeneity of hesitancy by vaccine types has been neglected so far, even though factual or believed vaccine characteristics and patient attributes are known to influence acceptance. We address this problem by analysing acceptance and assessment of five vaccine types using information collected with a nationally representative survey at the end of the third wave of the COVID-19 pandemic in Hungary, where a unique portfolio of vaccines were available to the public in large quantities. Our special case enables us to quantify revealed preferences across vaccine types since one could evaluate a vaccine unacceptable and even could reject an assigned vaccine to wait for another type. We find that the source of information that respondents trust characterizes their attitudes towards vaccine types differently and leads to divergent vaccine hesitancy. Believers of conspiracy theories were significantly more likely to evaluate the mRNA vaccines (Pfizer and Moderna) unacceptable while those who follow the advice of politicians evaluate vector-based (AstraZeneca and Sputnik) or whole-virus vaccines (Sinopharm) acceptable with higher likelihood. We illustrate that the rejection of non-desired and re-selection of preferred vaccines fragments the population by the mRNA versus other type of vaccines while it generally improves the assessment of the received vaccine. These results highlight that greater variance of available vaccine types and individual free choice are desirable conditions that can widen the acceptance of vaccines in societies.

【5】 Joint Models for Cause-of-Death Mortality in Multiple Populations 标题:多人群死因死亡的联合模型 链接:https://arxiv.org/abs/2111.06631

作者:Nhan Huynh,Mike Ludkovski 备注:27 pages, 14 figures 摘要:我们调查了在多国环境中联合建模不同死因的年龄特异性比率。我们应用多输出高斯过程(MOGP),一种空间机器学习方法,平滑并推断多个国家和性别的多死因死亡率。为了保持灵活性和可伸缩性,我们研究了具有Kronecker结构内核和潜在因素的MOGP。特别是,我们开发了一个定制的多级MOGP,它利用死亡率表的网格结构有效地捕获不同因素输入的异质性和依赖性。结果用人类死因数据库(HCD)的数据集进行了说明。我们讨论了一个涉及三个欧洲国家癌症变异的案例研究,以及一个基于美国的研究,该研究考虑了八个顶级原因,并包括与全因分析的比较。我们的模型深入了解了特定原因死亡率趋势的共性,并展示了各自数据融合的机会。 摘要:We investigate jointly modeling Age-specific rates of various causes of death in a multinational setting. We apply Multi-Output Gaussian Processes (MOGP), a spatial machine learning method, to smooth and extrapolate multiple cause-of-death mortality rates across several countries and both genders. To maintain flexibility and scalability, we investigate MOGPs with Kronecker-structured kernels and latent factors. In particular, we develop a custom multi-level MOGP that leverages the gridded structure of mortality tables to efficiently capture heterogeneity and dependence across different factor inputs. Results are illustrated with datasets from the Human Cause-of-Death Database (HCD). We discuss a case study involving cancer variations in three European nations, and a US-based study that considers eight top-level causes and includes comparison to all-cause analysis. Our models provide insights into the commonality of cause-specific mortality trends and demonstrate the opportunities for respective data fusion.

2.cs.SD语音:

【1】 Fully Automatic Page Turning on Real Scores 标题:全自动页面打开真实分数 链接:https://arxiv.org/abs/2111.06643

作者:Florian Henkel,Stephanie Schwaiger,Gerhard Widmer 机构: Institute of Computational Perception, Johannes Kepler University, Linz, Austria, LIT Artificial Intelligence Lab, Linz Institute of Technology, Austria 备注:ISMIR 2021 Late Breaking/Demo 摘要:我们提出了一个自动翻页系统的原型,该系统直接处理真实分数,即纸张图像,无需任何符号表示。我们的系统基于一个多模态神经网络架构,它观察一个完整的图像页面作为输入,聆听传入的音乐表演,并预测图像中相应的位置。使用我们系统的位置估计,我们使用一种简单的启发式方法,一旦到达纸张图像中的某个位置,就触发翻页事件。作为概念证明,我们进一步将我们的系统与实际机器结合起来,实际机器将根据命令翻开新的一页。 摘要:We present a prototype of an automatic page turning system that works directly on real scores, i.e., sheet images, without any symbolic representation. Our system is based on a multi-modal neural network architecture that observes a complete sheet image page as input, listens to an incoming musical performance, and predicts the corresponding position in the image. Using the position estimation of our system, we use a simple heuristic to trigger a page turning event once a certain location within the sheet image is reached. As a proof of concept we further combine our system with an actual machine that will physically turn the page on command.

【2】 A Convolutional Neural Network Based Approach to Recognize Bangla Spoken Digits from Speech Signal 标题:一种基于卷积神经网络的孟加拉语音数字识别方法 链接:https://arxiv.org/abs/2111.06625

作者:Ovishake Sen,Al-Mahmud,Pias Roy 机构:Computer Science and Engineering, Khulna University of Engineering, & Technology, Khulna, Bangladesh 备注:4 pages, 5 figures, 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), 14 to 16 September 2021, Khulna, Bangladesh 摘要:语音识别是一种将人类的语音信号转换成文本或文字,或以计算机或其他机器容易理解的任何形式的技术。有一些关于孟加拉语数字识别系统的研究,其中大多数使用的是在性别、年龄、方言和其他变量上几乎没有变化的小型数据集。本研究使用不同性别、年龄和方言的孟加拉国人的录音来创建一个大型语音数据集,该数据集包含说话的“0-9”孟加拉语数字。在这里,为创建数据集,每个数字记录了400个噪声和无噪声样本。Mel倒谱系数(MFCC)被用于从原始语音数据中提取有意义的特征。然后,利用卷积神经网络(CNN)检测孟加拉语数字。建议的技术在整个数据集中识别“0-9”孟加拉语语音数字的准确率为97.1%。使用10倍交叉验证对模型的效率进行了评估,获得了96.7%的准确率。 摘要:Speech recognition is a technique that converts human speech signals into text or words or in any form that can be easily understood by computers or other machines. There have been a few studies on Bangla digit recognition systems, the majority of which used small datasets with few variations in genders, ages, dialects, and other variables. Audio recordings of Bangladeshi people of various genders, ages, and dialects were used to create a large speech dataset of spoken '0-9' Bangla digits in this study. Here, 400 noisy and noise-free samples per digit have been recorded for creating the dataset. Mel Frequency Cepstrum Coefficients (MFCCs) have been utilized for extracting meaningful features from the raw speech data. Then, to detect Bangla numeral digits, Convolutional Neural Networks (CNNs) were utilized. The suggested technique recognizes '0-9' Bangla spoken digits with 97.1% accuracy throughout the whole dataset. The efficiency of the model was also assessed using 10-fold crossvalidation, which yielded a 96.7% accuracy.

【3】 Domain Generalization on Efficient Acoustic Scene Classification using Residual Normalization 标题:基于残差归一化的有效声场分类的域泛化 链接:https://arxiv.org/abs/2111.06531

作者:Byeonggeun Kim,Seunghan Yang,Jangho Kim,Simyung Chang 机构:Qualcomm AI Research†, Qualcomm Korea YH, Seoul, Republic of Korea, Seoul National University, Seoul, Republic of Korea 备注:Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021) 摘要:如何通过一个有效设计的单一声场景分类系统来处理多设备音频输入是一个实际的研究课题。在这项工作中,我们提出了残差规范化,这是一种新的特征规范化方法,它使用频率方向的规范化%实例规范化和一个快捷路径来丢弃不必要的特定于设备的信息,而不会丢失有用的分类信息。此外,我们还介绍了一种高效的体系结构,BC-ResNet-ASC,它是基线体系结构的一个改进版本,具有有限的接受域。BC ResNet ASC的性能优于基线体系结构,即使它包含少量参数。通过三种模型压缩方案:剪枝、量化和知识提取,我们可以进一步降低模型复杂度,同时缓解性能下降。建议的系统在TAU Urban Acoustic Scenes 2020移动开发数据集上实现了76.3%的平均测试精度,该数据集具有315k参数,压缩到61.0KB的非零参数后,平均测试精度为75.3%。该方法在DCASE 2021挑战赛TASK1A中获得第一名。 摘要:It is a practical research topic how to deal with multi-device audio inputs by a single acoustic scene classification system with efficient design. In this work, we propose Residual Normalization, a novel feature normalization method that uses frequency-wise normalization % instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Moreover, we introduce an efficient architecture, BC-ResNet-ASC, a modified version of the baseline architecture with a limited receptive field. BC-ResNet-ASC outperforms the baseline architecture even though it contains the small number of parameters. Through three model compression schemes: pruning, quantization, and knowledge distillation, we can reduce model complexity further while mitigating the performance degradation. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. The proposed method won the 1st place in DCASE 2021 challenge, TASK1A.

【4】 AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion 标题:AC-VC:基于非并行低延时语音后处理的语音转换 链接:https://arxiv.org/abs/2111.06601

作者:Damien Ronssin,Milos Cernak 机构:Logitech Europe S.A., Lausanne, Switzerland, ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland 备注:ASRU 2021 摘要:本文介绍了AC-VC(几乎因果语音转换),这是一种基于语音后验图的语音转换系统,它可以执行任意对多的语音转换,同时只有57.5ms的前瞻性。整个系统由三个分别用非并行数据训练的神经网络组成。虽然当前大多数语音转换系统主要关注质量,而与算法延迟无关,但本工作详细介绍了如何使用最少的未来上下文来设计一种方法,从而允许未来的实时实现。根据本研究中组织的主观听力测试,建议的AC-VC系统在自然度方面达到与2020年语音转换挑战赛的非因果ASR-TTS基线相当,MOS为3.5。相反,研究结果表明,未来语境的缺失会影响说话人的相似性。得到的相似度百分比为65%,低于当前最佳语音转换系统的相似度。 摘要:This paper presents AC-VC (Almost Causal Voice Conversion), a phonetic posteriorgrams based voice conversion system that can perform any-to-many voice conversion while having only 57.5 ms future look-ahead. The complete system is composed of three neural networks trained separately with non-parallel data. While most of the current voice conversion systems focus primarily on quality irrespective of algorithmic latency, this work elaborates on designing a method using a minimal amount of future context thus allowing a future real-time implementation. According to a subjective listening test organized in this work, the proposed AC-VC system achieves parity with the non-causal ASR-TTS baseline of the Voice Conversion Challenge 2020 in naturalness with a MOS of 3.5. In contrast, the results indicate that missing future context impacts speaker similarity. Obtained similarity percentage of 65% is lower than the similarity of current best voice conversion systems.

【5】 Disentangling Physical Parameters for Anomalous Sound Detection Under Domain Shifts 标题:解缠物理参数用于域移异常声检测 链接:https://arxiv.org/abs/2111.06539

作者:Kota Dohi,Takashi Endo,Yohei Kawaguchi 机构:Research and Development Group, Hitachi, Ltd., -, Higashi-koigakubo, Kokubunji-shi, Tokyo ,-, Japan 备注:4 pages, 4 figures 摘要:为了开发一种机器声音监测系统,提出了一种在域位移下检测异常声音的方法。当机器的物理参数改变时,就会发生域转移。由于域偏移会改变正常声音数据的分布,因此传统的无监督异常检测方法可能会输出误报。为了解决这个问题,提出的方法限制了规范化流(NF)模型的一些潜在变量来表示物理参数,这使得能够分离域移动的因素并学习相对于这些域移动不变的潜在空间。从该域移位不变潜在空间计算的异常分数不受此类移位的影响,这减少了误报并提高了检测性能。利用滑轨在不同运行速度下的声音数据进行了实验。结果表明,该方法分离了速度,获得了相对于区域位移不变的潜在空间,对于单块辉光,AUC提高了13.2%,对于多块辉光,AUC提高了2.6%。 摘要:To develop a sound-monitoring system for machines, a method for detecting anomalous sound under domain shifts is proposed. A domain shift occurs when a machine's physical parameters change. Because a domain shift changes the distribution of normal sound data, conventional unsupervised anomaly detection methods can output false positives. To solve this problem, the proposed method constrains some latent variables of a normalizing flows (NF) model to represent physical parameters, which enables disentanglement of the factors of domain shifts and learning of a latent space that is invariant with respect to these domain shifts. Anomaly scores calculated from this domain-shift-invariant latent space are unaffected by such shifts, which reduces false positives and improves the detection performance. Experiments were conducted with sound data from a slide rail under different operation velocities. The results show that the proposed method disentangled the velocity to obtain a latent space that was invariant with respect to domain shifts, which improved the AUC by 13.2% for Glow with a single block and 2.6% for Glow with multiple blocks.

【6】 MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification 标题:MultiSV:用于远场多通道说话人确认的数据集 链接:https://arxiv.org/abs/2111.06458

作者:Ladislav Mošner,Oldřich Plchot,Lukáš Burget,Jan Černocký 机构: Jan “Honza” ˇCernock´yBrno University of Technology 备注:Submitted to ICASSP 2022 摘要:受数据不整合和该领域缺乏标准基准的影响,我们补充了我们之前的工作,并提出了一个用于训练和评估文本无关多通道说话人验证系统的综合语料库。它也可以很容易地用于去冗余、去噪和语音增强的实验。我们通过在Voxceleb数据集的干净部分上使用数据模拟来解决始终存在的缺少多通道训练数据的问题。开发和评估试验基于复杂环境设置(Voices)语料库中模糊的重发语音,我们对其进行了修改,以提供多通道试验。我们发布了从公共来源创建数据集的完整配方作为MultiSV语料库,并提供了两个基于神经网络波束形成的多通道说话人验证系统的结果,该系统基于预测理想的二进制掩码或最近的Conv-TasNet。 摘要:Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem of the lack of multi-channel training data by utilizing data simulation on top of clean parts of the Voxceleb dataset. The development and evaluation trials are based on a retransmitted Voices Obscured in Complex Environmental Settings (VOiCES) corpus, which we modified to provide multi-channel trials. We publish full recipes that create the dataset from public sources as the MultiSV corpus, and we provide results with two of our multi-channel speaker verification systems with neural network-based beamforming based either on predicting ideal binary masks or the more recent Conv-TasNet.

3.eess.AS音频处理:

【1】 HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE 标题:HLT-NUS深渊翻滚2020年NIST会话式电话语音SRE 链接:https://arxiv.org/abs/2111.06671

作者:Rohan Kumar Das,Ruijie Tao,Haizhou Li 机构:Department of Electrical and Computer Engineering, National University of Singapore, Singapore 备注:3 pages 摘要:这项工作提供了人类语言技术(HLT)实验室,新加坡国立大学(NUS)系统提交2020 NIST会话电话语音(CTS)说话人识别评估(SRE)的简要描述。挑战集中在包含多语言语音的CTS数据下的评估。在HLT-NUS开发的系统考虑时延神经网络(TDNN)X矢量和ECAPA-TDDNN系统。我们还对我们的系统进行了概率线性判别分析(PLDA)模型和自适应s-范数的域自适应。对TDNN x-vector和ECAPA-TDNN系统进行了分数级融合,从而提高了提交至2020年NIST CTS SRE的最终系统性能。 摘要:This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE). The challenge focuses on evaluation under CTS data containing multilingual speech. The systems developed at HLT-NUS consider time-delay neural network (TDNN) x-vector and ECAPA-TDNN systems. We also perform domain adaption of probabilistic linear discriminant analysis (PLDA) model and adaptive s-norm on our systems. The score level fusion of TDNN x-vector and ECAPA-TDNN systems is carried out, which improves the final system performance of our submission to 2020 NIST CTS SRE.

【2】 AC-VC: Non-parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion 标题:AC-VC:基于非并行低延时语音后处理的语音转换 链接:https://arxiv.org/abs/2111.06601

作者:Damien Ronssin,Milos Cernak 机构:Logitech Europe S.A., Lausanne, Switzerland, ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland 备注:ASRU 2021 摘要:本文介绍了AC-VC(几乎因果语音转换),这是一种基于语音后验图的语音转换系统,它可以执行任意对多的语音转换,同时只有57.5ms的前瞻性。整个系统由三个分别用非并行数据训练的神经网络组成。虽然当前大多数语音转换系统主要关注质量,而与算法延迟无关,但本工作详细介绍了如何使用最少的未来上下文来设计一种方法,从而允许未来的实时实现。根据本研究中组织的主观听力测试,建议的AC-VC系统在自然度方面达到与2020年语音转换挑战赛的非因果ASR-TTS基线相当,MOS为3.5。相反,研究结果表明,未来语境的缺失会影响说话人的相似性。得到的相似度百分比为65%,低于当前最佳语音转换系统的相似度。 摘要:This paper presents AC-VC (Almost Causal Voice Conversion), a phonetic posteriorgrams based voice conversion system that can perform any-to-many voice conversion while having only 57.5 ms future look-ahead. The complete system is composed of three neural networks trained separately with non-parallel data. While most of the current voice conversion systems focus primarily on quality irrespective of algorithmic latency, this work elaborates on designing a method using a minimal amount of future context thus allowing a future real-time implementation. According to a subjective listening test organized in this work, the proposed AC-VC system achieves parity with the non-causal ASR-TTS baseline of the Voice Conversion Challenge 2020 in naturalness with a MOS of 3.5. In contrast, the results indicate that missing future context impacts speaker similarity. Obtained similarity percentage of 65% is lower than the similarity of current best voice conversion systems.

【3】 Disentangling Physical Parameters for Anomalous Sound Detection Under Domain Shifts 标题:解缠物理参数用于域移异常声检测 链接:https://arxiv.org/abs/2111.06539

作者:Kota Dohi,Takashi Endo,Yohei Kawaguchi 机构:Research and Development Group, Hitachi, Ltd., -, Higashi-koigakubo, Kokubunji-shi, Tokyo ,-, Japan 备注:4 pages, 4 figures 摘要:为了开发一种机器声音监测系统,提出了一种在域位移下检测异常声音的方法。当机器的物理参数改变时,就会发生域转移。由于域偏移会改变正常声音数据的分布,因此传统的无监督异常检测方法可能会输出误报。为了解决这个问题,提出的方法限制了规范化流(NF)模型的一些潜在变量来表示物理参数,这使得能够分离域移动的因素并学习相对于这些域移动不变的潜在空间。从该域移位不变潜在空间计算的异常分数不受此类移位的影响,这减少了误报并提高了检测性能。利用滑轨在不同运行速度下的声音数据进行了实验。结果表明,该方法分离了速度,获得了相对于区域位移不变的潜在空间,对于单块辉光,AUC提高了13.2%,对于多块辉光,AUC提高了2.6%。 摘要:To develop a sound-monitoring system for machines, a method for detecting anomalous sound under domain shifts is proposed. A domain shift occurs when a machine's physical parameters change. Because a domain shift changes the distribution of normal sound data, conventional unsupervised anomaly detection methods can output false positives. To solve this problem, the proposed method constrains some latent variables of a normalizing flows (NF) model to represent physical parameters, which enables disentanglement of the factors of domain shifts and learning of a latent space that is invariant with respect to these domain shifts. Anomaly scores calculated from this domain-shift-invariant latent space are unaffected by such shifts, which reduces false positives and improves the detection performance. Experiments were conducted with sound data from a slide rail under different operation velocities. The results show that the proposed method disentangled the velocity to obtain a latent space that was invariant with respect to domain shifts, which improved the AUC by 13.2% for Glow with a single block and 2.6% for Glow with multiple blocks.

【4】 MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification 标题:MultiSV:用于远场多通道说话人确认的数据集 链接:https://arxiv.org/abs/2111.06458

作者:Ladislav Mošner,Oldřich Plchot,Lukáš Burget,Jan Černocký 机构: Jan “Honza” ˇCernock´yBrno University of Technology 备注:Submitted to ICASSP 2022 摘要:受数据不整合和该领域缺乏标准基准的影响,我们补充了我们之前的工作,并提出了一个用于训练和评估文本无关多通道说话人验证系统的综合语料库。它也可以很容易地用于去冗余、去噪和语音增强的实验。我们通过在Voxceleb数据集的干净部分上使用数据模拟来解决始终存在的缺少多通道训练数据的问题。开发和评估试验基于复杂环境设置(Voices)语料库中模糊的重发语音,我们对其进行了修改,以提供多通道试验。我们发布了从公共来源创建数据集的完整配方作为MultiSV语料库,并提供了两个基于神经网络波束形成的多通道说话人验证系统的结果,该系统基于预测理想的二进制掩码或最近的Conv-TasNet。 摘要:Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem of the lack of multi-channel training data by utilizing data simulation on top of clean parts of the Voxceleb dataset. The development and evaluation trials are based on a retransmitted Voices Obscured in Complex Environmental Settings (VOiCES) corpus, which we modified to provide multi-channel trials. We publish full recipes that create the dataset from public sources as the MultiSV corpus, and we provide results with two of our multi-channel speaker verification systems with neural network-based beamforming based either on predicting ideal binary masks or the more recent Conv-TasNet.

【5】 Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR 标题:破译语音:ASR中跨语言迁移的零资源方法 链接:https://arxiv.org/abs/2111.06799

作者:Ondrej Klejch,Electra Wallington,Peter Bell 机构:Centre for Speech Technology Research, University of Edinburgh, United Kingdom 摘要:我们提出了一种在ASR系统中进行跨语言训练的方法,该系统完全不使用来自目标语言的转录训练数据,也不使用有关语言的语音知识。我们的方法使用了一种新的解密算法,该算法只对来自目标语言的未配对语音和文本数据进行操作。我们将这种破译应用于通用电话识别器在语言外语音语料库上训练生成的电话序列,然后进行平启动半监督训练,以获得新语言的声学模型。据我们所知,这是第一个不依赖任何手工语音信息的零资源跨语言ASR的实用方法。我们对从GlobalPhone语料库中读取的语音进行了实验,结果表明,只需20分钟的目标语言数据就可以学习解码模型。当用于生成用于半监督训练的伪标签时,我们得到的WER比在相同数据上训练的等效全监督模型差25%到5%。 摘要:We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by a universal phone recogniser trained on out-of-language speech corpora, which we follow with flat-start semi-supervised training to obtain an acoustic model for the new language. To the best of our knowledge, this is the first practical approach to zero-resource cross-lingual ASR which does not rely on any hand-crafted phonetic information. We carry out experiments on read speech from the GlobalPhone corpus, and show that it is possible to learn a decipherment model on just 20 minutes of data from the target language. When used to generate pseudo-labels for semi-supervised training, we obtain WERs that range from 25% to just 5% absolute worse than the equivalent fully supervised models trained on the same data.

【6】 Fully Automatic Page Turning on Real Scores 标题:全自动页面打开真实分数 链接:https://arxiv.org/abs/2111.06643

作者:Florian Henkel,Stephanie Schwaiger,Gerhard Widmer 机构: Institute of Computational Perception, Johannes Kepler University, Linz, Austria, LIT Artificial Intelligence Lab, Linz Institute of Technology, Austria 备注:ISMIR 2021 Late Breaking/Demo 摘要:我们提出了一个自动翻页系统的原型,该系统直接处理真实分数,即纸张图像,无需任何符号表示。我们的系统基于一个多模态神经网络架构,它观察一个完整的图像页面作为输入,聆听传入的音乐表演,并预测图像中相应的位置。使用我们系统的位置估计,我们使用一种简单的启发式方法,一旦到达纸张图像中的某个位置,就触发翻页事件。作为概念证明,我们进一步将我们的系统与实际机器结合起来,实际机器将根据命令翻开新的一页。 摘要:We present a prototype of an automatic page turning system that works directly on real scores, i.e., sheet images, without any symbolic representation. Our system is based on a multi-modal neural network architecture that observes a complete sheet image page as input, listens to an incoming musical performance, and predicts the corresponding position in the image. Using the position estimation of our system, we use a simple heuristic to trigger a page turning event once a certain location within the sheet image is reached. As a proof of concept we further combine our system with an actual machine that will physically turn the page on command.

【7】 Domain Generalization on Efficient Acoustic Scene Classification using Residual Normalization 标题:基于残差归一化的有效声场分类的域泛化 链接:https://arxiv.org/abs/2111.06531

作者:Byeonggeun Kim,Seunghan Yang,Jangho Kim,Simyung Chang 机构:Qualcomm AI Research†, Qualcomm Korea YH, Seoul, Republic of Korea, Seoul National University, Seoul, Republic of Korea 备注:Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021) 摘要:如何通过一个有效设计的单一声场景分类系统来处理多设备音频输入是一个实际的研究课题。在这项工作中,我们提出了残差规范化,这是一种新的特征规范化方法,它使用频率方向的规范化%实例规范化和一个快捷路径来丢弃不必要的特定于设备的信息,而不会丢失有用的分类信息。此外,我们还介绍了一种高效的体系结构,BC-ResNet-ASC,它是基线体系结构的一个改进版本,具有有限的接受域。BC ResNet ASC的性能优于基线体系结构,即使它包含少量参数。通过三种模型压缩方案:剪枝、量化和知识提取,我们可以进一步降低模型复杂度,同时缓解性能下降。建议的系统在TAU Urban Acoustic Scenes 2020移动开发数据集上实现了76.3%的平均测试精度,该数据集具有315k参数,压缩到61.0KB的非零参数后,平均测试精度为75.3%。该方法在DCASE 2021挑战赛TASK1A中获得第一名。 摘要:It is a practical research topic how to deal with multi-device audio inputs by a single acoustic scene classification system with efficient design. In this work, we propose Residual Normalization, a novel feature normalization method that uses frequency-wise normalization % instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Moreover, we introduce an efficient architecture, BC-ResNet-ASC, a modified version of the baseline architecture with a limited receptive field. BC-ResNet-ASC outperforms the baseline architecture even though it contains the small number of parameters. Through three model compression schemes: pruning, quantization, and knowledge distillation, we can reduce model complexity further while mitigating the performance degradation. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. The proposed method won the 1st place in DCASE 2021 challenge, TASK1A.