您现在的位置是：首页 > 其它

当前栏目

Introduction to CELP Coding

to Introduction Coding

2023-09-11 14:17:11 时间

Speex is based on CELP, which stands for Code Excited Linear Prediction. This section attempts to introduce the principles behind CELP, so if you are already familiar with CELP, you can safely skip to section 7. The CELP technique is based on three ideas:

The use of a linear prediction (LP) model to model the vocal tract
The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model
The search performed in closed-loop in a ``perceptually weighted domain''

This section describes the basic ideas behind CELP. Note that it's still incomplete.

Linear Prediction (LPC)

Linear prediction is at the base of many speech coding techniques, including CELP. The idea behind it is to predict the signal using a linear combination of its past samples:

$\begin{displaymath} y[n]=\sum_{i=1}^{N}a_{i}x[n-i]\end{displaymath}$

where is the linear prediction of . The prediction error is thus given by:

$\begin{displaymath} e[n]=x[n]-y[n]=x[n]-\sum_{i=1}^{N}a_{i}x[n-i]\end{displaymath}$

The goal of the LPC analysis is to find the best prediction coefficients $a_{i}$ which minimize the quadratic error function:

$\begin{displaymath} E=\sum_{n=0}^{L-1}\left[e[n]\right]^{2}=\sum_{n=0}^{L-1}\left[x[n]-\sum_{i=1}^{N}a_{i}x[n-i]\right]^{2}\end{displaymath}$

That can be done by making all derivatives $\frac{\partial E}{\partial a_{i}}$ equal to zero:

$\begin{displaymath} \frac{\partial E}{\partial a_{i}}=\frac{\partial}{\partial a... ...um_{n=0}^{L-1}\left[x[n]-\sum_{i=1}^{N}a_{i}x[n-i]\right]^{2}=0\end{displaymath}$

The $a_{i}$ filter coefficients are computed using the Levinson-Durbin algorithm, which starts from the auto-correlation of the signal .

$\begin{displaymath} R(m)=\sum_{i=0}^{N-1}x[i]x[i-m]\end{displaymath}$

For an order filter, we have:

$\begin{displaymath} \mathbf{R}=\left[\begin{array}{cccc} R(0) & R(1) & \cdots & ... ...s & \vdots\\ R(N-1) & R(N-2) & \cdots & R(0)\end{array}\right]\end{displaymath}$

$\begin{displaymath} \mathbf{r}=\left[\begin{array}{c} R(1)\\ R(2)\\ \vdots\\ R(N)\end{array}\right]\end{displaymath}$

The filter coefficients $a_{i}$ are found by solving the system $\mathbf{Ra}=\mathbf{r}$ . What the Levinson-Durbin algorithm does here is making the solution to the problem $\mathcal{O}\left(N^{2}\right)$ instead of $\mathcal{O}\left(N^{3}\right)$ by exploiting the fact that matrix $\mathbf{R}$ is toeplitz hermitian. Also, it can be proven that all the roots of are within the unit circle, which means that is always stable. This is in theory; in practice because of finite precision, there are two commonly used techniques to make sure we have a stable filter. First, we multiply by a number slightly above one (such as 1.0001), which is equivalent to adding noise to the signal. Also, we can apply a window to the auto-correlation, which is equivalent to filtering in the frequency domain, reducing sharp resonances.

The linear prediction model represents each speech sample as a linear combination of past samples, plus an error signal called the excitation (or residual).

$\begin{displaymath} x[n]=\sum_{i=1}^{N}a_{i}x[n-i]+e[n]\end{displaymath}$

In the z-domain, this can be expressed as

$\begin{displaymath} x(z)=\frac{1}{A(z)}\: e(z)\end{displaymath}$

where is defined as

$\begin{displaymath} A(z)=1-\sum_{i=1}^{N}a_{i}z^{-i}\end{displaymath}$

We usually refer to as the analysis filter and as the synthesis filter. The whole process is called short-term prediction as it predicts the signal using a prediction using only the past samples, where is usually around 10.

Because LPC coefficients have very little robustness to quantization, they are converted to Line Spectral Pair (LSP) coefficients which have a much better behaviour with quantization, one of them being that it's easy to keep the filter stable.

Pitch Prediction

During voiced segments, the speech signal is periodic, so it is possible to take advantage of that property by approximating the excitation signal by a gain times the past of the excitation:

$\begin{displaymath} e[n]\simeq p[n]=\beta e[n-T]\end{displaymath}$

where is the pitch period, $\beta$ is the pitch gain. We call that long-term prediction since the excitation is predicted from with $T\gg N$ .

Innovation Codebook

The final excitation will be the sum of the pitch prediction and an innovation signal taken from a fixed codebook, hence the name Code Excited Linear Prediction. The final excitation is given by:

$\begin{displaymath} e[n]=p[n]+c[n]=\beta e[n-T]+c[n]\end{displaymath}$

The quantization of is where most of the bits in a CELP codec are allocated. It represents the information that couldn't be obtained either from linear prediction or pitch prediction. In the z-domain we can represent the final signal as

$\begin{displaymath} X(z)=\frac{C(z)}{A(z)\left(1-\beta z^{-T}\right)}\end{displaymath}$

Analysis-by-Synthesis and Error Weighting

Most (if not all) modern audio codecs attempt to ``shape'' the noise so that it appears mostly in the frequency regions where the ear cannot detect it. For example, the ear is more tolerant to noise in parts of the spectrum that are louder and vice versa. That's why instead of minimizing the simple quadratic error

$\begin{displaymath} E=\sum_{n}\left(x[n]-\overline{x}[n]\right)^{2}\end{displaymath}$

where $\overline{x}[n]$ is the encoder signal, we minimize the error for the perceptually weighted signal

$\begin{displaymath} X_{w}(z)=W(z)X(z)\end{displaymath}$

where is the weighting filter, usually of the form

$\begin{displaymath} W(z)=\frac{A\left(\frac{z}{\gamma_{1}}\right)}{A\left(\frac{z}{\gamma_{2}}\right)} \end{displaymath}$

(1)

with control parameters $\gamma_{1}>\gamma_{2}$ . If the noise is white in the perceptually weighted domain, then in the signal domain its spectral shape will be of the form

$\begin{displaymath} A_{noise}(z)=\frac{1}{W(z)}=\frac{A\left(\frac{z}{\gamma_{2}}\right)}{A\left(\frac{z}{\gamma_{1}}\right)}\end{displaymath}$

If a filter has (complex) poles at $p_{i}$ in the -plane, the filter $A(z/\gamma)$ will have its poles at $p'_{i}=\gamma p_{i}$ , making it a flatter version of .

Analysis-by-synthesis refers to the fact that when trying to find the best pitch parameters (, $\beta$ ) and innovation signal , we do not work by making the excitation as close as the original one (which would be simpler), but apply the synthesis (and weighting) filter and try making $X_{w}(z)$ as close to the original as possible.

参考资料：

1 百科总结： https://zh.wikipedia.org/wiki/%E7%A0%81%E6%BF%80%E5%8A%B1%E7%BA%BF%E6%80%A7%E9%A2%84%E6%B5%8B
2 详细介绍： http://ntools.net/arc/Documents/speex/manual/node8.html

猜你喜欢

地球引擎保姆级教程——JavaScript的基础语法介绍（print，括号和函数）
getchar、putchar、puts、gets
PHP高级教程-Session
golang编译程序在linux上的部署
iOS开发 - 安装Weex开发环境，两大注意事项
Python标准库：内置函数abs(x)
Python setup.py和MANIFEST.in文件
SU-03T语音模块使用简介
在iOS开发中,我们会遇到十六进制和字符串之间相互转换,话不多说,直接上代码:
xargs原理剖析及用法详解
服务器监控异常重启服务并发送邮件
word文档编辑受限制怎么解除？
一个中等规模的七段数码数据库以及利用它训练的识别网络
HSync、VSync与硬件时钟
js中的NaN，isNaN与Number.isNaN的区别，如何判断一个值严格等于NaN
如何在linux上通过GRUB添加内核参数
JS框架设计之对象数组化一种子模块
国际清算银行发布安全指南保护金融市场基础设施网络安全

相关主题

edi to java
linq to xml
ORM TO SQL
To Do List
LINQ to Object
709. To Lower Case*