音頻數(shù)據(jù)建模全流程代碼示例：通過講話人的聲音進行年齡預測（2）

發(fā)布人：數(shù)據(jù)派THU 時間：2022-03-13 來源：工程師

加入技術交流群
- 掃碼加入
  和技術大咖面對面交流
  海量資料庫查詢

特征提取

數(shù)據(jù)是干凈的，應該繼續(xù)研究可以提取的特定于音頻的特征了。
1. 開始檢測
通過觀察一個信號的波形，librosa可以很好地識別一個新口語單詞的開始。

# Import librosa
import librosa
# Loads mp3 file with a specific sampling rate, here 16kHz
y, sr = librosa.load("c4_sample-1.mp3", sr=16_000)
# Plot the signal stored in 'y'
from matplotlib import pyplot as plt
import librosa.display
plt.figure(figsize=(12, 3))
plt.title("Audio signal as waveform")
librosa.display.waveplot(y, sr=sr);

2. 錄音的長度

與此密切相關的是錄音的長度。錄音越長，能說的單詞就越多。所以計算一下錄音的長度和單詞被說出的速度。

duration = len(y) / sr
words_per_second = number_of_words / duration
print(f"""The audio signal is {duration:.2f} seconds long,
with an average of {words_per_second:.2f} words per seconds.""")

>>> The audio signal is 1.70 seconds long,
>>> with an average of 4.13 words per seconds.

3. 節(jié)奏
語言是一種非常悅耳的信號，每個人都有自己獨特的說話方式和語速。因此，可以提取的另一個特征是說話的節(jié)奏，即在音頻信號中可以檢測到的節(jié)拍數(shù)。


# Computes the tempo of a audio recording
tempo = librosa.beat.tempo(y, sr, start_bpm=10)[0]
print(f"The audio signal has a speed of {tempo:.2f} bpm.")

>>> The audio signal has a speed of 42.61 bpm.

4. 基頻
基頻是周期聲音出現(xiàn)時的最低頻率。在音樂中也被稱為音高。在之前看到的譜圖圖中，基頻(也稱為f0)是圖像中最低的亮水平條帶。而在這個基本音之上的帶狀圖案的重復稱為諧波。
為了更好地說明確切意思，下面提取基頻，并在譜圖中畫出它們。


# Extract fundamental frequency using a probabilistic approach
f0, _, _ = librosa.pyin(y, sr=sr, fmin=10, fmax=8000, frame_length=1024)

# Establish timepoint of f0 signal
timepoints = np.linspace(0, duration, num=len(f0), endpoint=False)

# Plot fundamental frequency in spectrogram plot
plt.figure(figsize=(8, 3))
x_stft = np.abs(librosa.stft(y))
x_stft = librosa.amplitude_to_db(x_stft,ref=np.max)
librosa.display.specshow(x_stft, sr=sr, x_axis="time", y_axis="log")
plt.plot(timepoints, f0, color="cyan", linewidth=4)
plt.show();

在 100 Hz 附近看到的綠線是基本頻率。但是如何將其用于特征工程呢？可以做的是計算這個 f0 的具體特征。

# Computes mean, median, 5%- and 95%-percentile value of fundamental frequency
f0_values = [
    np.nanmean(f0),
    np.nanmedian(f0),
    np.nanstd(f0),
    np.nanpercentile(f0, 5),
    np.nanpercentile(f0, 95),
]
print("""This audio signal has a mean of {:.2f}, a median of {:.2f}, a
std of {:.2f}, a 5-percentile at {:.2f} and a 95-percentile at {:.2f}.""".format(*f0_values))
>>> This audio signal has a mean of 81.98, a median of 80.46, a
>>> std of 4.42, a 5-percentile at 76.57 and a 95-percentile at 90.64.

除以上說的技術以外，還有更多可以探索的音頻特征提取技術，這里就不詳細說明了。

音頻數(shù)據(jù)集的探索性數(shù)據(jù)分析 (EDA)

現(xiàn)在我們知道了音頻數(shù)據(jù)是什么樣子以及如何處理它，讓我們對它進行適當?shù)?EDA。首先下載一個數(shù)據(jù)集Kaggle 的 Common Voice 。這個 14 GB 的大數(shù)據(jù)集只是來自 Mozilla 的 +70 GB 大數(shù)據(jù)集的一個小的快照。對于本文這里的示例，將只使用這個數(shù)據(jù)集的大約 9'000 個音頻文件的子樣本。
看看這個數(shù)據(jù)集和一些已經(jīng)提取的特征。

1. 特征分布調查

目標類別年齡和性別的類別分布。

目標類別分布是不平衡的。

下一步，讓我們仔細看看提取的特征的值分布。

除了 words_per_second，這些特征分布中的大多數(shù)都是右偏的，因此可以從對數(shù)轉換中獲益。

import numpy as np
# Applies log1p on features that are not age, gender, filename or words_per_second
df = df.apply(
    lambda x: np.log1p(x)
    if x.name not in ["age", "gender", "filename", "words_per_second"]
    else x)
# Let's look at the distribution once more
df.drop(columns=["age", "gender", "filename"]).hist(
  bins=100, figsize=(14, 10))
plt.show();

好多了，但有趣的是 f0 特征似乎都具有雙峰分布。讓我們繪制與以前相同的內容，但這次按性別分開。

正如懷疑的那樣，這里似乎存在性別效應！但也可以看到，一些 f0 分數(shù)（這里特別是男性）比應有的低和高得多。由于特征提取不良，這些可能是異常值。仔細看看下圖的所有數(shù)據(jù)點。

# Plot sample points for each feature individuallydf.plot(lw=0, marker=".", subplots=True, layout=(-1, 3),        figsize=(15, 7.5), markersize=2)plt.tight_layout()plt.show();

鑒于特征的數(shù)量很少，而且有相當漂亮的帶有明顯尾部的分布，可以遍歷它們中的每一個，并逐個特征地確定異常值截止閾值。

2. 特征的相關性
下一步，看看所有特征之間的相關性。但在這樣做之前需要對非數(shù)字目標特征進行編碼。可以使用 scikit-learn 的 OrdinalEncoder 來執(zhí)行此操作，但這可能會破壞年齡特征中的正確順序。因此在這里手動進行映射。


import numpy as np

# Applies log1p on features that are not age, gender, filename or words_per_second
df = df.apply(
    lambda x: np.log1p(x)
    if x.name not in ["age", "gender", "filename", "words_per_second"]
    else x)

# Let's look at the distribution once more
df.drop(columns=["age", "gender", "filename"]).hist(
  bins=100, figsize=(14, 10))
plt.show();

現(xiàn)在可以使用 pandas 的 .corr() 函數(shù)和 seaborn 的 heatmap() 來更深入地了解特征相關性。


import seaborn as sns


plt.figure(figsize=(8, 8))
df_corr = df.corr() * 100
sns.heatmap(df_corr, square=True, annot=True, fmt=".0f",            mask=np.eye(len(df_corr)), center=0)
            
plt.show();

非常有趣！提取的 f0 特征似乎與性別目標有相當強的關系，而年齡似乎與任何其他的特征都沒有太大的相關性。

3. 頻譜圖特征
目前還沒有查看實際錄音。正如之前看到的，有很多選擇（即波形或 STFT、mel 或 mfccs 頻譜圖）。
音頻樣本的長度都不同，這意味著頻譜圖也會有不同的長度。因此為了標準化所有錄音，首先要將它們剪切到正好 3 秒的長度：太短的樣本會被填充，而太長的樣本會被剪掉。
一旦計算了所有這些頻譜圖，我們就可以繼續(xù)對它們執(zhí)行一些 EDA！而且因為看到“性別”似乎與錄音有特殊的關系，所以分別可視化兩種性別的平均梅爾譜圖，以及它們的差異。

男性說話者的平均聲音低于女性。這可以通過差異圖中的較低頻率（在紅色水平區(qū)域中看到）的更多強度來看出。

模型選擇

現(xiàn)在已經(jīng)可以進行建模了。我們有多種選擇。關于模型，我們可以：

訓練我們經(jīng)典（即淺層）機器學習模型，例如 LogisticRegression 或 SVC。
訓練深度學習模型，即深度神經(jīng)網(wǎng)絡。
使用 TensorflowHub 的預訓練神經(jīng)網(wǎng)絡進行特征提取，然后在這些高級特征上訓練淺層或深層模型

而我們訓練的數(shù)據(jù)是：

CSV 文件中的數(shù)據(jù)，將其與頻譜圖中的“mel 強度”特征相結合，并將數(shù)據(jù)視為表格數(shù)據(jù)集
單獨的梅爾譜圖并將它們視為圖像數(shù)據(jù)集
使用TensorflowHub現(xiàn)有模型提取的高級特征，將它們與其他表格數(shù)據(jù)結合起來，并將其視為表格數(shù)據(jù)集

當然，有許多不同的方法和其他方法可以為建模部分創(chuàng)建數(shù)據(jù)集。因為我們沒有使用全量的數(shù)據(jù)，所以在本文我們使用最簡單的機器學習模型。

經(jīng)典（即淺層）機器學習模型

這里使用EDA獲取數(shù)據(jù)，與一個簡單的 LogisticRegression 模型結合起來，看看我們能在多大程度上預測說話者的年齡。除此以外還使用 GridSearchCV 來探索不同的超參數(shù)組合，以及執(zhí)行交叉驗證。


from sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import RobustScaler, PowerTransformer, QuantileTransformerfrom sklearn.decomposition import PCAfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCV
# Create pipelinepipe = Pipeline(    [        ("scaler", RobustScaler()),        ("pca", PCA()),        ("logreg", LogisticRegression(class_weight="balanced")),    ])
# Create gridgrid = {    "scaler": [RobustScaler(), PowerTransformer(), QuantileTransformer()],    "pca": [None, PCA(0.99)],    "logreg__C": np.logspace(-3, 2, num=16),}
# Create GridSearchCVgrid_cv = GridSearchCV(pipe, grid, cv=4, return_train_score=True, verbose=1)
# Train GridSearchCVmodel = grid_cv.fit(x_tr, y_tr)
# Collect results in a DataFramecv_results = pd.DataFrame(grid_cv.cv_results_)
# Select the columns we are interested incol_of_interest = [    "param_scaler",    "param_pca",    "param_logreg__C",    "mean_test_score",    "mean_train_score",    "std_test_score",    "std_train_score",]cv_results = cv_results[col_of_interest]
# Show the dataframe sorted according to our performance metriccv_results.sort_values("mean_test_score", ascending=False)

作為上述 DataFrame 輸出的補充，還可以將性能得分繪制為探索的超參數(shù)的函數(shù)。但是因為使用了有多個縮放器和 PCA ，所以需要為每個單獨的超參數(shù)組合創(chuàng)建一個單獨的圖。

在圖中，可以看到總體而言模型的表現(xiàn)同樣出色。當降低 C 的值時，有些會出現(xiàn)更快的“下降”，而另一些則顯示訓練和測試（這里實際上是驗證）分數(shù)之間的差距更大，尤其是當我們不使用 PCA 時。

下面使用 best_estimator_ 模型，看看它在保留的測試集上的表現(xiàn)如何。






# Compute score of the best model on the withheld test setbest_clf = model.best_estimator_best_clf.score(x_te, y_te)
>>> 0.4354094579008074

這已經(jīng)是一個很好的成績了。但是為了更好地理解分類模型的表現(xiàn)如何，可以打印相應的混淆矩陣。

雖然該模型能夠檢測到比其他模型更多的 20 歲樣本（左混淆矩陣），但總體而言，它實際上在對 10 歲和 60 歲的條目進行分類方面效果更好（例如，準確率分別為 59% 和 55%）。

總結

在這篇文章中，首先看到了音頻數(shù)據(jù)是什么樣的，然后可以將其轉換成哪些不同的形式，如何對其進行清理和探索，最后如何將其用于訓練一些機器學習模型。如果您有任何問題，請隨時發(fā)表評論。
最后本文的源代碼在這里下載：https://github.com/miykael/miykael.github.io/blob/master/assets/nb/04_audio_data_analysis/nb_audio_eda_and_modeling.ipynb作者：Michael Notter

*博客內容為網(wǎng)友個人發(fā)布，僅代表博主個人觀點，如有侵權請聯(lián)系工作人員刪除。

在线看毛片网站电影-亚洲国产欧美日韩精品一区二区三区,国产欧美乱夫不卡无乱码,国产精品欧美久久久天天影视,精品一区二区三区视频在线观看,亚洲国产精品人成乱码天天看,日韩久久久一区,91精品国产91免费

博客專欄

音頻數(shù)據(jù)建模全流程代碼示例：通過講話人的聲音進行年齡預測（2）

相關推薦

技術專區(qū)