출처

Nguyen, Bac, and Fabien Cardinaux. “Nvc-net: End-to-end adversarial voice conversion.” ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.APA

0 Abstract

voice conversion의 idea는 voice의 idnetity를 한 사람에서 다른 사람으로 변환하는 것임.
다양한 voice conversion 접근은 vocoder에서 acoustic feature을 speech으로 reconstruct하게 하는 것이었기 때문에 speech 의 quality 가 vocoder에 엄청나게 의존되어 왔음
본 연구에서는 end to end advarsarial network인 NVC-Net을 제안함
- 이는 raw audio waveform에 직접적으로 voice coversion을 적용 가능함
- speech content으로부터 speaker identity를 구분함으로서 NVC NEt은 non parallel traditional many to manyvoic conversion 뿐만 아니라 unseen target speaker으로의 짧은 발화를 zero shot으로 conversion할 수 있음
- 특히 중요한건 NVC Net은 non auto regressive하고 fully convolutional하기 때문에 빠른 inference 가 가능함

1 Introduction

voice conversion은 linguistic info는 유지한 채, target speaker처럼 source speaker 가 발화하게 하는 것
초기의 voice conversion 방법은 parallel training data를 요했는데, 이는 다른 speaker가 같은 linguistic content를 발화한 데이터였음
- 이러한 방법의 단점은 source와 target 발화의 mis alignment에 대한 민감성이었음
  - source 와 target발화의 alignment가 일치해야지 학습이 유효하다는 것
- 그리고 이러한 parallel training data를 모은다는 것은 시간이 너무많이 걸리고 다양한 situation에서는 적용도 안되니까 비효율적임
- 따라서 non parallel training data를 이용한 GANs 가 voice conversion에 사용되었음
  - GAN 기반의 방법은 parallel 방법보다 좋은 성능을 내지 못한다는 한계가 존재함
다른 common 문제는 voice conversion이 raw audio waveform을 모델링한다는 것임
- 좋은 모델은 audio signal에 있는 short/long temporal dependency를 capture해야하는데 이게 high temporal resoultion을 가진 발화에서는 쉽지가 않음
  - 아주 간단한 해결책은 temproal resoultion을 lower dimensional represetation(acoustic feature)으로 reduce하는 것이지만 이러한 경우에는 vocoder으로 꼭 waveform을 reconstruct해줘야하는 번거로움이 잇음
이러한 문제를 overcome 하기 위해서 좀더 정교한 neural vocoder이 제안되었지만, auto regressive model은 엄청 느린 inference를 보이고 non auto regressive 모델과 같은 경우에는 계산량이나 메모리사용량이 너무 큼
- 계산량이나 메모리 이슈를 해결한 모델이 있지만, training과 inference간의 feature가 불일치한다는 문제가 있음
- feature mismatch 문제는 vocoder으로부터 하여금 nosiy하거나 collapsed speech를 산출하게 함 (특히 training data가 부족할때)
zero shot voice conversion에 대한 연구는 아직 부족함
본 연구에서는 raw audio waveform에 voice conversion을 적용할 때 발생하는 문제점을 해결하고자 하였으며 main contribution들은 아래와 같음
1. NCV-Net을 제안함
  1. adversarially하게 training된 many to many voice conversion 방법으로 parallel data를 요하지 않음
  2. speaker의 identity를 speech content으로부터 disentangle하고자 함
2. 다른 GAN 기반의 voice conversion방식과는 다르게 NVC-Net은 새로운 raw audio를 추가적인 vocoder의 학습 없이도 생성할 수 있음
  1. 따라서 independent vocoder을 사용했을 때 발생하는 mismatch 문제를 해소할 수 있음
  2. 또한 vocoder을 사용하지 않음으로서 빠른 inference time을 보여줌
3. NVC-Net은 zero shot voice coversion 문제를 해결하기 위해서 speaker의 representation을 constraining하였음
  1. speaker의 latent space에 대해서 Kullback-Leibler regulation을 적용해서 speaker의 representation이 input의 small variation에도 견고할 수 있도록 했음

introduction부분에서 설명된 내용들은 제외함

GAN은 data의 distribution에 대한 explicit한 가정을 하지 않기 때문에, GAN은 VAE보다 더 smoothed audio를 산출해 낼 수 있음
voice coversion의 관점에서는 speaker identity와 linguistic한 content의 compositon를고려해야함.
- speaker identity를 linguistic content으로부터 구분하는 것은 speaker identity를 독립적으로 바꿀 수 있음을 의미함
ASR 모델을 통해서 source speech으로부터 linguistic feature을 추출하고자 했고, 추출된 linguistic feature을 기반으로 speaker dependent model을 사용해서 target speaker으로 변환하고자 하는 시도가 있었음
- 이런모델들은 non parallel voice conversion을 수행할 수 있지만 해당 모델의 성능이 ASR model에 크게 의존한다는 단점이 있음

3 NVC-NET

본 연구에서의 제안은 아래와 같은 목표를 지님
1. latent embedding으로부터 highly perceptually similar audio waveform을 reconstruct하자
2. conversion 동안에 speaker invariant info는 보존하자
3. target speaker 에 대한 high-fidelity 즉 target speaker와 아주 유사한 audio를 생성해 내자
NVC Net은 content Encoder Ec, speaker encoder ES, generator G, 3개의 discriminator D(k) 로 구성되어 있음
- D(k)는 input의 temporal resoultion에 따라서 다르게 적용됨 (k=1,2,3)
전체적인 구조는 아래와 같음
본 연구에서의 가정은 아래와 같음
1. 발화 x가 speaker identity, speech content 즉 두개의 latent embedding으로부터 생성됨
  1. content는 speaker 간의 불변하는 정보임. (ex . phonetic 과 다른 prosodic간의 정보)
2. y 발화자로부터 나온 x 발화를 y ~ 발화자로부터 나온 x~ 발화로 변환하기 위해서는 x를 content encoder c = Ec(x)를 통해서 content embedding 시켜줌
3. 추가적으로 target speaker embedding z~는 speaker Encoder Es(X~) 으로부터의 output distribution 에서 샘플링 된것
4. target speaker embeding에 conditioned 된 content embedding 으로부터 새로운 raw audio를 생성함

3.1 Training objectives

3.1.1 Adversarial Loss

합성음성과 원본 음성을 구분할 수 없도록 합성음성을 생성하기 위해서 discriminator Ladv(D(k))와 generation Ladv(Ec,Es,G)에 대한 adversarial loss는 아래와 같이 정의됨
- encoder들과 generator가 discriminator을 속이기 위해서 훈련되고, discriminator들은 multiple binary classification task를 simultaneously하게 해결할 수 있도록 훈려됨
- 각각의 discriminator들은 multiple output branch를 가지고 있는데, 각각의 branch 는 하나의 task에 해당함.
- 각 task는 input 발화가 generator으로부터 생성된 것인지 아닌지를 구별해 내는 것을 목표로 함
- branch의 개수는 speaker의 개수와 동일함
- Discriminator D(k)를 class y 의 발화 x을 통해서 updating할 때, y번째 branch output D(k)(x)[y] 가 틀렸을 때만 penalize하고 다른 output 은 untouched 됨
- 본 연구에서는 source atterance x 를 위한 reference utterance x~ 는 같은 minibatch안에서 랜덤하게 샘플링됨
- 위의 식을 간단히 설명하자면
  - Ladv(D(k))
    - 식의 앞부분은 Discriminator가 Dk(x)[y]에 대한 정답을 틀렸을 때 패널티를 받게 하는 부분임
    - 식의 뒷부분은 Generator을 통해서 생성된 발화를 discriminator에 입력했을때 이를 구별해 낼 수 있는가에 대한 부분임
  - Ladv(Ec,Es,G)
    - Generator에 관한 adversrial loss으로, generator가 생성한 utterance를 입력으로 햇을 때 discriminator가 얼마나 잘 판별하는지를 나타냄. log안의 부분이 작아질 수 있도록 학습됨

3.1.2 Reconstruction Loss

audio 를 생성할 때 content와 speaker embedding으로부터의 input을 reconstructing해서 generator G가 speaker embedding을 사용할 수 있도록 함
이를 위해서 원본과 생성된 audio간의 차이를 point wise으로 계산할 수 있지만, point wise loss는 두 오디오간의 차이를 정확하게 capture해 내지 못함
- 이는 두개의 perceptually identical audio가 같은 audio waveform을 가지고 있지 않을수도 있기 때문임

따라서 본 연구에서는 feature matching loss를 아래와 같이 사용함

D(k)i 는 D(k)의 i번째 레이어의 Ni unit로부터 출력된 feature map을 의미하고,

는 l1 정규화를 의미, L은 layer의 개수를 의미함

speech의 특성을 보다 반영해주기 위해서 spectral loss도 추가로 사용함
- 세타()는 w의 FFT size를 기반으로 mel spectrogram의 log-magnitude를 계산하는 함수
- ~ 2는 l2 정규화를 의미
- spectral domain에서 정답 시그널과 생성된 시그널간의 차이를 게산하는 것은 위상(phase) 가 불변하게 된다는 장점을 가짐
- 선행 연구에 따르면 spectral loss는 spatial temporal resolution 에 따라서 FFT size를 다르게 하여 계산함. 따라서 전체적인 reconstruction loss는 모든 spectral loss들과 feature matching loss를 다 더한 값임

3.1.3 Content preservation Loss

converted 된 발화가 speaker의 고유한 특성을 보존하기 위해서 아래 loss를 minimize하는 쪽으로 학습함
- 본 연구에서는 content Encoder Ec가 input x에 대한 essential 한 특성을 보존함과 동시에 speaker identity z~를 변경할 수 있도록 함
- 위의 loss를 추가함으로서 얻게되는 잠재적인 이점은 아래와 같음
  1. cycle conversion을 가능하게 함. (speaker A→B, speakerB→A)
  2. speech content으로부터 speaker identity를 부분할 수 있도록 함
    - 만약 다른 speaker들에 해당하는 발화들의 content embedding이 같다면, speaker information이 content embedding에 embedding되지 못함.
    - 연구진들은 위의 식에서의 l2 norm 값이 Ec의 출력값의 크기에 영향을 받는 다는 것을 확인함
      - 즉, content embedding의 크기가 상대적으로 작으면, 두개의 content embedding 간의 거리를 의미 없게 만듦
      - 이러한 상황을 방지하기 위해서 content embedding이 spatial 차원에서의 unit l2 정규화를 적용함

3.1.4 Kullback-Leibler Loss

speaker의 latent space에 대해서 보다 정교한 sampling을 진행하기 위해서 speaker output distribution에 대해서 deviation을 penalize했음
- DKL은 KL divergence를 의미하고, p(z x)는 Es(x)에 대한 output 분포를 의미함 (ES: Speaker encoder)
- speaker 의 latent space를 constrain하는 것은 inference에서 speaker의 embedding을 샘플링하는데에 두가지 간단한 방법을 제공함
  1. sample from prior distribution
  2. sample from p(z X) → 즉 speaker encoder의 output 분포에서의 sample
- 더 나아가서 해당 식은 speaker embdding이 smoother하게 하도록 도와주어서 unseen sample들에 대해서 좀더 쉽게 일반화될 수 있도록 했음

3.2 Model Architectures

3.2.1 Content encoder

content encoder은 fully conv neural network임
waw audio waveform을 encoded content representation으로 mapping함
content embedding은 input에 비하여 temoral resolution이 256배 작음
residual block에 conditional input이 추가적으로 입력되지 않음
각각의 residual block에 대해서 dilation을 높이면서, 장기적인 temporal dependency를 capture하고자 했음

3.2.2 Speaker encoder

speaker encoder은 발화의 encodede 된 speaker representation을 encoding함
p(z x)가 조건부로 가우시안 분포에 독립적이라고 가정함
speaker의embedding은 output distribution에서 sampling하여 얻어짐
- reparametrized trick 사용
audio signal 으로부터 mel spectrogram을 추출해서 speaker encoder의 input으로 사용함
- average pooling을 사용해서 temporal한 차언을 없애주었음

3.2.3 Generator

speaker embedding을 raw audio로 mapping 해줌
본 연구에서는 MelGAN을 inherit함
- 본 연구에서는 미리 계산된 mel spectrogram을 upsampling하는 것 대신, content encoder으로부터의 input을 받아옴
- upsampling은 transpose con layer을 통해서 수행
- 본연구에서는 gated tanh 비선형 함수를 res block에 사용함
  - speaker embedding은 1x1 conv layer 을 통해서 차원이 축소되어서 dilation layer 에 사용된 feature map의 크기와 크기가 같아짐
  - 이후 time dimension에 대해서 broadcast하게 계산이 되고
  - conv layer와 tanh 을 적용해서 output audio waveform을 얻게됨

3.2.4 Discriminator

Discriminator또한 MelGAN에서 사용된 것과 아주 유사함
- audio resolution에 따라서 3개의 discriminator중 하나가 적용됨
- 본 연구에서는 downsampling을 위해서 strided average pooling을 사용함
- MelGan과 다르게 본 연구에서의 discriminator은 multi task discriminator임
  - multi task discriminator은 여러개의 리니어 출력 브랜치를 가지고 있음
    - num of output branches = num of speakers
  - 각각의 branch는 binary classification task를 수행함 (이게 진짜인지 가짜인지)
- window based discriminator을 사용해서 local한 audio chunk가 진자인지 가짜인지를 구별해 낼 수 있음

3.3 Data Augmentation

GANs를 학습시킬 때 데이터가 부족하면 over fitting을 겪을 수 있음
input signal을 memorize 하는대신에 semantic information을 학습하기 위해서 data augmentation을 진행함
Human auditory perception은 phase 가 180도로 shift 되었을 때에는 큰 영향을 받지 않음. 따라서 본 연구에서는 input에 -1을 곱해서 새로운 input을 생성해 냄
random amplitude scaling이 적용됨
- random factor[0.25~1] 을 통해서 input의 amplitude가 조절됨
[-30~30]의 범위 내에서 temporal jitter이 적용되었음
speaker encoder network에 random shuffle 전략이 적용되었음
- input signal을 0.35~0.45초의 길이로 uniform하게 분리됨
- 분리된 결과인 segment들을 shuffle해서 새로운 input을 형성함
- 해당 기법은 content information이 speaker embedding에 leak되는 것을 막아서 더 좋은 disentanglement를 보이게 했음

4 Experiments

spoofing : convert된 utterance들이 target speaker으로 분류되는 비율
- 높을수록 좋은 지표임

4.2 Traditional voice conversion

결과

4.3 Zero shot voice conversion

결과
- Auto VC가 large corpus에 대해서 학습되었다는 것을 고려한다면, NVC-Net이 더 많은 데이터로 학습된다면 더 좋은 성능을 낼 것이라고 기대할 수 있음

4.4 Further studies

4.4.1 Disentanglement analysis

content 와 speaker encoder으로부터 인코딩된 latent representation에 대한 disentanglement를 확인하고자 했음
disentanglement 측정 방법
- speaker identification classifier 이 latent representation에대해서 trained 되고 validated 된 accuracy 로 측정
- 높은 value는 latent represetation이 speaker information을 더 많이 포함하고 있다는 것
- 결과

4.4.2 Inference speed and model size

pass

4.4.3 Latent embedding visualization

Barnes Hut t-SNE 시각화를 통해서 content 와 speaker embedding을 2D로 시각화 함
결과
- speaker embedding에 기반해서 보면, 같은 speaker에 대한 발화는 같이 clustering되었고 다른 speaker과는 구분되었음. (c, d)
- content embedding에 기반해서 보면, 발화가 전체 space에 대해서 scatter되어 있음. 즉 content representation에 대해서 speaker 정보가 embedding되지 않음을 확인 가능함(a,b)

5 Conclusion

NVC Net은 adversarial feedback과 reconstruction loss를 통해서 좋은 internal representation을 얻음
NVC-Net은 speaker의 identity와 linguistic content를 disentangle함
Zero shot과 traditional setting모두에서 사용가능

0 Abstract

1 Introduction

2 Related Works

3 NVC-NET

3.1 Training objectives

3.1.1 Adversarial Loss

3.1.2 Reconstruction Loss

3.1.3 Content preservation Loss

3.1.4 Kullback-Leibler Loss

3.2 Model Architectures

3.2.1 Content encoder

3.2.2 Speaker encoder

3.2.3 Generator

3.2.4 Discriminator

3.3 Data Augmentation

4 Experiments

4.2 Traditional voice conversion

4.3 Zero shot voice conversion

4.4 Further studies

4.4.1 Disentanglement analysis

4.4.2 Inference speed and model size

4.4.3 Latent embedding visualization

5 Conclusion