음성·보이스 AI · 서강대학교 공학박사

Ui-Hyeop Shin

음성 분리·향상 등의 음성 처리와 음성·화자 인식을 아우르는 음성 연구자입니다.

신호의 통계적 구조를 딥러닝 아키텍처 설계에 반영하며, NeurIPS·ICML·IEEE 주요 학회 및 학술지에 1저자로 논문을 발표해 왔습니다.

삼성 휴먼테크 논문대상 — 동상 퀄컴 최우수 논문상 모든 논문 1저자

주요 논문

하나의 주제로 이어 온 1저자 연구.

마이크 어레이의 공간 상관에서 시간–주파수 비대칭, 표현 일관성까지 — 아래 모든 논문이 1저자 연구입니다.

2026

ICML 2026

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong and Hyung-Min Park

One model restores speech from any input quality to any target quality — query-based analysis of only the present signal with decoupled input/output rates.

Abstract

Speech restoration aims to recover clean speech from degraded recordings affected by noise, reverberation, bandwidth reduction, or other distortions, where input and output sampling rates may differ. Existing approaches typically assume matched input–output rates and apply redundant resampling, limiting native multi-rate processing. We formulate this gap as the extended sampling-frequency-independent (xSFI) setting, where a model must operate under decoupled input–output rates, and propose TF-Restormer, a query-based xSFI modeling framework. The model encodes only the observed input band and synthesizes the unobserved high-frequency band through extension queries with band-partitioned cross-attention, yielding an asymmetric encoder–decoder that allocates capacity to analysis while keeping synthesis lightweight. Trained with a perceptual loss, a scaled log-spectral loss, and adversarial supervision via an SFI-STFT discriminator, TF-Restormer attains balanced fidelity–perceptual quality as a single unified model, without redundant resampling across denoising, dereverberation, bandwidth extension, and combined-distortion benchmarks under multiple sampling rates.

Interspeech 2026

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

Jaeeun Baik^*, Ui-Hyeop Shin^*, Jiwon Lee, Woocheol Jeong and Hyung-Min Park

Dual-view self-distillation across multiple encoder layers learns clean–noisy consistency for robust ASR without sacrificing clean accuracy.

Abstract

Automatic Speech Recognition (ASR) often degrades in real-world noisy environments, making noise robustness essential for deployment. Supervised noise-augmented fine-tuning is a common remedy, but it can introduce a robustness–clean trade-off and overfit to specific corruptions, degrading recognition in clean conditions. We propose DASH, a self-distillation framework that improves robustness by learning clean–noisy consistency from paired views. DASH distills hidden representations from multiple encoder layers to capture features from low-level acoustics to high-level semantics, and stabilizes training by minimizing KL divergence between prototype assignment distributions of clean and noisy views. Experiments on LibriSpeech show that DASH consistently improves recognition under diverse noisy conditions while preserving clean accuracy, achieved by a label-free pre-training stage with minimal additional overhead (about 4% of fine-tuning time) beyond standard fine-tuning.

2025

Interspeech 2025

Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement

Jangyeon Kim^*, Ui-Hyeop Shin^*, Jaehyun Ko and Hyung-Min Park

Reuses one block progressively — more repetition, fewer parameters, better enhancement.

Abstract

This paper presents an efficient speech enhancement (SE) approach that reuses a processing block repeatedly instead of conventional stacking. Rather than increasing the number of blocks for learning deep latent representations, repeating a single block leads to progressive refinement while reducing parameter redundancy. We also minimize domain transformation by keeping encoder and decoder shallow and reusing a single sequence modeling block. Experimental results show that the number of processing stages is more critical to performance than the number of blocks with different weights. Also, we observed that the proposed method gradually refines a noisy input within a single block. Furthermore, with the block reuse method, we demonstrate that deepening the encoder and decoder can be redundant for learning deep complex representation. Therefore, the experimental results confirm that the proposed block reusing enables progressive learning and provides an efficient alternative for SE.

SPL

TF-CorrNet: Leveraging Spatial Correlation for Continuous Speech Separation

Ui-Hyeop Shin, Bon Hyeok Ku and Hyung-Min Park

Uses time–frequency spatial correlation for continuous speech separation.

Abstract

In general, multi-channel source separation has utilized inter-microphone phase differences (IPDs) concatenated with magnitude information in time-frequency domain, or real and imaginary components stacked along the channel axis. However, the spatial information of a sound source is fundamentally contained in the differences between microphones, specifically in the correlation between them, while the power of each microphone also provides valuable information about the source spectrum, which is why the magnitude is also included. Therefore, we propose a network that directly leverages a correlation input with phase transform (PHAT)-beta to estimate the separation filter. In addition, the proposed TF-CorrNet processes the features alternately across time and frequency axes as a dual-path strategy in terms of spatial information. Furthermore, we add a spectral module to model source-related direct time-frequency patterns for improved speech separation. Experimental results demonstrate that the proposed TF-CorrNet effectively separates the speech sounds, showing high performance with a low computational cost in the LibriCSS dataset.

2024

NeurIPS 2024 HumanTech Bronze

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Ui-Hyeop Shin, Sangyoun Lee, Taehan Kim and Hyung-Min Park

An asymmetric encoder–decoder that separates early and reconstructs lightly — SOTA separation with a minimal repeated block.

Abstract

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage of the network. Instead, we propose a more intuitive strategy that separates features earlier by expanding the feature sequence to the number of speakers as an extra dimension. To achieve this, an asymmetric strategy is presented in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, which also performs cross-speaker processing. Without relying on speaker information, the weight-shared network in the decoder directly learns to discriminate features using a separation objective. In addition, to improve performance, traditional methods have extended the sequence length, leading to the adoption of dual-path models, which handle the much longer sequence effectively by segmenting it into chunks. To address this, we introduce global and local Transformer blocks that can directly handle long sequences more efficiently without chunking and dual-path processing. The experimental results demonstrated that this asymmetric structure is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model combining both of these achieved state-of-the-art performance with much less computation in various benchmark datasets.

ICASSP 2024

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

Hyun-Jun Heo^*, Ui-Hyeop Shin^*, Ran Lee, YoungJu Cheon and Hyung-Min Park

A modernized multi-scale temporal-convolution backbone for speaker verification.

Abstract

In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies.

T-ASLP Qualcomm Best Paper

Statistical Beamformer Exploiting Non-Stationarity and Sparsity With Spatially Constrained ICA for Robust Speech Recognition

Ui-Hyeop Shin and Hyung-Min Park

A statistical beamformer exploiting non-stationarity and sparsity via spatially constrained ICA.

Abstract

In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method for steering vector estimation (SVE) based on a noise power ratio obtained from the target and noise outputs in independent component analysis (ICA). To update the beamformer in the same ICA framework, we derive ICA with distortionless and null constraints on target speech, which yields beamformed speech at the target output and noises at the other outputs, respectively. The demixing weights for the target output result in a statistical beamformer with the weighted spatial covariance matrix (wSCM) using a weighting function characterized by a source model. To enhance the SVE, the strict null constraints imposed by the Lagrange multiplier methods are relaxed by generalized penalties with weight parameters, while the strict distortionless constraints are maintained. Furthermore, we derive an online algorithm based on an optimization technique of recursive least squares (RLS) for practical applications. Experimental results on various environments using CHiME-4 and LibriCSS datasets demonstrate the effectiveness of the presented algorithm compared to conventional beamforming and blind source extraction (BSE) based on ICA on both batch and online processing.

2022

ICA 2022

Statistical Beamforming based on AuxIVA with Distortionless and Null Constraints for Robust Speech Recognition

Ui-Hyeop Shin and Hyung-Min Park

Statistical beamforming on AuxIVA with distortionless and null constraints for robust ASR.

Abstract

An early conference-stage formulation of statistical beamforming that combines AuxIVA-style demixing updates with distortionless and null constraints on the target speech, aimed at improving robustness for automatic speech recognition. This line of work was later extended into the journal treatment published in IEEE/ACM T-ASLP (2024).

2020

IEEE Access Qualcomm Best Paper

Auxiliary-Function-Based Independent Vector Analysis Using Generalized Inter-Clique Dependence Source Models With Clique Variance Estimation

Ui-Hyeop Shin and Hyung-Min Park

A generalized inter-clique dependence model for AuxIVA that mitigates the permutation problem in blind speech separation.

Abstract

By introducing a frequency dependence source prior including full-band and clique models, independent vector analysis (IVA) has been successfully used for convolutive blind source separation (BSS). In addition, independent low-rank matrix analysis (ILRMA) learns a low-rank approximation of the time-frequency structure of source signals. This paper presents IVA using a clique-based frequency dependence model with time-varying clique variances to combine advantages of both ILRMA and clique-model-based IVA for BSS of speech signals. Although conventional clique models are effective in separating sources with specific spectral structures, the dependency among the cliques is considered by overlaps between cliques or a global clique of all frequency bins if there is. To avoid the permutation problem by strengthening the dependency among the cliques, we develop a generalized probability-density-function (pdf) model imposing a variable exponent on the summed cliques with overlaps and time-varying clique variances, which may include most conventional source models as particular cases. In addition, update rules of the clique variances and demixing matrices are derived by minimization of the cost function of BSS as well as non-negative matrix factorization (NMF) and auxiliary function techniques for fast and robust convergence, respectively. Through experiments on BSS of speech mixtures with various mixing conditions, the proposed IVA showed improved separation performance than the conventional methods. Experimental results consistently demonstrated that the performance of a method could be determined in general by the trade-off between the degree of freedom of source models and the vulnerability to the permutation problem.

Under Review

2026

Interspeech Under Review

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Ui-Hyeop Shin, Jun Hyung Kim, Jangyeon Kim, Wooseok Kim and Hyung-Min Park

Estimates multi-frame deep filters from inter-frame STFT correlation for robust monaural dereverberation.

Abstract

Speech dereverberation in distant-microphone scenarios remains challenging due to the high correlation between reverberation and target signals, often leading to poor generalization in real-world environments. We propose IF-CorrNet, a correlation-to-filter architecture designed for robustness against acoustic variability. Unlike conventional black-box mapping methods that directly estimate complex spectra, IF-CorrNet explicitly exploits inter-frame STFT correlations to estimate multi-frame deep filters for each time-frequency bin. By shifting the learning objective from direct mapping to filter estimation, the network effectively constrains the solution space, which simplifies the training process and mitigates overfitting to synthetic data. Experimental results on the REVERB Challenge dataset demonstrate that IF-CorrNet achieves a substantial gain in the SRMR metric on RealData, confirming its robustness in suppressing reverberation and noise in practical, non-synthetic environments.

T-ASLP Under Review

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Ui-Hyeop Shin and Hyung-Min Park

An asymmetric TF encoder–decoder that estimates deep filters from spatio-spectro-temporal correlation for speech separation.

Abstract

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-{2,3,4,5}Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

오디오 데모

모델을 직접 들어보세요.

전체 오디오 비교는 각 프로젝트의 데모 페이지에서 확인할 수 있습니다.

수상 및 연구과제

수상

Ui-Hyeop Shin receiving the Bronze Prize at the 31st Samsung HumanTech Paper Awards — 동상 · 제31회 삼성 휴먼테크 논문대상 (신호처리 분야), 2025

삼성 휴먼테크 논문대상 — 동상 (제31회, 신호처리 분야) 2025

제31회 삼성 휴먼테크 논문대상 동상 (신호처리 분야) — SepReformer (NeurIPS 2024).
최우수 대학원생 — 서강대 리치 공학연구상 2024

공과대학 단독 수상.
퀄컴 최우수 논문상 2024

우수상 — 통계 기반 빔포밍 T-ASLP 2024 논문.
퀄컴 최우수 논문상 2020

우수상 — AuxIVA(일반화된 inter-clique 종속성) IEEE Access 2020 논문.

연구과제

VAD 기반 전처리·호출어 모델 통합 학습

LG전자 · 2025
(세부 3) 동시 다화자 처리를 위한 AI 기반 대화 모델링 기술 개발

정보통신기획평가원(IITP) · 2021–2025
음성 기반 운전자 프로파일링을 위한 화자 인식

현대자동차그룹(NGV) · 2022–2023
극한 음향 환경에서의 음성 복원

정부기관 · 2024–2025
스마트폰 녹음 음성 식별을 위한 오디오 신호처리

대검찰청 · 2023–2024
사용자 의도·맥락 이해를 위한 지능형 인터랙션 기술 연구개발

한국연구재단(NRF) · 2019–2021

소개

독립적으로, 기초부터 쌓아 온 연구.

음성 분리·향상 등의 음성 처리와 음성·화자 인식을 아우르는 음성 연구자입니다. 신호의 통계적 구조를 딥러닝 아키텍처 설계에 반영하며, NeurIPS·ICML·IEEE 주요 학회 및 학술지에 1저자로 논문을 발표해 왔습니다.

음성 신호의 상관·비정상성·희소성 같은 통계적 구조를 먼저 살피고, 그 구조를 딥러닝 아키텍처 설계에 반영하는 방식으로 연구합니다. 공간 상관에서 시작해 시간–주파수 모델링, 표현 일관성으로 이어지는 흐름이 그렇게 만들어졌습니다.

지금까지의 논문은 모두 단독 1저자이거나 공동 1저자(equal contribution)입니다. 그중 DASH·NeXt-TDNN·Stack Less, Repeat More 세 편은 공동 1저자이고, 나머지는 단독 1저자입니다.

연락처

이야기 나눠요.

음성·보이스 AI 분야의 연구·엔지니어링 포지션을 찾고 있습니다. 이메일이 가장 빠르며, 아래 CV에 전체 이력이 담겨 있습니다.

dmlguq456@naver.com

CV 보기 PDF 내려받기

Ui-Hyeop Shin

하나의 주제로 이어 온 1저자 연구.

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement

TF-CorrNet: Leveraging Spatial Correlation for Continuous Speech Separation

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

Statistical Beamformer Exploiting Non-Stationarity and Sparsity With Spatially Constrained ICA for Robust Speech Recognition

Statistical Beamforming based on AuxIVA with Distortionless and Null Constraints for Robust Speech Recognition

Auxiliary-Function-Based Independent Vector Analysis Using Generalized Inter-Clique Dependence Source Models With Clique Variance Estimation

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

모델을 직접 들어보세요.

음성 분리

범용 음성 복원

수상 및 연구과제

수상

연구과제

독립적으로, 기초부터 쌓아 온 연구.

이야기 나눠요.