Ui-Hyeop Shin

Ui-Hyeop Shin

Speech & Voice AI · Ph.D., Sogang University

Download PDF

Profile

I build deep-learning models for speech separation, enhancement, restoration, and recognition, as well as speaker recognition, rooted in signal processing. Among the few researchers in Korea publishing first-author work at NeurIPS, ICML and IEEE flagship venues.

Selected Highlights

  • 2 flagship ML venues (ICML 2026 · NeurIPS 2024 — first author)
  • 4 flagship speech venues (T-ASLP · SPL · ICASSP · Interspeech)
  • Samsung HumanTech Paper Awards — Bronze (Signal Processing)

Education

  • Ph.D., Electronic Engineering

    Sogang University · 2021.9 – 2026.2

    Thesis: "Asymmetric encoder–decoder leveraging correlation for universal speech enhancement and separation"

  • M.S., Electronic Engineering

    Sogang University · 2019.3 – 2021.2

    Thesis: "Target speech extraction based on AuxIVA…"

  • B.S.

    Sogang University · 2015.3 – 2019.2

Honors & Awards

  • Samsung HumanTech Paper Award — Bronze (31st, Signal Processing) · 2025

    31st Samsung HumanTech Paper Awards, Bronze Prize (Signal Processing) — for SepReformer (NeurIPS 2024).

  • Most Outstanding Graduate Student — Sogang Ricci Engineering Research Award · 2024

    Single awardee, College of Engineering.

  • Qualcomm Best Paper Award · 2024

    Outstanding Award — for the T-ASLP 2024 paper on statistical beamforming.

  • Qualcomm Best Paper Award · 2020

    Outstanding Award — for the IEEE Access 2020 paper on AuxIVA (generalized inter-clique dependence).

Selected Publications

Interspeech 2026

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

Jaeeun Baik*, Ui-Hyeop Shin*, Jiwon Lee, Woocheol Jeong and Hyung-Min Park

Dual-view self-distillation across multiple encoder layers learns clean–noisy consistency for robust ASR without sacrificing clean accuracy.

Abstract

Automatic Speech Recognition (ASR) often degrades in real-world noisy environments, making noise robustness essential for deployment. Supervised noise-augmented fine-tuning is a common remedy, but it can introduce a robustness–clean trade-off and overfit to specific corruptions, degrading recognition in clean conditions. We propose DASH, a self-distillation framework that improves robustness by learning clean–noisy consistency from paired views. DASH distills hidden representations from multiple encoder layers to capture features from low-level acoustics to high-level semantics, and stabilizes training by minimizing KL divergence between prototype assignment distributions of clean and noisy views. Experiments on LibriSpeech show that DASH consistently improves recognition under diverse noisy conditions while preserving clean accuracy, achieved by a label-free pre-training stage with minimal additional overhead (about 4% of fine-tuning time) beyond standard fine-tuning.

ICML 2026

Query-Based Asymmetric Modeling with Decoupled Input-Output Rates for Speech Restoration

Ui-Hyeop Shin, Jaehyun Ko, Woocheol Jeong and Hyung-Min Park

One model restores speech from any input quality to any target quality — query-based analysis of only the present signal with decoupled input/output rates.

Abstract

Speech restoration aims to recover clean speech from degraded recordings affected by noise, reverberation, bandwidth reduction, or other distortions, where input and output sampling rates may differ. Existing approaches typically assume matched input–output rates and apply redundant resampling, limiting native multi-rate processing. We formulate this gap as the extended sampling-frequency-independent (xSFI) setting, where a model must operate under decoupled input–output rates, and propose TF-Restormer, a query-based xSFI modeling framework. The model encodes only the observed input band and synthesizes the unobserved high-frequency band through extension queries with band-partitioned cross-attention, yielding an asymmetric encoder–decoder that allocates capacity to analysis while keeping synthesis lightweight. Trained with a perceptual loss, a scaled log-spectral loss, and adversarial supervision via an SFI-STFT discriminator, TF-Restormer attains balanced fidelity–perceptual quality as a single unified model, without redundant resampling across denoising, dereverberation, bandwidth extension, and combined-distortion benchmarks under multiple sampling rates.

Interspeech 2025

Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement

Jangyeon Kim*, Ui-Hyeop Shin*, Jaehyun Ko and Hyung-Min Park

Reuses one block progressively — more repetition, fewer parameters, better enhancement.

Abstract

This paper presents an efficient speech enhancement (SE) approach that reuses a processing block repeatedly instead of conventional stacking. Rather than increasing the number of blocks for learning deep latent representations, repeating a single block leads to progressive refinement while reducing parameter redundancy. We also minimize domain transformation by keeping encoder and decoder shallow and reusing a single sequence modeling block. Experimental results show that the number of processing stages is more critical to performance than the number of blocks with different weights. Also, we observed that the proposed method gradually refines a noisy input within a single block. Furthermore, with the block reuse method, we demonstrate that deepening the encoder and decoder can be redundant for learning deep complex representation. Therefore, the experimental results confirm that the proposed block reusing enables progressive learning and provides an efficient alternative for SE.

SPL

TF-CorrNet: Leveraging Spatial Correlation for Continuous Speech Separation

Ui-Hyeop Shin, Bon Hyeok Ku and Hyung-Min Park

Uses time–frequency spatial correlation for continuous speech separation.

Abstract

In general, multi-channel source separation has utilized inter-microphone phase differences (IPDs) concatenated with magnitude information in time-frequency domain, or real and imaginary components stacked along the channel axis. However, the spatial information of a sound source is fundamentally contained in the differences between microphones, specifically in the correlation between them, while the power of each microphone also provides valuable information about the source spectrum, which is why the magnitude is also included. Therefore, we propose a network that directly leverages a correlation input with phase transform (PHAT)-beta to estimate the separation filter. In addition, the proposed TF-CorrNet processes the features alternately across time and frequency axes as a dual-path strategy in terms of spatial information. Furthermore, we add a spectral module to model source-related direct time-frequency patterns for improved speech separation. Experimental results demonstrate that the proposed TF-CorrNet effectively separates the speech sounds, showing high performance with a low computational cost in the LibriCSS dataset.

ICASSP 2024

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

Hyun-Jun Heo*, Ui-Hyeop Shin*, Ran Lee, YoungJu Cheon and Hyung-Min Park

A modernized multi-scale temporal-convolution backbone for speaker verification.

Abstract

In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies.

NeurIPS 2024 HumanTech Bronze

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Ui-Hyeop Shin, Sangyoun Lee, Taehan Kim and Hyung-Min Park

An asymmetric encoder–decoder that separates early and reconstructs lightly — SOTA separation with a minimal repeated block.

Abstract

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage of the network. Instead, we propose a more intuitive strategy that separates features earlier by expanding the feature sequence to the number of speakers as an extra dimension. To achieve this, an asymmetric strategy is presented in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, which also performs cross-speaker processing. Without relying on speaker information, the weight-shared network in the decoder directly learns to discriminate features using a separation objective. In addition, to improve performance, traditional methods have extended the sequence length, leading to the adoption of dual-path models, which handle the much longer sequence effectively by segmenting it into chunks. To address this, we introduce global and local Transformer blocks that can directly handle long sequences more efficiently without chunking and dual-path processing. The experimental results demonstrated that this asymmetric structure is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model combining both of these achieved state-of-the-art performance with much less computation in various benchmark datasets.

T-ASLP Qualcomm Best Paper

Statistical Beamformer Exploiting Non-Stationarity and Sparsity With Spatially Constrained ICA for Robust Speech Recognition

Ui-Hyeop Shin and Hyung-Min Park

A statistical beamformer exploiting non-stationarity and sparsity via spatially constrained ICA.

Abstract

In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method for steering vector estimation (SVE) based on a noise power ratio obtained from the target and noise outputs in independent component analysis (ICA). To update the beamformer in the same ICA framework, we derive ICA with distortionless and null constraints on target speech, which yields beamformed speech at the target output and noises at the other outputs, respectively. The demixing weights for the target output result in a statistical beamformer with the weighted spatial covariance matrix (wSCM) using a weighting function characterized by a source model. To enhance the SVE, the strict null constraints imposed by the Lagrange multiplier methods are relaxed by generalized penalties with weight parameters, while the strict distortionless constraints are maintained. Furthermore, we derive an online algorithm based on an optimization technique of recursive least squares (RLS) for practical applications. Experimental results on various environments using CHiME-4 and LibriCSS datasets demonstrate the effectiveness of the presented algorithm compared to conventional beamforming and blind source extraction (BSE) based on ICA on both batch and online processing.

ICA 2022

Statistical Beamforming based on AuxIVA with Distortionless and Null Constraints for Robust Speech Recognition

Ui-Hyeop Shin and Hyung-Min Park

Statistical beamforming on AuxIVA with distortionless and null constraints for robust ASR.

Abstract

An early conference-stage formulation of statistical beamforming that combines AuxIVA-style demixing updates with distortionless and null constraints on the target speech, aimed at improving robustness for automatic speech recognition. This line of work was later extended into the journal treatment published in IEEE/ACM T-ASLP (2024).

IEEE Access Qualcomm Best Paper

Auxiliary-Function-Based Independent Vector Analysis Using Generalized Inter-Clique Dependence Source Models With Clique Variance Estimation

Ui-Hyeop Shin and Hyung-Min Park

A generalized inter-clique dependence model for AuxIVA that mitigates the permutation problem in blind speech separation.

Abstract

By introducing a frequency dependence source prior including full-band and clique models, independent vector analysis (IVA) has been successfully used for convolutive blind source separation (BSS). In addition, independent low-rank matrix analysis (ILRMA) learns a low-rank approximation of the time-frequency structure of source signals. This paper presents IVA using a clique-based frequency dependence model with time-varying clique variances to combine advantages of both ILRMA and clique-model-based IVA for BSS of speech signals. Although conventional clique models are effective in separating sources with specific spectral structures, the dependency among the cliques is considered by overlaps between cliques or a global clique of all frequency bins if there is. To avoid the permutation problem by strengthening the dependency among the cliques, we develop a generalized probability-density-function (pdf) model imposing a variable exponent on the summed cliques with overlaps and time-varying clique variances, which may include most conventional source models as particular cases. In addition, update rules of the clique variances and demixing matrices are derived by minimization of the cost function of BSS as well as non-negative matrix factorization (NMF) and auxiliary function techniques for fast and robust convergence, respectively. Through experiments on BSS of speech mixtures with various mixing conditions, the proposed IVA showed improved separation performance than the conventional methods. Experimental results consistently demonstrated that the performance of a method could be determined in general by the trade-off between the degree of freedom of source models and the vulnerability to the permutation problem.

Under Review

Interspeech Under Review

Deep Filter Estimation from Inter-Frame Correlations for Monaural Speech Dereverberation

Ui-Hyeop Shin, Jun Hyung Kim, Jangyeon Kim, Wooseok Kim and Hyung-Min Park

Estimates multi-frame deep filters from inter-frame STFT correlation for robust monaural dereverberation.

Abstract

Speech dereverberation in distant-microphone scenarios remains challenging due to the high correlation between reverberation and target signals, often leading to poor generalization in real-world environments. We propose IF-CorrNet, a correlation-to-filter architecture designed for robustness against acoustic variability. Unlike conventional black-box mapping methods that directly estimate complex spectra, IF-CorrNet explicitly exploits inter-frame STFT correlations to estimate multi-frame deep filters for each time-frequency bin. By shifting the learning objective from direct mapping to filter estimation, the network effectively constrains the solution space, which simplifies the training process and mitigates overfitting to synthetic data. Experimental results on the REVERB Challenge dataset demonstrate that IF-CorrNet achieves a substantial gain in the SRMR metric on RealData, confirming its robustness in suppressing reverberation and noise in practical, non-synthetic environments.

T-ASLP Under Review

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Ui-Hyeop Shin and Hyung-Min Park

An asymmetric TF encoder–decoder that estimates deep filters from spatio-spectro-temporal correlation for speech separation.

Abstract

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-{2,3,4,5}Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

Funded Projects

  • Integrated training of pre-processing & wake-word models with VAD

    LG Electronics · 2025

  • (Sub 3) Development of AI-Based Dialogue Modeling Technology for Simultaneous Multi-Speaker Processing

    IITP · 2021–2025

  • Speaker recognition for voice-based driver profiling

    Hyundai Motor Group (NGV) · 2022–2023

  • Speech restoration in extreme acoustic environments

    Government Agency · 2024–2025

  • Audio signal processing for identifying smartphone-recorded speech

    Supreme Prosecutors' Office, Republic of Korea · 2023–2024

  • Research and Development of Intelligent Interaction Technology for Understanding User Intent and Context

    NRF · 2019–2021

Skills

Lang
Python, C/C++, MATLAB, Shell
ML
PyTorch, Lightning, Hugging Face, ONNX, TensorRT
Audio
torchaudio, librosa, Kaldi, ESPnet, Asteroid
Methods
ICA/IVA, Beamforming, DSP, Bayesian inference
Infra
Distributed training (DDP), Linux/Slurm, Git, Docker