FreeSonic

Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

Anonymous Authors
Anonymous Institution
TL;DR: A training-free audio editing framework on Rectified-Flow TangoFlux. We localize target segments from text-audio attention maps, then apply scheduled Key/Value decoupling and task-oriented noise injection to confine edits to the intended region while keeping the rest of the audio untouched.
Overview of FreeSonic
Figure 1. Overview of the FreeSonic pipeline. (a) The editing workflow involves inversion and denoising, where Scheduled Attention Decoupling and Task-Oriented Noise Injection are applied for localized modification. (b) The temporal mask is extracted from text-audio attention maps during the first five inversion steps. By aggregating interaction scores in the double blocks, target segments are localized to guide editing while preserving the source background.

Abstract

Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. Existing methods often struggle to balance temporal consistency with background preservation, frequently failing to isolate modifications from the original acoustic context.

We propose FreeSonic, a training-free audio editing framework built on the Rectified Flow-based TangoFlux model. FreeSonic leverages an optimized inversion-reverse process and utilizes text-audio attention maps for precise temporal localization of target segments. For content modification, it implements a schedule-based fusion of Key and Value (KV) features within the model's internal attention layers, ensuring edits are strictly confined to intended regions while keeping the rest of the audio undisturbed. Additionally, a task-oriented noise injection strategy is introduced to enhance versatility across diverse editing objectives, such as sound removal and semantic substitution.

Experimental results demonstrate that FreeSonic achieves superior performance in maintaining original acoustic integrity while ensuring faithful edits, offering unique flexibility for complex, real-world audio scenarios.

Key Contributions

1

Training-free editing on Rectified Flow. Built on TangoFlux with an optimized inversion-reverse process, FreeSonic edits audio without any task-specific training or fine-tuning.

2

Attention-based temporal localization. Target segments are localized from text-audio attention maps during the first inversion steps, guiding edits to the intended time interval.

3

Scheduled Attention Decoupling. A schedule-based fusion of Key and Value features inside the attention layers confines modifications to the target region while preserving the background.

4

Task-oriented noise injection. A unified strategy that adapts to diverse objectives — addition, removal, and semantic substitution — within a single framework.

Demos

Mel-spectrograms and audio samples produced by FreeSonic against prior baselines and ground truth. The FreeSonic column is highlighted.

1. Add

Adding a target sound within a designated time interval of the source audio. FreeSonic injects the new event while keeping the original context intact.

Instruction: Add Music.
Source: A clock ticking.
Target: A clock ticking, as music.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Add glass breaking.
Source: A group of people laughing followed by a person farting.
Target: A group of people laughing followed by a person farting, as glass breaking.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Add rooster.
Source: People are speaking, chewing, breathing, and laughing in a busy environment.
Target: People are speaking, chewing, breathing, and laughing in a busy environment, as rooster.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Add washing machine.
Source: Women cook, speak, and stir amidst sizzling and clanking.
Target: Women cook, speak, and stir amidst sizzling and clanking, as washing machine.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Add glass breaking.
Source: Humans walk and speak while various sounds are heard in the background such as wind and field recordings.
Target: Humans walk and speak while various sounds are heard in the background such as wind and field recordings, as glass breaking.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

2. Remove

Removing a target sound from a designated time interval. FreeSonic suppresses the event while preserving everything else.

Instruction: Drop toilet flush.
Source: A baby laughs while a man and a woman speaks and laughs as well, as toilet flush.
Target: A baby laughs while a man and a woman speaks and laughs as well.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Drop brushing teeth.
Source: Reversing beeps, engine noise, radio, and male speech can be heard, as brushing teeth.
Target: Reversing beeps, engine noise, radio, and male speech can be heard.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Drop Music.
Source: Women are speaking, laughing, walking, running, and shouting, as Music.
Target: Women are speaking, laughing, walking, running, and shouting.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Drop vacuum cleaner.
Source: Crushing, breathing, and conversation sounds are present, as vacuum cleaner.
Target: Crushing, breathing, and conversation sounds are present.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Drop chirping birds.
Source: A young girl talking as a woman is talking, as chirping birds.
Target: A young girl talking as a woman is talking.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

3. Replace

Replacing a target sound with another within a designated time interval, leaving the rest of the audio undisturbed.

Instruction: Replace sea waves to breathing.
Source: Pigeons are making grunting sounds and snapping beaks, as sea waves.
Target: Pigeons are making grunting sounds and snapping beaks, as breathing.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Replace pouring water to toilet flush.
Source: Computer keyboards click while males speak, with the occasional camera click and human sounds, as pouring water.
Target: Computer keyboards click while males speak, with the occasional camera click and human sounds, as toilet flush.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Replace frog to chirping birds.
Source: Rain and thunder, as frog.
Target: Rain and thunder, as chirping birds.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Replace mouse click to chirping birds.
Source: A female speaking, as mouse click.
Target: A female speaking, as chirping birds.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio

Instruction: Replace hand saw to rain.
Source: A woman speaking, as hand saw.
Target: A woman speaking, as rain.

Source AudioEditor ZETA SAO-Instruct FreeSonic (Ours) Ground Truth
Mel
Audio