logo
0
0
WeChat Login

Woosh — Sound Effect Generative Models

Inference code and open weights for sound effect generative models developed at Sony AI.

GitHub ComfyUI
Node arXiv

Screenshot 2026-04-12 013347

Your browser does not support the video tag.

Models

ModelTaskStepsCFGDescription
Woosh-FlowText-to-Audio504.5Base model, best quality
Woosh-DFlowText-to-Audio41.0Distilled Flow, fast generation
Woosh-VFlowVideo-to-Audio504.5Base video-to-audio model
Woosh-DVFlowVideo-to-Audio41.0Distilled VFlow, fast video-to-audio

Components

  • Woosh-AE — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from generated latents.
  • Woosh-CLAP (TextConditionerA/V) — Multimodal text-audio alignment model. Provides token latents for diffusion model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
  • Woosh-Flow / Woosh-DFlow — Original and distilled LDMs for text-to-audio generation.
  • Woosh-VFlow — Multimodal LDM generating audio from video with optional text prompts.

ComfyUI Nodes

Use these models in ComfyUI with ComfyUI-Woosh:

# Via ComfyUI Manager — search "Woosh" and click Install # Or manually: cd ComfyUI/custom_nodes git clone https://github.com/Saganaki22/ComfyUI-Woosh.git pip install -r ComfyUI-Woosh/requirements.txt

Place downloaded model folders in ComfyUI/models/woosh/. See the ComfyUI-Woosh README for full setup and workflow examples.

Note: Set the Woosh TextConditioning node to T2A for Flow/DFlow models and V2A for VFlow/DVFlow models.

Inference

See the official Woosh repository for standalone inference code and training details.

VRAM Requirements

ModelVRAM (Approx)
Flow / VFlow~8-12 GB
DFlow / DVFlow~4-6 GB
With CPU offload~2-4 GB

Citation

@article{saghibakshi2025woosh, title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation}, author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and Kawakami, Kazuhiro and Gu, Yuxuan}, journal={arXiv preprint arXiv:2502.07359}, year={2025} }

License

About

No description, topics, or website provided.