Scholars at Carnegie Mellon University recently offered what they call a “High-Modality Multimodal Transformer,” which combines not just text, image, video, and speech but also database table information and time series data. Lead author Paul Pu Liang and colleagues report that they observed “a crucial scaling behavior” of the 10-mode neural network. “Performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks.”
Scholars Yiyuan Zhang and colleagues at the Multimedia Lab of The Chinese University of Hong Kong boosted the number of modalities to a dozen in their Meta-Transformer. Its point clouds model 3D vision, while its hyper-spectral sensing data represents electromagnetic energy reflected back from the ground to fly-over images of landscapes.
The immediate payoff of multi-modality will simply be to enrich the output of a thing such as ChatGPT in ways that go far beyond the “demo” mode. A children’s storybook, a book with text passages combined with pictures illustrating the text, is one immediate example. By combining the language and image attributes, the kinds of pictures created by the diffusion process can be more subtly controlled from picture to picture.
Comments are closed.