Generative AI: Perpetual View/Video Generation (PVG)
Perpetual View/Video Generation (PVG). AI models generate open-ended videos from one image only - at least that’s the newest approach from Google Research [1]. These videos are “flying“ through a scene depicted in the input image. The range of potential applications is endless. Further, the tech behind it is incredibly smart and it marks the next cornerstone of generative AI.
Glossary to understand AI better 📙
AGI = Artificial general intelligence
HGNG = Hybrid Generative Neural Graphics
PVG’s complex tech simplified 🧰
Generating a perpetual video is not trivial at all. Reading Google’s paper, this becomes quite obvious. However, I broke it down to 3+1 main ingredients:
1) In-painting: As the virtual camera continues its movement the space behind objects like trees and mountains opens up which needs to be filled. For more infos see episode 4 of this newsletter [2].
2) Out-painting: When the camera continues to move beyond what the input image captured, the AI needs to fill the space with newly generated content. See again episode 4 [2].
3) Superresolution: We move the camera in the image direction, which means that we zoom into some pixels, causing low image resolution. This is tackled by upscaling the image with superresolution techniques e.g. TecoGAN [3a]. In addition, take a look at Saharia’s work in this space. A talented Research Engineer at Google Brain to follow [3b].
+1) The dataset: For good PVG the AI models are data-hungry. The Google Research team put a lot of work into synthesizing a labeled dataset with around 10 million videos that each includes a depth profile of the objects in it. It is impossible to do it by hand. This brings us to a new forefront of creating/ enhancing datasets synthetically. This is an evolving field that enables us to achieve new AI heights and will get more and more attention.
The tech research firm Gartner forecasts as well that synthetic data will become the main form of data used in AI training [4].
Sorry for that quick digression, back to PVG. 🫠
Google’s new approach pushes the boundaries here. The videos are significantly longer than other methods, keeping good and photorealistic quality. Further, it supports linear and non-linear movements.
There is room for improvement e.g. around video consistency, resolution, etc. But that is A-OK because this field just starts to open up. For instance, look at DALL-E’s evolution: at first, DALL-E 1 produced low-quality images [5], and now, just 1 year later, the whole community is stoked about DALL-E 2’s stunning results [6]. I am positive that we will see a similar progression with PVG.
A picture is worth a thousand words. Please, visit [1] to see an example video. Anyways, here is a screenshot:
And, what could this mean? 💡
I always think about how could this evolve into 1-to-5 papers further down the research line and how could this be used in the industry, at home, or for good. Starting with the obvious one, these are my thoughts:
Video generation for various purposes: PVG is an important piece of in the evolution of video generation, like HGNG from episode 5 of this newsletter [7].
Models are becoming more complex: AI models are getting bigger. An increasing number of distinct AI tasks are being integrated into bigger models. To me, this indicates the progression towards AGI. By the way, I do recommend to listen Lex Friedman’s interview with John Cormack [8] on this topic.
Merge with Google Maps: Now imagine they would combine this technology with Google Maps. Then, recurring images i.e. user’s snapshots of places and Google Street View footage, would support the AI as reference points for frame recalibration. The generated “flying“ video could keep high quality in resolution and transitions while perpetually generating video material.
My mind goes in all kinds of directions here. What about 360-degree views? What could be the impact on the gaming industry? And mostly, what do YOU think is an interesting angle to this unfolding? Let us know. 🧡
Finally, The Top 3 GAI Gems 💎
Play around with (stable diffusion) AI image generation.
An awesome, trippy AI-generated music video. Try [9].
Old but gold: OpenAI’s Jukebox generates music including singing.