How We Use AI & ML to Enhance the Visual Appeal of Any Video

Zoe Liu
President & Co-Founder, Visionular
Stories are what bind us, and it has been this way throughout human history. Through stories, people share their society’s values, the knowledge of past generations and their dreams for the future. And there has never been a more powerful, immersive and beautiful way to tell those stories than through film. Enabling that film to look as good as it can is what drives our team of more than 50 video codec, algorithm, and signal processing engineers.
The devices we stream video on today are unforgiving. Every flaw in the original source, bit of noise, and artifact, is under the microscope as jaw-droppingly good modern screens amplify impairments that have been neverbeforseen.
For this reason, in addition to our industry renowned codec engineering team, we’ve recruited some of the foremost experts in video processing who joined together are able to deliver visually stunning video encodes, even when the source video is less than perfect. Here’s a look at the AI and Machine Learning (ML) technology that our team has developed.
Image Quality Repair and Enhancement.
With improper handling and careless storage, countless films, including landmarks in the history of cinema, have been marred by scratches, blurriness, color fading and image distortion that was never part of the original filmmaker’s vision.
Fortunately, technology can restore the original artistic intent, often improving upon it. Using this marvelous technology, we infuse black and white films with lifelike colors, remove noise, and can even add new frames to the video where the original doesn’t contain a sufficient level of visual information to look great when transferred to digital. At Visionular, we are leading the way in film restoration.
Our AI- and ML-assisted video encoding technology breathes new life into films that have been degraded by time and neglect. In addition, our technology adds extra fidelity to the filmmaker’s vision, preserving the Director and Cinematographer’s aesthetic decisions, and the limits imposed by the technology they were working with at the time.
For example, there is a distinct difference between a film produced with the latest digital technology in high-contrast black and white (B&W) to reflect the stark moral choices its characters are faced with, as opposed to B&W being all that was available at the time of production. If the latter, the film is a prime candidate for colorization, but if the former, adding color would clearly conflict with the filmmaker’s artistic intent and would be avoided.
Our specially developed technology and tools perform de-interlacing, denoising, sharpening, and other processing functions to enhance texture, sharpen details, and improve the overall clarity of the source film using convolutional neural networks (CNN).
In order to improve the effect of CNN network model on real video restoration, the optimization of our training data set with a variety of blur kernels, random down sampling, and coding compression cascade degradation detune the high-definition data set so that a more realistic simulation can be used.
Here’s an example from the animated film Journey to the West that shows what’s possible to improve perceived image quality:
In the image on the left, it’s clear that lines are blurred, and the frame is littered with noise. But, after undergoing repair and enhancement, the noise has been intelligently adjusted according to its intensity. As is clearly visible, the resulting picture quality is dramatically improved.
Resolution Enhancement.
Poor image quality isn’t always intrinsic to the source film but a result of today’s screens. Older movies have relatively low resolution, meaning, if you watch them on a 2K or 4K screen, issues like film grain, mosaics and jaggies may take center stage. A further issue is that one television could have an inherently better upscaling function than another and this leads to inconsistent perceptual quality, something that is not good for the user.
Our Intelligent Super-Resolution technology evaluates the content and transforms it to be ultra-high-resolution, making it possible for 21st-century audiences to enjoy 20th-century classic movies on today’s high resolution screens.
Intelligent Super-Resolution technology uses a generative adversarial network to perform upscaling on the video, whereby adding additional lines of detail (resolution) to the picture. By optimizing the perceptual loss function, we make the analysis of video content using a generative adversarial network closer to the human visual system (HVS) whereby improving the subjective experience.
The training of generative adversarial networks is based on millions of high-definition image and video training datasets and their extended datasets. These training datasets cover a variety of content types and resolutions, which can meet the super-score requirements of various scenarios. The traditional generative adversarial network structure is complex and intensive, something that greatly affects the calculation speed. Our Intelligent Super-Resolution technology uses pruning and distillation technology to optimize and slim down the traditional generative adversarial network, which greatly improves the computational efficiency.
As usual, words don’t do it justice, only a picture can. See Intelligent Super-Resolution in action:
Intelligent Frame Rate Up-Conversion.
Another issue that results from the incompatibility between old and new technology has to do with the fast refresh rates of today’s TVs and mobile devices — a film with a refresh rate of 24 frames per second cannot match the fidelity of the HVS, especially with higher resolution video where scenes with a lot of high-speed motion can have annoying jerkiness or choppiness and/or other visual artifact issues that impact quality.
We address these problems using our Intelligent Frame Rate Up-Conversion technology that analyzes motion in the scene to determine the path of objects based on their location in two consecutive frames. It applies that information to construct an entirely new frame, which gets inserted to increase the frame rate to a level that will assure a smooth and clear viewing experience.
Among the current video frame interpolation methods, the method based on deep learning optical flow estimation performs the best overall. Two classic network models are:
-
- FlowNet network – see Dosovitskiy et al., “FlowNet: Learning Optical Flow with Convolutional Networks,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2016, pp. 2758-2766.
- FlowNet 2.0 network – see A. Dosovitskiy et al., “FlowNet: Learning Optical Flow with Convolutional Networks,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2016, pp. 2758-2766.
Leveraging the deep learning-based framework, our frame interpolation engine takes advantage of the frame feature residual information to restore image details and effectively improve the accuracy of the interpolated frames. Generally, frame rate up-conversation methods always suffer from non-ideal interpolated frames caused by inaccurate results from optical flow estimation. Our Intelligent Frame Rate Interpolation Engine can determine whether the interpolated frame is appropriate using the bidirectional optical flow results and if needed, modify the optical flow value accordingly to obtain a visually better interpolated frame.
See Intelligent Frame Interpolation in action:
Pseudo HDR using High Dynamic Range Frame Beautification.
With audiences accustomed to the vibrant, true-to-life colors of HDR photos and videos, they have little tolerance for flat or poor color reproduction. As most videos were originally captured in standard dynamic range (SDR), our Pseudo HDR solution was developed by using our HDR Frame Beautification (HDR-FB) technology.
After analyzing a video’s color range and brightness, HDR-FB converts standard dynamic range video to a simulated high dynamic range, with deep, balanced colors across the spectrum for a beautifully natural look. We call the end result “Pseudo HDR.”
Here’s a visually stunning side-by-side comparison of SDR and Pseudo HDR:
Smart Tone Mapping.
For devices that do not support HDR video decoding or where the connected display cannot correctly represent high dynamic range video, Smart Tone Mapping converts HDR video to SDR video.
HDR images and videos have higher brightness, deeper bit depth, and wider color gamut, meaning in many cases they cannot be displayed correctly on non-HDR displays. To be compatible with HDR images and videos, SDR devices make it necessary to apply tone mapping technology to map HDR images and videos to SDR only displays and devices.
Because the brightness range of SDR display devices is much smaller than that of HDR images, and with the brightness range seen in nature including the relationship between the human eye’s perception of brightness and brightness intensity as perceived by the human eye, non-linear processing is required when designing tone mapping algorithms. Tone mapping algorithms use different processing methods, but the traditional tone mapping methods can be described by the following formulas:
Where ‘f’ represents the tone mapping operator, ‘I’ represents the image to be operated, ‘w’ and ‘h’ represent the width and height of the image, and ‘c’ represents the number of channels of the image, usually this is RGB, meaning three channels are the most common.
Here ‘C’ and ‘c’ are the colors before and after tone mapping, respectively. ‘L’ represents the brightness of the HDR image, while ‘T’ is the corresponding brightness value after tone mapping. ‘s’ is the parameter used to adjust saturation provided s<1.
Although universal algorithms can enable HDR pictures and videos to be displayed on SDR monitors, every model has varying attributes that yield quality inconsistencies. We conduct extensive tone mapping experiments to filter specific colors in the RGB space for uniform sampling, encoding in REC709 video, and we use a spectrophotometer to identify the closest color in the BT.2020 space. Finally, a high-precision tone-mapping algorithm based on the existing data pairs is used, and an adaptive tone-mapping model established.
Example of Smart Tone Mapping:
At Visionular, we have tremendous respect for the craft and artistic passion that goes into filmmaking. We are committed to making sure that every video, no matter when it was made or what equipment it was made on, can be viewed and enjoyed to its fullest extent. While we are extremely proud of our technology, our constant pursuit is for better and more effective ways to refine and improve on the state of the art when it comes to video enhancement and restoration so that all Director’s and DP’s artistic visions can be preserved for generations.
How to Deliver HDR in an SDR Package PRESS PLAY BUTTON
Continue reading...
At Foothill Ventures, we believe in startup companies that ride the transformative power of major technology shifts such as deep learning in computer vision. Visionular’s founders are world-class technologists in their field of video codec and AI-driven optimization. We feel privileged to support their adventure with our resources and experience.
I invested in Visionular because the team is at the forefront of innovations in video encoding and image processing for real-time low latency video communications and premium video streaming applications.