Visionular ML Models & Image Processing Technology
Click here to read part one of this White Paper that explains Visionular’s Intelligent Optimization technology.
Video Noise Estimation
Through the use of machine learning models, noise modeling and the noise strength that is inherent in videos can be analyzed and estimated and then used to guide the processing of denoising, quality enhancement, and encoding optimization.
Video Noise Estimation can effectively estimate the strength of compression noise and Gaussian noise. We pair these two types of noises as an indicator for Image Quality Assessment benchmarking, namely IQA=(compression noise strength, Gaussian Noise strength).
As denoted in the following figure, the left image has an IQA score of (2.34, 0.0), indicating that the image has a strong compression noise but quite weak Gaussian noise, hence it is needed to design dedicated denoising algorithms to effectively remove the compression noise.
The right image has an IQA score of (0.0, 2.85), indicating that this image has a strong Gaussian noise, whereas a weak compression noise, and hence it is needed to utilize a Gaussian noise removal algorithm.
Such IQA can effectively guide encoding optimization. When video sources possess less noise and higher quality, over-compression should be avoided in order to preserve quality. However, if the video source has high inherent noise, it is possible to utilize a higher compression ratio without introducing noticeable quality degradation. The use of IQA guided encoding is an essential part of our CAE technology.
Our proprietary degrain algorithms can be conducted at a speed of 9.8x compared to FFmpeg DCT based degrain filtering. At the same time, our degrain process results are more natural, with clearer edges, without the artifact created by the FFmpeg degrain algorithms.
Based on the distortion level present in the original video, denoising strength can be adjusted adaptively, so as to achieve a good tradeoff between denoised visual quality and the algorithms computational complexity. We are always looking for the optimal tradeoff between performance and complexity, and this is why our CAE technology can work across a wide range of applications from premium VOD to live streaming to RTC ultra-low latency video conferencing and screen sharing.
Visionular’s spikiness removal may adaptively and relatively accurately adjust the denoising strength based on the spikiness noise level. As shown in the following figure, Figure 1(a) is the original, and figure 1(b) is the image after denoising. It can be easily observed that Visonular’s denoised image presents superior visual quality, especially w.r.t. spikiness removal. Utilizing such pre-processing algorithms, at the same visual quality, a bitrate saving gain can be further achieved by as large as 84%.
Demosaicing algorithm can effectively remove compression noise under various noise levels. As shown in the following figure, Figure 1(a) is original, Figure 1(b) is the result after demosaicing. It is evident that image after demosaicing is cleaner and preserves more details.
CNN based demosaicing algorithm leverages the use of deep learning and can usually perform better with less blocky artifacts as a result of compression, and edge preservation.
Image Scaling Algorithm
Visionular’s image scaling algorithm compared against the FFmpeg bicubic scale filter preserves more detail and is especially effective with scenes that contain lighter textures. Compared to bicubic filtering in OpenCV, our scaling algorithm incurs fewer halo artifacts and is smoother with less zigzag like artifacts.
Video Enhancement Algorithm
Our Video Enhancement algorithm plays two roles, one is to enhance the overall visual quality resulting in a cleaner image, as shown in the following image, where the left side is the source and the right side is the output after enhancement. It’s easy to see from the enhanced side (right) of the image that the quality has been noticeably improved.
An important point is that the enhancement algorithm has been carefully designed to take into account context adaptation so that compared to transcoding without enhancement, there will not be a bitrate penalty post transcoding.
As illustrated in Table 1, using the same set of transcoding parameters, adding enhancement increases the bitrate by just 5%. This increase in bitrate delivers a large improvement in the perceived image quality, which is reflected in a 3 point VMAF score improvement, as a result. To achieve the same subjective quality, extensive testing shows that enhancement can lead to a bitrate savings of as much as 30%.
The second role of the Video Enhancement algorithm is to implement detail enhancement only to the regions that are significant to the human visual system. As shown in the following image, after detail enhancement, the face has more visible detail added, making the video more appealing.
Since detail enhancement is only completed in visually significant local regions, bitrate is largely unchanged. Experimental testing reveals that in certain cases where detailed texture is added to local regions, the result can be a reduction in bitrate. As shown in Table 2, a bitrate reduction of 1.6% has been achieved whereas the VAMF score increment slightly higher by 0.37. This improvement in subjective quality is fairly noticeable and a positive result of the detail enhancement algorithm at work.
Low Light Enhancement
Adaptive low-light enhancement increases detail in low-light regions while reducing the chance of overexposure in bright regions. In the following image, it can be seen that more details are present in the low-light regions and there is no over-exposure in the area of the sun for the image on the right side that was presented using our low light enhancement algorithm.
Region of Interest and Bitrate Adaptation
Our region of interest (ROI) identification algorithm and the associate bitrate adaptation algorithm consists of visually significant areas that can present sufficient quality. We mainly focus on the following aspects：
- Faces and the field of view most noticeable to the HVS.
- Visually significant foreground regions.
- Block-level subjective sensitivity.
Faces are subjectively sensitive regions for many viewers. It is critical to accurately detect faces so that the quality can be optimized and encoder parameter adjusted specifically for these regions. Visionular’s Intelligent Transcoder provides a standard and very low complexity version. The standard version can process 1080p videos with a speed of <3ms per frame while the very low complexity version is able to achieve a speed of <1ms per frame.
Our solutions adapt to various scenarios where faces are frequently present, including talk shows and newscasts, sports, movies, and video conferencing. For complicated scenes with many faces including face occlusion, side silhouettes, small faces, etc. a reasonably good face detection result can be achieved as the following image demonstrates.
Ocular Focus Area
The standard face detection and very low complexity face detection versions include ocular focus area detection.
Standard Version for Ocular Focus Region Detection
Our standard version for Ocular Focus Region detection leverages the use of eye-tracker devices to obtain training samples. We use obtained training samples to identify the human eyes focus regions that we apply to many scenarios.
Very Low Complexity Version for Ocular Focus Region Detection
We also provide a very low complexity version of ocular focus region detection. Using this version, the processing time for 1080p videos using a single CPU core is negligible at just 1ms.
Block Subjective Sensitivity
In x264, the default setup uses adaptive quantization, short as AQ, where the variance of each block is used as the only reference to determine the corresponding block quantization step. The larger the variance, the bigger the assigned quantization step.
However, this variance has a disadvantage in robustness. When evaluating smoothness, any variance should closely correlate with the perceptual smoothness quality, yet still, the variance is not reliable.
For instance, for the one-dimensional signal as shown in the following figure, the one on the right side has a larger variance compared to the one on the left, but it can be clearly observed that the curve on the right is smoother, even with a larger variance.
Let’s examine the widely used video clip, RaceHorses. For the first row of macroblocks, the green meadow in the background possesses a smaller variance, whereas those blocks covering the cap, face, and human eyes have a bigger variance, which leads to a larger QP assigned to those regions of faces and human eyes.
As we know, the face and eyes are regions that are very sensitive to subjective quality and should have been assigned smaller QP values. In order to preserve more detail, Visionular’s Intelligent Optimization technology differentiates between visually significant regular textures and those containing a larger variance. As a result, subjectively sensitive regions are preserved resulting in superior subjective quality.