Bi-Prediction Based Video Quality Enhancement via Learning
This paper was presented by Dandan Ding, Wenyu Wang, Junchao Tong, Xinbo Gao, Zoe Liu, and Yong Fang, “Bi-Prediction Based Video Quality Enhancement via Learning”, IEEE Transactions on Cybernetics, June 17, 2020.
Convolutional neural networks (CNNs)-based video quality enhancement generally employs optical flow for pixel-wise motion estimation and compensation, followed by utilizing motion-compensated frames and jointly exploring the spatiotemporal correlation across frames to facilitate the enhancement. This method, called the optical-flow-based method (OPT), usually achieves high accuracy at the expense of high computational complexity. In this article, we develop a new framework, referred to as bi-prediction-based multi-frame video enhancement (PMVE), to achieve a one-pass enhancement procedure.
In this article, we target the potential inherent in the temporal domain. A two-step framework is adopted where the first step is preprocessing and the second is frame fusion. For accurate modeling, both stages within the framework are developed through learning technology.
In preprocessing, a method is expected to efficiently extract and collect the temporal information. The previous work generally resorts to optical flow, which can obtain pixelwise motion information. Neighboring frames are compensated on the basis of motion information. In such scenarios, the performance largely depends on the quality of optical flow.
For accurate modeling, a complicated estimation operation is usually conducted, requiring high computational resources. On the other hand, neighboring frames are usually involved in video enhancement to achieve better gains where these neighboring frames are similar and redundancy exists in the temporal information they provide, especially for compressed videos where frames have high reference dependencies.
Therefore, high performance can be achieved and the complexity maintained at a reasonable level if we can extract and utilize the information across multiple frames in an efficient manner. Inspired by the techniques behind frame interpolation and extrapolation we propose to extract the temporal information in a bi-prediction manner, that is, we try to predict the current frame through learning from its prior and following neighboring frames, even without the need of utilizing the current frame.
The predicted frame is essentially an inference of the current frame (VF) that contains abundant relevant temporal information that is helpful for the enhancement of the current frame. Relative to the optical flow-based method, the bi-prediction scheme can involve more neighboring frames for enhancement without increasing the computational complexity.
In-frame fusion is critical to developing a CNN structure taking full advantage of the obtained temporal information. But on the other hand, due to the restriction of available memory, both the depth of CNN and the number of network parameters are limited. A large number of parameters or an unreasonable network structure will also have a large probability leading to the overfitting problem and instead deteriorating the performance. Hence, it is needed to design a CNN approach that can well balance the CNN depth and the total number of network parameters.
A high-level overview of how PMVE works:
PMVE develops a new multi-frame approach for compressed video enhancement, aiming to achieve a balanced tradeoff between enhancement performance and computational complexity.
PMVE designs two networks, namely the prediction network (Pred-net) and the frame-fusion network (FF-net), to implement the two steps of synthesis and fusion, respectively.
Specifically, the Pred-net leverages frame pairs to synthesize the so-called virtual frames (VFs) for those low-quality frames (LFs) through bi-prediction. Afterward, the slowly fused FF-net takes the VFs as the input to extract the correlation across the VFs and the related LFs, to obtain an enhanced version of those LFs. Such a framework allows PMVE to leverage the cross-correlation between successive frames for enhancement, hence capable of achieving high accuracy performance.
Meanwhile, PMVE effectively avoids the explicit operations of motion estimation and compensation, hence greatly reducing the complexity compared to OPT.
The experimental results demonstrate that the peak signal-to-noise ratio (PSNR) performance of PMVE is fully on par with that of OPT while its computational complexity is only 1% of OPT. Compared with other state-of-the-art methods in the literature, PMVE is also confirmed to achieve superior performance in both objective quality and visual quality at a reasonable complexity level. For instance, PMVE can surpass its best counterpart method by up to 0.42 dB in PSNR.
Conclusion and future directions:
In this article, we presented a new approach, PMVE, to leverage the joint spatiotemporal correlation across frames for the enhancement of compressed videos. For any LF to be enhanced, we proposed a bi-prediction based scheme, where VFs are first created from the respective neighboring frame pairs through the Pred-net. The tradeoff between frame quality and frame distance was considered and the frame pairs identified accordingly.
Conventional pixelwise motion estimation and compensation processes are thus avoided and a large complexity reduction achieved. Afterward, the VFs are fed into the FF-net for frame fusion, in conjunction with the original LFs to finally reconstruct the enhanced version.
The experimental results confirm the effectiveness of our PMVE design, as it obtains a consistent superior result in PSNR and visual quality over other approaches. Currently, we mainly apply PMVE to the decoder in the postprocessing stage, but we are continuing to attempt the use of PMVE at the encoder side. For example, high-quality reference frames may be produced through PMVE and then involved in motion estimation for further coding efficiency improvement. We plan to further investigate computational complexity reduction to improve overall performance.