Cascaded Temporal and Spatial Attention Network for solar adaptive optics image restoration

Context. Atmospheric turbulence severely degrades the quality of images observed through a ground-based telescope. An adaptive optics (AO) system only partially improves the image quality by correcting certain level wavefronts, making post-facto image processing necessary. Several deep learning-based methods have recently been applied in solar AO image post-processing. However, further research is still needed to get better images while enhancing model robustness and using inter-frame and intra-frame information. Aims. We propose an end-to-end network that can better handle solar adaptive image anisoplanatism by leveraging attention mechanisms, pixel-wise filters, and cascaded architecture. Methods. We developed a cascaded attention-based deep neural network named Cascaded Temporal and Spatial Attention Network (CTSAN) for solar AO image restoration. CTSAN consists of four modules: optical flow estimation PWC-Net for inter-frame explicit alignment, temporal and spatial attention for dynamic feature fusion, temporal sharpness prior for sharp feature extraction, and encoder-decoder architecture for feature reconstruction. We also used a hard example mining strategy to create a loss function in order to focus on the regions that are difficult to restore, and a cascaded architecture to further improve model stability. Results. CTSAN and the other two state-of-the-art (SOTA) supervised learning methods for solar AO image restoration are trained on real 705 nm photospheric and 656 nm chromospheric AO images supervised by corresponding Speckle images. Then all the methods are quantitatively and qualitatively evaluated on five real testing sets. Compared to the other two SOTA methods, CTSAN can restore clearer solar images, and shows better stability and generalization performance when restoring the lowest contrast AO image.

A&A 674, A126 (2023) and its wavefront information simultaneously (Carbillet et al. 2006;Blanc et al. 2003). One of the successful PD algorithms is MOMFBD (van Noort et al. 2006). The main shortcomings of PD reside in its extra hardware and large computational requirement (Asensio Ramos & Olspert 2021). The deconvolution method can be divided into deconvolution with wavefront sensing (DWFS, Restaino 1992;Rao & Jiang 2002) and blind deconvolution (BD, Tian et al. 2008;Asensio Ramos et al. 2018). The DWFS algorithms employ wavefront information detected by wavefront sensors, combined with short-exposure images simultaneously (Welsh & Roggemann 1995). However, it is not real-time and highly susceptible to noise. The BD algorithm only uses degraded images to estimate the target and point spread function representing the wavefront disturbance, but directly solving BD is seriously ill-posed. So BD algorithms focus on finding proper constraint information to get better solutions and converge behavior (Wang et al. 2023).
With the improvement of computing power and the algorithm theory innovation, deep learning has made significant advances in various fields, including computer vision (Voulodimos et al. 2018), signal processing , natural language processing (Vaswani et al. 2017), network science , life science (Zhang et al. 2021), and clinical detection (Jiang et al. 2022). In image and video highquality reconstruction, deep learning is also successfully used to defect correction (Ratnasingam 2019), video inpainting (Zou et al. 2021), deblurring , and denoising (Tian et al. 2020). Wang et al. (2019) employed a deformable convolution module with pyramid architecture for frame alignment and a temporal and spatial attention fusion module followed by a reconstruction module to restore high-resolution videos (Zhang et al. 2022). Pan et al. (2020) construct a deep network named cascaded deep video deblurring using a temporal sharpness prior to deblurring on real challenging videos; it first aligns adjacent frames, then computes the sharpness prior, and finally feeds both the blurry images and the prior into an encoder-decoder network.
Deep learning-based methods are also used in solar image restoration and AO image post-processing (Guo et al. 2022;Jia et al. 2019). Baso et al. (2019) designed a neural network with autoencoder architecture to denoise solar images corrupted by complicated noise corruption such as instrumental artifacts and nonlinear post-processing. Fei et al. (2017) constructed an end-to-end network to learn the mapping between the blurred and sharp AO retinal images, which are subsequently applied to restore real solar AO images ). Asensio Ramos et al. (2018) proposed two different deep convolutional neural networks for multi-frame blind deconvolution of the solar image. The first is an encoder-decoder deconvolution neural network (EDDNN) for restoring images from a fixed number of frames (Asensio Ramos et al. 2018). The second is a recurrent network with an encoder-decoder network using an arbitrary number of frames, which is more challenging to train than EDDNN due to the existence of recurrent connections (Asensio Ramos et al. 2018). Asensio Ramos & Olspert (2021) used an unsupervised neural network to restore the solar image and estimate the wavefronts at the same time. Based on the adaptive filter,  proposed a blind restoration network (CSSTN) for solar AO image post-processing by constructing two sophisticated spatially variant adaptive filters for deblurring and aligning. CSSTN outperforms other deep learning-based solar image restoration methods in four real solar AO closed-loop datasets.
Although these deep learning models perform well in their datasets, model robustness be further considered in real solar AO imaging systems with unpredictable imaging conditions and complex observation scenes. Among these circumstances, the properties of images degraded by atmospheric turbulence should also be considered. And further research is still needed to enhance model robustness and for a better use of inter-frame and intra-frame information.
Aligning multiple frames using inter-frame information is essential for restoring multiple frames degraded by large motions, atmospheric disturbances, and large FoV blurs (Wang et al. 2019. Existing aligning methods can be divided into explicit and implicit branches (Chan et al. 2021). Implicit aligning has been applied in solar AO restoration, but its effectiveness in alignment needs to be verified . Some researchers have validated the better robustness of the explicit alignment model in various denoising and deblurring tasks (Otte & Nagel 1994;Ehret et al. 2019;Xiang et al. 2020). However, different adjacent frames are usually hard to fully and precisely align, and are not equally informative in the case of large blurring (Wang et al. 2019). The attention mechanism that emphasizes spatially and temporally important features to prevent the adverse effect of misalignment and unalignment arising from the preceding alignment stage has achieved great success and is validated in various visual tasks, including image reconstruction and video deblurring Zamir et al. 2021).
We propose the Cascaded Temporal and Spatial Attention Network (CTSAN) based on the explicit alignment and attention mechanism. CTSAN is constructed with pixel-wise filters and attention mechanisms to better handle anisoplanatism in solar AO images (Wang et al. 2019;Pan et al. 2020). Our main contributions are: (1) an explicit alignment module to align adjacent frames at pixel-level with pre-training and fine-tuning strategy to increase training efficiency and model accuracy; (2) a temporal and spatial attention (TSA) module to construct pixel-wise filters and emphasize features temporally and spatially; (3) two modules to address uneven blurring; the temporal sharpness prior (TSP) module is introduced to increase the weights of clearer patches, while the hard example mining strategy (HEM) module lets the loss function concentrate on severely blurred regions to lead the model to focus on learning to produce better results; and (4) cascaded two-stage architectures to improve performance further and ensure model robustness with the same number of training parameters as one-stage architecture.
This article is divided into five sections. Section 1 briefly introduces the background, research status, and motivation. The remainder of this article is organized as follows. Section 2 details the principle and architecture of our proposed network. Section 3 describes the parameters of five real datasets, network training details, and hyperparameter settings. Section 4 shows the qualitative and quantitative evaluation results of all networks trained with two-band and single-band training sets and our model's ablation experiments. Section 5 presents the discussion and our conclusions.

Network structure
Our proposed CTSAN is constructed by the multiple Temporal and Spatial Attention Network (TSAN), which consists of four parts: PWC-Net, TSA, TSP, and the feature reconstruction module, as shown in Fig. 1a. Each TSAN unit takes three consecutive frames as input, and outputs one restored image, as A126, page 2 of 15 Zhang,C.,et al.: A&A proofs,

Layer
Type Kernel size (a) Stride Input shape (b) Output shape (b) Notes. (a) Kernel spatial size and depth. (b) Image spatial size and the number of channels.
shown in Fig. 1b. CTSAN is built in a cascaded two-stage manner, which inputs five consecutive frames to restore one clear image, as shown in Fig. 1c. The architecture (including kernel size and depth), and input and output shapes of the reconstruction network are shown in Table 1. The properties of these super blocks numbered as C i, j in Fig. 1a are shown in Table 1, while i, j refer to the superblock and the block inside the superblock (Asensio Ramos et al. 2018). Dense optical flow can provide motion information by estimating the displacement field between two consecutive frames, which are able to generate higher resolution results (Fortun et al. 2015;Pan et al. 2020;Zou et al. 2021). PWC-Net is a compact and effective CNN model for dense optical flow estimation. Unlike the traditional optical flow estimation approach, PWC-Net employs a multi-scale feature pyramid, warping, and cost volumes, and performs better in real images with large motions and image edges corrupted by motion blur, atmospheric change, and noise (Sun et al. 2018;Gu et al. 2019). We use PWC-Net in our proposed network. Given three consecutive frames I i−1 , I i and I i+1 , the optical flow using PWC-Net for A126, page 3 of 15 A&A 674, A126 (2023) alignment to frame I i can be computed as follows: (1) denote backward flow and forward flow at frame I i+1 and I i−1 , respectively; and u, v are the velocity vectors of the optical flow along the X-axis and Y-axis, respectively (Hyun Kim & Mu Lee 2015;Meister et al. 2018;Maurer & Bruhn 2018). According to the obtained optical flow u i+1→i and u i−1→i , we employ warping operation to align the corresponding adjacent frames to reference frame I i (Lai et al. 2019): The x denotes every pixel location. We note that we estimate the optical flow from frame I i+1 and I i−1 to frame I i instead of frame I i to frame I i−1 and I i+1 because the warped results of optical flow u i→i+1 and u i→i−1 leads to blurred images .
The network architecture of the TSA fusion module is shown in Fig. 2. Temporal attention in TSA is designed to give more attention to the adjacent frame aligned better to I i . We first compute the similarity distance of aligned frames I i+1 (x + u i+1→i ), I i−1 (x + u i−1→i ) to the reference frame I i . The similarity distance (matrix of temporal attention map) of aligned frame where θ (I i+1 (x + u i+1→i )) and ϕ (I i (x)) are two embeddings computed by convolution filters, and the ⊗ is the dot product as shown in Fig. 2 (left). The Sigmoid function in Eq.
(3) limits the output value of h in [0, 1], which can stabilize gradient back-propagation (Wang et al. 2019); h i−1 is computed in a similar way. Then the temporal attention map is multiplied to the original aligned frame: Here ⊙ is the pixel-wise multiplication. After calculating the temporal attention mapĨ i−1 (x + u i−1→i ) in the same way, the fusion convolution layer is used to fuse these features: The symbol [, ] denotes concatenation operation, and Conv represents the convolution filter corresponding to the "Fusion Conv" block in Fig. 2. Then f i,conv is fed into the spatial attention part, which adaptively makes an impact on spatial pixels by employing spatial affine transformation (Wang et al. , 2019). Assuming f i,TSA is the final output of the TSA module, the spatial attention operation is expressed as follows: The symbol ⊕ is an element-wise addition. The masks γ i and β i , which have the same dimension as f i,conv and respectively correspond to the input of L 5,5 and L 5,6 in Table 2, are generated by spatial attention module with three-layer feature pyramid architecture that increases the attention receptive field (Wang et al. 2018, 2019). The layer inside the operation of spatial attention in TSA denoted as L i, j are shown in Table 2, while i refers to the layer number of pyramid architecture, corresponding to the number in the spatial attention part of Fig. 2, and j refers to the inside operation to get the corresponding ith layer. The yellow blocks in Fig. 2 are output features of each layer. We use TSP to reconstruct the clear latent image by adding sharpness weights in feature map (Cho et al. 2012). The sharpness in TSP is formulated as where the parameter j is set to 1 and −1, and w(x) is an image patch centered at pixel x. The outputs of the TSP and TSA modules are finally concatenated and fed into the feature reconstruction network: Here N R (·) represents the feature reconstruction network, a standard encoder-decoder architecture with skip connection (Tao et al. 2018), and I i,r is the final restored image of each TSAN unit, which corresponds to C 6,4 in Table 1.

Loss function
The common optimization goal of a restoring network is to minimize L 1 norm between the restored latent image and referenced supervised image. However, purely using the L 1 norm will lead to blurred edges and boundaries, producing artifacts in subsequent latent image restoration Xu et al. 2019). We deal with this problem by adding a HEM module in the loss function to pay attention to severely blurred regions and subsequently lead the model to focus on learning to produce better results (Xu et al. 2019). Accordingly, the TSAN unit loss of frame I i can be expressed as where T represents the TSAN unit, I i,c corresponds to the ith clear frame restored by Speckle. The M i is the HEM mask and is computed as follows: Here M i,h is the hard sample mask representing the top a% (a ∈ [0, 100]) pixels in L 1 loss value in descending order of current restored image pixel in training iteration. The mask M i,r , which is used to enhance model robustness, represents the randomly selected b% (b ∈ [0, 100]) pixels of the current restored image pixel in training iteration. The λ in Eq. (9) is the HEM module weight parameter. CTSAN is constructed by a cascaded multi-stage approach, which can improve model stability when the input image has complicated blurrings (Zhang et al. 2022). To balance accuracy and speed, we empirically set the stage number to 2, as shown in Fig. 1c. During the iteration of CTSAN training, the overall loss is calculated according to the forward propagation results of four TSAN units shown in Fig. 1c, which is correspondingly formulated as follows: The symbols T 1 and T 2 represent the TSAN units used in the first and second stages. In the first stage of CTSAN, the L i−1,T 1 , L i,T 1 , and L i+1,T 1 are calculated from I i−1,r , I i,r and I i+1,r , which are restored from raw AO images, as shown in Col. 1 of Fig. 1c.
In the second stage of CTSAN, the L i,T 2 is calculated from I i,r , which is restored from the three outputs of the first stage, as shown in Col. 2 of Fig. 1c. Hence, the overall loss L i ensures that CTSAN is robust to different types of inputs. In each iteration's backward propagation of CTSAN, a unique TSAN parameter set based on a single TSAN unit is updated for faster convergence and lower computational costs. At the next iteration's forward propagation, the updated unique TSAN parameter will be used four times in constructing the cascaded architecture, sharing the same TSAN parameters in the first and second stages of CTSAN, as shown in Fig. 1c.

Dataset
Five real short-exposure solar image datasets with AO closedloop corrections are employed in our experiment, including three datasets of the photospheric 705 nm band and two datasets of the chromospheric 656 nm band. These solar images were observed by the New Vacuum Solars Telescope (NVST), a one-meter vacuum solar telescope located near Fuxian Lake, with a ground layer adaptive optics (GLAO) system or classical adaptive optics (CAO) system (Rao et al. 2015(Rao et al. , 2016bKong et al. 2016Kong et al. , 2017. The dataset parameters are listed in Table 3, and the first three datasets are the same as the first three datasets used in CSSTN ). The Speckle method uses 100 images to restore one Speckle image (Zhong et al. 2014). We define an AO closed-loop source image and its corresponding Speckle image as a pair. The first photosphere dataset is divided into 5200, 500, and 1000 pairs as the first training, validation, and testing set. The first chromosphere dataset is divided into 70, 10, and 20 pairs as the second training, validation, and testing set.
To keep the balance between the two training sets, we duplicate the second training set 80 times to keep the balance and obtain 5600 pairs of images for the second training set, as shown in Table 3. The third, fourth, and fifth datasets are only used as testing sets.

Network training
Some research, such as  and Asensio Ramos et al. (2018), have verified that Recurrent-DNN has nearly the same performance as EDDNN, which has a similar architecture with our encoder-decoder backbone structure. CSSTN is the latest solar AO image blind deconvolution method based on two adaptive filters, outperforming several previous supervised A126, page 5 of 15 A&A 674, A126 (2023) learning methods and TV-based blind deconvolution methods. Therefore, EDDNN and CSSTN are selected as comparison methods in this research. For all methods, a single-band trained model is used to restore all TiO (750 nm) testing sets to restore better photospheric images , and a dual-band trained model is used to restore all chromospheric testing sets to prevent overfitting on our small chromospheric training sets. CSSTN is trained on a computer with an Intel i7-8700K CPU with 16GB memory and an NVIDIA 12GB Titan Xp GPU , while EDDNN and CTSAN are trained on a server with 2 Intel Xeon 4210R CPU, 128GB RAM, and NVIDIA 24GB RTX3090 with Ubuntu 20.04 OS. Data augmentation is employed in data preprocessing. Randomly rotating at 90 degrees and horizontally and vertically flipping are used. Then, the randomly cropped patch with 88 × 88 (Asensio Ramos et al. 2018), 256 × 256 , and 256 × 256 pixels are used to train EDDNN, CSSTN, and CTSAN, respectively. CTSAN is implemented by the CUDA 11.7 and PyTorch 1.13.0 platform. PWC-Net of CTSAN is initialized by introducing its parameters pre-training on KITTI and Sintel benchmark datasets and fine-tuning on our training sets with learning rate 1 × 10 −6 (Sun et al. 2018;Menze & Geiger 2015;Butler et al. 2012). KITTI contains static and dynamic scenes, large motion, severe illumination changes, and occlusions, while Sintel contains various strong atmospheric effects, defocus blur, and camera noise, among others (Sun et al. 2018;Menze & Geiger 2015;Butler et al. 2012). This pre-training strategy can efficiently accelerate training convergence and improve the quality of restoring images (Lim et al. 2017;Kaufman & Fattal 2020).
All trainable parameters of CTSAN, including PWC-Net, TSA, and feature reconstruction modules, are jointly trained in an end-to-end manner. The learning rate is initialized to be 1 × 10 −4 , decayed by half every 80 epochs. The parameter a and b in Eq. (10) are set to 50 and 10. The patch size of w(x) in Eq. (7) is set to 20 in our experiment. The weight hyperparameter λ in the loss function is empirically set to 2. During the training, the batch size is set to 8, and the Adam optimizer is adapted with β 1 = 0.9, β 2 = 0.999, and ϵ = 10 −8 . At the end of each epoch, the latest model parameters are tested on the validation set, and the best model is preserved. EDDNN and CSSTN converged and stopped training when reaching 200 and 300 epochs ). Our network converged and stopped training when reaching 200 epochs with about 100 h on a single RTX3090 configuration in the server. The source code is publicly available on the GitHub website 1 .

Results
Four evaluation metrics are used in our experiment, including peak signal-to-noise ratio (PSNR), structural similarity (SSIM), the azimuthally averaged power spectrum curve, and granulation subregion contrast (CON) and relative improvement ( where std and µ denote standard deviation and average value, respectively. The relative improvement of contrast is defined as where c r and c b represent the granulation subregion contrast of restored and source blurred images, respectively (Zhong et al. 2014).

Comparative results
Extensive evaluation experiments are implemented on five testing sets different from those used for training and validation to evaluate the performance of all models. Our network's average PSNR outperforms EDDNN and CSSTN in all five testing sets, as shown in Table 4. The average PSNR among the all the first testing sets restored from CTSAN is 0.28 dB and 2.88 dB higher than that of CSSTN and EDDNN as shown in Table 4. One of the AO closedloop source images in the first testing set is shown in Fig. 3a, and its corresponding restored images by EDDNN, CSSTN, and CTSAN are shown in Figs. 3 b-d. It can be seen that all deep learning-based methods restore detailed structure than AO closed-loop image. Figure 3f shows the power spectrum curve of Figs. 3a-e. The horizontal coordinates of the power spectrum are normalized to the theoretical spatial cutoff point (Zhong et al. 2020). The power spectrum curve of CTSAN, especially at middle to high frequencies, is closer to the reference Speckle among the three deep learning methods.
Four randomly selected patches (256 × 256 pixels) corresponding to the rectangular regions in Fig. 3a are used to show small-scale substructures. These typical solar subregions in Fig. 3a, including granulation, penumbra, and sunspot areas, are respectively marked by the red, purple, and blue rectangles A126, page 6 of 15 Zhang, C., et al.: A&A proofs, manuscript no. aa44904-22  and are shown in Fig. 4. It can be seen that the appearance of EDDNN has a blurrier structure among all the methods, as shown in Fig. 4b. Compared with CSSTN, CTSAN can restore more apparent textures and less noisy points in all regions, especially in granulation and penumbral filaments, as shown in the first, second, and fourth lines of Fig. 4. Among all methods, CTSAN has the nearest comparable subjective appearance and objective measure to Speckle in the first TiO testing set. However, CTSAN's results appear with fewer texture details than Speckle in sunspots, umbral dots, light bridges, elongated structures, and magnetic bright points (MBPs), as shown in Figs. 4d,e). The relative improvements of the granulation subregion contrast for all methods in the whole first testing set are shown in Table 5. The upper left corner 32 × 32 pixels subregion (Zhong et al. 2014) of two granulation areas marked with red rectangles in Fig. 3a are selected to represent granulation contrast and relative improvements. The contrast and relative improvement values of images restored by all methods for the lowest and highest contrasts of the source AO closed-loop frame are also shown in Table 5. CSSTN has the lowest improvements among all methods in improving the granulation area's contrast for the lowest granulation contrast AO closed-loop frames, and EDDNN has the lowest improvements among all methods in improving the granulation area's contrast for the highest granulation contrast AO closed-loop frames. In contrast, CTSAN performs best in the lowest and highest contrast source AO closed-loop frames. The average relative contrast improvements of CTSAN are 42.26 and 28.57% higher than that of EDDNN and CSSTN, denoting that CTSAN has the best performance among the three deep learning methods in improving granulation contrasts of the whole first testing set.
In the second testing set the average PSNR restored by CTSAN is 12.88 dB and 4.99 dB higher than EDDNN and CSSTN, as shown in Table 4. All the learning method's restored results from the same AO closed-loop source image in the second testing set (Fig. 5a)    power spectrum curve of second testing set in Fig. 5f shows that CTSAN's curve is closer to Speckle than other methods in all frequencies, including low, middle, and high frequencies.
To quantitatively and qualitatively compare model reconstruction precision and generalization capability, we test all methods on the third, fourth, and fifth testing sets, which are not seen for all networks in training. As shown in Table 4, the average PSNR of CTSAN among the whole third testing set is 6.58 dB and 1.89 dB higher than EDDNN and CSSTN. One source AO closed-loop image and its corresponding results corrected by all methods are shown in Fig. 6. EDDNN restored result in Fig. 6b shows the prominent blurring effect and noises, indicating EDDNN has poor generalization performance on the third testing set. CSSTN has better quality than EDDNN, but some noise is still apparent, as shown in Fig. 6c. The results of CSSTN and CTSAN are very similar, but CTSAN produces a less blurred effect and more visible boundaries, and clearer stripes than CSSTN, as shown in Fig. 6d. CTSAN has the nearest comparable subjective appearance to Speckle than EDDNN and CSSTN, whereas the edges of CTSAN are smoother than the sharp edges restored by Speckle. Their corresponding power spectrum curve of the third testing set in Fig. 6f shows that CTSAN is closer to Speckle than other methods at low to middle frequencies and part of the high frequency.
The fourth testing set is a quiet-Sun area. The average PSNR of our CTSAN is 6.14 dB and 4.82 dB higher than that of EDDNN and CSSTN in the whole fourth testing set, as shown in Table 4. One of the AO closed-loop source images in the fourth testing set is shown in Fig. 7a and its corresponding restored results by all methods are shown in Figs. 7b-d. All restored individual frames are improved in quality compared to the raw AO closed-loop images. Their corresponding power spectrum curve of the fourth testing set in Fig. 7f shows that CTSAN is closer to Speckle than EDDNN and CSSTN in the low-and high-frequency parts of about [0, 0.15] and [0.4, 1].
Four randomly selected granulation patches with 256 × 256 pixels marked by red rectangles in Fig. 7a are chosen to compare texture details of different methods, and are shown in Fig. 8. The EDDNN restored image is of lower quality with fewer details, more significant outliers, and saturated pixels, as shown A126, page 8 of 15 Zhang,C.,et al.: A&A proofs,. One source image in the second testing set and its corresponding restored images by all methods. Panel a is an AO closed-loop image; panels b to d are the corresponding images restored by EDDNN, CSSTN, and CTSAN; panel e is the corresponding Speckle image (512 × 512 pixels); and panel f is the corresponding power spectrum curves of (a)-(e). in Fig. 8b. The CSSTN restored result in Fig. 8c shows a similar appearance to EDDNN but restores more details and granulation textures. The granulation areas restored by CTSAN have richer texture details, finer structures without significant outliers and more saturated pixels than EDDNN and CSSTN, as shown in Fig. 8d. Compared to Speckle in Fig. 8e, CTSAN's granulation edges have weaker signals and more fractal appearances, as shown in Fig. 8d.
The four upper left corners (32 × 32 pixels) of the four red rectangles in Fig. 7a are selected to represent contrast and relative improvements. The corresponding relative contrast improvements of all methods in the whole fourth testing set are shown in Table 6. EDDNN has the lowest improvements for the lowest and highest contrasts of the source AO closed-loop frame, while CTSAN has the highest contrast improvements on both. The average relative contrast improvements of CTSAN are 57.72 and 23.56% higher than that of EDDNN and CSSTN, denoting CTSAN has better performance than EDDNN and CSSTN in improving granulation contrast among the whole fourth testing set, as shown in the last line of Table 6.
In the fifth testing set, the average PSNR of our CTSAN is 4.20 dB and 0.93 dB higher than that of EDDNN and CSSTN, as shown in Table 4. The relative contrast improvements from the source AO closed-loop frame with the lowest and highest Table 6. Lowest, highest, and average contrast and relative improvement of granulation region on the fourth testing set. contrast and the average relative contrast improvements restored on the whole fifth testing set by all methods are shown in Table 7. The ways of selecting the lowest and highest contrast frames and choosing computing regions are the same as in the first testing set. CSSTN has the lowest improvements among all methods in improving the granulation area's contrast for the lowest granulation contrast AO closed-loop frames, and EDDNN has the lowest improvements among all methods in improving the granulation area's contrast for the highest granulation contrast AO closed-loop frames, as shown in Table 7  than EDDNN and CSSTN in the AO closed-loop source image with the lowest and highest contrasts in the whole fifth testing set. The average relative contrast improvement in Table 7 shows that CTSAN is 127.39 and 59.56% higher than that of EDDNN and CSSTN in the fifth testing set, indicating CTSAN achieves the best contrast improvement with a clear appearance in these areas.
The AO closed-loop image with the lowest granulation contrast among the whole fifth testing set is shown in Fig. 9a, and its corresponding restored images by all methods are shown in Figs. 9 b-d. Both EDDNN and CSSTN perform poorly in restoring this AO closed-loop frame with the lowest granulation contrast, especially the blurred outlines and low-contrast features in the granulation area shown in Figs. 9b,c. CTSAN performs best in the lowest granulation contrast frame, as shown in Fig. 9d. The power spectrum curves in Fig. 9f also show that the curve of CTSAN is closer to the Speckle curve than EDDNN and CSSTN, especially in the low to middle frequencies, and part of the high frequency range.
Four randomly selected subregions (256 × 256 pixels) marked by red rectangles in Fig. 9a, including two granulation and two penumbra subregions, are shown in Fig. 10 to easily compare the details and structural differences of images restored by different methods. It is obvious that both EDDNN and CSSTN restore blurred granulation areas, as shown in the first and second lines of Fig. 10. In contrast, CTSAN performs better than EDDNN and CSSTN in restoring the granulation subregions on the lowest granulation contrast frames. The structural textures and details in the four subregions restored by CTSAN are closer to Speckle than EDDNN and CSSTN while there are remaining apparent gaps to Speckle in all small-scale substructures, such as the MBPs and filamentary areas shown in Figs. 10d,e.

Ablation study
To better understand and verify each module's effectiveness in CSTAN, including cascaded two-stage architecture, PWC-Net, TSP, TSA, and HEM, we perform further analysis by a series of ablation studies in Table 8. We first train these ablation models   Table 8), respectively. The effectiveness of PWC-Net is verified by comparing the third and fourth rows in Table 8, which shows the average PSNR among the whole first testing set of a model with PWC-Net is 0.13 dB higher than the ablation model without PWC-Net module. The SSIM single results also reflect a similar trend to PSNR single , highlighting these modules' effectiveness in improving network performance. We also show the ablation model's results trained on the dual-band training dataset, which has a similar trend to singleband evaluation results in verifying each module's effectiveness, as the PSNR dual and SSIM dual results shown in Table 8. The dualband model's PSNR iteration curves of the PWC-Net and TSP ablation module and CTSAN on the validation set are shown in Fig. 11, which verifies their effectiveness from another perspective. It can be seen that the PSNR iteration curve of the ablation model without PWC-Net has a large variance and fluctuation, while the model with PWC-Net tends to be more stable and smooth, as shown in Figs. 11a,b. It is also evident that the PSNR iteration curve of the model with TSP converges at a higher PSNR value on the validation set than the ablation model without TSP, as shown in Figs. 11b,c.

Discussion and conclusion
This research proposed a cascaded temporal and spatial network named CTSAN for real solar AO closed-loop image postprocessing. Four modules inside CTSAN, including PWC-Net, TSA, TSP, and HEM, were included for inter-frame explicit alignment, feature fusion, sharp feature extraction, and hard sample mining. We compared the performance of CTSAN with EDDNN and CSSTN in our five real AO testing sets both subjectively and objectively. The results show CTSAN has better precision and generalization abilities and is more robust in restoring the short-exposure AO frames with the lowest granulation subregion contrast. Finally, ablation studies were evaluated on our first testing set to verify the modules' effectiveness preliminarily.
CTSAN and the other two supervised networks were evaluated on five real testing sets using PSNR, power spectrum, granulation contrast, and relative improvement as quantitative metrics. From the power spectrum curve, CTSAN's curve is closer to Speckle's curve than EDDNN and CSSTN, especially in the mid-A126, page 12 of 15 Zhang,C.,et al.: A&A proofs,. Four subregion images corresponding to the rectangular patches in Fig. 9a (256 × 256 pixels). Table 8. Ablation studies using single-band and dual-band trained models to restore the whole first testing set to verify the effectiveness of cascaded two-stage manner, PWC-net, TSP, TSA, and HEM module of our proposed network. When restoring multi-frame solar AO sequences, taking advantage of information transferring from adjacent frames and previous stages is essential. The cascaded architecture has a different information-transferring method than EDDNN and CSSTN. CTSAN only depends on the input of the current five consecutive frames, as shown in Fig. 1c, and does not rely on previous restored images. This implementation prevents previous side effects on current and later image reconstruction. Conversely, CSSTN uses the previously restored image to reach the purpose of information transferring. We found that the results of CSSTN have a smaller difference in appearance compared to CTSAN on the normal source AO image, but have a noticeably greater difference to CTSAN on the lowest granulation contrast images, as shown in Fig. 9c. These phenomena indicate that the stability of CTSAN is better than CSSTN. As a simple and real-time network with an encoder-decoder architecture, EDDNN employs seven frames of the burst to restore one solar AO image. However, its performance needs further improvement than CSSTN and CTSAN in all five real testing sets. It indicates that more than a single encoder-decoder architecture is needed for an end-to-end restoring network.
CTSAN has more obvious superior performance than EDDNN and CSSTN in restoring granulation among all typical sun areas. PWC-Net may contribute to correcting well in loworder biases, leading to better results in the granulation area. Generally speaking, the AO system can better restore the middle region of a FoV (Kong et al. 2017;Scharwaechter et al. 2019). Therefore, granulation regions are farther from the center and contain more lower-order biases in our three 705 nm photosphere testing sets. These better results than CSSTN also indicate that our explicit alignment is better than implicit alignment in CSSTN. We validate the effectiveness of PWC-Net in inter-frame alignment by ablation experiments, as shown in Table 8 and Fig. 11. Our inter-frame alignment in PWC-Net uses pixel-level explicit alignment, which is a more precise alignment (Sun et al. 2018;Gu et al. 2019). In the future, we would consider finding other registration methods that may yield better aligned features in geometric variation regions, such as deformable alignment (Chan et al. 2021).
However, our proposed network still has some limitations. One practical limitation of our CTSAN is that we spend more time on both training and inference. Although our CTSAN has fewer memory allocation requirements than CSSTN in inferencing TiO high-resolution images, our CTSAN spends about 10 s restoring a 1792 × 1792 pixel image with a single RTX3090 in our server. Some efforts are still needed to apply this method in real-time astronomical observation. Employing knowledge distillation to get a compact model with less inference time and model parameters is a possible and practical solution (Suin et al. 2021;Wang & Yoon 2021). Applied scientists can also train with the newest Pytorch 2.0 + version, which has a far higher training speed than the version we used, Pytorch 1.13.0, or can choose a parallel multi-GPU training strategy to accelerate training. Moreover, model performance may be further improved by designing cascaded three or more stage networks while keeping the number of trainable parameters constant.
In addition to the common data enhancement methods for small training samples used in this article, one of the classical methods in astronomical image restoration is using synthetic data to expand training sets, thus preventing overfitting in small training samples (Löfdahl & Hillberg 2022). We also find that the 705 nm training set contributes to preventing overfitting when training the 656 nm restoration model with a small 656 nm training set. Other restrictions of our algorithm applied in restoring solar AO image post-processing are the limited interpretability. Other sophisticated and more interpretable processes, such as self-attention and diffusion probabilistic models, may be worth exploring further (Deng et al. 2021;Song et al. 2022). Meanwhile, compared with non-deep learning methods of restoring solar AO images (Wang et al. 2022), a systematic theory of deep learning is urgently needed to support its wide application in actual astronomy imaging and observation scenario requiring high reliability and interpretability.