Detection of exomoons in simulated light curves with a regularized convolutional neural network

Rasha Alshehhi; Kai Rodenbeck; Laurent Gizon; Katepalli R. Sreenivasan

doi:10.1051/0004-6361/201937059

Home

All issues

Volume 640 (August 2020)

A&A, 640 (2020) A41

Full HTML

Free Access

Issue		A&A Volume 640, August 2020


Article Number		A41
Number of page(s)		9
Section		Numerical methods and codes
DOI		https://doi.org/10.1051/0004-6361/201937059
Published online		07 August 2020

A&A 640, A41 (2020)

Detection of exomoons in simulated light curves with a regularized convolutional neural network

Rasha Alshehhi¹, Kai Rodenbeck²^,3, Laurent Gizon³^,2^,1 and Katepalli R. Sreenivasan⁴^,1

¹ Center for Space Science, NYUAD Institute, New York University Abu Dhabi, Abu Dhabi, UAE
e-mail: ra130@nyu.edu
² Institut für Astrophysik, Georg-August-Universität Göttingen, Friedrich-Hund-Platz 1, 37077 Göttingen, Germany
³ Max-Planck-Institut für Sonnensystemforschung, Justus-von-Liebig-Weg 3, 37077 Göttingen, Germany
⁴ Tandon School of Engineering, New York University, New York, USA

Received: 5 November 2019
Accepted: 12 May 2020

Abstract

Context. Many moons have been detected around planets in our Solar System, but none has been detected unambiguously around any of the confirmed extrasolar planets.

Aims. We test the feasibility of a supervised convolutional neural network to classify photometric transit light curves of planet-host stars and identify exomoon transits, while avoiding false positives caused by stellar variability or instrumental noise.

Methods. Convolutional neural networks are known to have contributed to improving the accuracy of classification tasks. The network optimization is typically performed without studying the effect of noise on the training process. Here we design and optimize a 1D convolutional neural network to classify photometric transit light curves. We regularize the network by the total variation loss in order to remove unwanted variations in the data features.

Results. Using numerical experiments, we demonstrate the benefits of our network, which produces results comparable to or better than the standard network solutions. Most importantly, our network clearly outperforms a classical method used in exoplanet science to identify moon-like signals. Thus the proposed network is a promising approach for analyzing real transit light curves in the future.

Key words: Moon / methods: numerical / methods: data analysis

© ESO 2020

1. Introduction

In the past two decades, over 3500 planets around distant stars (exoplanets) have been detected and confirmed. Most of the known exoplanets have been detected using the transit method, in which space telescopes observe a star for a period of time, generating a photometric light curve (Rodenbeck et al. 2018). If an exoplanet passes in front of the star on its orbit around the star, light is blocked out and the brightness decreases. Planetary transits repeat every orbital period. The shape of a transit light curve, in combination with additional knowledge about the host star, is used to constrain the characteristics of the planet. If an exoplanet has a moon, it is also expected to leave a signature in the transit light curve. Detecting exomoons is difficult because a stellar variability component as well as photon noise are present (Rodenbeck et al. 2018).

Several methods have been proposed to detect exomoons around exoplanets in light curves, using individual transits or averages over multiple transits (Kipping 2009; Heller 2014). A planet-moon transit light curve is modeled simply as the sum of two transits; one for the planet, and a more shallow transit for the moon. To model multiple planet-moon transits, more detailed modeling is required in principle, to take the orbital motion of the moon and of the planet around their centers of mass into account. This more detailed modeling is only applied when a promising target has been identified. For previous attempts for detecting exomoons, we refer to Szabó et al. (2013) and Hippke (2015). The most recent attempt has been reported by Teachey et al. (2018), who stacked the phase-folded transits of 284 Kepler exoplanets in search for exomoon candidates.

Unfortunately, no exomoon has been unambiguously detected around any exoplanet so far, although some exomoon candidates have been reported. The most promising such candidate orbits the exoplanet Kepler 1625 b. This candidate has a mass of 10 – 40 Earth masses (M_⊕) and a radius of $4 . 90_{- 0.72}^{+ 0.79}$ $4.90^{+0.79}_{-0.72}$ Earth radii (R_⊕), which is higher and larger than the masses and radiii of many confirmed exoplanets. In total, four transits of Kepler-1625 b have been observed: three by the Kepler space telescope between 2009 and 2012, and one from the Hubble space telescope in October 2017 (Teachey et al. 2018; Teachey & Kipping 2018). Kepler-1625 b orbits its star every 287 days; the exomoon orbits the planet with a period between 13 and 39 days (see Fig. 1 for a diagram of the orbital configuration).

Fig. 1.

Sketch of the orbital configuration for the numerical simulations of transit light curves. The planet and the moon orbit their common center of mass C.

We propose a solution based on a convolutional neural network (CNN) to detect exomoons in photometric light curves. To test the method, we generate synthetic transit light curves with and without an exomoon. The light curves are similar to those of the Kepler-1625 b observations. Figure 2 presents examples of such simulated light curves. The left panel shows a planetary transit including an exomoon at low noise level. The right panel shows a transit containing an exomoon at higher noise level. The exomoon transit is always close in space to the planet transit because the exomoon is on a close-in orbit of the planet. Depending on the orbital configuration, the exomoon transit can occur before or after the planetary transit.

Fig. 2.

Examples of simulated transit light curves. Left panel: the curve shows a transit of a planet with radius R_p = 12 R_⊕ (similar to that of Kepler 1625b) for a low level of noise (100 ppm). The red curve shows a similar but different planetary transit, and an exomoon transit is superimposed (R_m = 4.75 R_⊕). Right panel: simulations for a higher noise level (400 ppm) and a much smaller moon R_m = R_⊕. The distinction between the planet-only (blue) and planet-moon (red) case is not visible by eye in most cases.

Recently, deep CNN architectures have achieved impressive performance for data classification. They are quickly becoming prominent in astrophysical applications because they have the ability to effectively encode spectral and spatial information based on the input data, without pre-processing. A typical network consists of multiple interconnected layers and learns a hierarchical feature representation from raw data (Liu & Deng 2015). The network optimization is usually performed with L2-norm without considering the effect of the noise properties, and it does not exploit neighboring information in the data (Janocha & Czarnecki 2017). We here propose a 1D CNN to classify a transit signal as either containing or excluding an exomoon candidate. The architecture consists of five 1D convolution layers, followed by one fully connected layer. Gaussian smoothness is used as total variation loss to penalize the L1-norm error. The goal of using this smoothness is to remove the noise or unwanted variations in the data (e.g., from stellar activity, or systematic or instrumental effects) and conserve the neighborhood information at the same time.

Deep learning has led to significant advances in the field of astrophysics. Most of the studies that incorporate this method were based on well-known network configurations, such as alexnet (Krizhevsky et al. 2012) and vgg16 (Liu & Deng 2015), which is tuned using a mean standard error as cost function. We here propose an improved cost function that includes a total variation loss component to regularize the network. Total variation loss has been used extensively in other fields, for example, in computer vision. Mahendran & Vedaldi (2015) used the total variation as a regularization procedure for reconstructing natural images from their representation based on image priors. Javanmardi et al. (2016) used the total variation as a structured loss by applying a Sobel edge-detection technique on the output probability map to train a deep network, especially in the case when there are not enough labeled data. We apply similar ideas to improve the classification of exoplanet light curves.

The remainder of this paper is organized as follows. In Sect. 2 we describe the data sets we used in the analysis. An overview of our proposed method is given in Sect. 3. Section 4 compares the performance of the proposed method with the performances of several other techniques in both generalization and convergence. Section 5 summarizes the most important findings.

2. Synthetic data

We used three sets of light curves (Table 1). Each light curve contained four transits. For each transit we had 50 days of data, with 25 days on each side of a transit. The light curves had a cadence of 29.4 min, which is Kepler’s long cadence. The simulated transit light curves were generated using the model described in detail by Rodenbeck et al. (2018), with the additional improvement that the occultation of the moon by the planet was taken into account.

Table 1.

Parameters for the three data sets of simulated light curves.

The first data set consists of 1 052 400 simulated light curves; 526 200 with exomoons and 526 200 without exomoons. The noise was varied between 25 ppm and 500 ppm. The radius was also varied between 0.25 R_⊕ to 5 R_⊕. The exomoon phase was randomized between 0 and 1 for each light curve that contained a moon.

The second data set consisted of ten subsets. Each subset had 200 000 curves: 100 000 with exomoons and 100 000 without exomoons. The noise was kept constant at 100 ppm. The exomoon radius was varied from 0.25 R_⊕ to 4.75 R_⊕ from one subset to another. The exomoon phases were also randomized.

The third data set also contained ten subsets, each with 200 000 curves. There was either no moon (100 000) or a moon (100 000) with a radius of 1 R_⊕. The noise was varied from 25 ppm to 475 ppm. The exomoon phases were again randomized.

Table 2 summarizes the choice of astrophysical parameters for the simulations, that is, the stellar, planetary, and exomoon parameters. The stellar parameters were fixed; the star had the same radius R_⋆ and mass M_⋆ as our Sun. We used a quadratic limb-darkening law, which describes the dimming of a star’s brightness from the center of the stellar disk toward the limb (see Claret & Bloemen 2011 for more detail), with roughly solar-like values. The exoplanet radius R_p and mass M_p are roughly Jupiter-like, and the orbital distance was 0.85 astronomical units. The exoplanet orbits its star in 287 days, in accordance with Kepler’s third law. This law describes the relation between the distance of a planet to its star, the orbital period of the planet, and the combined mass of the star, the planet, and the moon if present, to determine the distance from the star. The exomoon semimajor axis around its planet a_m was set to 20 R_Jupiter. The planet mass and the exomoon semimajor axis determine the orbital period of the moon around its planet. We varied the exomoon radius between 0.25 and 5 R_⊕ while keeping its mass density constant and equal to the Earth’s density. We chose to keep the density constant, therefore only one exomoon parameter was varied. We also added noise of various amplitudes between 25 ppm (parts per million) and 500 ppm to the synthetic light curves, instead of using a more realistic noise model. The initial exomoon phase φ₀ = φ(t = 0) (see Fig. 1) was randomized for each realization. The realization means a new generation of the light curve with the appropriate parameters, with newly generated noise added to the model. The q₁ and q₂ determine the brightness of the star as a function of center-to-limb distance. Limb darkening affects the shape of the transit: the bottom of the transit is rounded and not flat (see Fig. 2).

Table 2.

Stellar, planetary, and exomoon parameters for the transit light-curve simulations.

3. Method

In this section, the proposed framework for detection of a moon-like signal from a light curve is described. The method does not require a preprocessing step. The 1D CNN architecture is summarized below and described in full in Appendix A. Our definition of the loss function is presented below.

The idea of detecting a moon in light curves is similar to the idea of detecting an edge in images, where brightness (pixel value) changes sharply. The convolution stage in CNNs involves the sharing of connection weights, and this allows for robust position-invariant feature detection. Time variations are detected regardless of when they happen to be within the window of interest.

The input of the CNN is a simulated light curve, and the output is a binary value indicating the presence of a moon. We denote the mth simulated light curve by X_m = {X_m(0),X_m(1)⋯X_m(N − 1)}, where 0 ≤ m ≤ M − 1. Each light curve contains four transits. The total number of time samples is N = 9796, and the number of simulated light curves, M, is given in Table 1. For each input simulated light curve, the output of the network is denoted by C(X_m), equal to 0 (no moon) or 1 (moon). The true classification of the light curve is c_m = 1 (moon) or c_m = 0 (no moon). We note that convolutional networks are also appropriate because they are computationally efficient; the dimension of each input is reduced by convolution and pooling stages.

3.1. Convolutional neural network architecture

Our network takes X_m as input and classifies it into a binary value. As presented in Fig. 3, the architecture of the network consists of five convolution layers, followed by one fully connected layer and one output layer. Each convolution layer consists of a 1D convolution, batch normalization, leaky ReLU activation, maximum pooling, and Gaussian dropout. The determination of the network architecture was obtained experimentally.

Fig. 3.

CNN architecture. As explained in the text, it consists of five convolution layers composed of 64, 128, 128, 128, and 64 channels, including convolution, batch normalization, leaky ReLU activation (first three columns in light blue), maximum pooling, and Gaussian dropout (last two columns in dark blue). The last layer is followed by a fully connected layer (purple) and a classification layer returning a binary answer (1 for moon, 0 otherwise).

3.2. Loss function

The training process of the CNN is accomplished through the minimization of a loss function that measures the error between correct and predicted values. In the classification problems, there are many loss functions that could be used (e.g., binary cross-entropy, hinge, Kullback–Leibler (KL) divergence, and wasserstein) (Janocha & Czarnecki 2017).

Noise-smoothing operators have been extensively used as preprocessing in the context of signal processing to suppress the noise and preserve the changes at the same time (Simonoff 1998). The contribution of our work is the use of the smoothing operator directly in learning. The loss function is constructed using spatial information (e.g., neighborhood information) of the data features. As a result, back-propagation errors of the proposed loss functions are different. In this section, we define the proposed loss function that combines mean error and total variation error.

We consider a binary classification, where the goal is the assignment of either no-moon (0) or moon (1) to each moon-like signal in the training set. The input data are a set of training samples indexed by 0 ≤ m ≤ M − 1. The loss function is a combination of two functions. The first function is the mean error,

$\begin{matrix} L_{err} (Θ) = \frac{1}{M} \sum_{m = 0}^{M - 1} | C (X_{m} ; Θ) - c_{m} |, \end{matrix}$ $\begin{aligned} L_{\rm err}(\Theta )= \frac{1}{M}\sum _{m=0}^{M-1} | C(X_m; \Theta ) - c_m |, \end{aligned}$ (1)

where each term in the sum can only take the values 1 or 0. This function is to be minimized to obtain all the parameters Θ of the CNN (all weights and biases). However, if the number of parameters Θ is large and the amount of data is low, the problem might be overfit. Regularization by penalizing the loss function is a common solution for this problem.

The output of the last convolution layer (ℓ = 4) is a matrix of dimension 64 (number of feature maps) times 102 (number of time samples). This layer is followed by a fully connected layer Y(p) activated by a leaky ReLU (see Appendix). We use the total variation of Y as a penalty to regularize the classification. First, each vector Y_m is smoothed to remove noise while preserving the original signal. This is done by convolution with a centered Gaussian kernel G with standard deviation σ = 3 and width 9,

$\begin{matrix} S_{m} (p) = \sum_{j = 0}^{8} Y_{m} (p - j) G (j) . \end{matrix}$ $\begin{aligned} S_m(p) = \sum _{j=0}^{8} Y_m(p-j)G(j). \end{aligned}$ (2)

The total variation loss is then obtained by computing the mean unsigned difference between two consecutive points, and averaging over all the light curves,

$\begin{matrix} L_{tv} (Θ) = \frac{1}{M} \sum_{m = 0}^{M - 1} \frac{1}{N_{p} - 1} \sum_{p = 0}^{N_{p} - 2} | (S_{m} (p + 1) - S_{m} (p) | . \end{matrix}$ $\begin{aligned} L_{\rm tv}(\Theta ) = \frac{1}{M} \sum _{m=0}^{M-1} \frac{1}{N_p-1} \sum _{p=0}^{N_p-2} \left|(S_m(p+1) - S_m(p) \right|. \end{aligned}$ (3)

The final loss function is a linear combination of the mean error between the correct and predicted values and the average unsigned gradient of the input features,

$\begin{matrix} L (Θ) = L_{err} (Θ) + γ L_{tv} (Θ), \end{matrix}$ $\begin{aligned} L(\Theta ) = L_{\rm err}(\Theta ) + \gamma L_{\rm tv}(\Theta ), \end{aligned}$ (4)

where γ is a regularization parameter. We here give equal weight to the two terms, that is, γ = 1.

3.3. Optimization

In order to obtain the network parameters Θ^* that minimize L(Θ), we use the AdamMax algorithm to gradually update the weights and the biases when searching for the optimal solution, see Appendix B.

4. Performance

We present evaluation metrics and setup (Sect. 4.1) to assess the performance of the CNN in comparison to the classical method for the detection of exomoons (Sect. 4.2). In Sect. 4.3 the performance of the proposed method is further discussed by varying the parameters of the moon, the planet, and the host star.

4.1. Performance metrics and setup

The most common metrics for the evaluation of a binary classification are sensitivity, specificity, and accuracy. Sensitivity, also called the true-positive rate, measures the proportion of actual positive samples that are correctly classified as positive samples. Specificity, also called true-negative rate, is the proportion of actual negative samples that are correctly classified as negative examples (Hand et al. 2001). Accuracy measures the proportion of correctly classified light curves. These metrics are defined as follows:

$\begin{matrix} Sen & = \frac{TP}{T P + F N}, (sensitivity) \end{matrix}$ $\begin{aligned} \mathrm{Sen}&= \frac{TP}{TP+FN} , \qquad \mathrm{(sensitivity)} \end{aligned}$ (5)

$\begin{matrix} Spe & = \frac{TN}{T N + F P}, (specificity) \end{matrix}$ $\begin{aligned} \mathrm{Spe}&= \frac{TN}{TN+FP} , \qquad \mathrm{(specificity)} \end{aligned}$ (6)

$\begin{matrix} Acc & = \frac{T P + T N}{T P + F P + F N + T N}, (accuracy) \end{matrix}$ $\begin{aligned} \mathrm{Acc}&= \frac{TP+TN}{TP+FP+FN+TN} , \qquad \!\!\!\! \mathrm{(accuracy)} \end{aligned}$ (7)

where TP is the number of true positives (exomoons correctly detected), FP is the number of false positives, TN is the number of true negatives (absence of moons correctly classified), and FN is the number of false negatives.

We used 70% of each data set (in Table 1) to train and validate the CNN and 30% of data to evaluate the CNN performance using the above metrics. All our CNN experiments were run on Nvidia Tesla V100 GPUs with Keras 2.2.2 with Tensorflow 1.9.1. Each CNN experiment with the same configuration was run 25 times. The results we present here were obtained by averaging over these 25 trials. The code of the network can be found online¹.

4.2. Classical analysis

For comparison with the CNN algorithm, we also classified the light curves using the classical method described in Rodenbeck et al. (2018). For each light curve, we considered two models that might explain the data: a light-curve model containing the transits of a single planet without a moon (which we call no-moon model), and a model containing the transits of both a planet and a moon (which we call one-moon model). To estimate the likelihood of each case, we used the Bayesian information criterion (BIC, Schwarz 1978) to determine whether a moon is present. The BIC of a model ℳ(θ), which depends on model parameters θ = {θ₁, θ₂⋯θ_k}, is given by

$BIC (M) = - 2 \max_{θ} L (θ | X) + k \ln N,$ $\begin{aligned} \mathrm{BIC}(\mathcal{M})=-2\max _\theta {\mathcal{L} (\theta |X)}+k\ln {N}, \end{aligned}$ (8)

where ℒ(θ|X) is the likelihood of θ given the data X, k is the number of used parameters (k = 14 for the one-moon model and k = 7 for the no-moon model), and N = 9796 is the length of X, as described in Sect. 3. We calculated the difference ΔBIC = BIC (one moon) − BIC (no moon) between the two BICs and classified light curves with a ΔBIC > 0 as containing no moon and ΔBIC < 0 as containing a moon. The ΔBIC between two competing models compared the maximum likelihood of the two models while penalizing the model with more parameters. In other words, the model with more parameters has to explain the data sufficiently better to justify the use of more free parameters. The one-moon model contains five more parameters than the no-moon model, which describes the moon size and orbital configuration (e.g., ratio of the moon-to-star radius, planet-moon distance, moon period, moon phase, and ratio of moon-to-planet mass). We find the maximum likelihood of the model with a moon and without a moon using the Markov chain Monte Carlo (MCMC) sampler emcee, which is used to approximate the parameter posterior distribution ℒ(θ|X).

4.3. Results and analysis

4.3.1. Effect of the choice of loss function

In order to illustrate the effectiveness of the total variation loss function L defined in Sect. 3.2, we considered other loss functions: binary cross-entropy, hinge, KL, wasserstein, and mean square error functions. Tables 3 and 4 show faster convergence and better performance of L than all other choices of loss functions.

Table 3.

Number of epochs needed to train the model and computing time in second required to process the test data for different choices of loss functions.

Table 4.

Comparison between loss functions based on performance metrics: sensitivity, specificity, and accuracy.

4.3.2. Effect of smoothing as preprocessing

The main purpose of the preprocessing step is to produce more effective features by standardizing the dynamic range of the raw data or to remove unwanted variations in the raw data. Smoothing is one of the preprocessing steps that might be used to remove unwanted variations. However, this approach could introduce new systemic variability or remove actual signal from the raw data. In Table 5 we compare the smoothing method as a preprocessing step of 1D CNN with L_err and the method we propose here, which uses L_tv to learn the parameters.

Table 5.

Comparison between smoothing techniques.

4.3.3. Dependence on exomoon radius

Figure 4 shows the results we obtained by increasing the exomoon radius. Sen and Spe are significantly lower when the exomoon radius is smaller than the Earth radius, although the convolution, pooling, and smoothing windows are small enough to capture small differences in data distribution. When the exomoon radius is equal to or greater than the Earth radius, the performance reaches 100%. The total variation is commonly used as a regularization factor to find continuity in the signals (Houhou et al. 2009) and detect a change between neighboring values, but the variations due to small moons are very difficult to identify.

Fig. 4.

Performance as a function of exomoon radius at fixed noise level (100 ppm). Data set 2 was used.

The addition of a small moon alone alters the light curve slightly, which also means that the maximum likelihoods of the no-moon and one-moon models are very similar. In this case, the ΔBIC method decides in favor of the model with the fewer parameters, and Sen goes to zero.

4.4. Effect of noise on the light curves

Figure 5 shows the results for detecting a 1 R_⊕ moon when the amount of noise is varied from 25 ppm to 475 ppm every 50 ppm. We find that when the noise is lower than 175 ppm, the performance of the network is perfect, and it is very good (above 80%) when the noise is below 375 ppm. The performance decreases to Acc = 75% for the highest noise level. The sensitivity is higher for the CNN than ΔBIC for two reasons. First, the total variation (Beck & Teboulle 2009) enforces spatial constraints, which helps the learning process to define spatial continuity in the signal and distinguish between a moon-like signal and noise. Second, adding Gaussian noise with dropout also helps to detect changes in the signal (see, e.g., Srivastava et al. 2014).

Fig. 5.

Performance as a function of noise level between 25 ppm and 475 ppm. The exomoon radius is 1 R_⊕. Data set 3 was used.

On the other hand, for high noise levels, the difference in the maximum likelihood between the one-moon model and the no-moon model is too small to counteract the penalty given by the difference in the number of parameters between the two models.

4.5. Effect of training size

Figure 6 presents the performance metrics as functions of the number of samples in the training set from 300 000 to 700 000. The number of samples for the testing set was kept constant at 315 720 samples. As expected, the CNNs are more robust when larger training sets were used. The results show that training sets with 650 000 samples achieve only little improvement. For the largest three training sets shown in Fig. 6, the CNN performance appears to have converged.

Fig. 6.

Results as a function of training size. The training set includes all noise levels and exomoon sizes. Data set 1 was used.

4.6. Comparison with the ΔBIC method and other CNN architectures

Table 6 shows the comparison between the method we propose here, the classic approach for detecting moon-like signals (ΔBIC), and other well-known CNN architectures (Vgg16 1D, Liu & Deng 2015; AlexNet 1D, Krizhevsky et al. 2012; ResNet 1D, He et al. 2016; and DenseNet 1D, Huang et al. 2017). The difference between the approach we propose and the other CNN approaches is in the size of the kernel filters, the number of filter units, the number of convolution layers, and the order of convolution layers. The results of our proposed CNN with total variation loss outperforms all other approaches. The classical approach (ΔBIC) performs more poorly than all CNNs.

Table 6.

Performance comparison between the proposed CNN and other 1D CNN architectures.

The ΔBIC method requires around 12–18 h per light curve on one CPU with a memory of 32 GB per core, which requires above 2 × 10¹⁰ s to analyze 315 720 light curves (test data). On the other hand, the CNN requires less than one second per light curve on one GPU with 32 GB per core, which corresponds to only about 200 seconds for the entire test data set.

4.7. Training the CNN using a more general data set

The synthetic data from the previous sections were all generated using only three free parameters (exomoon radius, noise level, and exomoon phase), while the other parameters of the star and planet were kept constant. In order to test how the CNN method behaves when it is trained on a more general data set, we generated new synthetic light curves for which we varied all 14 free parameters (Table 7). This new data set includes 200 000 light curves.

Table 7.

Stellar, planetary, and exomoon parameters for the transit light-curve simulations with 14 free parameters.

Figure 7 shows the sensitively of both ΔBIC and the CNN versus noise level at a fixed exomoon radius of 1 R_⊕. When the CNN is trained using the dataset with 14 free parameters, the performance of the CNN decreases by up to 20% compared to its performance when the more restricted training data set is used. It is important to note that the CNN still performs well at high noise levels. For noise levels above about 200 ppm, the performance of the ΔBIC method is far worse than the CNN, independently of the data set it is applied to.

Fig. 7.

Sensitivity as a function of noise level using input data generated with 3 free parameters (exomoon radii, exomoon orbital phases, and noise levels, see Table 2; red curves) and with 14 free parameters (see Table 7; blue curves). The two methods are compared at a fixed exomoon radius of 1 R_⊕: ConvNet (solid curves) vs. ΔBIC (dashed curves). The sensitivity is plotted in bins of 50 ppm noise levels.

5. Conclusion

We proposed a regularized 1D CNN architecture to detect exomoon candidates in transit light curves. The regularized loss function brings a significant improvement over traditional loss functions in terms of convergence, sensitivity, and specificity. Furthermore, the method we propose performs relatively well with high levels of noise and small moon sizes.

We compared the network results with the ΔBIC method described by Rodenbeck et al. (2018). The ΔBIC method is more conservative: it has a higher true-negative rate, but also a much lower true-positive rate. For low noise levels (100 ppm), both methods are able to reliably detect exomoons larger than 1 R_⊕ in our simulation setup. The CNN shows its strength at higher noise levels. For the data set with only 3 free parameters, Sen and Spe stay above 90% up to a noise level of 325 ppm and remain around 80% for even higher noise levels. For the data set with 14 free parameters, Sen remains above 65% at 500 ppm. The ΔBIC method is completely insensitive to the presence of exomoon at noise levels of 275 ppm and higher. The CNN is more effective in predicting a moon-like signal, and it is faster. It is a promising technique for applications to real astronomical data sets.

¹

https://github.com/ralshehhi/Exomoon

Acknowledgments

We thank Chris Hanson for constructive comments. This work was supported in part by NYUAD Institute grant G1502 “Center for Space Science”. RA acknowledges support from the NYUAD Kawader Research Program. Computational resources were provided by the NYUAD Institute through Muataz Al Barwani and the HPC Center. The simulated light curves were produced at the DLR-supported PLATO Data Center at the MPI for Solar System Research. Author Contributions: LG proposed the research idea. RA proposed the Machine Learning algorithm and analyzed the data and the results. KR prepared the simulated data and performed the BIC analysis. KR is a member of the International Max Planck Research School for Solar System Science at the University of Göttingen. All authors contributed to the final manuscript.

References

Beck, A., & Teboulle, M. 2009, IEEE Trans. Image Proc., 18, 2419 [NASA ADS] [CrossRef] [Google Scholar]
Claret, A., & Bloemen, S. 2011, A&A, 529, A75 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Hand, D. J., Smyth, P., & Mannila, H. 2001, Principles of Data Mining (MIT Press) [Google Scholar]
He, K., Zhang, X., Ren, S., & Sun, J. 2016, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770 [Google Scholar]
Heller, R. 2014, ApJ, 787, 14 [NASA ADS] [CrossRef] [Google Scholar]
Hippke, M. 2015, ApJ, 806, 51 [NASA ADS] [CrossRef] [Google Scholar]
Houhou, N., Bresson, X., Szlam, A., Chan, T. F., & Thiran, J.-P. 2009, in Scale Space and Variational Methods in Computer Vision, eds. X.-C. Tai, K. Mørken, M. Lysaker, & K.-A. Lie, 112 [CrossRef] [Google Scholar]
Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. 2017, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4700 [Google Scholar]
Ioffe, S., & Szegedy, C. 2015, Proceedings of the 32nd International Conference on Machine Learning (Lille, France: PMLR), eds. F. Bach, & D. Blei, Proc. Mach. Learn. Res., 37, 448 [Google Scholar]
Janocha, K., & Czarnecki, W. M. 2017, Theor. Found Mach. Learn., 25, 49 [Google Scholar]
Javanmardi, M., Sajjadi, M., Liu, T., & Tasdizen, T. 2016, IEEE Conf. Comput. Vision Pattern Recogn. [Google Scholar]
Kingma, D. P., & Ba, J. 2015, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, eds. Y. Bengio, & Y. LeCun [Google Scholar]
Kipping, D. M. 2009, MNRAS, 392, 181 [NASA ADS] [CrossRef] [Google Scholar]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, Adv. Neural Inf. Proc. Syst., 25, 1097 [Google Scholar]
Liu, S., & Deng, W. 2015, 3rd IAPR Asian Conference on Pattern Recognition, 730 [Google Scholar]
Maas, A. L., Hannun, A. Y., & Ng, A. Y. 2013, ICML 2013 Workshop onDeep Learning for Audio, Speech and Language Processing, Atlanta, GA,USA, June 16, 2013 [Google Scholar]
Mahendran, A., & Vedaldi, A. 2015, Int. J. Comput. Vision, 120, 233 [CrossRef] [Google Scholar]
Pedamonti, D. 2018, IEEE Conf. Comput. Vision Pattern Recognit. [Google Scholar]
Rodenbeck, K., Heller, R., Hippke, M., & Gizon, L. 2018, A&A, 617, A49 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Schwarz, G. 1978, Ann. Stat., 6, 461 [Google Scholar]
Simonoff, J. S. 1998, Smoothing Methods in Statistics (Springer) [Google Scholar]
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. 2014, J. Mach. Learn. Res., 15, 1929 [Google Scholar]
Szabó, R., Szabó, G. M., Dálya, G., et al. 2013, A&A, 553, A17 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Teachey, A., & Kipping, D. M. 2018, Sci. Adv., 4 [CrossRef] [Google Scholar]
Teachey, A., Kipping, D. M., & Schmitt, A. R. 2018, AJ, 155, 36 [NASA ADS] [CrossRef] [Google Scholar]

Appendix A: Network architecture

The network consists of the following: convolution blocks (convolution, batch normalization, maximum pooling, leaky ReLU activation and Gaussian dropout), fully-connected block and classification block.

First convolution layer (ℓ = 0). The input of the first convolution layer, ℓ = 0, is a vector X_m of dimension N. The index m of the light curve is fixed throughout this section, therefore we dropped it to simplify the notation. To construct feature maps, the input data were convolved 64 times with different kernels F_k, 0 of width $n_{f}^{0} = 3$ $ n^0_f=3 $ and stride s_f = 1. The kernel values are to be determined later by training the network. The stride is the distance used to shift the convolution kernel. We obtained 64 feature maps of length $N_{f}^{0} = N$ $ N_f^0=N $ (with zero padding), see, for example, Krizhevsky et al. (2012),

$\begin{matrix} Y_{conv}^{k, 0} (i) = \sum_{j = 0}^{n_{f}^{0} - 1} X (i - j) F_{k, 0} (j), 0 \leq k < K_{0} = 64 . \end{matrix}$ $\begin{aligned} Y^{k,0}_{\rm conv}(i) = \sum _{j=0}^{n^0_f-1} X(i-j) F_{k, 0}(j), \quad 0\le k < K_0=64 . \end{aligned}$ (A.1)

Batch normalization. The convolution operator is followed by a batch normalization function. Batch normalization is applied to reduce internal covariance shift (Ioffe et al. 2015). It is simply a linear transformation of the data with scaling λ₀ and a shift β₀,

$\begin{matrix} Y_{norm}^{k, 0} (i) = λ_{0} \frac{Y_{conv}^{k, 0} (i) - μ_{0}}{\sqrt{v_{0} + ϵ}} + β_{0}, \end{matrix}$ $\begin{aligned} Y_{\rm norm}^{k,0}(i) = \lambda _0 \frac{Y^{k,0}_{\rm conv}(i)-\mu _0}{\sqrt{v_0+\epsilon }} + \beta _0, \end{aligned}$ (A.2)

where μ₀ and v₀ are the average and variance of the input data for all ks at fixed ℓ. The initial values of the parameters were λ₀ = 1 and β₀ = 0. The constant ϵ = 0.002 was added to avoid division by very low values.

Leaky ReLU activation. The batch normalization is followed by applying a leaky rectified linear unit (ReLU) activation function as follows (Maas et al. 2013):

$\begin{matrix} \begin{matrix} Y_{lactiv}^{k, 0} (i) & = LReLU (Y_{norm}^{k, 0} (i) ; w_{k}, b_{k}) \\ = {\begin{matrix} Y_{norm}^{k, 0} (i) w_{k} + b_{k} & if Y_{norm}^{k, 0} (i) ⩾ 0, \\ ρ Y_{norm}^{k, 0} (i) w_{k} + b_{k} & else . \end{matrix} \end{matrix} \end{matrix}$ $\begin{aligned} \begin{aligned} Y_{\rm lactiv}^{k,0} (i)&=\mathrm{LReLU}\left( Y_{\rm norm}^{k,0}(i) ; { w}_k, b_k \right)\\&= {\left\{ \begin{array}{ll} Y_{\rm norm}^{k,0}(i) { w}_k+b_k&\mathrm{if } Y^{k,0}_{\rm norm}(i) \geqslant 0, \\ {\rho } {Y}^{k,0}_{\rm norm}(i) { w}_k+b_k&\mathrm{else} . \end{array}\right.} \end{aligned} \end{aligned}$ (A.3)

Here w and b are values to be determined. Many activation functions are possible, such as tanh, sigmoid, hyperbolic, or rectified functions. The leakly ReLU function allows for a small gradient ρ when the units are not active. Here, we set ρ = 0.02, and refer to Maas et al. (2013) and Pedamonti (2018) for an in-depth discussion.

Maximum pooling. A maximum-pooling operator performs a spatial subsampling in time by taking the maximum value over a pooling window of length n_p every s_p points (with n_p = s_p),

$\begin{matrix} Y_{mpool}^{k, 0} (i) = max {Y_{lactiv}^{k, 0} (j)}_{i s_{p} \leq j \leq i s_{p} + n_{p} - 1}, \end{matrix}$ $\begin{aligned} Y^{k,0}_{\rm mpool}(i) = \max \{ Y_{\rm lactiv}^{k,0}(j) \}_{ i s_{\rm p} \le j \le i s_{\rm p} +n_{\rm p} -1} , \end{aligned}$ (A.4)

for $i = 0, 1, \dots, N_{f}^{0} / n_{p} - 1$ $i=0,1,\ldots,N_{\rm f}^0/n_{\rm p}-1$ .

Gaussian dropout. In order to provide some regularization, a Gaussian dropout procedure is applied after the maximum pooling (Srivastava et al. 2014). It involves multiplying weights by a Gaussian random variable with mean 1 and standard deviation σ = 0.65. Symbolically, we write

$\begin{matrix} Y_{dout}^{k, ℓ} = {Gaussian}_{-} dropout (Y_{mpool}^{k, ℓ}) . \end{matrix}$ $\begin{aligned} Y^{k,\ell }_{\rm dout} = \mathrm{Gaussian}_{-}\mathrm{dropout} \bigg (\ Y_{\rm mpool}^{k,\ell }\ \bigg ) . \end{aligned}$ (A.5)

We found through experimentation that a Gaussian dropout gives better results than a standard dropout.

Next convolution layers. Another four convolution layers (ℓ = 1, 2, 3, and 4) follow the first convolution layer (ℓ = 0). For each layer ℓ ≥ 1, we computed the feature maps,

$\begin{matrix} Y_{conv}^{k, ℓ} (i) = \sum_{j = 0}^{n_{f} - 1} Y_{dout}^{k, ℓ - 1} (i - j) F^{k, ℓ} (j), 0 \leq k < K_{ℓ}, \end{matrix}$ $\begin{aligned} Y_{\rm conv}^{k,\ell }(i) = \sum _{j=0}^{n_f-1} Y_{\mathrm{dout}}^{k,\ell -1}(i-j) F^{k,\ell }(j), \quad 0\le k < K_\ell , \end{aligned}$ (A.6)

followed by batch normalization, leaky ReLU activation, maximum pooling, and dropout, as defined above. The number of feature maps K_ℓ is 128 for the intermediate convolution layers ℓ = 1, 2, 3 and K₄ = 64 for the last layer.

Fully connected layer. The output of the last convolution layer ℓ = 4 is a matrix of dimension 64 (number of feature maps) times 102 (number of time samples). This layer is followed by a fully connected layer Y(p) activated by a leaky ReLU (see Eq. (A.3)),

$\begin{matrix} Y (p) = \sum_{k = 0}^{63} \sum_{i = 0}^{101} LReLU (Y_{dout}^{k, 4} (i) ; w_{k, i} (p), b (p)), \end{matrix}$ $\begin{aligned} Y(p) = \sum _{k=0}^{63} \sum _{i=0}^{101} \mathrm{LReLU}\left( Y_{\mathrm{dout}}^{k,4}(i) ; { w}_{k,i}(p), b(p) \right), \end{aligned}$ (A.7)

for p ≤ 0 ≤ N_p − 1, where N_p = 64 × 102 = 6528 is the number of neurons; this is linearly converted to 2048. The fully connected layer is the final vector output before the data are classified.

Output Layer. Sigmoid activation and classification. This layer is activated by a sigmoid function, which produces a probability in the range between 0 and 1, as follows:

$\begin{matrix} \hat{y} = {(1 + e^{- y})}^{- 1}, \end{matrix}$ $\begin{aligned} \hat{{ y}} = (1+e^{-{ y}})^{-1}, \end{aligned}$ (A.8)

with

$\begin{matrix} y = \sum_{p = 0}^{N_{p} - 1} w_{p} Y (p) + b . \end{matrix}$ $\begin{aligned} { y} = \sum _{p=0}^{N_{\rm p}-1} { w}_{\rm p} Y(p) + b . \end{aligned}$ (A.9)

The final classification of the data is

$\begin{matrix} C (X) = {\begin{matrix} 1 & if \hat{y} \geq 0.5, \\ 0 & else . \end{matrix} \end{matrix}$ $\begin{aligned} C(X) = \left\{ \begin{array}{ll} 1&\mathrm{if }\,\hat{{ y}} \ge 0.5, \\ 0&\mathrm{else} . \end{array}\right. \end{aligned}$ (A.10)

Appendix B: Optimization

We used a batch approach, whereby the input samples are passed forward and backward through the network in chunks of 32 samples (in each epoch). We refer to Kingma & Ba (2015) for additional details about this algorithm (AdaMax). Denoting Θ_q − 1 the parameters of the network at iteration q − 1, the updated values at iteration q are given by

$\begin{matrix} Θ_{q} = Θ_{q - 1} - (\frac{α}{1 - β_{1}}) \frac{m_{q}}{μ_{q}}, \end{matrix}$ $\begin{aligned} \Theta _{q} = \Theta _{q-1} - \left(\frac{\alpha }{1-\beta _{1}}\right) \frac{m_q}{\mu _q}, \end{aligned}$ (B.1)

with

$\begin{matrix} m_{q} = β_{1} m_{q - 1} + (1 - β_{1}) \nabla L (Θ_{q - 1}), \end{matrix}$ $\begin{aligned} m_{q} = \beta _{1} m_{q-1} + (1-\beta _{1})\ \nabla L(\Theta _{q-1}), \end{aligned}$ (B.2)

and

$\begin{matrix} μ_{q} = max (β_{2} μ_{q - 1}, ‖ \nabla L (Θ_{q - 1}) ‖) . \end{matrix}$ $\begin{aligned} \mu _{q} = \max (\beta _{2}\mu _{q-1},\Vert \nabla L(\Theta _{q-1})\Vert ) . \end{aligned}$ (B.3)

The quantity α is the step size. η = α/(1 − β₁) is the learning rate, and the parameters β₁ = 0.9 and β₂ = 0.999 were fixed.

Initially, we set α = 0.002. At q = 0, we initialized the weights of the network in each layer with a random number drawn from a zero-mean Gaussian distribution with standard deviation 0.1. The biases were initially set to zero.

A stopping criterion was used to reduce overfitting of the network. Training stopped when the validation error did not decrease after 20 epochs. If the loss function did not decrease, the parameter α was multiplied by a factor of 0.1 after the first 10 epochs, and the operation was repeated every 10 epochs.

Figure B.1 compares the evolution of the loss function versus the number of epochs used to train the network. The regularized loss function performs the best.

Fig. B.1.

Validation loss as a function of the number of epochs used to train the network. An epoch consists of 32 light curves. The regularized loss function L_err + L_tv decreases faster than L_err alone.

All Tables

Table 1.

Parameters for the three data sets of simulated light curves.

In the text

Table 2.

Stellar, planetary, and exomoon parameters for the transit light-curve simulations.

In the text

Table 3.

Number of epochs needed to train the model and computing time in second required to process the test data for different choices of loss functions.

In the text

Table 4.

Comparison between loss functions based on performance metrics: sensitivity, specificity, and accuracy.

In the text

Table 5.

Comparison between smoothing techniques.

In the text

Table 6.

Performance comparison between the proposed CNN and other 1D CNN architectures.

In the text

Table 7.

Stellar, planetary, and exomoon parameters for the transit light-curve simulations with 14 free parameters.

In the text

All Figures

	Fig. 1. Sketch of the orbital configuration for the numerical simulations of transit light curves. The planet and the moon orbit their common center of mass C.
In the text

Fig. 2.

Examples of simulated transit light curves. Left panel: the curve shows a transit of a planet with radius R_p = 12 R_⊕ (similar to that of Kepler 1625b) for a low level of noise (100 ppm). The red curve shows a similar but different planetary transit, and an exomoon transit is superimposed (R_m = 4.75 R_⊕). Right panel: simulations for a higher noise level (400 ppm) and a much smaller moon R_m = R_⊕. The distinction between the planet-only (blue) and planet-moon (red) case is not visible by eye in most cases.

In the text

Fig. 3.

CNN architecture. As explained in the text, it consists of five convolution layers composed of 64, 128, 128, 128, and 64 channels, including convolution, batch normalization, leaky ReLU activation (first three columns in light blue), maximum pooling, and Gaussian dropout (last two columns in dark blue). The last layer is followed by a fully connected layer (purple) and a classification layer returning a binary answer (1 for moon, 0 otherwise).

In the text

	Fig. 4. Performance as a function of exomoon radius at fixed noise level (100 ppm). Data set 2 was used.
In the text

	Fig. 5. Performance as a function of noise level between 25 ppm and 475 ppm. The exomoon radius is 1 R_⊕. Data set 3 was used.
In the text

	Fig. 6. Results as a function of training size. The training set includes all noise levels and exomoon sizes. Data set 1 was used.
In the text

Fig. 7.

Sensitivity as a function of noise level using input data generated with 3 free parameters (exomoon radii, exomoon orbital phases, and noise levels, see Table 2; red curves) and with 14 free parameters (see Table 7; blue curves). The two methods are compared at a fixed exomoon radius of 1 R_⊕: ConvNet (solid curves) vs. ΔBIC (dashed curves). The sensitivity is plotted in bins of 50 ppm noise levels.

In the text

	Fig. B.1. Validation loss as a function of the number of epochs used to train the network. An epoch consists of 32 light curves. The regularized loss function L_err + L_tv decreases faster than L_err alone.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Beck, A., & Teboulle, M. 2009, IEEE Trans. Image Proc., 18, 2419 [NASA ADS] [CrossRef] [Google Scholar]

[2] Claret, A., & Bloemen, S. 2011, A&A, 529, A75 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[3] Hand, D. J., Smyth, P., & Mannila, H. 2001, Principles of Data Mining (MIT Press) [Google Scholar]

[4] He, K., Zhang, X., Ren, S., & Sun, J. 2016, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770 [Google Scholar]

[5] Heller, R. 2014, ApJ, 787, 14 [NASA ADS] [CrossRef] [Google Scholar]

[6] Hippke, M. 2015, ApJ, 806, 51 [NASA ADS] [CrossRef] [Google Scholar]

[7] Houhou, N., Bresson, X., Szlam, A., Chan, T. F., & Thiran, J.-P. 2009, in Scale Space and Variational Methods in Computer Vision, eds. X.-C. Tai, K. Mørken, M. Lysaker, & K.-A. Lie, 112 [CrossRef] [Google Scholar]

[8] Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. 2017, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4700 [Google Scholar]

[9] Ioffe, S., & Szegedy, C. 2015, Proceedings of the 32nd International Conference on Machine Learning (Lille, France: PMLR), eds. F. Bach, & D. Blei, Proc. Mach. Learn. Res., 37, 448 [Google Scholar]

[10] Janocha, K., & Czarnecki, W. M. 2017, Theor. Found Mach. Learn., 25, 49 [Google Scholar]

[11] Javanmardi, M., Sajjadi, M., Liu, T., & Tasdizen, T. 2016, IEEE Conf. Comput. Vision Pattern Recogn. [Google Scholar]

[12] Kingma, D. P., & Ba, J. 2015, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, eds. Y. Bengio, & Y. LeCun [Google Scholar]

[13] Kipping, D. M. 2009, MNRAS, 392, 181 [NASA ADS] [CrossRef] [Google Scholar]

[14] Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, Adv. Neural Inf. Proc. Syst., 25, 1097 [Google Scholar]

[15] Liu, S., & Deng, W. 2015, 3rd IAPR Asian Conference on Pattern Recognition, 730 [Google Scholar]

[16] Maas, A. L., Hannun, A. Y., & Ng, A. Y. 2013, ICML 2013 Workshop onDeep Learning for Audio, Speech and Language Processing, Atlanta, GA,USA, June 16, 2013 [Google Scholar]

[17] Mahendran, A., & Vedaldi, A. 2015, Int. J. Comput. Vision, 120, 233 [CrossRef] [Google Scholar]

[18] Pedamonti, D. 2018, IEEE Conf. Comput. Vision Pattern Recognit. [Google Scholar]

[19] Rodenbeck, K., Heller, R., Hippke, M., & Gizon, L. 2018, A&A, 617, A49 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[20] Schwarz, G. 1978, Ann. Stat., 6, 461 [Google Scholar]

[21] Simonoff, J. S. 1998, Smoothing Methods in Statistics (Springer) [Google Scholar]

[22] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. 2014, J. Mach. Learn. Res., 15, 1929 [Google Scholar]

[23] Szabó, R., Szabó, G. M., Dálya, G., et al. 2013, A&A, 553, A17 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[24] Teachey, A., & Kipping, D. M. 2018, Sci. Adv., 4 [CrossRef] [Google Scholar]

[25] Teachey, A., Kipping, D. M., & Schmitt, A. R. 2018, AJ, 155, 36 [NASA ADS] [CrossRef] [Google Scholar]