Issue |
A&A
Volume 689, September 2024
|
|
---|---|---|
Article Number | A66 | |
Number of page(s) | 19 | |
Section | Cosmology (including clusters of galaxies) | |
DOI | https://doi.org/10.1051/0004-6361/202449597 | |
Published online | 02 September 2024 |
SHAMe-SF: Predicting the clustering of star-forming galaxies with an enhanced abundance matching model
1
Donostia International Physics Center, Manuel Lardizabal Ibilbidea, 4, 20018 Donostia, Gipuzkoa, Spain
2
University of the Basque Country UPV/EHU, Department of Theoretical Physics, Bilbao E-48080, Spain
3
IKERBASQUE, Basque Foundation for Science, 48013 Bilbao, Spain
Received:
13
February
2024
Accepted:
13
May
2024
Context. With the advent of several galaxy surveys targeting star-forming galaxies, it is important to have models capable of interpreting their spatial distribution in terms of astrophysical and cosmological parameters.
Aims. We introduce SHAMe-SF, an extension of the subhalo abundance matching (SHAM) technique designed specifically for analysing the redshift-space clustering of star-forming galaxies.
Methods. Our model directly links a galaxy’s star-formation rate to the properties of its host dark matter subhalo, with further modulations based on effective models of feedback and gas stripping. To quantify the accuracy of our model, we show that it simultaneously reproduces key clustering statistics such as the projected correlation function, monopole, and quadrupole of star-forming galaxy samples at various redshifts and number densities. These tests were conducted over a wide range of scales [0.6, 30] h−1 Mpc using samples from both the TNG300 magneto-hydrodynamic simulation and a semi-analytical model.
Results. SHAMe-SF can reproduce the clustering of simulated galaxies selected by star-formation rate as well as galaxies that fall within the colour selection criteria employed by DESI for emission line galaxies.
Conclusions. Our model exhibits several potential applications, including the generation of covariance matrices, exploration of galaxy formation processes, and even placing constraints on the cosmological parameters of the Universe.
Key words: galaxies: formation / galaxies: statistics / large-scale structure of Universe
© The Authors 2024
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.
1. Introduction
A new generation of galaxy redshift surveys (e.g., DESI, EUCLID, J-PAS, 4MOST) will soon measure the position of hundreds of millions of galaxies. An important aspect in properly analysing these surveys is the creation of realistic mock catalogues that resemble the spatial distribution of galaxies in the Universe. These synthetic catalogues enable to test analysis pipelines, quantify statistical and systematic uncertainties, and explore the impact of various cosmological ingredients.
The nature of upcoming surveys will pose several challenges in constructing such mock catalogues. Firstly, the unprecedented amount of data will require extremely precise mock catalogues over large cosmological volumes. Secondly, as these surveys will mostly focus on emission line galaxies (ELGs), the mock catalogues require realistic models for star-formation rates (SFRs) and metallicity in galaxies. Thirdly, future observations will provide a complementary view on the cosmic large-scale structure by delivering overlapping gravitational lensing and galaxy maps. Consequently, this creates the need for a new generation of mock catalogues with accurate predictions for ELGs and their connection with the underlying dark matter structures. The SHAMe-SF model presented in this paper, aims to achieve this objective.
One of the most widely employed approaches to model the spatial distribution of galaxies is the halo occupation distribution (HOD) model (Jing et al. 1998; Benson et al. 2000; Peacock & Smith 2000; Berlind et al. 2003; Zheng et al. 2005, 2007; Guo et al. 2015). Basic HOD models appear to be sufficient to analyse the clustering of stellar-mass or luminosity-selected galaxies (Zehavi et al. 2005; Coil et al. 2006; Zheng et al. 2007). However, the modelling of star-forming galaxies is more complex (Geach et al. 2012; Contreras et al. 2013, 2017; Gonzalez-Perez et al. 2020; Avila et al. 2020; Alam et al. 2020) since, for instance, these galaxies typically do not follow the same phase-space distribution as dark matter, and their abundance does not scale monotonically with halo mass (Orsi & Angulo 2018; Avila et al. 2020). Additionally, HODs make several simplifications, such as a lack of assembly bias (e.g., Sheth & Tormen 2004; Gao et al. 2005), which can lead to incorrect cosmological inferences (Cuesta-Lazaro et al. 2023; Chaves-Montero et al. 2023). Recently, some of the limitations have been addressed by extending HODs with environmental dependencies and/or secondary halo properties in addition to mass (Hearin et al. 2016; Zehavi et al. 2018; Xu et al. 2021; Hadzhiyska et al. 2022a,b). However, the greater number of parameters decreases the predictability of the model, potentially compromising its ability to accurately predict statistical quantities beyond those explicitly tested.
It is possible to obtain a more realistic galaxy distribution by directly assuming a connection between galaxies and subhalos. Specifically, in subhalo abundance matching (SHAM) models, the most massive (luminous) galaxies are assumed to be hosted by the largest subhalos (Vale & Ostriker 2006; Shankar et al. 2006). These models are notably successful at reproducing the spatial distribution of galaxies selected based on stellar mass in both observational datasets and hydrodynamical simulations (e.g., Conroy et al. 2006; Chaves-Montero et al. 2016). In the case of SFR-selected galaxies, SHAM is relatively successful at high-z (z ∼ 2), where the fraction of quenched galaxies is small (Simha et al. 2012). However, its accuracy decreases at low redshift, where subhalo mass is a poor predictor for star formation (see e.g., Contreras et al. 2015). A potential approach to address this issue is to begin with a standard SHAM and then incorporate an additional subhalo property to model the relationship between SFR and stellar mass (Tinker et al. 2018; Favole et al. 2022).
Hydrodynamical simulations and semi-analytical models (SAM) provide a more sophisticated way to construct mock surveys. These approaches attempt to directly model relevant physical processes, such as star formation, feedback from active galactic nuclei (AGNs) and supernova, and metal enrichment, whose free parameters are calibrated by reproducing a set of predefined observations (see Baugh 2006; Vogelsberger et al. 2020, for reviews). However, it is still not clear how well hydrodynamical simulations and SAMs reproduce the observed SFR, especially at higher redshifts, where observations are scarce (e.g., McCarthy et al. 2017; Davé et al. 2019 for simulations and Wang et al. 2018 for SAMs). Additionally, the large computational cost of these models usually means that only a handful of variations of the underlying physical recipes and free parameters are available. Moreover, the volumes are still usually comparatively smaller than those available to simpler models.
Ultimately, the physics that regulate star formation and emission lines in galaxies is still uncertain, and almost certainly, its implementation in numerical simulations is too simplistic. This motivates the development of ‘empirical assembly models’, which describe the relation between galaxies and their host dark matter subhalos using parametric equations that aim at being an ‘effective’ description of the relevant physics (Moster et al. 2018; Behroozi et al. 2019). For instance, Emerge (Moster et al. 2018) assumes that the SFR of a galaxy is given by the dark matter-accretion rate times a baryon conversion efficiency but without making a reference to the specific physics responsible. The philosophy of these models is that the effective descriptions can be constrained through observations, which would then help identify the actual physical processes governing galaxy formation.
If one aims to only reproduce galaxy clustering, it is possible to further reduce the number of assumptions. An example of this is shown in the SHAMe model presented in Contreras et al. (2021a). The SFR prescription in SHAMe combines the recipes implemented in Emerge with the rank ordering philosophy of the SHAM models. This approach has several advantages. For instance, it can achieve realistic models with only a few free parameters while also being fast and computationally cheap. This allows for the scanning of millions of different subhalo-galaxy connections and varying cosmology.
The SHAMe model is a significant improvement compared to other prescriptions. Importantly, it delivers accurate predictions for the joint distribution of clustering and lensing statistics (Contreras et al. 2023a,b). However, the predictions for SFR-selected galaxies were much less accurate than for stellar-mass selections. In this work, we present SHAMe-SF, a new galaxy model that improves the SFR prescription of Contreras et al. (2021a) in order to better model the spatial distribution and velocities of the star-forming galaxies that will be detected by upcoming large-scale structure surveys.
To achieve our goal of improving on the prediction of the SHAMe model for star-forming galaxies, we first applied machine learning techniques onto a hydrodynamical simulation in order to identify the most important subhalo properties for predicting the SFR of its hosted galaxy. Using this information, we then built a model that can be applied to subhalos in a gravity-only simulation. The new model populates a simulation with 205 h−1 Mpc side in just a few seconds. Additionally, it is flexible enough to reproduce the clustering of samples from both hydrodynamic simulation and semi-analytical models. The flexibility and computational efficiency of SHAMe-SF could help build realistic mocks for star-forming galaxies that span our uncertainty regarding SFR physics and for many different cosmological models. Ultimately, these features will allow SHAMe-SF to be used directly in the interpretation of upcoming galaxy redshift surveys and thus in placing constraints on the astrophysics and cosmology of such observational data.
This paper is organised as follows. In Section 2, we present the simulations and models used in this work. In Section 3, we employed a random forest (RF) algorithm to identify the best combination of subhalo properties to predict the spatial distribution of galaxies selected by SFR at z = 0 and 1. In Section 4, we used those properties to build a model that could be applied to gravity-only simulations. In Section 5, we validated our model by presenting the best fits to redshift-space clustering, whereas in Section 6, we focused on additional tests. We summarise our findings in Section 7.
2. Simulations and galaxy population models
In this section, we provide an overview of the hydrodynamical and gravity-only simulations we employed in this work, along with a description of the samples we used to develop and validate SHAMe-SF. Additionally, we introduce the Semi-Analytical Model (SAM) employed to assess SHAMe-SF’s capability to replicate the clustering of samples based on an alternate star-formation rate (SFR) prescription.
2.1. TNG300
2.1.1. Simulations
To analyse the connection between subhalo properties and SFR as well as test the performance of SHAMe-SF, we used the magneto-hydrodynamic simulation Illustris-TNG300 (Nelson et al. 2018; Springel et al. 2018; Marinacci et al. 2018; Pillepich et al. 2018; Naiman et al. 2018).
The TNG300 simulation is part of the “The Next Generation” Illustris Simulations suite, and it is one of the largest publicly available high-resolution hydrodynamical simulations. We use TNG300-1 (TNG300 thereafter), the run with the highest resolution for the largest box, which jointly evolves 25003 dark matter particles and gas cells on a periodic box of 205 h−1 Mpc (∼300 Mpc) side. The mass of these elements are 3.98 × 107 h−1 M⊙ and 7.44 × 106 h−1 M⊙, respectively. The simulation was run using AREPO (Springel 2010) with cosmological parameters from Planck Collaboration XIII (2016)1. In addition, we will use the TNG300-1-Dark simulation, the gravity-only counterpart of the TNG300, which contains the same number of dark matter particles (with 4.72 × 107 h−1 M⊙ mass), volume and initial conditions as TNG300.
To match subhalos in the TNG300-1-Dark to those in the TNG300, we used one of the catalogues provided by the TNG collaboration (Nelson et al. 2015; Springel et al. 2005), which was created using a bidirectional matching (subhalos are only linked if they share most of their particles among gravity-only and baryonic runs). At z = 1, the completeness is almost 100% for central subhalos, and ∼60% for satellites when considering subhalos with Mpeak > 3 × 1010 h−1 M⊙, the mass of a subhalo resolved with 10 particles in the lowest resolution run (similarly at z = 0).
As shown in Springel et al. (2018), TNG300 agrees well with observational estimates of the clustering of stellar-mass selected galaxies in SDSS. Additionally, TNG300 qualitatively reproduces the main sequence and scatter of the stellar mass-SFR relation (Donnari et al. 2019). TNG300 also reproduces the general trends in the fraction of quenched galaxies. However, it cannot replicate a strong bimodality on SFRs (Donnari et al. 2019; Zhao et al. 2020). Regarding colours, TNG300 can reproduce the shape of the red sequence and the blue cloud (Nelson et al. 2018) when compared to SDSS data, as well as the bimodality in colour.
Therefore, TNG300 provides us with a reasonably realistic population of star-forming galaxies for our work. However, we emphasise that our goal was not to perfectly reproduce this simulation, instead, we aimed to use it to extract general physical relations between SFRs and subhalo properties. We used these relations to develop a general empirical model that could be used to describe galaxies in TNG300 but also in other physical galaxy formation models, and ultimately in observations of our Universe.
2.1.2. Number density-selected samples
We considered four different samples of galaxies at z = 0 and 1. These samples correspond to number-density selections in SFR: . The number of objects, as well as the minimum SFR value (in M*/yr), are provided in Table 1. We note that we defined our samples in terms of number densities since SHAMe-SF predicts the rank order of star-forming galaxies and not the actual value of their SFR.
Star-forming rate-selected samples constructed from the TNG300 hydrodynamical simulation at z = 0 and z = 1 for different number densities.
The first three samples match the number densities of those in Contreras et al. (2021a), allowing us to directly compare the performance of our model with its predecessor. The fourth sample roughly matches the abundance expected in DESI (e.g., DESI Collaboration 2016; Yuan et al. 2022); thus, we used it to test the model performance for such a survey.
2.1.3. Colour-selected samples
As mentioned above, one of the targets of upcoming surveys will be Emission Line Galaxies (ELG). ELGs are detected by measuring the flux of some specific lines, such as the [OII] doublet or Hα, associated with newly formed stars and their interaction with the surrounding gaseous media.
Since the TNG300 does not provide predictions for specific emission lines, we tested the performance of our model on a sample with similar properties to DESI ELG selection. Following Hadzhiyska et al. (2021), we adopted various magnitude and colour selection criteria (DESI Collaboration 2016; Raichoor et al. 2020, see Table 2). We used synthetic SDSS magnitudes accounting for dust obscuration calculated for TNG300 galaxies by Nelson et al. (2018). This catalogue is available on the TNG database, and the dust model is outlined in the aforementioned work.
Selection criteria used to construct a sample of DESI-like ELGs in the TNG300 hydrodynamical simulation.
Imposing these selection criteria to TNG300 galaxies, we obtained a number density of n ≈ 0.002 h3 Mpc−3 at z = 1 (we note that it is between the two intermediate samples described in the previous subsection).
2.2. TNG300-mimic
In addition to TNG300-1-Dark, we used a gravity-only simulation with lower resolution, referred to as TNG300-mimic. This tested whether SHAMe-SF can be used with a resolution achievable in large-volume simulations.
The TNG300-mimic simulation has the same volume and initial conditions as TNG300-1-Dark, but employs only 6253 particles (equivalent to TNG300-3-Dark). It was carried out with an updated version of L-Gadget3 (Angulo et al. 2012; Springel et al. 2005), employed for the Millennium XXL and BACCO simulations (Angulo et al. 2021). It includes an on-the-fly identification of halos and subhalos through a Friends-of-Friends algorithm (FOFDavis et al. 1985), and a modified version of SUBFIND (Springel et al. 2001) able to follow the history of subhalos and compute properties which are non-local in time (peak masses and velocities, accretion rates, etc.). Numerical accuracy parameters (e.g., time-integration step, softening length, mass resolution) match those employed in the BACCO simulations (see Angulo et al. 2021).
Given the identical initial conditions of TNG300-mimic and TNG300 simulations, the comparison between SHAMe and TNG samples is less susceptible to the effects of cosmic variance.
2.3. Semi-analytical model
Using TNG300 as a reference assumes a specific prescription for modelling baryonic physics. However, our goal was to develop a model capable of capturing various plausible galaxy formation scenarios rather than being confined to a single simulation. Because of this, we employed a semi-analytical model (SAM) to further validate SHAMe-SF. In this work, employ the public version of L-Galaxies2, developed by the “Munich Group” (White & Frenk 1991; Kauffmann et al. 1993, 1999; Springel et al. 2001; De Lucia et al. 2004; Croton et al. 2006; De Lucia & Blaizot 2007; Guo et al. 2011, 2013; Henriques et al. 2013, 2020), as implemented by Henriques et al. (2015). The model was executed with its default parameter set on TNG300-mimic, as in the fiducial SAM model in Contreras et al. (2023a). Subsequently, we generated galaxy catalogues analogous to those constructed from TNG300 (cf. Table 1).
3. The relation between star-formation rates and dark matter structures
Before attempting to build a physically motivated model for star-forming galaxies, we first seeked to understand the relation between the properties of simulated galaxies and those of their host DM subhalos. This correlation might not be trivial due to the diverse array of physical processes at play, including gas cooling and accretion, feedback mechanisms, RAM pressure stripping, and galaxy interactions. In fact, in the literature the SFR has been linked to various DM properties such as host halo mass, halo mass accretion rate, concentration, age, and environment (Wang et al. 2013; Behroozi et al. 2013; Hahn et al. 2019; Blank et al. 2020; Zjupa et al. 2020).
Here, we employed the TNG300 catalogues together with a machine learning algorithm, specifically an RF, to discern the subset of subhalo properties that best predict the spatial distribution of star-forming galaxies. The following subsections provide details of our approach and demonstrate that the clustering of SFR-selected samples from TNG300 can be accurately predicted using only three variables: the maximum over redshift of the subhalo circular velocity (Vpeak), a concentration proxy, and the host halo mass.
3.1. Random forest
Random forest algorithms are designed to predict a set of values given a set of input variables. During the training phase, the algorithm builds several decision trees using a random subsample of the input data (Breiman et al. 1984). At each level of the tree, the algorithm performs a division using the input variable that minimises a given loss function (e.g., squared error). The number of divisions is typically limited to maintain a given number of instances per leaf. Since each tree is built with a random subset of the training data, RF is not prone to overfitting. The final prediction is obtained by walking the tree and averaging the values of data within the final leaf across all random trees (Breiman 2001).
Tree-based algorithms have been used to predict (s)SFR based on subhalo properties (Kamdar et al. 2016; Agarwal et al. 2018; de Santi et al. 2022), yielding reasonably accurate predictions (albeit less precise than for stellar mass). However, since RF trees are built by making divisions using one of the input variables at each step, they are well-suited for exploring the influence of input properties on predictions. Indeed, RFs have been used, for instance, to disentangle the role of secondary variables in the galaxy-halo connection. Here, we will employ RFs to identify the DM variables that best predict the SFR of TNG300 galaxies (see Moster et al. 2021 for a similar approach).
3.1.1. Training
Our basic idea was to build a suite of RFs that consider different subsets of DM properties. In particular, we considered those properties listed in Table 3, containing properties computed from halo and subhalo catalogues (e.g., Mhalo, Vmax), derived from subhalo merger trees (e.g., Vpeak, halo age), as well as measured from the large-scale environment (e.g., linear bias, αpeak).
Definitions of the subhalo properties considered for the RF.
In principle, we would like to consider all possible combinations of N subhalo properties. However, this quickly becomes computationally infeasible. As a pragmatic approach, we only focused on the best-performing set of N variables as a basis for building RFs with N + 1 properties. Concretely, we started by considering all subhalo properties when building RFs with one variable, but only kept the best-performing properties as a basis to build the two-variable RFs, and so forth.
In all cases, we employed the cross-matched catalogues between TNG300-1-Dark and TNG300-1 (cf. Section 2.1). We applied a threshold Mpeak = 3 × 1010 h−1 M⊙, corresponding to objects resolved with at least 10 particles in TNG300-3-Dark and TNG300-mimic. To build the RFs, we used the publicly available code Scikit-Learn (Pedregosa et al. 2012). On a single CPU, the training time varied between 1 and 10 minutes per RF, depending on the number of variables considered.
3.1.2. Performance metric
We split the simulation volume in two halves: one half was used for training the RFs while the other was used for assessing its performance.
Specifically, for each RF, we computed the predicted SFR in subhalos within the validation half of the TNG300-mimic. We then computed the resulting monopole and quadrupole of the redshift-space correlation function3. These measurements, jointly denoted as VRF, are compared with those obtained from the TNG300 VTNG300 via
where Cv is the covariance matrix of the TNG300 estimated with 27 jackknife samples (Zehavi et al. 2002; Norberg et al. 2009) and adding in the diagonal 5% of the signal to account for other sources of systematic error. To build the jackknife resamplings, we computed the cross-correlation between each division and the full sample (to avoid boundary effects). We used 27 jackknife samples (removing sub-boxes of ∼70 Mpc/h side) since this is the maximum number of divisions that guarantee keeping the scale of each volume above the scales we measure. We computed the χ2 associated with a given RF as the average value over two samples and
at both z = 0 and z = 1.
3.2. Key subhalo properties
When considering more than one property, the best combinations of two properties were found to be [Vpeak, Mhalo] followed by [Mpeak, Mhalo]. It is interesting to note that several previously mentioned models assumed Ṁ to be a good proxy for SFR. This was also the case in our analysis, but only when considering a single subhalo property – Ṁ was not among the best models when the number of subhalo properties was increased.
When incorporating a third variable, Vpeak/Vvir, Vmax/V200, tinfall (concentration proxies), , and Age emerged as the most informative additional properties when combined with [Vpeak, Mhalo]. Here we highlight that the combination [
] is used in UniverseMachine (Behroozi et al. 2019). Quantitatively, including the third subhalo property produced only minor improvements in the predicted clustering when compared to models with two properties. Similarly, when including four or more subhalo properties, we find almost no improvement in the clustering predictions.
3.2.1. Performance of the algorithm
We now illustrate the performance of our RFs at predicting the SFR of individual subhalos and the clustering of SFR-selected samples. Specifically, we will display the predictions of our best-performing RF with three properties (Vpeak, Mhalo and Vpeak/Vvir) and its “root” models with one and two fewer properties.
Fig. 1 displays the SFR for central (left) and satellite (right) subhalos, as measured in the validation set of TNG300 and as predicted by our RFs. Filled symbols with error bars show the median and standard deviations in bins of TNG300 SFRs. Additionally, in the legend we quote the mean logarithmic difference, , and mean logarithmic error,
. We indicate as vertical blue lines the two number density cuts used to quantify the performance of each RF.
![]() |
Fig. 1. Comparison between the real SFR values (TNG300, x-axis) and the ones predicted by the RF (y-axis) for the validation sample at z = 1 using one of the best three models with three subhalo properties (bottom) and the model with only Vpeak (top). Blue lines show the SFR values that define SFR-selected number densities of |
We can see that the active subhalos are fairly well predicted by our 1-variable RF using Vpeak, performing somewhat better for central compared to satellite subhalos. Adding additional properties reduces the scatter, by about 25% for centrals and by 35% for satellites, and halves the bias in the mean predicted SFR. This is an important validation that our RFs are learning the connection between SFR and subhalo properties, also highlighting that the increased number of variables adds non-redundant information.
We note that both RFs underpredict the SFR of the most star-forming galaxies. This might not be surprising considering the influence of the distribution of training values on the final model, where these galaxies are under-represented. RF and other mean-based predictors tend to mispredict the tails of the target distribution (as discussed in de Santi et al. 2022, which includes an analysis of different machine learning methods and the impact of modifications to the input data). To achieve a better modelling of extreme star-forming galaxies, alternative methods capable of capturing the tails of distributions would be necessary. For instance, one possibility is modelling the probability distribution for each value of the input properties, instead of only returning the mean value (e.g., Rodrigues et al. 2023).
We further explore the accuracy of the RF predictions in Fig. 2 where we show the clustering at z = 0 and z = 1 for both of our validation samples and 10−2.5 h3 Mpc−3. In addition to the previous one- and three-variable RFs, we include a third RF with two variables for comparison. In Appendix B we expand on the comparison between these and other RFs in terms of their respective χ2 values (cf. Eq. 1).
![]() |
Fig. 2. Difference between the predicted and TNG monopole (top block) and quadrupole (bottom block) of the redshift space cross-correlation function between the test half of the box and the whole box. We show three models with different numbers of subhalo properties. We tested two redshifts, z = 0 (left) and z = 1 (right), and two number densities for each statistic (n = 10−2 h3 Mpc−3, upper panel of each block, and n = 10−2.5 h3 Mpc−3 lower panel). The grey-shaded intervals mark the 1σ regions. The yellow-shaded region (r < 0.316 h−1 Mpc) is not used to compute the χ2 (see Appendix B). |
We can see a significant improvement in the predicted clustering of the three-variable RF when compared to that built uniquely on Vpeak. As we will discuss later, we argue the improvement reflects the RF learning a quenching mechanism and the nature of the scatter for active galaxies. This interpretation is in agreement with the two-variable RF being statistically consistent with the three-variable RF. Although not shown here, all the top three three-variable RFs perform at a similar level of accuracy.
4. The SHAMe-SF model
The previous section showed that the clustering of SFR-selected samples from TNG galaxies appears to be mostly determined by three subhalo properties. We now discuss how these properties can be used to build a flexible model that predicts the clustering of star-forming galaxies.
The first aspect to consider is that the relation between DM properties and SFR can be highly nonlinear and difficult to disentangle. For this reason, we start by examining and modelling the SFR for central galaxies, and later explore enhancements required for satellite galaxies. Even if we focus on the subhalo properties selected by the RF, we are open to including additional properties for implementation purposes to avoid complex analytical dependencies.
4.1. Star-formation rate predictions for central galaxies
We start by noting that RFs containing Vpeak performed slightly better than those using Mpeak as the primary property. This might be because Vpeak is more sensitive to inner regions of DM haloes compared to Mpeak. Consequently, we will select Vpeak to build the SHAMe-SF model. Nonetheless, we expect that a model based on Mpeak would lead to similar results.
The middle panel of Fig. 3 displays the distribution of [Vpeak, SFR] values for central subhalos with SFR > 10−4 M⊙/yr in the TNG300 at z = 1. The total number of subhalos as a function of Vpeak in various samples, as indicated by the legend, is shown in the top panel.
![]() |
Fig. 3. Vpeak-dependence of the star-formation rate. Top: Vpeak distribution for the different number densities analysed in this work: n = 10−2 h3 Mpc−3 (orange, solid), n = 10−2.5 h3 Mpc−3 (green, dashed), n = 10−3 h3 Mpc−3 (blue, dash-dotted), and n = 10−3.5 h3 Mpc−3 (pink, dotted). The distribution for all the subhalos is shown in grey. Centre and bottom: Vpeak versus SFR distribution of TNG300 (centre) and the SHAMe-SF model (bottom) for central subhalos at z = 1. The colour-coding represents the median of Vpeak/Vvir-defined intervals. The solid, dotted, dash-dotted, and dotted lines mark the values of the SFR that define each number density in the top panel. |
Firstly, we note that the mean Vpeak–SFR relation, indicated by a thick black line, is non-monotonic. Qualitatively, this relation reflects that SFR increases with the size of the DM host due to the increment of the available cold gas. This relation breaks at the point where the central supermassive black hole and its accretion rate are large enough for AGN feedback to become effective, which quenches the galaxy decreasing its SFR. For even larger host haloes, feedback in TNG300 seems to become incapable of counteracting the effect of star formation and SFR increases again for subhalos with log(Vpeak [km/s]) > 2.5. Neglecting the SFR increase at high Vpeak, the overall Vpeak–SFR can be described by a broken power-law:
where β and γ are the values of the slopes around V1, the value of Vpeak at the turnover of the relation.
Secondly, we appreciate a large scatter (∼0.5 dex) in the Vpeak–SFR distribution. After considering all the properties identified RF algorithm in addition to Vpeak, we find that Vpeak/Vvir strongly correlates with deviations from the median Vpeak–SFR relation. We note that other properties that also could parametrize the deviations from the main relation were and subhalo age, but their effect was subtler. We performed the same tests analysing the Mpeak–SFR, but the combination of Vpeak and Vpeak/Vvir returned the best description of the SFR distribution.
In Fig. 3 we also show the mean relations in bins of Vpeak/Vvir, as labelled by the vertical colour bar. In the case of NFW profiles, Vpeak/Vvir can be directly linked to the concentration parameter. Therefore, we see that less concentrated subhalos host galaxies with higher SFRs, which we interpret as galaxies populating more concentrated subhalos, tend to form earlier, running out of gas and thereby exhibiting lower SFRs (Lin et al. 2020).
To implement the previously discussed dependencies into our empirical model, we proceeded as follows. First, we computed the SFR as a function of Vpeak using three free parameters: β, γ, V1 (cf. Eq. 2). Next, we added a random scatter given by a Gaussian with width σ (another free parameter) constant in log(SFR). Finally, for a fixed value of Vpeak, we semi-sorted the scatter using Vpeak/Vvir following the method described in Contreras et al. (2021b), which yields a distribution ordered with some random noise (given by the free parameter fk). Central and satellite subhalos were sorted separately to keep the satellite fraction constant. We refer to this as (SFR|(Vpeak, Vvir)).
We show the predicted distribution of SFRs and their dependence with Vpeak/Vvir on the lower panel Fig. 3 (using a preliminary set of parameters). We see a remarkable similarity with the results from TNG300, with, perhaps, the exception of subhalos with log Vpeak > 2.6. As mentioned earlier, this feature might originate by inefficient AGN feedback in TNG300. We note, however, that this feature is absent in other models (see Contreras et al. 2015 for some SAM examples). These differences also appear when comparing models to observations, and arise from the implementation of quenching processes and environmental effects (Popesso et al. 2015). However, those high-Vpeak subhalos are very rare and thus subdominant for clustering predictions. Therefore, we have not adopted a more complex model by default, although SHAMe-SF can be easily extended in the future to incorporate such a feature.
4.2. Star-formation rate predictions for satellite galaxies
As discussed earlier, we expect the modelling of SFR in satellite galaxies to be more complex than that of centrals. When a galaxy becomes a satellite, additional physics is at play: tidal forces start stripping its cold gas, which finally quenches the galaxy (generally after the first pericentre passage, Orsi & Angulo 2018).
To model this additional complexity, we first explore if the difference in SFR between our model so far and TNG300 correlates with the host halo mass, the second property selected by the RF. This is shown in the right panel of Fig. 4 where we display the average ratio between the SFR in our model and in TNG300. We show results for satellite subhalos in bins of Mhalo, as labelled in the top colour bar. We note that we plot results for a fixed range in Vpeak and Vpeak/Vvir to avoid dependencies on the relation already modelled.
![]() |
Fig. 4. For satellites, mean difference (in log scale) between the TNG300 values and the predicted ones using only Eq. (2) and the semi-sorted scatter for a subsample of galaxies on a fixed [Vpeak, Vpeak/Vvir] interval for different Mhalo values (colour coding) (right). Left: The same, but plotted as a function of aMpeak (in scale factor units) maintaining the Mhalo intervals. The black thick line shows the behaviour when the whole sample is considered. The circle represents the mean value for central galaxies with aMpeak ∼ 1. |
We clearly see that our model roughly predicts the correct SFR of satellites hosted by halos of 1011 h−1 M⊙ but significantly underpredicts it in larger haloes. For instance, for halo masses of 1014 h−1 M⊙, SFR values in the TNG are about a factor of 10 smaller than in our model.
This is not surprising since more massive haloes quench galaxies faster, mainly due to their inter-cluster medium (ICM) striping the cold gas of the galaxies that form stars (Chaves-Montero et al. 2016) – a process absent in our model so far.
To further explore the modelling of quenching, we plot in the left panel of Fig. 4 the ratio of the model and TNG300 SFR as a function of aMpeak (the time, in units of the scale factor, when the subhalo reached its peak mass). We show results in bins of host halo mass at z = 0, where quenching processes are more statistically significant (Donnari et al. 2020). However, similar trends appear at z = 1.
We can clearly see that galaxies with aMpeak ∼ 1 (i.e. those recently accreted) have similar SFRs as those of centrals. Smaller values of aMpeak typically consist of more quenched galaxies. This is consistent with environmental processes in clusters, where a higher accretion redshift results in more efficient stripping. Although not shown here, other galaxy samples and bins in Vpeak/Vvir display consistent results.
We found that we could parametrise the decrement of SFR as a power-law with a different slope, depending on the host halo mass:
where az is the value of the expansion factor and αMhalo is given by
where α0, αexp, and Mcrit are free parameters. We forced αMhalo > 0 to avoid the enhancement of the SFR of satellites on lower mass halos. We performed the same test for central subhalos, finding the same behaviour in the cases of subhalos whose peak time was in the past (ranging between 5% and 15% of the central subhalos, depending on the Vpeak interval considered).
We highlight that aMpeak was not on the list of best properties suggested by our RF analyses. However, we recall that our performance metric was given by general clustering statistics for which the mass-dependent quenching of satellite galaxy might have a minor effect, specially when considering the statistical noise in the TNG300 catalogues (we compare a RF with this property in the Appendix B). Given the clear trends shown in Fig. 4, we decided to include it in the model foreseeing additional applications to observable more sensitive to the intra-halo quenching.
Summarising, SHAMe-SF models the SFR of galaxies as a function of Vpeak (Eq. 2). Deviations from this mean relations are assumed to partially correlate with Vpeak/Vvir. The SFR in satellites is further modulated as a function of the host halo mass and time since accretion (Eqs. 3 and 4).
5. Redshift space clustering
In this section, we examine the ability of SHAMe-SF to describe the projected and redshift-space correlation function of SFR-selected samples. First, we present an emulator for SHAMe-SF. Then, we describe our fitting procedure and show the results for z = 1 for TNG300, SAM, and DESI-like samples. We finalise with a discussion on the impact of the free parameters of our model.
5.1. SHAMe-SF emulator
Creating one SHAMe-SF mock on the TNG300-mimic (Section 2.2) takes around 1 CPU second, needing an additional 2 seconds to compute the clustering statistics (for details see Appendix A). While reasonably fast, this becomes computationally expensive for Monte-Carlo analyses.
Following Aricò et al. (2019, 2021a), Angulo et al. (2021), Zennaro et al. (2021), Pellejero Ibañez et al. (2023), Contreras et al. (2023a), we speeded up the evaluation of the model by constructing Neural Network emulators. Specifically, we used the same architecture as Contreras et al. (2023a): two (three) fully connected hidden layers with 200 neurons each. Alternative configurations yielded comparable performances. To build the emulator we used the Keras front-end of the Tensor-flow library. We chose the Adam optimisation algorithm with a learning rate of 0.001 with a mean squared error loss function. We left 10% of the data set for validation.
We built 18 different emulators, one for each clustering statistic (projected correlation function, monopole and quadrupole), number density ( 10−2.5} [h3 Mpc−3]) and redshift (z = 0 and 1). We trained our emulators by measuring the clustering for 50 000 sets of SHAMe-SF parameters generated by a Latin-Hypercube algorithm over the range
For each parameter set, we averaged the predictions of four mocks generated by different random realisations of the semi-sorting step. We refer the reader to Section 4 for further details on this and the rest of the model parameters.
The accuracy of our emulators is shown in Fig. 5, which compares its predictions with measurements in the validation set for the densest galaxy sample at z = 1. We note we display the difference in units of the statistical uncertainty, σjn, as estimated by 27 jackknife samples of the TNG300 plus a 5% of the signal in the diagonal (shown by the orange line). We note that we renormalise σjn by the ratio between the SHAMe-SF and the TNG300 clustering to account for the differences between the amplitude of the predictions. The emulator uncertainty is typically smaller than the statistical precision of TNG300 in all cases, which also holds for all the other samples not shown.
![]() |
Fig. 5. Difference between the emulator prediction and the real mock weighted by the scaled TNG300 error for the highest number density ( |
5.2. Parameter fitting
Once we have built the emulators, we employed a Markov Chain Monte-Carlo (MCMC) algorithm to find the SHAMe-SF parameters that best describe the clustering of various samples. We jointly considered the projected correlation function, and the monopole and quadrupole of the redshift-space correlation function, over the range r = [0.6, 30] h−1 Mpc. We assumed the likelihood of model parameters to be given by a multivariate Gaussian, logℒ = −χ2/2, where χ2 is computed using Eq. (1). For simplicity, we employed the same covariance matrix previously described discarding off-diagonal terms. We have checked our results were insensitive to using the full covariance matrix.
We sampled the posterior distribution function using an ensemble MCMC algorithm as implemented in emcee (Foreman-Mackey et al. 2013). We used 1000 chains with 40 000 steps each, discarding the first 10 000 as a burn-in phase. This somewhat unusual configuration is efficient for our emulator since the computational cost of evaluating a large number of points simultaneously is low (100 000 samples are evaluated in 2 seconds). We ran an MCMC for each number density, which takes 90 CPU minutes, and utilise flat priors on model parameters across the full range where the emulator was trained.
Given the small volume of TNG300, the measured clustering of sparse samples has considerable stochastic noise, which complicates the emulator training. For this reason, in these cases we employed a particle swarm optimisation (PSO, Kennedy & Eberhart 1995)4. We use the implementation from Aricò et al. (2021b)5 to find the best-fitting SHAMe-SF parameters for the sample with , for samples at z = 1.5, and for our DESI-like ELG sample.
5.3. Redshift space galaxy clustering predictions
We now present the SHAMe-SF models that best-fit the clustering of our mock galaxy samples.
5.3.1. TNG300
Results for z = 1 are shown in Fig. 6 for the projected correlation function (top), monopole (middle) and quadrupole (bottom). We display the results for different number densities using line styles as indicated by the legend. Higher number densities have been displaced on the y-axis for visualisation purposes. Black line and grey-shaded regions represent the TNG300 measurements and its uncertainty, respectively. The lower panel shows the difference between the TNG300 prediction and our best fit, in units of the uncertainty.
![]() |
Fig. 6. Projected correlation function (wp), monopole, and quadrupole of the galaxies of the TNG300 simulation (black, 1σ interval in grey) and the model (red) for an SFR-selected subsample with number densities n = 10−2 (straight), 10−2.5 (dashed), 10−3 (dash-dotted), and n = 10−3.5 (dotted) h3 Mpc−3 at z = 1. The three lower number densities have been shifted on the upper panel along the y-axis for visualisation purposes. The lower panel shows the relative difference between the fits and the model normalised by the uncertainty on the measurement on TNG300, with the grey-shaded region indicating 1σ. The green-shaded region (r < 0.6 Mpc/h) indicates the region not considered in the parameter fitting. |
Our model can describe the TNG300 clustering prediction within 1σ for all statistics and number densities. This is remarkable, considering the difference in resolution between TNG300 and TNG300-mimic, and that the latter is a gravity-only simulation. We note that the accuracy is better for higher number densities where cosmic variance (which is suppressed as TNG300 and its TNG300-mimic have identical initial conditions) dominates the error budget compared to sparser samples where Poisson noise would be more important. It is interesting to note that we find fits of similar quality as those using RF predictions (cf. Fig. 2), despite modelling simplifications in SHAMe-SF. This implies that our model is effectively capturing all the information about SFR contained in the properties of DM structures.
As mentioned earlier, we only fit scales above 0.6 h−1 Mpc indicated by the green-shaded region). This was motivated by Contreras et al. (2021a), who found that SHAMe performed accurately and robustly above those separations. In our cases, we see that the best-fit model is still a good description of the data down to 300 h−1 kpc, which offers the possibility of describing even smaller scales than in the original SHAMe model. This is also the case for other redshifts and samples (see Appendix C).
Although we do not show a direct comparison here, we note that, in general, SHAMe-SF performs better than the original SHAMe, delivering more accurate fits, especially for number densities below 10−2.5 h3 Mpc−3.
5.3.2. SAM
Before carrying out MCMC fits with SAM samples, we confirmed that the basis relations between subhalo properties and SFR discussed in Section 4 follow similar trends as in the TNG300. We discuss these relations with further detail in Appendix D. The main difference between SAM and TNG300 predictions occurs in subhalos with large Vpeak values, where quenching is not as evident as in TNG300: SAM galaxies show SFRs that, on average, monotonically increase with Vpeak. We expect, however, that the flexibility of our model is able to capture this feature (by considering negative values of γ, the second slope of the SFR–Vpeak relation).
Therefore, we expected SHAMe-SF to provide a reasonable description of the clustering in our SAM samples. This is indeed the case, as shown in Fig. 7 which, similarly to the case of TNG300, compares the best-fit SHAMe-SF with the clustering of samples at z = 1. As in the case of TNG300, we appreciate that our model is statistically consistent with the data over the full range of statistics and scales included in the fit.
![]() |
Fig. 7. Same as Figure 6 but for galaxies from LGalaxies semi-analytical model (black, 1σ interval in grey) instead of the TNG300 simulation. Mock fit is shown in blue. |
5.3.3. DESI-like ELG galaxies
Since the goal of our model is to reproduce the clustering of ELG-like samples, we also fit the clustering of a sample with similar selection criteria to DESI (see Section 2.1.3). We present our results in Fig. 8.
![]() |
Fig. 8. Same as Figure 6 but for galaxies applying the colour cuts from DESI (see Table 2) to TNG300 galaxies at z = 1 ( |
As in previous cases, we can reproduce the clustering measurements within statistical uncertainties. This opens up the remarkable possibility of using SHAMe-SF to analyse the clustering of observational samples. Our model would be capable of employing most of the scale range in the data, providing constraints on model parameters that could inform the galaxy-subhalo connection and constrain star formation physics. Additionally, it could place constraints on cosmological parameters in the highly nonlinear regime.
As a future challenge, we aim to test whether the model is flexible enough to replicate the clustering of subsamples selected directly based on the intensity of emission lines.
5.4. About model freedom
SHAMe-SF has 8 free parameters that control the relationship between halo and subhalo properties and the expected SFR galaxies. Next, we discuss how the number of model parameters impacts the clustering predictions. More concretely, we investigate the performance of SHAMe-SF when one parameter is fixed.
Using the SHAMe-SF emulator, we repeat our MCMCs analysis fixing each parameter to four values within the range of our emulator. We do this for z = 0 and z = 1 and the two higher number densities. To quantify the performance of each case, we compare the respective χ2 value with that of the best-fitting model when all parameters are varied simultaneously.
We found that SHAMe-SF performs similarly when fixing the parameters that control the SFR–Vpeak relation (β, γ, σ), or those that regulate the quenching of satellite galaxies (α0 or αexp). This holds regardless of the specific values at which each parameter is fixed. We can understand this in terms of internal degeneracies. For instance, a large scatter in the SFR–Vpeak relation would be equivalent to weak slopes. Additionally, we expect that the details of satellite quenching are not important, given that its dependence on subhalo properties was not detected in our RF analysis (cf. Section 3.2).
On the contrary, we found that freedom on the peak of the SFR–Vpeak relation was crucial to obtain accurate predictions. This is consistent with our RF analysis (and our model implementation), which determined that a SFR selection can be regarded as a Vpeak selection at first order. Similarly, we find that the halo mass where the quenching starts acting (Mcrit) is important in delivering good fits, specially at small scales. This is because quenching processes mostly affect the abundance and distribution of satellites.
Finally, the semi-sorting parameter has a strong effect on the clustering when fixed to random values. Small variations on this parameter can even double the value of the χ2. The effect of fk is partly compensated by other parameters, mainly reducing the distribution width to minimise its importance, especially for higher number densities. Lower values of σ allow for more freedom in fk since it reduces its effect on the final distribution.
It is important to take into account that we only analyse two number densities, redshifts and the specific case of TNG300: even if we can obtain a good clustering with fewer parameters, this does not mean that the model will perform equally well in every situation, or when the SFR physics deviates strongly from that in TNG. Thus, we recommend using the full model (with 8 parameters) as a default and evaluate possible restrictions based on the specific dataset to analyse.
6. Other statistics
To further validate the physical soundness of our best-fitting models, we explored whether they correctly predict other statistics not directly included in the fitting procedure. Namely, we investigated the halo occupation number and magnitude of the assembly bias. This tests whether SHAMe-SF is selecting subhalos at the correct host halo mass and large-scale environments.
6.1. HOD
In Fig. 9, we present the average number of galaxies as a function of host halo mass. We display the total and central galaxies as solid and dashed lines, respectively. We show results for samples at z = 1 as measured in TNG300 and in the corresponding best-fitting SHAMe-SF model. To illustrate the impact of parameter uncertainties in the best-fitting SHAMe-SF model, shaded regions indicate the range that encompasses the results for 200 parameter sets randomly selected within a 1σ interval in our MCMC chains.
![]() |
Fig. 9. Halo occupation distribution for galaxies in TNG300 (black) and the SHAMe-SF mocks (red) for z = 0 and number densities n = 10−2, 10−2.5, and 10−3 h3 Mpc−3. The SHAMe-SF parameters were fitted to reproduce the clustering predictions. The shaded region represents the 1σ confidence interval (solid for centrals and galaxies; circle-texture for centrals). |
In all cases, SHAMe-SF predicts a distribution of host halo masses, for both central and satellites, that is very similar to that in the TNG300. In particular, satellite fractions, shown in Table 4, are also remarkably similar. In the Appendix C, we show that SHAMe-SF yields a similar level of agreement at z = 0 and when fitting the SAM catalogue.
Satellite fractions for the three highest number density samples in TNG300 at z = 1 and the respective SHAMe-SF predictions.
Perhaps an exception is the abundance of central galaxies in halos above ∼ 1011 h−1 M⊙, where SHAMe-SF predicts almost no star-forming galaxies. This is a consequence of SFR decreasing monotonically at large Vpeak, in contrast, TNG300 shows a “rejuvenation” of massive galaxies (cf. Fig. 8). To investigate this further, we tested an extension of SHAMe-SF with a second turning point, allowing high Vpeak subhalos to host star-forming galaxies even if intermediate Vpeak subhalos are quenched. Including this extra freedom allowed a higher number of centrals at high halo masses but this regime remained largely unconstrained due to its minor impact on clustering statistics. However, it is worth considering this extension for specific statistics that would be sensitive to that regime.
6.2. Assembly bias
The final statistic we explore is the magnitude of the so-called “assembly bias”. Assembly bias quantifies the dependency of large-scale clustering on halo properties other than mass (Sheth & Tormen 2004; Gao et al. 2005; Gao & White 2007; Wechsler et al. 2006; Faltenbacher & White 2010; Angulo et al. 2009; Mao et al. 2018). The galaxy-halo connection implies that this effect can be generalised to galaxies, defining a “galaxy assembly bias”.
To measure the degree of assembly bias in our samples, we use the technique proposed by Croton et al. (2007). In this approach, the positions of halos (including all their satellites) are shuffled among halos within 0.1 bins in halo mass. Then, assembly bias is estimated by comparing the shuffled and original correlation functions:
For SFR-selected samples, assembly bias decreases for lower number densities at larger r, even becoming ‘negative’ (lower than the shuffled sample, Contreras et al. 2019). However, we note that the amount of galaxy assembly bias differs for different models and schemes (Croton et al. 2007; Chaves-Montero et al. 2016), owing to the assumed impact of secondary halo properties on a galaxy SFR.
Results for the two highest number densities samples at z = 0 and z = 1 from the TNG300 and the corresponding SHAMe-SF are shown in Fig. 10. We show the average over 10 random shuffles to reduce statistical noise. As in the previous sections, we estimate the role of uncertainty in the best fits by considering 200 parameter sets within a 1σ region in our chains.
![]() |
Fig. 10. Galaxy assembly bias for galaxies in TNG300 (black) and the SHAMe-SF mocks for z = 0 (left) and z = 1 (right) and number densities n = 10−2 (top) and 10−2.5 h3 Mpc−3 (bottom). The SHAMe-SF parameters were fitted to reproduce the clustering predictions (red). The shaded region represents the 1σ confidence interval. |
For the sparser sample, we obtain a behaviour statistically consistent with TNG300 at both z = 0 and 1. Specifically, at z = 1 the amount of assembly bias changes the clustering by approximately 10% in both TNG300 and SHAMe-SF. Even if we did not include any parameter to explicitly model assembly bias (see Contreras et al. 2021a). However, for the denser sample, we slightly overestimate the amount of assembly bias. It is interesting to note that both samples at z = 0 display a scale-dependent assembly bias, which tends to reduce the clustering amplitude, as also previously reported by Contreras et al. (2019) (see also Jiménez et al. 2021). By applying SHAMe-SF to larger simulations, we will be able to explore this with better statistical precision.
7. Summary
In this study, we have introduced SHAMe-SF, a physically motivated model designed to populate gravity-only simulations. This model is capable of predicting the clustering of SFR-selected samples across various number densities and redshifts.
Our initial step to develop the SHAMe-SF model involved utilizing an RF algorithm to explore the connection between dark matter structures and SFRs in TNG300 simulations. Although star formation is a highly complex and potentially stochastic process, we showed that the clustering of SFR-selected samples can be predicted using only dark matter properties.
After identifying the most relevant subhalo properties, we constructed SHAMe-SF. Our model involves the following ingredients: modeling the SFR–Vpeak relationship as a broken power law with three free parameters (Eq. 2); introducing two additional parameters to represent deviations from the mean relation while controlling the magnitude of the scatter and its correlation with Vpeak/Vvir; and modulating the SFR of satellite galaxies with three additional parameters that reduce the SFR associated with a subhalo based on the time elapsed since reaching its peak mass and the host halo’s mass.
Subsequently, we trained a neural network that emulates the SHAMe-SF clustering predictions in less than 1 second of CPU time. With this emulator, we showed that SHAMe-SF reproduces the clustering of various mock galaxy samples. Remarkably, this was achieved using a gravity-only simulation with a 64-times lower resolution than TNG300 and over a broad range of scales r ∈ [0.6, 30] h−1 Mpc. Specifically, the key milestones can be summarised as follows:
-
SHAMe-SF accurately describes the projected correlation function, monopole, and quadrupole of the redshift-space correlation function of TNG300 samples, three number densities (
), and two redshifts (z = 0 and z = 1) (Fig. 6).
-
We demonstrated the flexibility of SHAMe-SF by fitting the clustering of samples predicted by LGalaxies, a semi-analytical model that adopts galaxy and star formation prescriptions different than TNG300 (Fig. 7).
-
Additionally, SHAMe-SF also was able to describe the clustering of mock ELG galaxies at z = 1, which were selected using photometric criteria analogous to those employed by the ongoing DESI survey (Fig. 8).
-
We validated the physical realism of our best-fitting SHAMe-SF models by comparing the mean occupation number (Fig. 9) and the magnitude of the so-called assembly bias (Fig. 10). SHAMe-SF retrieves trends in qualitative agreement with those in the mock samples.
Recently, various models have been developed, specifically ELGs, using approaches such as HODs and SHAM extensions (e.g., Lin et al. 2023, [OII] in eBOSS; Favole et al. 2017, [OII] in SDSS; Yu et al. 2023; Prada et al. 2023, ELGs in DESI). Although our work is primarily focused on SFR selections, the flexibility of SHAMe-SF, coupled with the strong correlation between SFR and emission lines, suggests its applicability to ELGs. We will explore this in the future. This prospect becomes particularly interesting when combined with suites of N-body simulations or re-scaled simulations (Zennaro et al. 2019; Contreras et al. 2020; Angulo et al. 2021), which would offer opportunities to constrain cosmological parameters with future galaxy surveys from highly nonlinear scales.
These statistics are computed by measuring the cross-correlation between the validation sample (half of the box) and the full galaxy sample. This selection of samples and clustering function allows us to measure a clustering signal dominated by the validation sample (i.e. galaxies not used in the training process) while reducing the noise on the clustering statistic. This method is repeated twice, changing the side of the box used as the training set. Details about the calculation of the correlation functions can be found in Appendix A.
Acknowledgments
We would like to thank Jonás Chaves-Montero, Idit Zehavi and Violeta González-Pérez for useful comments on the manuscript and the project. We thank the anonymous referee for their insight and feedback on the paper. The authors acknowledge support by the project PID2021-128338NB-I00 from the Spanish Ministry of Science. SOM is funded by the Spanish Ministry of Science and Innovation under grant number PRE2020-095788. SC acknowledges the support of the “Juan de la Cierva Incorporacíon” fellowship (IJC2020-045705-I). The authors also acknowledge the computer resources at MareNostrum and the technical support provided by Barcelona Supercomputing Center (RES-AECT-2019-2-0012 & RES-AECT-2020-3-0014). The first Random Forest calculations followed the example provided by Saurabh Kumar in Xu et al. (2021) (https://saurabhkumar3400.com/assemblybias.html). We further thank the developers of the open-source tools used in this work: matplotlib (Hunter 2007), numpy (Walt et al. 2011), scipy (Jones et al. 2001), scikit-learn (Pedregosa et al. 2011), pandas (McKinney 2010) and emcee (Foreman-Mackey et al. 2013).
References
- Agarwal, S., Davé, R., & Bassett, B. A. 2018, MNRAS, 478, 3410 [CrossRef] [Google Scholar]
- Alam, S., Peacock, J. A., Kraljic, K., Ross, A. J., & Comparat, J. 2020, MNRAS, 497, 581 [NASA ADS] [CrossRef] [Google Scholar]
- Angulo, R. E., Lacey, C. G., Baugh, C. M., & Frenk, C. S. 2009, MNRAS, 399, 983 [CrossRef] [Google Scholar]
- Angulo, R. E., Springel, V., White, S. D. M., et al. 2012, MNRAS, 426, 2046 [NASA ADS] [CrossRef] [Google Scholar]
- Angulo, R. E., Zennaro, M., Contreras, S., et al. 2021, MNRAS, 507, 5869 [NASA ADS] [CrossRef] [Google Scholar]
- Aricò, G., Angulo, R. E., Hernández-Monteagudo, C., et al. 2019, ArXiv e-prints [arXiv:1911.08471] [Google Scholar]
- Aricò, G., Angulo, R. E., Contreras, S., et al. 2021a, MNRAS, 506, 4070 [CrossRef] [Google Scholar]
- Aricò, G., Angulo, R. E., Hernández-Monteagudo, C., Contreras, S., & Zennaro, M. 2021b, MNRAS, 503, 3596 [Google Scholar]
- Avila, S., Gonzalez-Perez, V., Mohammad, F. G., et al. 2020, MNRAS, 499, 5486 [NASA ADS] [CrossRef] [Google Scholar]
- Baugh, C. M. 2006, Rep. Progr. Phys., 69, 3101 [Google Scholar]
- Behroozi, P. S., Wechsler, R. H., & Conroy, C. 2013, ApJ, 770, 57 [NASA ADS] [CrossRef] [Google Scholar]
- Behroozi, P., Wechsler, R. H., Hearin, A. P., & Conroy, C. 2019, MNRAS, 488, 3143 [NASA ADS] [CrossRef] [Google Scholar]
- Benson, A. J., Cole, S., Frenk, C. S., Baugh, C. M., & Lacey, C. G. 2000, MNRAS, 311, 793 [NASA ADS] [CrossRef] [Google Scholar]
- Berlind, A. A., Weinberg, D. H., Benson, A. J., et al. 2003, ApJ, 593, 1 [NASA ADS] [CrossRef] [Google Scholar]
- Blank, M., Meier, L. E., Macciò, A. V., et al. 2020, MNRAS, 500, 1414 [NASA ADS] [Google Scholar]
- Breiman, L. 2001, Mach. Learn., 45, 5 [Google Scholar]
- Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. 1984, Classification and Regression Trees (Chapman and Hall/CRC) [Google Scholar]
- Chaves-Montero, J., Angulo, R. E., Schaye, J., et al. 2016, MNRAS, 460, 3100 [NASA ADS] [CrossRef] [Google Scholar]
- Chaves-Montero, J., Angulo, R. E., & Contreras, S. 2023, MNRAS, 521, 937 [Google Scholar]
- Coil, A. L., Newman, J. A., Cooper, M. C., et al. 2006, ApJ, 644, 671 [NASA ADS] [CrossRef] [Google Scholar]
- Conroy, C., Wechsler, R. H., & Kravtsov, A. V. 2006, ApJ, 647, 201 [Google Scholar]
- Contreras, S., Baugh, C. M., Norberg, P., & Padilla, N. 2013, MNRAS, 432, 2717 [Google Scholar]
- Contreras, S., Baugh, C. M., Norberg, P., & Padilla, N. 2015, MNRAS, 452, 1861 [CrossRef] [Google Scholar]
- Contreras, S., Zehavi, I., Baugh, C. M., Padilla, N., & Norberg, P. 2017, MNRAS, 465, 2833 [NASA ADS] [CrossRef] [Google Scholar]
- Contreras, S., Zehavi, I., Padilla, N., et al. 2019, MNRAS, 484, 1133 [NASA ADS] [CrossRef] [Google Scholar]
- Contreras, S., Angulo, R. E., Zennaro, M., Aricò, G., & Pellejero-Ibañez, M. 2020, MNRAS, 499, 4905 [Google Scholar]
- Contreras, S., Angulo, R. E., & Zennaro, M. 2021a, MNRAS, 508, 175 [NASA ADS] [CrossRef] [Google Scholar]
- Contreras, S., Angulo, R. E., & Zennaro, M. 2021b, MNRAS, 504, 5205 [CrossRef] [Google Scholar]
- Contreras, S., Angulo, R. E., Chaves-Montero, J., White, S. D. M., & Aricò, G. 2023a, MNRAS, 520, 489 [Google Scholar]
- Contreras, S., Chaves-Montero, J., & Angulo, R. E. 2023b, ArXiv e-prints [arXiv:2305.09637] [Google Scholar]
- Croton, D. J., Springel, V., White, S. D. M., et al. 2006, MNRAS, 365, 11 [Google Scholar]
- Croton, D. J., Gao, L., & White, S. D. M. 2007, MNRAS, 374, 1303 [Google Scholar]
- Cuesta-Lazaro, C., Nishimichi, T., Kobayashi, Y., et al. 2023, MNRAS, 523, 3219 [CrossRef] [Google Scholar]
- Davé, R., Anglés-Alcázar, D., Narayanan, D., et al. 2019, MNRAS, 486, 2827 [Google Scholar]
- Davis, M., Efstathiou, G., Frenk, C. S., & White, S. D. M. 1985, ApJ, 292, 371 [Google Scholar]
- De Lucia, G., & Blaizot, J. 2007, MNRAS, 375, 2 [Google Scholar]
- De Lucia, G., Kauffmann, G., & White, S. D. M. 2004, MNRAS, 349, 1101 [Google Scholar]
- de Santi, N. S. M., Rodrigues, N. V. N., Montero-Dorta, A. D., et al. 2022, MNRAS, 514, 2463 [CrossRef] [Google Scholar]
- DESI Collaboration (Aghamousa, A., et al.) 2016, ArXiv e-prints [arXiv:1611.00036] [Google Scholar]
- Donnari, M., Pillepich, A., Nelson, D., et al. 2019, MNRAS, 485, 4817 [Google Scholar]
- Donnari, M., Pillepich, A., Joshi, G. D., et al. 2020, MNRAS, 500, 4004 [CrossRef] [Google Scholar]
- Faltenbacher, A., & White, S. D. M. 2010, ApJ, 708, 469 [NASA ADS] [CrossRef] [Google Scholar]
- Favole, G., Rodríguez-Torres, S. A., Comparat, J., et al. 2017, MNRAS, 472, 550 [NASA ADS] [CrossRef] [Google Scholar]
- Favole, G., Montero-Dorta, A. D., Artale, M. C., et al. 2022, MNRAS, 509, 1614 [Google Scholar]
- Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, PASP, 125, 306 [Google Scholar]
- Gao, L., & White, S. D. M. 2007, MNRAS, 377, L5 [NASA ADS] [CrossRef] [Google Scholar]
- Gao, L., Springel, V., & White, S. D. M. 2005, MNRAS, 363, L66 [NASA ADS] [CrossRef] [Google Scholar]
- Geach, J. E., Sobral, D., Hickox, R. C., et al. 2012, MNRAS, 426, 679 [NASA ADS] [CrossRef] [Google Scholar]
- Gonzalez-Perez, V., Cui, W., Contreras, S., et al. 2020, MNRAS, 498, 1852 [NASA ADS] [CrossRef] [Google Scholar]
- Guo, Q., White, S., Boylan-Kolchin, M., et al. 2011, MNRAS, 413, 101 [Google Scholar]
- Guo, Q., White, S., Angulo, R. E., et al. 2013, MNRAS, 428, 1351 [NASA ADS] [CrossRef] [Google Scholar]
- Guo, H., Zheng, Z., Zehavi, I., et al. 2015, MNRAS, 446, 578 [NASA ADS] [CrossRef] [Google Scholar]
- Hadzhiyska, B., Tacchella, S., Bose, S., & Eisenstein, D. J. 2021, MNRAS, 502, 3599 [NASA ADS] [CrossRef] [Google Scholar]
- Hadzhiyska, B., Hernquist, L., Eisenstein, D., et al. 2022a, ArXiv e-prints [arXiv:2210.10068] [Google Scholar]
- Hadzhiyska, B., Eisenstein, D., Hernquist, L., et al. 2022b, ArXiv e-prints [arXiv:2210.10072] [Google Scholar]
- Hahn, C., Tinker, J. L., & Wetzel, A. 2019, ArXiv e-prints [arXiv:1910.01644] [Google Scholar]
- Hearin, A. P., Zentner, A. R., van den Bosch, F. C., Campbell, D., & Tollerud, E. 2016, arXiv e-prints [arXiv:1512.03050] [Google Scholar]
- Henriques, B. M. B., White, S. D. M., Thomas, P. A., et al. 2013, MNRAS, 431, 3373 [NASA ADS] [CrossRef] [Google Scholar]
- Henriques, B. M. B., White, S. D. M., Thomas, P. A., et al. 2015, MNRAS, 451, 2663 [Google Scholar]
- Henriques, B. M. B., Yates, R. M., Fu, J., et al. 2020, MNRAS, 491, 5795 [NASA ADS] [CrossRef] [Google Scholar]
- Hunter, J. D. 2007, Comput. Sci. Eng., 9, 90 [Google Scholar]
- Jiménez, E., Padilla, N., Contreras, S., et al. 2021, MNRAS, 506, 3155 [CrossRef] [Google Scholar]
- Jing, Y. P., Mo, H. J., & Börner, G. 1998, ApJ, 494, 1 [NASA ADS] [CrossRef] [Google Scholar]
- Jones, E., Oliphant, T., Peterson, P., et al. 2001, SciPy: Open Source Scientific Tools for Python, https://scipy.org/ [Google Scholar]
- Kamdar, H. M., Turk, M. J., & Brunner, R. J. 2016, MNRAS, 457, 1162 [NASA ADS] [CrossRef] [Google Scholar]
- Kauffmann, G., White, S. D. M., & Guiderdoni, B. 1993, MNRAS, 264, 201 [Google Scholar]
- Kauffmann, G., Colberg, J. M., Diaferio, A., & White, S. D. M. 1999, MNRAS, 303, 188 [NASA ADS] [CrossRef] [Google Scholar]
- Kennedy, J., & Eberhart, R. 1995, IEEE International Conference on Neural Networks – Conference Proceedings, 4, 1942 [Google Scholar]
- Lin, L., Faber, S. M., Koo, D. C., et al. 2020, ApJ, 899, 93 [NASA ADS] [CrossRef] [Google Scholar]
- Lin, S., Tinker, J. L., Blanton, M. R., et al. 2023, MNRAS, 519, 4253 [CrossRef] [Google Scholar]
- Mao, Y.-Y., Zentner, A. R., & Wechsler, R. H. 2018, MNRAS, 474, 5143 [NASA ADS] [CrossRef] [Google Scholar]
- Marinacci, F., Vogelsberger, M., Pakmor, R., et al. 2018, MNRAS, 480, 5113 [NASA ADS] [Google Scholar]
- McCarthy, I. G., Schaye, J., Bird, S., & Le Brun, A. M. C. 2017, MNRAS, 465, 2936 [Google Scholar]
- McKinney, W. 2010, in Proceedings of the 9th Python in Science Conference, eds. S. van der Walt, & J. Millman, 56 [Google Scholar]
- Moster, B. P., Naab, T., & White, S. D. M. 2018, MNRAS, 477, 1822 [Google Scholar]
- Moster, B. P., Naab, T., Lindström, M., & O’Leary, J. A. 2021, MNRAS, 507, 2115 [CrossRef] [Google Scholar]
- Naiman, J. P., Pillepich, A., Springel, V., et al. 2018, MNRAS, 477, 1206 [Google Scholar]
- Nelson, D., Pillepich, A., Genel, S., et al. 2015, Astron. Comput., 13, 12 [Google Scholar]
- Nelson, D., Pillepich, A., Springel, V., et al. 2018, MNRAS, 475, 624 [Google Scholar]
- Norberg, P., Baugh, C. M., Gaztanaga, E., & Croton, D. J. 2009, MNRAS, 396, 19 [NASA ADS] [CrossRef] [Google Scholar]
- Orsi, Á. A., & Angulo, R. E. 2018, MNRAS, 475, 2530 [NASA ADS] [CrossRef] [Google Scholar]
- Paranjape, A., & Alam, S. 2020, ArXiv e-prints [arXiv:2001.08760] [Google Scholar]
- Paranjape, A., Hahn, O., & Sheth, R. K. 2018, MNRAS, 476, 3631 [NASA ADS] [CrossRef] [Google Scholar]
- Peacock, J. A., & Smith, R. E. 2000, MNRAS, 318, 1144 [Google Scholar]
- Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
- Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2012, ArXiv e-prints [arXiv:1201.0490] [Google Scholar]
- Pellejero Ibañez, M., Angulo, R. E., Zennaro, M., et al. 2023, MNRAS, 520, 3725 [Google Scholar]
- Pillepich, A., Nelson, D., Hernquist, L., et al. 2018, MNRAS, 475, 648 [Google Scholar]
- Planck Collaboration XIII. 2016, A&A, 594, A13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Popesso, P., Biviano, A., Finoguenov, A., et al. 2015, A&A, 579, A132 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
- Prada, F., Ereza, J., Smith, A., et al. 2023, ArXiv e-prints [arXiv:2306.06315] [Google Scholar]
- Raichoor, A., Eisenstein, D. J., Karim, T., et al. 2020, Res. Notes Am. Astron. Soc., 4, 180 [Google Scholar]
- Ramakrishnan, S., Paranjape, A., Hahn, O., & Sheth, R. K. 2019, MNRAS, 489, 2977 [NASA ADS] [CrossRef] [Google Scholar]
- Rodrigues, N. V. N., de Santi, N. S. M., Montero-Dorta, A. D., & Abramo, L. R. 2023, MNRAS, 522, 3236 [NASA ADS] [CrossRef] [Google Scholar]
- Shankar, F., Lapi, A., Salucci, P., De Zotti, G., & Danese, L. 2006, ApJ, 643, 14 [NASA ADS] [CrossRef] [Google Scholar]
- Sheth, R. K., & Tormen, G. 2004, MNRAS, 350, 1385 [NASA ADS] [CrossRef] [Google Scholar]
- Simha, V., Weinberg, D. H., Davé, R., et al. 2012, MNRAS, 423, 3458 [NASA ADS] [CrossRef] [Google Scholar]
- Sinha, M. 2016, https://doi.org/10.5281/zenodo.55161 [Google Scholar]
- Sinha, M., & Garrison, L. 2017, Astrophysics Source Code Library [record ascl:1703.003] [Google Scholar]
- Springel, V. 2010, MNRAS, 401, 791 [Google Scholar]
- Springel, V., White, S. D. M., Tormen, G., & Kauffmann, G. 2001, MNRAS, 328, 726 [Google Scholar]
- Springel, V., White, S. D. M., Jenkins, A., et al. 2005, Nature, 435, 629 [Google Scholar]
- Springel, V., Pakmor, R., Pillepich, A., et al. 2018, MNRAS, 475, 676 [Google Scholar]
- Tinker, J. L., Hahn, C., Mao, Y.-Y., & Wetzel, A. R. 2018, MNRAS, 478, 4487 [CrossRef] [Google Scholar]
- Vale, A., & Ostriker, J. P. 2006, MNRAS, 371, 1173 [NASA ADS] [CrossRef] [Google Scholar]
- Vogelsberger, M., Marinacci, F., Torrey, P., & Puchwein, E. 2020, Nat. Rev. Phys., 2, 42 [Google Scholar]
- Walt, S. V. D., Colbert, S. C., & Varoquaux, G. 2011, Comput. Sci. Eng., 13, 22 [Google Scholar]
- Wang, L., Farrah, D., Oliver, S. J., et al. 2013, MNRAS, 431, 648 [Google Scholar]
- Wang, E., Wang, H., Mo, H., et al. 2018, ApJ, 864, 51 [NASA ADS] [CrossRef] [Google Scholar]
- Wechsler, R. H., Zentner, A. R., Bullock, J. S., Kravtsov, A. V., & Allgood, B. 2006, ApJ, 652, 71 [NASA ADS] [CrossRef] [Google Scholar]
- White, S. D. M., & Frenk, C. S. 1991, ApJ, 379, 52 [Google Scholar]
- Xu, X., Kumar, S., Zehavi, I., & Contreras, S. 2021, MNRAS, 507, 4879 [CrossRef] [Google Scholar]
- Yu, J., Zhao, C., Gonzalez-Perez, V., et al. 2023, ArXiv e-prints [arXiv:2306.06313] [Google Scholar]
- Yuan, S., Hadzhiyska, B., Bose, S., & Eisenstein, D. J. 2022, MNRAS, 512, 5793 [NASA ADS] [CrossRef] [Google Scholar]
- Zehavi, I., Blanton, M. R., Frieman, J. A., et al. 2002, ApJ, 571, 172 [NASA ADS] [CrossRef] [Google Scholar]
- Zehavi, I., Zheng, Z., Weinberg, D. H., et al. 2005, ApJ, 630, 1 [Google Scholar]
- Zehavi, I., Contreras, S., Padilla, N., et al. 2018, ApJ, 853, 84 [NASA ADS] [CrossRef] [Google Scholar]
- Zennaro, M., Angulo, R. E., Aricò, G., Contreras, S., & Pellejero-Ibáñez, M. 2019, MNRAS, 489, 5938 [NASA ADS] [CrossRef] [Google Scholar]
- Zennaro, M., Angulo, R. E., Pellejero-Ibáñez, M., et al. 2021, ArXiv e-prints [arXiv:2101.12187] [Google Scholar]
- Zhao, P., Xu, H., Katsianis, A., & Yang, X.-H. 2020, Res. Astron. Astrophys., 20, 195 [Google Scholar]
- Zheng, Z., Berlind, A. A., Weinberg, D. H., et al. 2005, ApJ, 633, 791 [NASA ADS] [CrossRef] [Google Scholar]
- Zheng, Z., Coil, A. L., & Zehavi, I. 2007, ApJ, 667, 760 [Google Scholar]
- Zjupa, J., Paranjape, A., Hahn, O., & Pakmor, R. 2020, ArXiv e-prints [arXiv:2009.03329] [Google Scholar]
Appendix A: Computing clustering statistics
The main galaxy clustering statistics computed in this work are the projected correlation function and two multipoles (monopole and quadrupole) of the correlation function. When we analyse the clustering predicted by the RF models, we use the cross-correlation function between the test side of the box and the full volume. When analysing the predictions of the model, we use the auto-correlation function.
To compute wp, we use CORRFUNC (Sinha 2016; Sinha & Garrison 2017) integrating the auto-correlation function over the line of sight following
where ξ(p,π) is the two-point correlation function. We choose πmax = 30 h−1Mpc as the maximum depth used on the pair counting due to the small size of the TNG300 simulation box.
To obtain the multipoles, we first compute the correlation function in terms of s and μ bins defined as: and
. Then, we define the multipoles as
where Pℓ is the ℓ-th order Legendre polynomial. We also use CORRFUNC to compute these statistics.
Appendix B: Random forest models χ2
Here, we show the χ2 of some of the models trained with RF. Details on the samples and the calculation were discussed in Section 3.1.2. We compare the measurements of the cross-correlation between the validation sample (half of the box) and the complete sample for TNG300 and the RF models. This choice reduces the statistical noise while reducing the impact of the training set. However, to compute the contribution of the jackknife samples to the covariance matrix (Cv), we use the auto-correlation function of the full galaxy sample. This choice will not affect the model comparison since we use the same covariance matrix, but we expect the values of the χ2 to be higher due to the size of the errors.
In Fig. B.1 we show the χ2 values of the monopole and the quadrupole for two number densities, two redshifts and five models trained using different properties. We show the models from Fig. 2, the final property choice of SHAMe-SF (Vpeak, Mhalo, Vpeak/Vvir, aMpeak) and the model trained with Vpeak and Mpeak (as an example of another model with two properties). As discussed in Section 3.2, the most significant improvement appears when adding the second property. Adding the third (or fourth) property can enhance the prediction for some statistics (see the quadrupole for the highest number density at both redshifts), producing small noise-driven fluctuations in the χ2 of other statistics for the same property choice.
![]() |
Fig. B.1. χ2 values of three different models computed using RF with a different number of input properties (colour coding). We show the results for the monopole and quadrupole for number densities, |
Appendix C: TNG300-SHAMe-SF clustering at z = 0
Even if future surveys will target star-forming galaxies at higher redshifts (z ∼ 1), we also test the clustering predictions of SHAMe-SF at z = 0. We show the results in Fig. C.1 for the projected correlation function, monopole and quadrupole (upper panel) and the predictions of the HOD (lower panel). The performance of the clustering statistics used for the fit is similar to z = 1, discussed in Section 5.3. The main differences appear on the HOD, especially for high-mass central halos. As discussed in Section 6.1, the model cannot reproduce two turnarounds in the SFR function. This reflects on the satellite fractions (Table C.1), which tend to be overestimated, mostly due to the lack of central subhalos.
Satellite fractions for the two higher number densities for TNG300 and the prediction of the SHAMe-SF model for z = 0.
Appendix D: L-Galaxies
To test SHAMe-SF with an SFR prescription different from the TNG300, we use the SAM described in Section 2.3. Before trying to fit the clustering, we prove that the main parametrisation of the model (Section 4) can also describe the relation between Vpeak and SFR. We replicate the upper and centre panels of Fig. 3 in Fig. D.1 for z = 0 (left) and z = 1 (right). In the SAM model, the quenching mechanisms for high-mass halos are not as efficient as in TNG300, especially for z = 1. In this case, the most star-forming galaxies are hosted mostly by the subhalos with higher Vpeak (as appreciable in the histograms for the higher number densities). As discussed in Section 5.3.2, SHAMe-SF can describe this behaviour setting γ, the second slope in Eq. 2, to negative values. We note that we find the same behaviour with Vpeak/vvir, where less concentrated subhalos would host galaxies with higher star formation rates for a fixed value of Vpeak.
![]() |
Fig. D.1. Same as the top and middle panel of Figure 3 but for SAM galaxies at z = 0 (left) and 1 (right). |
To the clustering prediction shown for z = 1 in Fig. 7, we add the results for z = 0 in Fig. D.2. We show the HODs for z = 0 (upper panel) and z = 1 (lower panel) in Fig. D.3.
![]() |
Fig. D.3. Halo occupation distribution for galaxies in TNG300 (black) and the SHAMe-SF mocks (blue) for z = 0 (top) and z = 1 (bottom) and number densities n = 10−2, 10−2.5, and 10−3 h3Mpc−3. The SHAMe-SF parameters were fitted to reproduce the clustering predictions. The shaded region represents the 1σ confidence interval (circle-textured for centrals). |
All Tables
Star-forming rate-selected samples constructed from the TNG300 hydrodynamical simulation at z = 0 and z = 1 for different number densities.
Selection criteria used to construct a sample of DESI-like ELGs in the TNG300 hydrodynamical simulation.
Satellite fractions for the three highest number density samples in TNG300 at z = 1 and the respective SHAMe-SF predictions.
Satellite fractions for the two higher number densities for TNG300 and the prediction of the SHAMe-SF model for z = 0.
All Figures
![]() |
Fig. 1. Comparison between the real SFR values (TNG300, x-axis) and the ones predicted by the RF (y-axis) for the validation sample at z = 1 using one of the best three models with three subhalo properties (bottom) and the model with only Vpeak (top). Blue lines show the SFR values that define SFR-selected number densities of |
In the text |
![]() |
Fig. 2. Difference between the predicted and TNG monopole (top block) and quadrupole (bottom block) of the redshift space cross-correlation function between the test half of the box and the whole box. We show three models with different numbers of subhalo properties. We tested two redshifts, z = 0 (left) and z = 1 (right), and two number densities for each statistic (n = 10−2 h3 Mpc−3, upper panel of each block, and n = 10−2.5 h3 Mpc−3 lower panel). The grey-shaded intervals mark the 1σ regions. The yellow-shaded region (r < 0.316 h−1 Mpc) is not used to compute the χ2 (see Appendix B). |
In the text |
![]() |
Fig. 3. Vpeak-dependence of the star-formation rate. Top: Vpeak distribution for the different number densities analysed in this work: n = 10−2 h3 Mpc−3 (orange, solid), n = 10−2.5 h3 Mpc−3 (green, dashed), n = 10−3 h3 Mpc−3 (blue, dash-dotted), and n = 10−3.5 h3 Mpc−3 (pink, dotted). The distribution for all the subhalos is shown in grey. Centre and bottom: Vpeak versus SFR distribution of TNG300 (centre) and the SHAMe-SF model (bottom) for central subhalos at z = 1. The colour-coding represents the median of Vpeak/Vvir-defined intervals. The solid, dotted, dash-dotted, and dotted lines mark the values of the SFR that define each number density in the top panel. |
In the text |
![]() |
Fig. 4. For satellites, mean difference (in log scale) between the TNG300 values and the predicted ones using only Eq. (2) and the semi-sorted scatter for a subsample of galaxies on a fixed [Vpeak, Vpeak/Vvir] interval for different Mhalo values (colour coding) (right). Left: The same, but plotted as a function of aMpeak (in scale factor units) maintaining the Mhalo intervals. The black thick line shows the behaviour when the whole sample is considered. The circle represents the mean value for central galaxies with aMpeak ∼ 1. |
In the text |
![]() |
Fig. 5. Difference between the emulator prediction and the real mock weighted by the scaled TNG300 error for the highest number density ( |
In the text |
![]() |
Fig. 6. Projected correlation function (wp), monopole, and quadrupole of the galaxies of the TNG300 simulation (black, 1σ interval in grey) and the model (red) for an SFR-selected subsample with number densities n = 10−2 (straight), 10−2.5 (dashed), 10−3 (dash-dotted), and n = 10−3.5 (dotted) h3 Mpc−3 at z = 1. The three lower number densities have been shifted on the upper panel along the y-axis for visualisation purposes. The lower panel shows the relative difference between the fits and the model normalised by the uncertainty on the measurement on TNG300, with the grey-shaded region indicating 1σ. The green-shaded region (r < 0.6 Mpc/h) indicates the region not considered in the parameter fitting. |
In the text |
![]() |
Fig. 7. Same as Figure 6 but for galaxies from LGalaxies semi-analytical model (black, 1σ interval in grey) instead of the TNG300 simulation. Mock fit is shown in blue. |
In the text |
![]() |
Fig. 8. Same as Figure 6 but for galaxies applying the colour cuts from DESI (see Table 2) to TNG300 galaxies at z = 1 ( |
In the text |
![]() |
Fig. 9. Halo occupation distribution for galaxies in TNG300 (black) and the SHAMe-SF mocks (red) for z = 0 and number densities n = 10−2, 10−2.5, and 10−3 h3 Mpc−3. The SHAMe-SF parameters were fitted to reproduce the clustering predictions. The shaded region represents the 1σ confidence interval (solid for centrals and galaxies; circle-texture for centrals). |
In the text |
![]() |
Fig. 10. Galaxy assembly bias for galaxies in TNG300 (black) and the SHAMe-SF mocks for z = 0 (left) and z = 1 (right) and number densities n = 10−2 (top) and 10−2.5 h3 Mpc−3 (bottom). The SHAMe-SF parameters were fitted to reproduce the clustering predictions (red). The shaded region represents the 1σ confidence interval. |
In the text |
![]() |
Fig. B.1. χ2 values of three different models computed using RF with a different number of input properties (colour coding). We show the results for the monopole and quadrupole for number densities, |
In the text |
![]() |
Fig. C.1. Same as Figure 6 but for galaxies (TNG300 and mock sample) at z = 0. |
In the text |
![]() |
Fig. D.1. Same as the top and middle panel of Figure 3 but for SAM galaxies at z = 0 (left) and 1 (right). |
In the text |
![]() |
Fig. D.2. Same as Figure 6 but for SAM galaxies at z = 0. |
In the text |
![]() |
Fig. D.3. Halo occupation distribution for galaxies in TNG300 (black) and the SHAMe-SF mocks (blue) for z = 0 (top) and z = 1 (bottom) and number densities n = 10−2, 10−2.5, and 10−3 h3Mpc−3. The SHAMe-SF parameters were fitted to reproduce the clustering predictions. The shaded region represents the 1σ confidence interval (circle-textured for centrals). |
In the text |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.