Multicomponent, multiwavelength benchmarks for source- and filament-extraction methods

Modern multiwavelength observations of star-forming regions that reveal highly structured molecular clouds require adequate extraction methods that provide both detection of the structures and their accurate measurements. The omnipresence of filamentary structures and their physical connection to prestellar cores demand methods that are able to disentangle and extract both sources and filaments. It is fundamentally important to test all extraction methods to compare their detection and measurement qualities and fully understand their capabilities before their scientific applications. A recent publication described getsf, the new method for source and filament extraction that employs the separation of the structural components, a successor to getsources, getfilaments, and getimages (collectively referred to as getold). This new paper describes a detailed benchmarking of both getsf and getold using two multicomponent, multiwavelength benchmarks resembling the Herschel observations of the nearby star-forming regions. Each benchmark consists of simulated images at six Herschel wavelengths and one additional surface density image with a 13 arcsec resolution. The structural components of the benchmarks include a background cloud, a dense filament, hundreds of starless and protostellar cores, and instrumental noise. Five variants of benchmark images of different complexity are used to perform the source and filament extractions with getsf and getold. A formalism for evaluating source detection and measurement qualities is presented, allowing quantitative comparisons of extraction methods in terms of their completeness, reliability, and goodness, as well as the detection and measurement accuracies and the overall quality. A detailed analysis shows that getsf has better qualities than getold and that the best choice for source detection is the high-resolution surface density.


Introduction
Extraction methods are critically important research tools, interfacing the astronomical imaging observations with their analyses and physical interpretations. Many different methods have been applied in various studies of star formation in the recent decades to extract sources and filaments and derive their physical properties. The launch of the Herschel Space Observatory stimulated the development of a number of new source-extraction methods, for example, cutex (Molinari et al. 2011), getsources (Men'shchikov et al. 2012, csar (Kirk et al. 2013), and fellwalker (Berry 2015). Ubiquitous filamentary structures observed with Herschel prompted the creation of several filamentextraction methods, for example, disperse (Sousbie 2011), getfilaments (Men'shchikov 2013), a Hessian matrix-based method (Schisano et al. 2014), rht (Clark et al. 2014), filfinder (Koch & Rosolowsky 2015), and tm (Juvela 2016). Most of the methods provide solutions to the problem of detecting sources or filaments, whereas a complete extraction entails their accurate measurements, for which knowledge of their backgrounds is necessary. However, the backgrounds of sources and filaments embedded in the complex, filamentary molecular clouds that strongly Send offprint requests to: Alexander Men'shchikov fluctuate on all spatial scales are highly uncertain, which induces increasingly larger measurement errors for fainter structures.
The methods employ very different approaches, and it is quite reasonable to expect the qualities of their results obtained for the same observed image to be dissimilar. Experience shows that various methods perform differently on increasingly complex images, although they tend to show more comparable results when tested on the simplest images. It seems unlikely that various independent tools would provide the same or consistent results in terms of detection completeness, number of false positive (spurious) detections, and measurement accuracy. The various uncalibrated tools applied in different studies have the potential to bring about contradictory results and wrong conclusions and to create serious long-term confusion in our understanding of the observed astrophysical reality.
It is highly important to benchmark the extraction methods before their astrophysical applications. Although new extraction methods are usually validated before publication on either observed or simulated images, the test images are different for each method, have dissimilar components and complexity levels, and are not always available for independent evaluations and future comparisons with other tools. The validation images, used to test older methods at the time of their publication, are unlikely to resemble the higher complexity observed with new telescopes Article number, page 1 of 26 arXiv:2108.05585v2 [astro-ph.IM] 13 Aug 2021 A&A proofs: manuscript no. bench that have improved angular resolution, sensitivity, and dynamic range. Before the use of the older tools for such improved generations of images, their performance must be reevaluated and compared with other methods on newer images that resemble the new observations. New methods must also be tested on the same set of images to demonstrate their advantages over the older methods.
Comparisons of extraction methods using observed images cannot be conclusive. Only the proper benchmarks would be able to reveal the true qualities and capabilities of the extraction tools. In this paper, the term "benchmark" refers to a standard multiwavelength set of simulated images with fully known properties of all their components, resembling a certain type of observed image in their components and complexity. To benchmark extraction methods means to run them on the simulated images without any knowledge of the model parameters, as if such images were the true observed images. Subsequent comparisons of the resulting extraction catalogs with the truth catalogs using a reasonable set of quality estimators would determine their detection and measurement qualities, inaccuracies, and biases. It would be highly desirable, if various studies used the extraction tool that shows the best performance in benchmarks, to exclude any discrepancies caused by different methods. Notwithstanding that such an approach is sometimes practiced within research consortia, it does not solve the problem entirely, because the results and conclusions derived for the same images by independent groups with completely different tools would still likely be incompatible.
Systematic benchmarking of different extraction methods to guide researchers in their selection of the most appropriate tool for their star-formation studies are hard to find in the literature. A quantitative benchmarking of eight source-extraction methods, referred to by Men'shchikov et al. (2012), was instrumental in the selection of the best tool to apply for the Herschel Gould Belt Survey (HGBS, André et al. 2010) and Herschel Imaging Survey of OB Young Stellar Objects (HOBYS, Motte et al. 2010), but that work remains unpublished. It would not make sense to publish the old results now, because some of the methods have been improved over the years, while the others have become outdated and are not used for the modern, complex images. Any publication of benchmarking results for a selection of extraction tools might quickly lose its value, because it cannot include any improved and newly developed methods. In this work, a completely different approach was taken.
A recent publication (Men'shchikov 2021, hereafter referred to as Paper I) presented a multicomponent, multiwavelength benchmark resembling the images observed by Herschel in starforming regions. The benchmark images contain a realistic filamentary cloud and hundreds of starless and protostellar cores computed by radiative transfer modeling. Fully known properties of all components allow conclusive comparisons of different methods by evaluating their extraction completeness, reliability, and goodness, along with the detection and measurement accuracies. The benchmark images, together with the truth catalogs, are made publicly available and proposed as the standard benchmark for existing and future extraction methods.
Besides the benchmark, Paper I presented getsf, the multiscale, multiwavelength source-and filament-extraction method 1 , replacing the older getsources, getfilaments, and getimages algorithms Men'shchikov 2013Men'shchikov , 2017; throughout this paper, the three predecessors of getsf are collectively named getold. The new method handles both sources and filaments consistently, separating the structural components from each other and from their backgrounds, thereby facilitating their extraction problem. The method produces flattened detection images with uniform levels of the residual background and noise fluctuations, which allows the use of global thresholds for detecting the structures. Independent information contained in the multiwaveband images is combined in the detection images, preserving the higher angular resolutions. Properties of the detected sources and filaments are measured in their backgroundsubtracted images and cataloged. This paper presents benchmarking results for source and filament extraction with getsf and for source extraction with getold, using the new benchmark from Paper I and the old benchmark from Men'shchikov et al. (2012). Instead of describing benchmarking results for an arbitrary selection of existing sourceextraction tools, this paper provides researchers in star formation with an extraction quality evaluation system and the sourceextraction results obtained with getsf and getold for five variants of the benchmarks with increasing complexity levels. Such an approach enables researchers to benchmark any number of source-extraction tools of their choice and evaluate improved or newly developed methods in the future. It is not unusual that researchers prefer to conduct their own benchmarking and analysis, which often is more convincing.
Extraction of filaments is more problematic than extraction of sources. Filaments are observed as the two-dimensional projections that are really hard to decipher and relate to their complex three-dimensional structure. Their appearance, identification, and measurements depend on the spatial scales of interest (cf. Sect. 3.4.5 in Paper I) and they usually contain sources that are either formed within the filaments or appear on them in projection. They are often heavily curved and blended, but no filament deblending algorithm is available, and their physically meaningful lengths and masses are hard to determine. Setting aside the difficult problems to the further dedicated studies, this paper presents the benchmark filament extraction with getsf. No such results are presented for getold, because this method was unable to reconstruct the filament with any acceptable level of accuracy.
Section 2 summarizes all properties of the old and new multiwavelength benchmarks. Section 3 introduces a system of quantities for evaluating performances of source-extraction methods. Section 4 presents the benchmarking results for several variants of the benchmark. Section 6 concludes this work.
Following Paper I, images are represented by the capital calligraphic characters (e.g., A, B, C) and software names and numerical methods are typeset slanted (e.g., getsf ) to distinguish them from other emphasized words. The curly brackets {} are used to collectively refer to either of the characters, separated by vertical lines. For example, {a|b} refers to a or b and {A|B} {a|b}c expands to A {a|b}c or B {a|b}c , as well as to A ac , A bc , B ac , or B bc . the embedded structures and chance projections of the structural components along the line of sight.

Benchmark A
The multicomponent, multiwavelength benchmark, described by Men'shchikov et al. (2012), was constructed in 2009, before the launch of Herschel, at slightly nonstandard wavelengths (λ of 75, 110, 170, 250, 350, and 500 µm). The images on a 1800 × 1800 grid of 2 pixels cover 1 • × 1 • or 2.4 pc at a distance D = 140 pc. They include three independent structural components: the background B λ , sources S λ , and small-scale instrumental noise N λ .
The backgrounds B λ were computed from a synthetic scalefree image D B . The image was scaled at each wavelength to the typical intensities of molecular clouds in the nearby star-forming regions, adopting a planar image of dust temperatures decreasing from 20 to 15 K between the upper-left and lower-right corners, with a constant value of 17.5 K along the other diagonal.
The component S λ of sources was computed from the radiative transfer models of starless cores and protostellar cores with a range of masses from 0.01 to 6 M and half-maximum sizes from ∼ 0.001 to 0.1 pc. The individual model images of 360 starless and 107 protostellar cores were distributed quasi-randomly, preferentially in the brighter areas of the background B λ , allowing them to overlap without any restrictions. A broken power-law function with the slopes dN/dlog 10 M of 0.3 for M ≤ 0.08 M , −0.3 for M ≤ 0.5 M , and −1.3 for M > 0.5 M was used to determine the numbers of models per mass bin δlog 10 M ≈ 0.1.
The final benchmark images I λ were obtained by adding different realizations of the random Gaussian noise N λ at 75, 110, 170, 250, 350, and 500 µm and convolving them to the slightly nonstandard Herschel resolutions of O λ of 5, 7, 11, 17, 24, and 35 . In this paper, the set of benchmark images is extended with an additional image I ≡ D 11 of surface density at a high angular resolution O H = 11 , derived from I λ at 170−500 µm using the algorithm hires described in Sect. 3.1.2 of Paper I.

Benchmark B
The multicomponent, multiwavelength benchmark from Paper I is based on images of a simulated star-forming region at a distance D = 140 pc. The images in all Herschel wavebands (λ of 70, 100, 160, 250, 350, and 500 µm) on a 2690 × 2690 grid of 2 pixels cover 1.5 • × 1.5 • or 3.7 pc. They include emission of four independent structural components: the background cloud B λ , long filament F λ , round sources S λ , and small-scale instrumental noise N λ . A sum of the first two components C λ represents the emission of the filamentary background.
The benchmark images were computed from the adopted surface densities and dust temperatures of the structural components (Figs. 2 -4 of Paper I). The background cloud D B from Benchmark A was scaled to produce the surface densities N H 2 from 1.5 × 10 21 to 4.8 × 10 22 cm −2 and fluctuation levels differing by two orders of magnitude in its diffuse and dense areas. The spiral filament D F has a crest density of N 0 = 10 23 cm −2 , a full width of W = 0.1 pc (150 ) at half-maximum (FWHM), and a power-law profile N H 2 (θ) ∝ θ −3 at large distances θ from the crest. The filament is self-touching, because the two sides of the tightly curved spiral touch each other (Fig. 13), but the filament is not self-blending: there is no additive mutual contribution of the two sides. This allows the benchmark filament to have unaltered radial profiles on both sides, to test the extraction methods' ability to reproduce the profiles without any filament deblending algorithm. The filament mass M F = 3.04 × 10 3 M and length L F = 10.5 pc correspond to the linear density Λ F = 290 M pc −1 .
The resulting surface densities D C = D B + D F of the filamentary cloud are in the range of 1.7 × 10 21 to 1.4 × 10 23 cm −2 . The dust temperatures T C have values from 15 K in the densest central areas of the filamentary cloud to 20 K in its diffuse outer parts. The surface densities D C and temperatures T C were used to compute the cloud images C λ in all Herschel wavebands, assuming optically thin dust emission.
The component D S of sources was computed from radiative transfer models of starless cores and protostellar cores, very similar to those in Benchmark A, in a wide range of masses (from 0.05 to 2 M ) and half-maximum sizes (from ∼ 0.001 to 0.1 pc). Individual surface density images of the models of 828 starless and 91 protostellar cores were distributed in the dense areas (N H 2 ≥ 5 × 10 21 cm −2 ) of the filamentary cloud D C . They were added quasi-randomly, without overlapping, at positions, where their peak density exceeded that of the cloud N H 2 value. A power-law function with a slope dN/dlog 10 M of −0.7 was used to define the numbers of models per mass bin δlog 10 M ≈ 0.1.
This resulted in the surface densities D S , the intensities S λ of sources, and the emission C λ + S λ of the simulated star-forming region. The complete benchmark images I λ were obtained by adding different realizations of the random Gaussian noise N λ at 70, 100, 160, 250, 350, and 500 µm and convolving the images to the Herschel angular resolutions O λ of 8. 4, 9.4, 13.5, 18.2, 24.9, and 36.3 , respectively. The set of benchmark images is extended with an additional image I ≡ D 13 of surface density at a high angular resolution O H = 13.5 derived from I λ at 160−500 µm using the algorithm hires described in Sect. 3.1.2 of Paper I.

Quality evaluation system for source extractions
For comparisons of different source-extraction methods using benchmarks, it is necessary to define several quantities that would evaluate an extraction quality by comparing the positions of detected sources and their measured properties with the true values. Such a formalism was developed by the author in collaboration with Ph. André a decade ago (2010, unpublished) and used to compare getsources with seven other methods (listed in Sect. 1.1 of Men'shchikov et al. 2012). That quality evaluation system has been slightly improved and is now described below and applied to assess performances of getsf and getold in the benchmark extractions. Source extraction methods can be quantitatively compared with each other, using the definitions below and the truth catalogs of the benchmarks.
It is convenient to denote N T the true number of sources in a benchmark, N Dλ the number of detected sources (acceptable at wavelength λ) whose peak coordinates match those of the model sources from the truth catalog, N Gλ the number of sources among N Dλ that have good measurements, and N Sλ the number of spurious sources, that is the number of sources in N Dλ that do not have any positional match in the truth catalog. A measurement is considered as good, if the measured quantities (fluxes, sizes) are within a factor of 2 1/2 from its true model value; otherwise, the measurement is regarded as bad and the corresponding number of bad sources is N Bλ = N Dλ − N Gλ .
In the multiwavelength extraction catalogs, sources can be prominent in one waveband and completely undetectable or not measurable in another one. In the above definitions, a source n is deemed acceptable at wavelength λ, if where Ξ λn is the source detection significance, Γ λn is the source goodness, Ω λn and Ψ λn are the signal-to-noise ratios related to the peak intensity F Pλn and total flux F Tλn , respectively (cf. Eqs. (41) and (42) of Paper I), {A|B} λn are the source FWHM sizes, and A Fλn is the major diameter of the source footprint. The last inequality discards the sources with unrealistically small ratios A Fλn /A λn of their footprint and half-maximum sizes. The empirical set of conditions in Eq.
(1) ensures that the selected subset of sources is reliable (not contaminated by significant numbers of spurious sources) and that selected sources have acceptably accurate measurements.
With the above definitions of N T and N {D|G|S}λ , it makes sense to define the source extraction completeness C λ , reliability R λ , and goodness G λ as where R λ has been updated with respect to the original version of the system, where it was defined as 1/N Sλ . The newly defined reliability is the Moffat (Plummer) profile (cf. Eq. (2) in Paper I), with Θ = 0.05N Dλ and ζ = 1/2. It has a Gaussian-like peak at N Sλ = 0, slowly descends to 0.5 when 5% of N Dλ are spurious sources, and decreases as 1/N Sλ for N Sλ 0.05N Dλ . It is useful to compute the ratios of the measured quantities to their true model values, for each acceptable source, and evaluate their mean values among N Gλ sources with good measurements: where Eλ evaluates the accuracy of the source area. The mean ratios with their standard deviations σ {P|T|A|B}λ can be used to define the qualities of the measured source parameters as Denoting δ Dλ ≡ D λn the mean distance of the well-measurable sources from the true model peaks and σ Dλ the corresponding standard deviation, the positional quality is defined as It is convenient to define the detection quality Q CRλ and the measurement quality Q PTEλ combining the qualities related to the independent source detection and measurement steps, as well as the overall quality Q λ of a source extraction, The quantities defined by Eqs.
(2) -(6) have values in the range [0, 1] that become unity for an imaginary perfect extraction tool that would extract all simulated sources and measure their parameters with no deviations from the true model values. The absolute values of the quantities are arbitrary and meaningless for a single extraction with a single method. The values become quite useful, however, when comparing the relative extraction qualities of two or more methods or of several extractions with a single method (with different parameters). The quality evaluation system represented by Eqs.
(2) -(6) is not unique and other formalisms might be devised and applied to the benchmark truth catalogs and the getsf and getold extraction catalogs found on the benchmarking page of the getsf website 2 .

Benchmarking
The benchmark names with subscripts are used to indicate the number of the structural components. For example, Benchmark A 3 contains three components (background, sources, and noise) and Benchmark B 4 has four components (background, filament, sources, and noise). There are also three simpler variants of the benchmarks: B 3 has no filament, A 2 and B 2 have no background. Below, the source extractions in {A, B} 2 , {A, B} 3 , and B 4 are presented in a sequence of their increasing complexity and followed by the filament extraction in B 4 .
The simplest benchmarks {A, B} 2 (Figs. 1 and 2) contain only two components, the model cores and noise. Most sources are clearly visible against the noise, and therefore they must be uncomplicated to detect for a variety of extraction methods. The model sources have a wide range of the FWHM sizes from the angular resolution O λ up to A λn ≈ 200 ; therefore the methods that limit the largest sizes of extractable sources by only a few beams are expected to miss many larger sources. The resolved models and real objects have non-Gaussian intensity distributions; therefore, the methods, assuming that all sources have Gaussian shapes, are expected to produce less accurate measurements.
The benchmark variants {A, B} 3 (Figs. 3 and 4) contain three components (background, sources, and noise), adding fluctuating backgrounds to the sources and uniform noise of {A, B} 2 . The background fluctuations in A 3 are similar in both diffuse and dense areas, whereas in B 3 they they progressively increase in the denser areas. In the presence of the background clouds, more of the sources are expected to remain undetected and possibly more spurious sources to become cataloged. Extraction methods may perform well in A 3 with its relatively simple background, but some of them would experience greater problems in B 3 . The benchmarks could present serious problems to those extraction tools that are not designed to handle complex backgrounds.
The most complex variant B 4 ( Fig. 5) contains 4 components (background, filament, sources, and noise), adding the dense spiral filament to the structural components of B 3 . The filamentary background of the sources becomes much denser and it acquires markedly different anisotropic properties (e.g., along the filament crest and in the orthogonal directions), in addition to the strong and nonuniform background fluctuations of B 3 . Better resembling the complexity of the interstellar clouds revealed by the Herschel observations, it further complicates the source extraction problem. Among all benchmark variants, the largest numbers of model sources are expected to vanish in the filamentary background cloud of B 4 .
In extractions with getsf and getold, it is necessary to determine the structures of interest to extract and specify their maximum sizes for each waveband ({X|Y} λ   25, 30, 150, 150, and 150 for the Herschel wavebands and 150 for the surface density image. In Benchmark B, the maximum size Y λ for filaments was 350 for all images. The getold extractions followed an improved scheme (Men'shchikov 2017): all benchmark images were first processed by getimages (using the above maximum sizes) that subtracted their large-scale backgrounds and flattened residual background and noise fluctuations. The background-subtracted and flattened images were then used in the getsources extractions.

Source extractions in Benchmarks A and B
In the standard approach to the multiwavelength benchmarking adopted in this paper, sources are detected in the wavelengthindependent images combined from all (seven) wavelengths. Effects of different combinations of images for source detection on extraction qualities are discussed in Sect. 4.1.4. In the analysis of the extractions, all acceptable sources from the catalogs were positionally matched with the truth catalogs using stilts (Taylor 2006). The matching radius was essentially a quadratic mean of the angular resolution O λ and the true FWHM size of the model core, The extracted sources with positions within the circles were considered the matches to the true model cores. Only those of them with errors in measurements within a factor of 2 1/2 were evaluated in Tables C.1 -C.2 according to the system outlined in Sect. 3. For plotting the ratios of the measured and true parameters (cf. Figs. 6 -10) the sources with measurement errors within a factor of 10 were used.  resolutions of λ < 160 µm and blended with the other sources (A 2 and A 3 ) or backgrounds ({A, B} 3 and B 4 ) at the lower resolutions of the longer wavelengths. The starless cores with temperatures T D < ∼ 10 K produce little emission at λ < ∼ 100 µm; therefore, only the protostellar cores are extractable at the short wavelengths. Although the starless sources appear stronger at λ > 170 µm, progressively lower resolutions spread their emission over larger footprints, which makes interpolation of the fluctuating background less accurate. The strongly overlapping sources in the crowded areas of Benchmark A become more heavily blended with each other and with their background, which makes their deblending less accurate. The backgrounds and true extents of the footprints of such blended sources are difficult, if not impossible, to determine reliably. Their footprints often become overestimated and the backgrounds underestimated, which leads to excessively large measured fluxes. For other sources that are largely isolated, the measurements of fluxes, sizes, and positions are usually more accurate. Increasing numbers of overlapping sources at the lower resolutions of longer wavelengths degrade the quality of their backgrounds further, because much more distant source-free pixels are to be used in the background interpolation. Figures 1 and 3 demonstrate several difficult cases, when one or more narrow sources appear on top of the much wider, wellresolved starless source, referred to as a sub-structured source. If the narrow source is located close to the peak of the wide source, it is practically impossible for an automated extraction method to distinguish these two sources. Depending on the intensity distributions, the wide source may be regarded as the background of the narrow source, remaining not extracted, or it may be considered as belonging to the power-law outskirts of the narrow source. More often, such narrow sources are located off-peak of the wide source, hence they can be detected as separate sources. In both cases, however, the benchmarks reveal that their measurements are inaccurate, because of the incorrectly determined individual backgrounds of each source and an approximate nature of their deblending. Backgrounds of sources are highly uncertain (cf. Appendix A) and it is not surprising that they are even less accurate for the blended sources.
An inspection of Figs. 1 -5 reveals several spurious sources, those that do not exist in the benchmarks. The spurious detections are partially or completely discarded from the final catalogs during measurements by the acceptability criteria in Eq. (1). Some spurious sources are found on the well-resolved starless sources, whose large-scale intensity peak enhances the smallscale background and noise fluctuations, making them appear as real sources. When a source extraction aims at the highest possible completeness, at finding the faintest sources, it is normal that some peaks, produced by the background and noise fluctuations, are mistakenly identified as genuine sources. A good source ex-  traction method must, however, guarantee that the number N Sλ of spurious sources in the final catalog remains below a few percent of the number of real sources N Dλ . For some studies, it may be beneficial to require that a valid source must be detected and acceptable in at least two wavebands. This strategy potentially removes most of the spurious sources, together with some real sources, unfortunately. It is better not to apply such a condition when benchmarking source extraction methods, because practical applications often require extractions in a single image.
In the surface density images D {11|13} in Figs. 1 and 2, the footprints of several unresolved peaks of quite extended protostellar cores appear too small. They correspond to just the unresolved central peaks and not to the entire large cores with their power-law profiles. The same sources have large sizes of their extended footprints at {160|170} and 500 µm and in the benchmarks with background . This abnormality is caused by the derivation algorithm of the images D {11|13} , which employs fitting of the spectral shapes Π λ of the pixels. The surface densities are known to be quite inaccurate in the pixels with strong temperature gradients along the lines of sight (e.g., Appendix A of Paper I). Such fitting problems lead to the overestimated temperatures and underestimated surface densities around the unresolved peaks. The resulting strong depressions (local minima) around the peaks of several protostellar cores in D {11|13} prevent the extraction methods from finding the correct footprint sizes. This happens only in the simplest benchmarks {A, B} 2 with just two components (sources and noise), because the bright emission of the background and filament dilutes the temperature effect along the lines of sight within the cores.

Measurement accuracies
Figures 6 -10 display the measurement accuracies of the peak intensity F Pλn , integrated flux F Tλn , and sizes {A, B} λn for each acceptable source n, represented by the ratios of their measured and true values, as functions of their S/N ratios Ω λn and true FWHM sizes A λnT . The accuracy plots are not shown for λ < 160 µm, because only the bright protostellar cores are extractable in those images and their measurements are quite precise, with errors well below 1%. The measurement results in the derived D {11|13} are not shown either, because they are known to be inaccurate (e.g., Appendix A of Paper I). Some of the starless cores become measurable at {160|170} µm as faint sources with Ω λn < ∼ 10, the values well below the S/N of the bright protostellar cores. The faintness of the starless cores with respect to the background and noise fluctuations makes measurements of some of them inaccurate, with a large spread of errors in total fluxes, exceeding a factor of 2 1/2 . Toward the longer wavelengths (250−500 µm), the starless cores become brighter, whereas the  protostellar cores become fainter, making their Ω λn ranges overlap for the two populations of sources.
Figures 6 -10 reveal that getold systematically underestimates the FWHM sizes {A, B} λn of sources by ∼ 20%. The problem is most clearly visible for the well-resolved sources, because for the slightly resolved or unresolved sources getold adjusts the underestimated values {A, B} λn < O λ by setting them to the angular resolution O λ . The main reason for the systematic deficiency is the size estimation algorithm that uses the source intensity moments, which can only be accurate for the Gaussian sources. In most practical applications, however, there are no Gaussian-shaped sources, for several reasons. Firstly, the point-spread functions (PSFs, beams) of telescopes are often non-Gaussian in their lower parts, which affects the shapes of mostly the unresolved sources. Secondly, the radiative transfer models of the starless and protostellar cores (Sect. 2) suggest that the real physical cores produce non-Gaussian intensity profiles, thereby affecting the shapes of mostly the resolved sources. Finally, the backgrounds of sources in bright fluctuating molecular clouds cannot be determined accurately, hence non-negligible over-or under-subtraction of the background of even the Gaussian sources would create non-Gaussian shapes, in both resolved and unresolved cases. Background of extracted sources is often overestimated, hence the intensity moments of the backgroundsubtracted source would underestimate {A, B} λn . For the proto-stellar cores that have power-law intensity profiles at large radii, the intensity moments algorithm leads to strongly overestimated half-maximum sizes and for the starless cores with flat-topped shapes, the intensity moments could significantly underestimate the half-maximum sizes (cf. Sect. 3.4.6 of Paper I).
Figures 6 -10 demonstrate that getsf does not have such systematic problems with the FWHM sizes {A, B} λn of sources. This is because getsf evaluates them directly at the half-maximum intensity (Sect. 3.4.6 of Paper I), unlike getold that employs the source intensity moments. Direct measurements are much less affected by the background inaccuracies, but over-subtracted backgrounds of the (almost) unresolved sources could also lead to unrealistically small {A, B} λn < O λ and underestimated fluxes F Pλn and F Tλn . The sizes and fluxes of such unresolved or slightly resolved sources are rectified by getsf using the correction factors (Appendix B) derived for an unresolved Gaussian source, assuming that it is the background over-subtraction that makes the source have the sub-resolution sizes {A, B} λn < O λ . The Gaussian model is used to obtain the correction factors, not the measurements themselves. Unfortunately, similar corrections cannot be derived for the well-resolved sources, nor for the sources with underestimated backgrounds and overestimated sizes and fluxes. Figures 6 -10   expected general trend that the numbers of acceptable sources in the accuracy plots become lower for the backgrounds with increasing complexity, in the sequence from {A, B} 2 to {A, B} 3 and to B 4 . This is caused by the much stronger variations in the immediate surroundings of the sources, especially those located on the densest parts of the background cloud, which strongly reduce the S/N ratio Ω λn of the extracted sources. As a result, some of those sources that were acceptable in the simpler variants of the benchmarks, are pushed off the acceptability domain by their lower values Ω λn < 2. In all benchmarks, the measurement errors significantly increase for the faint sources, because their estimated individual backgrounds become more strongly affected by the fluctuations of the filamentary cloud and noise. The resulting over-or underestimation of the backgrounds depends on whether the sources happen to be located on the hollow-or hilllike fluctuation, correspondingly, as well as on the other types of background inaccuracies (cf. Appendices A and B). Figure 11 presents an overview of the extraction qualities of getold and getsf, displaying Q CRλ , Q PTEλ , and Q λ from Tables C.1 and C.2. The first two qualities conveniently evaluate the extraction methods at their independent detection and measurement steps, whereas the third one combines the two in the overall ex-traction quality. To facilitate their analysis, the plots display also the global qualities Q CR , Q PTE , and Q, the geometric mean values over the wavelengths, for each benchmark. All features of the plots in Fig. 11 can be readily understood by comparisons of the tabulated qualities (Tables C.1 and C.2). The source-detection quality Q CRλ is the product of the extraction completeness C λ and reliability R λ . As expected, the global detection quality Q CR of both methods decreases from A 2 to B 4 , toward the more complex benchmarks (Fig. 11), demonstrating better results for getsf in all benchmarks, except A 3 . In A 3 , getsf has a 13% lower quality, because of several spurious (very noisy) sources extracted at 110 and 170 µm with very low significance levels, within just a few percent above the cleaning threshold λS j = 5σ λS j (Sect. 3.4.2 of Paper I). At {70|75} and {100|110} µm, Q CRλ shows lower values, because only the protostellar cores are detectable, whereas at {160|170} µm, some of the starless cores appear as faint detectable sources, hence the quality gets higher. For some benchmarks, Q CRλ becomes significantly lower, which usually indicates that more spurious sources were extracted, hence the lower reliability R λ .

Extraction qualities
The measurement quality Q PTEλ is a product of the respective qualities Q Pλ , Q Tλ , and Q Eλ of peak intensity, integrated flux, and source area. The global measurement quality Q PTE for getsf is better by 20% than that for getold, across all benchmarks ( Fig. 11). At {70|75} and {100|110} µm, Q PTEλ is within   2% of unity, because the protostellar cores are bright, hence they can be accurately measured (cf. Figs. 6 -10). The faint starless cores at {160|170} µm are poorly measurable; therefore, Q PTEλ becomes lower. The getsf measurement quality is substantially higher at the SPIRE wavelengths, partly because getold systematically underestimates source sizes (Sect. 4.1.2).
The overall quality Q λ is the product of Q CRλ and Q PTEλ , as well as of the positional quality Q Dλ and goodness G λ . In line with the expectations, the global quality Q of both methods decreases toward the more complex benchmarks (Fig. 11). For the simpler Benchmark A, getsf has a small 10% edge over getold, whereas for Benchmark B, the getsf quality reaches the values higher by a factor of 2 1/2 . The quality evaluation system (Sect. 3) encapsulates all aspects of source extraction; therefore, the plots in Fig. 11, based on Tables C.1 and C.2, justify the conclusion that the getsf is superior to getold in both Benchmarks A and B.

Dependence on the images used for detection
In the present multiwavelength benchmarking, all seven images were combined in the wavelength-independent images for detecting sources (Sect. 3.4.3 of Paper I). To some extent, however, source extraction results must depend on the images used for source detection. Both getsf and getold combine images and detect sources with almost the same algorithms; therefore, getsf alone may be used to evaluate the dependence of the extraction qualities on the subsets of images. Only the realistic benchmark variants with backgrounds ({A, B} 3 and B 4 ) may be used in these tests to keep the amounts of results within reasonable limits. Figure 12 presents an overview of the overall quality Q λ and its global counterpart Q for source extractions with getsf in A 3 , B 3 , and B 4 using 6 subsets of images combined for detection. The full set of seven images (PDS) was discussed above (Sects. 4.1.1 -4.1.3) and is shown again for completeness. The subset of six images (PS) tests the case when the surface density image D {11|13} is not used. The subset of four images (P 3 S) examines the absence of two PACS images (at {70|75} and {100|110} µm). The subset of three images (S) clarifies the effects of the source detection with only the SPIRE images. The three single-image subsets (S 1 , S 3 , and D) explore the source extractions with the 250 µm image, the 500 µm image, and the surface density image D {11|13} , respectively. The subsets with only the PACS images are not considered, because no starless cores appear in the images at λ < 160 µm.
The results (Fig. 12) for the seven different cases are sorted from left to right in the order of decreasing global quality Q. In all three benchmarks, the best extraction quality is found in the subset D, when the surface density D {11|13} is the single image used to detect sources. It is obvious from the original images (e.g., Figs. 3 -5) that the surface density image must be beneficial for source extractions, because the sources are visible there most clearly. However, this result suggests that the highresolution D {11|13} may also be used alone to detect sources, with better results than in a combination with the Herschel images. In the benchmarks {A, B} 3 , the second best global quality Q is shown by the complete set (PDS), when all seven images are used to detect sources. In B 4 , however, the extraction quality with this subset of images is only the fourth, which was caused by a few more spurious sources extracted at 70 and 160 µm. The spurious peaks are clearly identifiable with the background and noise fluctuations in those images that happened to be slightly brighter than the cleaning threshold λS j = 5σ λS j (Sect. 3.4.2 of Paper I). Without the spurious sources, the PDS set would have the second best Q value in all three benchmarks.
When the subset S of only the three SPIRE images is used for source detection, the global quality Q becomes the fourth, the third, and the second best in the benchmarks A 3 , B 3 , and B 4 , correspondingly (Fig. 12). An addition of the PACS 160 µm image to the SPIRE images in P 3 S leads to the fifth, the fourth, and the third best Q values among all subsets, always just below the global quality for the subset S. The slightly lower (by 5−10%) qualities in P 3 S can be traced to the chance extraction of a few spurious sources at the longest PACS wavelength. Using the subset S 1 with the single 250 µm image for source detection makes the global quality in A 3 the third best, whereas in B 3 and B 4 it becomes only the sixth and fifth, respectively. The absence of the high-resolution D {11|13} in the subset PS of the six Herschel images makes the extraction one of the two worst ones. However, the differences between the Q values outside the top three best-quality results is very small, at the levels of a few percent. An exception is the worst extraction for S 3 , whose quality is well below all others in A 3 and B 4 , because of the lowest angular resolution of the detection image.
Formally taking all the benchmarking results, it is possible to rank the getsf source extraction qualities by summing up their places in the three benchmarks shown in Fig. 12. The two best subsets of the Herschel images to be used for source detection are D (D {11|13} ) and PDS (D {11|13} together with all PACS and SPIRE images), and the next good subset is S (SPIRE images at 250−500 µm). The three worst subsets appear to be S1 (250 µm image), PS (all Herschel images), and S3 (500 µm image). It must be emphasized that the actual choices in real-life applications depend on the research interests. For example, if the goal is to study the protostellar cores, then the shortest PACS wavelength, where they are the brightest and with the highest resolution, is the best choice for their detection. However, if the aim is to study the starless cores that are the strongest at the SPIRE wavelengths, then the high-resolution surface density D {11|13} (possibly together with the 250−500 µm images) is likely the best choice for the source detection with getsf. This is an important decision to make when preparing for source extractions.

Filament extraction in Benchmark B 4
Filaments are separated from both backgrounds B λY and sources S λ and detected as skeletons in their own flattened, wavelengthcombined component F D jC (Sects. 3.2 -3.4 in Paper I). The separation allows the filament crests to be traced more precisely, reducing the interference from the sources that could significantly affect the results. In the standard approach to the multiwavelength benchmarking adopted in this paper, the filament detection image is combined from six wavelengths, excluding the 70 and 100 µm images, because the simulated filament is very faint and noisy at those wavelength. Figure 13 shows the skeletons and footprints of eight filaments detected in B 4 at the significance level ξ = 4 (Sect. 3.4.5 in Paper I). All but one of them are spurious, the short fluctuations of background that happened to be elongated and slightly denser than the filament detection threshold λF j = 2σ λF j (Sect. 3.4.2 in Paper I). The rate of spurious filaments can be reduced using a higher value of the skeleton significance ξ when detecting filaments. Spurious filaments usually have their lengths shorter than their widths, hence they can be discarded from further analysis after a visual inspection. This is not done by getsf automatically, because also the real filaments are often split into relatively short segments by the sources, other intersecting filaments, or background fluctuations.
The sides of a filament are known to getsf as left (α) or right (β) with respect to the path from the first pixel of the skeleton to its last pixel. The first pixel of the spiral skeleton is in the center, hence the left normals to the skeleton are pointing inside the loops and the right normals are pointing outward (Fig. 13). Although the one-sided footprints and normals touch each other, which indicates overlapping of the two sides, the model filament is not affected by self-blending (Sect. 2.2), hence the orthogonal profiles of each loop must follow the true model profile, unaltered by the blending that would complicate the measurements of the observed filaments. The central loops of the filament are blended, however, with the dense background cloud (Fig. 13), which makes the separated background of the filament less accurate, underestimated in the central area (Fig. 8   outermost loop of the spiral filament has a more accurate background and is filament-free along the right normals. Therefore, the right-sided measurements of the filament along the outermost loop may be expected to produce more accurate results than those over the inner parts of the spiral filament that have a contribution from the strongly fluctuating background. Figure 14 presents the filament radial profiles D {α|β} (r) along the skeleton normals, median-averaged over the filament length. The standard deviations ς {α|β}± (r) about the median profiles are computed separately for the positive and negative differences. The true filament profiles correspond to the model surface densities (Eq. (2) in Paper I) and display practically no differences between the filament sides. The slopes γ {α|β} (r) accurately represent the true model values, increasing from γ(r) ≈ 1 at the halfmaximum radius of 0.05 pc to γ(r) ≈ 3 (at much larger distances r > ∼ 0.3 pc). For the filament extracted with getsf in B 4 , the radial profiles obtained from the entire filament are less accurate, with significantly larger dispersions (Fig. 14). This is caused by the underestimated fluctuating background in the central area ( Fig.  8 in Paper I) that in effect makes a substantial contribution to the background-subtracted filament F Y . For a comparison, the profiles D {α|β} (r) obtained over only the outer filament loop, where its background is more accurate (Fig. 14), much better reproduce the true model surface density distribution, with much smaller dispersions of their values along the segment.  . 2.2). The discrepancies are caused by the residual contribution of the incompletely subtracted background, underestimated by up to ∼ 50% in the center of the filamentary cloud (Fig. 8 in Paper I). The filament footprint covers the entire cloud (Fig. 13), which makes it especially difficult to separate the filament from its blended background.
Benchmark B 4 provides a good test for filament extraction methods. The source extraction in B 4 with getold, described in Sect. 4.1, executed also getfilaments (Paper II), an integral part of getsources. Although the method passed simpler filament extraction tests (Sect. 3 in Paper II), getfilaments was unable to properly reconstruct the filament in B 4 . The crest values of the filament were underestimated by a factor of ∼ 5 in the central area of the dense background cloud, whereas the values were either correct or overestimated within ∼ 40% in some segments of the outermost loop of the filament. Even though the filament one-sided widths were determined fairly accurately (within 10−20%), the distant fainter areas of the filament profile (beyond a radius of 0.1 pc) were completely missing. Therefore, the mass and linear density of the filament were also strongly underestimated (by factors of ∼ 3).
The dense spiral filament in B 4 represents just the simplest benchmark. The filament crest must not create problems for any skeletonization algorithm; its detection is not the main goal of this benchmark. The simulated filament was created primarily to test the accuracy of various methods in measuring the filament profile and physical properties. Observed filaments display have various masses, densities, lengths, widths, curvatures, and signal-to-noise ratios. The filaments imaged with Herschel are embedded in strongly fluctuating backgrounds and arranged in complex networks with hundreds of interconnected segments. A proper benchmarking would require the simulated images that resemble the observations, as well as a quality evaluation system, similar to that used in this paper for testing the source-extraction methods. Realistic and rigorous benchmarking of filament-extraction methods are the subject of a future work.

Discussion
Astronomical images are known to be very dissimilar across the electromagnetic spectrum (e.g., Figs. 16 -23 in Paper I). Therefore, the source-or filament-extraction methods, developed for different research areas and types of observed images, have heterogeneous properties and qualities. Benchmarking of the extraction tools must also depend on the research project, and the simulated images must resemble the complexity and structural components of the typical observed images. For example, if the sources of interest are all unresolved and there is no strong fluctuating background in the observed images, then the benchmark images must also contain just the unresolved sources (with a similar spatial distribution) and faint background. In this simple case, it may well be that a simple source-extraction tool employing a PSF-fitting algorithm could give more accurate results than a more general method designed to work for both unresolved and resolved sources on complex filamentary backgrounds.
The benchmarks described and applied in this study were designed to resemble the mid-to far-infrared (submm) imaging observations obtained with Herschel for the nearby star-forming regions. The simulated images contain a bright, fluctuating filamentary background cloud and starless and protostellar cores with a wide range of sizes, from unresolved to strongly resolved. By construction, these benchmark images are most suitable for testing the source-and filament-extraction methods to be applied in the studies of star formation. If observed images are signifi-cantly different, the benchmarks explored in this paper may not be directly applicable for testing extraction methods. For example, the substantial differences between the ALMA interferometric images of distant star-forming regions (e.g., Fig. 23 in Paper I) and the Herschel images of the nearby star-forming clouds required creation of dedicated benchmarks with unresolved sources and background from MHD simulations (Pouteau et al., in prep.). To make the benchmark images better resemble the real interferometric observations, they were also processed with the ALMA observations simulator.
For testing the source-extraction methods, it is the model sources that are the most important (primary) component of the benchmarks, and it must resemble the sources in real observations as closely as possible. Similarly, for testing the filamentextraction methods, it is the model filaments that are the main component, with all their parameters tabulated in a truth catalog. The other components of the benchmark images (e.g., fluctuating background, instrumental noise) just complicate the extraction of the primary component. They may be scaled up or down to create variants of the same benchmark with diverse contributions of the secondary components. For example, this paper employed several benchmark variants ({A|B} 2 , {A|B} 3 , and B 4 ) of different complexity, expanding the applicability of the two benchmarks to other types of images of the nearby star-forming regions.
Benchmarking source-or filament-extraction methods, it is important to make ensure that the simulated images contain realistic enough models of the sources or filaments that are expected to be extracted in the real-life observations. This may be a potential problem, because that requires an advance knowledge of the physical reality being observed. In practice, there usually exists a good deal of previous studies that would allow the creation of the suitable primary and secondary components of the benchmarks. However, if an application of the extraction tools to the observed images shows the component properties that are significantly different from the ones simulated for the benchmarks, the latter may need to be adjusted and the testing of the methods to be repeated.

Conclusions
This paper described detailed benchmarking of two multiwavelength source and filament extraction methods, getsf and getold, to quantitatively evaluate their performance in Benchmarks A and B. In total, the two methods of source extraction were tested and compared using five variants of the simulated multiwavelength images of different complexity. Although the benchmarks were designed to resemble the Herschel observations of star-forming regions, the images are suitable for evaluating extraction methods for various astronomical projects and applications.
Benchmark B includes the complex fluctuating background cloud, the long dense filament, and the multitude of sources (starless and protostellar cores) with wide ranges of sizes, masses, and intensity profiles, computed with a radiative transfer code. In Benchmark A with similar properties of the structural components (no filaments), the sources are allowed to arbitrarily overlap with each other. The benchmarks enable conclusive comparisons between different methods and allow a quantitative comparison of their qualities, using the formalism given in this paper, in terms of the extraction completeness, reliability, and goodness, as well as the detection and measurement accuracies and the overall quality. All benchmark images, the truth catalogs containing the model parameters, and the reference extraction catalogs produced by the author are available for download on the getsf website 3 .
The quantitative analysis of the benchmark source extractions showed that the getsf method has superior qualities in comparison with getold. The benchmark filament extraction with getsf recovered parameters of the model filament, in contrast to the extraction with getold that was unable to properly reconstruct the filament to an acceptable accuracy. An investigation of the dependence of the source extraction results on different sets of images used to detect sources suggested that the best choice for source detection with getsf is the high-resolution surface density, either alone or together with other Herschel images. The worst choice for source detection would be the lowest-resolution observed images.
The benchmarks explored in this paper are proposed as the standard benchmarks for calibrating existing and future source and filament extraction methods before any astrophysical applications of the methods. It is critically important to use only the best calibrated tools with known properties that are fully understood on the basis of the standard benchmarking. Applications of various uncalibrated extraction tools with unknown qualities that have never been quantitatively compared, could lead to a proliferation of incompatible results and severe long-term problems in understanding of the astrophysical reality.

Appendix A: Fluctuating backgrounds and the measurement accuracy for faint sources
Exact shapes of the molecular clouds under faint sources are practically impossible to separate from the observed emission peaks with any acceptable accuracy. The observed backgrounds of sources fluctuate on all spatial scales. Instrumental noise further complicates the source backgrounds by adding random fluctuations on scales of the angular resolution O λ . The background and noise fluctuations are totally blended with the sources and no source extraction method is able to precisely deblend the components. This makes the measured sizes and fluxes of faint sources uncertain, often significantly over-or underestimated, depending on the unknown shapes of the fluctuations within the source footprints. Naturally, the background inaccuracies become relatively less important for increasingly stronger sources. Figure A.1 illustrates the problem using a simple Gaussian source G of a FWHM size of 10 and several differently shaped backgrounds (flat and hill-or hollow-like). To simplify the matters, the source may be considered as unresolved, although the extended sources are also affected by the same problem. In the simplest (unrealistic) case, the source could be observed against constant background (B 1 = 1). The fluctuating backgrounds were modeled by adding the positive or negative 15 (FWHM) Gaussians with peak values of 0.5 and 0.9 to the flat background. For simplicity, the source position is assumed to be aligned with the background extrema, which is sufficient to illustrate the roots of the problem.
The flat background would normally present no difficulties for accurate source measurements. However, the strongly fluctuating backgrounds pose severe problems for source extraction methods. Measurements of the faint sources could be quite different from the true values, depending on the sign and magnitude of the background fluctuation within their footprints (Fig.  A.1). When the source is blended with a hill-like background, its shape remains very similar to a Gaussian source and contains no information that the background is not flat. The source footprint usually widens and the hill-like background contributes to the overestimated width and fluxes of the source. On the other hand, when the source is blended with a hollow-like background, its apparent footprint shrinks to the area limited by the intensity minimum that appears around the peak. As a consequence, the sizes and fluxes of such sources become underestimated, sometimes quite strongly (Fig. A.1).
It is clear that the backgrounds of sources in the benchmark simulations and real observations are much more complex than the above simple model. However, the model illustrates the fundamental reasons behind the increasingly larger inaccuracies for the sources with low S/N ratios in Benchmarks A and B (Figs. 6 -10). In general, measurement accuracy for such sources is impossible to improve, because the necessary information is practically lost, when the source peak is blended with the background and noise fluctuations. Fortunately, the unresolved or slightly resolved sources are the exception, for which it is possible to (approximately) correct the underestimated sizes and fluxes.

Appendix B: Corrections for the measurements of unresolved or slightly resolved sources
The PSFs (the telescope beams) set a natural lower limit to the source sizes {A, B} λn , their values must be larger or at least equal to the angular resolution O λ . However, the benchmarking discussed in this paper has revealed numerous examples of sources with sizes {A, B} λn < O λ and underestimated peak in-   The same source G is also added to the nonuniform (hill-and hollowlike) backgrounds B 1.9 , B 1.5 , B 0.5 , and B 0.1 (blue and green lines). The fluctuating backgrounds were obtained from the flat background B 1 by adding or subtracting the Gaussians with a size of 15 (FWHM) and the peak values of 0.5 and 0.9. Extraction methods would not be able to recognize that the real backgrounds are hill-or hollow-like, hence they would instead subtract the flat backgrounds, based on the intensities just outside the apparent source footprints. Therefore, the source G would be extracted with over-or underestimated FWHM sizes A, peak intensities F P , and total fluxes F T (the middle, left, and right columns of numbers, respectively).
tensities F Pλn and integrated fluxes F Tλn . An analysis of the results showed that the underestimated parameters are related to the overestimated backgrounds of the faint sources. When the FWHM sizes are directly measured at half-maximum intensities, like in getsf (cf. Sect. 3.4.6 of Paper I), the measurements can be improved, as shown below. However, such corrections are not feasible for getold, because the sizes obtained with intensity moments often correspond to uncertain levels, significantly deviating from the half-maximum intensity. Figure B.1 illustrates the Gaussian model, adopted by getsf to correct the underestimated sizes and fluxes of faint sources, when their measured sizes are smaller than the beam size. The model assumes that the unresolved or slightly resolved faint sources have Gaussian shapes, which is an appropriate assumption, because most telescopes have Gaussian beams in their central (upper) parts. Various deviations and artifacts that often appear in the PSFs at larger angular distances from their peak are invisible for the faint sources. The model also supposes that the actual source background is flat, which is the only reasonable assumption that could be made. Although there are many possible shapes of the fluctuating background within a source footprint, they cannot be accurately recovered from the blended source intensity distribution.
With the above two assumptions, Fig. B.1 demonstrates how the measured properties of a Gaussian source G would be affected by the increasingly overestimated backgrounds B 0.7 , B 0.8 , B 0.9 , B 1.0 , and B 1.1 . In the simplest case of an intrinsically flat background, the background could be progressively overestimated for stronger instrumental noise, which would effectively represent a fluctuating background of the source. For the blended When such backgrounds are subtracted (lower colored curves), the source FWHM sizes A, peak intensities F P , and total fluxes F T become increasingly underestimated. The measured A values (given in the plot) get progressively smaller than the angular resolution of 10 , which clearly indicates an increasing inaccuracy of the background. Requiring that a source cannot be narrower than the telescope beam, it is possible to improve the measurements, substantially reducing their errors.
sources or those in crowded areas, absence of the source-free pixels in their immediate environments is often the reason for the background to be overestimated. Whatever the actual cause, an over-subtraction of the increasingly inaccurate background leads to the progressively underestimated FWHM sizes, peak intensities, and total fluxes. For the Gaussian G in Fig. B.1, it is possible to determine the correction factors that would recover the true properties of the source from their underestimated values.
The multiplicative correction factors f Sλn , f Pλn , and f Tλn for the sizes {A, B} λn , peak intensity F Pλn , and total flux F Tλn , respectively, are obtained empirically by approximating the results for the Gaussian model (Fig. B.1) using different overestimated backgrounds, where the factors differ from unity only when (A λn B λn ) 1/2 < O λ . They are applied when creating the final catalog at the end of the measurement iterations. The factors are implemented in getsf, hence the benchmark extraction catalogs discussed in this paper contain improved measurements for the faint unresolved or slightly resolved sources. By their definition, the factors from Eq. (B.1) provide precise results for only Gaussian sources on flat backgrounds. In most cases, however, the real backgrounds of sources have more complex shapes, in which case the formulas from Eq. (B.1) provide less accurate corrections to the measured quantities. Despite being approximate, the corrections are nevertheless very useful, because they significantly improve the measurements. For ex-ample, the hollow-like backgrounds B 0.5 and B 0.1 of a Gaussian source G from Fig. A.1   (2) -(6), are evaluated for only acceptable sources, cf. Eq. (1), with errors in measurements within a factor of 2 1/2 . The numbers of model sources are N T = 459 in Benchmark A and N T = 919 in Benchmark B. Source measurements in the image of derived surface densities are known to be inaccurate (e.g., Appendix A of Paper I), hence the data are not presented.     (2) -(6), are evaluated for only acceptable sources, cf. Eq. (1), with errors in measurements within a factor of 2 1/2 . The number of model sources is N T = 919. Source measurements in the image of derived surface densities are known to be inaccurate (e.g., Appendix A of Paper I), hence the data are not presented.   A. Men'shchikov: Benchmarks for source-and filament-extraction methods Table C.4. Benchmark B 3 with getsf using different subsets of images for the combination over wavelengths and detection. The qualities, defined in Eqs.
(2) -(6), are evaluated for only acceptable sources, cf. Eq. (1), with errors in measurements within a factor of 2 1/2 . The number of model sources N T = 919. Source measurements in the image of derived surface densities D 13 (at a fictitious wavelength = 165 µm) are known to be inaccurate (e.g., Appendix A of Paper I), hence the data are not presented. The extractions are sorted, from top to bottom, by their global qualities Q of 0.042, 0.030, 0.030, 0.028, 0.027, 0.027, 0.025. A&A proofs: manuscript no. bench Table C.5. Benchmark B 4 with getsf using different subsets of images for the combination over wavelengths and detection. The qualities, defined in Eqs.
(2) -(6), are evaluated for only acceptable sources, cf. Eq. (1), with errors in measurements within a factor of 2 1/2 . The number of model sources N T = 919. Source measurements in the image of derived surface densities D 13 (at a fictitious wavelength = 165 µm) are known to be inaccurate (e.g., Appendix A of Paper I), hence the data are not presented. The extractions are sorted, from top to bottom, by their global qualities Q of 0.011, 0.0096, 0.0089, 0.0084, 0.0080, 0.0078, 0.0055.