Data collection
In this paper, we combine first-digit NBL and KLD to evaluate the reliability of COVID-19 records in all 20 Latin American countries using information from Our World in Data on country-level aggregate cases [25]. By reliability, we consider the “the extent to which an experiment, test or any measuring procedure yields the same results on repeated trials” [26].
Statistical analysis
Initially proposed by Newcomb [27] and popularized by Benford [28], NBL states that some digits appear more frequently than others. Comparatively, 1 is the most common first digit, leading 30.10% of the time, and 9 is the least common, with an expected frequency of 4.58% [29]. Scholars compare observed data distribution with the theoretical expectation that the “occurrence of numbers is such that all mantissa of their logarithms are equally probable” [27]. Therefore, for the first digit,
$$P(d)=\left(\frac{1+d}{d}\right)\kern0.5em for\ d\in \left\{1,..,9\right\}$$
(1)
Where P(d) gives the probability of a given number occurring as the first digit. According to Hill [30], “this law implies that a number has leading significant digit 1 with probability log10 2 ≅ .301, leading significant digit 2 with probability log10 3 ≅ .176 and so on monotonically down to probability .046 for leading digit 9”. NBL has been used as a forensic tool to detect data irregularities in several fields, such as religious activity [31], scientific data [32], socio-economic datasets [33], electoral processes [34], international trade [35], and academic misconduct [36]. In epidemiological data, deviations from NBL may be associated with inadequate capacity in surveillance systems or intentional fraud [13].
According to Nigrini [13], in order to apply Benford’s Law to a given dataset, the data must form a geometric sequence or a number of geometric sequences for the digit pattern to conform to the NBL. In the context of COVID-19 data, the exponential growth of SARS-CoV-2 infections mets this assumption [37].
To ensure more reliable findings, we employ three goodness of fit tests (Pearson chi-square, Kolmogorov-Smirnov D statistic, and Chebyshev distance m statistic) and three conformity estimates (average mantissa, mean absolute deviation, and distortion factor). In this manner, we diminish the likelihood that our results are driven by any specific statistical technique.
The chi-square test assesses the adherence of a data set to Benford’s Law by comparing the actual and expected counts for all digits. The Kolmogorov-Smirnov (KS) test is strongly influenced by the first and second digits of the numbers and evaluates the conformity of a data set to Benford’s Law by taking into account all the digits and their actual and expected counts [13]. According to Druica, Oancea and Vâlsan [38], Chebyshev distance (MD) informs the absolute size of the difference between two distributions, and it accommodates both ordinal and quantitative variables. The Chebyshev distance is similar to the Euclidean distance and it is also known as maximum value distance [38]. Regarding conformity estimates, NBL theoretical distribution expects that the average mantissa should be .5 with variance 1/12 and skewness close to zero. The mean absolute deviation (MAD) is based on the average absolute deviation of the actual proportions from the Benford proportions [13]. MAD takes into account the expected proportions and the actual proportions for each digit, but it is not influenced by the size. According to Nigrini [13], observed values above .015 indicate nonconformity to NBL for the first digit test. Finally, the distortion factor (DF) model suggests whether data are likely to be over or underestimated [13].
We complete the analysis using KLD, a well-established measure of directed divergence in information theory [17]. Also known as relative entropy, KLD estimates how much information change it would take to encode a given distribution Q as a target distribution P. By estimating the directed divergence of two distributions, it is possible to discriminate their information and measure how similar they are. The notation for a continuous distribution is given by:
$$KLD\left(P\Big\Vert Q\right)=\int P(x)\mathit{\log}\frac{P_{(x)}}{Q_{(x)}} dx$$
(2)
Where p(x) typically represents the true distribution of data and q(x) represents a theoretical or given distribution from the same group. Originated in information theory [17], the KLD measures the expected number of extra bits required to code samples from p(x) when using a code based on q(x), rather than using a code based on p(x) [39]. KLD will always be a non-negative number without a maximum value [40]. If p(x) equals q(x), the measure will be 0, corresponding to similar distributions [41]. Figure 1 shows two pairs of distributions with different levels of entropy measured by KLD.
Figure 1A shows two probability distributions with low divergence (KLD = .02), meaning that few information changes would be required to encode p(x1) as p(x2). Figure 1B shows two distributions with a higher divergence (KLD = .21). Therefore, approximating the two data distributions would entail more information change. In addition to comparing data from the same group, KLD also applies to the estimation of pairwise divergences. KLD has been used to study outlier detection [42], sample similarity [43], SAR images [44], copying on educational tests [45], and fake news recognition [46]. Given that the number of new COVID-19 cases is a count variable, we should estimate KLD by discrete probability distribution:
$$KLD\left(P\Big\Vert Q\right)={\sum}_i{P}_{(i)}\mathit{\log}\frac{P_{(i)}}{Q_{(i)}}$$
(3)
Where p(x) and q(x) are two probability distributions of a discrete random variable x. Mathematically, both p(x) and q(x) sum up to 1, and p(x) > 0 and q(x) > 0 for any x in X [40]. Unlike NBL, which compares data distribution with a theoretical model, KLD does not need a priori information on distributions. It observes the direct divergences between data from similar events [14].
The reasoning to combine NBL and KLD is to strengthen the methodological rigor of our research design. While NBL is a popular tool to detect potential fraudulent activity, KLD has been used in empirical research to compare data sets, identify discrepancies between models, and measure the relative entropy between two distributions. The joint application of NBL and KLD has been used in other research areas, such as image processing [47], electrical engineering [48], and electronics [49].
Computational tools
To estimate NBL functions, we used the benford.analysis package developed by Cinelli [50] and the BenfordTests package developed by Joenssen and Muellerleile [51], and to run KLD, we used philentropy package designed by Drost [52]. Statistical analyses were performed using R Statistical 4.0.4, and all significance tests were two-sided at conventional levels (p-value < .05).