4
$\begingroup$

I know that sample variance has the formula

$$s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}$$

I also know that sample variance has the formula "Mean of the squares minus the mean squared".

While calculating the sample variance of a given sample, I used both the formula and realised that they give two different answers, hence I wanted to ask, when do I use the first one, and when do I use the second one?

New contributor
BeginnerCode776 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
4
  • $\begingroup$ As someone wrote in an answer, the difference in the answers has to do with whether you divide by $n$ or by $n-1$. However, practically speaking, this does not matter, as the two numbers you will get will be almost indistinguishable from one another. The emphasis on whether to use $n$ or $n-1$ is more important for learning purposes. $\endgroup$ Commented yesterday
  • $\begingroup$ outside the classroom the difference never matters $\endgroup$
    – Aksakal
    Commented 10 hours ago
  • $\begingroup$ @NicolasBourbaki The difference matters for very small samples, and there are many small samples in some areas of science and so your comment is incorrect. $\endgroup$ Commented 7 hours ago
  • $\begingroup$ @MichaelLew I do not take very small sample studies seriously. If some study has 14 observations and dividing by either 14 vs 13 affects the arbitrary set p-value to affect/reject that publication, then I do not take those papers seriously. It is just a publication game at that point. $\endgroup$ Commented 6 hours ago

4 Answers 4

8
$\begingroup$

Dividing by $n-1$ is designed to give an unbiased estimator of the population variance because you do not know the population mean $\mu$ and using $\bar x$ instead might otherwise give a estimate which is on average too low.

The following are correct, though you rarely see the second: $$\frac{\sum (x_i - \bar x)^2}{n} = \frac{\sum x_i^2}{n} - \bar x^2 = \frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n} \right)^2$$ $$\frac{\sum (x_i - \bar x)^2}{n-1} = \frac{\sum x_i^2}{n-1} - \frac{n}{n-1}\bar x^2 = \frac{\sum x_i^2}{n-1} - \frac{\left(\sum x_i\right)^2}{n(n-1)} $$

$\endgroup$
1
  • 1
    $\begingroup$ And, just for fun, let me add yet a 3rd way of computing the variance; $\sigma^2=\dfrac {\displaystyle\sum_i \displaystyle\sum_j (x_i-x_j)^2} {2n(n-1)}$. And if you do not want Bessel's correction, just replace the $(n-1)$ in the denominator by $n$. Also, while using $(n-1)$ gives you an unbiased estimator of the variance, it does not give you an unbiased estimator of the standard deviation (but it is less biased than using $n$). $\endgroup$
    – jginestet
    Commented 19 hours ago
4
$\begingroup$

Dividing by $n-1$ rather than by $n$ is done ONLY when $s^2$ is used as an estimate, based on a random sample, of the variance of a population from which one has only a small random sample of size $n.$ If one knew the value of $\mu,$ the population mean, rather than only of $\overline x,$ the sample mean, then one would use $\mu$ rather than $\overline x$ and one would divide by $n$ rather than by $n-1.$

$\endgroup$
0
3
$\begingroup$

When you use the first you are computing an unbiased estimator for a data sample (estimator). Although you can compute the second, it will be biased.

When using the second you are are computing the variance of a population (descriptive statistic), for which the first is not a good expression.

New contributor
Alberto Torrejon Valenzuela is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
1
$\begingroup$

I also know that sample variance has the formula "Mean of the squares minus the mean squared".

No. A phrase like "mean of squares minus the square of means" is a description of a formula for the population variance — not the sample variance. (E.g., you can see it suggested here, with a recommended acronym of "MOSSOM"). We see:

$$\sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{\sum x_i^2}{n} - \bar{x}^2$$

The mnemonic known to the OP describes the rightmost expression here. This is called the "calculating formula" because it takes fewer operations to produce, and it was an important technique when all such calculations were done by hand (not so now with available technology). To be clear, both of the expressions above produce the exact same number — if you were to double-check with both, and get different results, then that indicates an error in some hand calculation.

However, this reflects the definition for population variance (it has $n$ in the denominator), so it will of course produce a different value than the formula for sample variance noted by the OP (which has $n - 1$ in the denominator, that is, Bessel's correction).

Note that sample variance also has an analogous calculating formula, but it cannot be described by the same MOSSUM mnemonic:

$$s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} = \frac{\sum x_i^2 - (\sum x_i)^2/n}{n - 1}$$

$\endgroup$
1
  • $\begingroup$ Yeah I realised that MOSSOM is in fact for the population variance, and not the sample variance. Thanks $\endgroup$ Commented 3 hours ago

Not the answer you're looking for? Browse other questions tagged or ask your own question.