Statistics theory

Statistics refers primarily to a branch of mathematics that specializes in enumeration, or counted, data and their relation to measured data. It may also refer to a fact of classification, which is the chief source of all statistics, and has a relationship to psychometric applications in the social sciences.

An individual statistic refers to a derived numerical value, such as a mean, a coefficient of correlation, or some other single concept of descriptive statistics. It may also refer to an idea associated with an average, such as a median, or standard deviation, or some value computed from a set of data.

More precisely, in mathematical statistics, and in general usage, a statistic is defined as any measurable function of a data sample. A data sample is described by instances of a random variable of interest, such as a height, weight, polling results, test performance, etc., obtained by random sampling of a population.

Simple illustration
Suppose one wishes to embark on a quantitative study of the height of adult males in some country C. How should one go about doing this and how can the data be summarized? In statistics, the approach taken is to assume/model the quantity of interest, i.e., "height of adult men from  the country C"  as a random variable X, say, taking on values in [0,5] (measured in metres) and distributed according to some unknown probability distribution F on [0,5]. One important theme studied in statistics is to develop theoretically sound methods (firmly grounded in probability theory) to learn something about the postulated random variable X and also its distribution F by collecting samples, for this particular example, of the height of a number of men randomly drawn from the adult male population of C.

Suppose that N men labeled $$\scriptstyle M_1,M_2,\ldots,M_N$$ have been randomly drawn by simple random sampling (this means that each man in the population is equally likely to be selected in the sampling process) whose heights are $$\scriptstyle x_1,x_2,\ldots,x_N$$, respectively. An important yet subtle point to note here is that, due to random sampling, the data sample $$\scriptstyle x_1,x_2,\ldots,x_N$$ obtained is actually an instance or realization of a sequence of independent random variables $$ \scriptstyle X_1,X_2,\ldots,X_N$$ with each random variable $$ X_i$$ being distributed identically according to the distribution of $$X$$ (that is, each $$\scriptstyle X_i$$ has the distribution F). Such a sequence $$ \scriptstyle X_1,X_2,\ldots,X_N$$ is referred to in statistics as independent and identically distributed (i.i.d) random variables. To further clarify this point, suppose that there are two other investigators, Tim and Allen, who are also interested in the same quantitative study and they in turn also randomly sample N adult males from the population of C. Let Tim's height data sample be $$\scriptstyle y_1,y_2,\ldots,y_N$$ and Allen's be $$\scriptstyle z_1,z_2,\ldots,z_N$$, then both samples are also realizations of the i.i.d sequence $$\scriptstyle X_1,X_2,\ldots,X_N$$, just as the first sample $$\scriptstyle x_1,x_2,\ldots,x_N$$ was.

From a data sample $$\scriptstyle x_1,x_2,\ldots,x_N$$ one may define a statistic T as $$\scriptstyle T=f(x_1,x_2,\ldots,x_N)$$ for some real-valued function f which is measurable (here with respect to the Borel sets of $$\scriptstyle \mathbb{R}^N$$). Two examples of commonly used statistics are:


 * 1) $$ T\,=\,\bar{x}\,=\,\frac{x_1+x_2+\ldots+x_N}{N}$$. This statistic is known as the sample mean
 * 2) $$T\,=\, \sum_{i=1}^{N} (x_i-\bar{x})^2/N $$. This statistic is known as the sample variance. Often the alternative definition $$ T\,=\, \frac{1}{N-1}\sum_{i=1}^{N} (x_i-\bar{x})^2 $$ of sample variance is preferred because it is an unbiased estimator of the variance of X, while the former is a biased estimator.

Summary statistics

 * Descriptive statistics

Measurements of central tendency

 * Mean
 * Median

Measurements of variation

 * Standard deviation (SD) is a measure of variation or scatter. The standard deviation does not change with sample size.
 * Variance is the square of the standard deviation:
 * $$s^2$$


 * Standard error of the mean (SEM) measures the how accurately you know the mean of a population and is always smaller than the SD. The SEM becomes smaller as the sample size increases. The sample standard devision (S) and SEM are related by:
 * $$SE_\bar{x}\ = \frac{s}{\sqrt{n}}$$


 * 95% confidence interval is + 1.96 * standard error.

Inferential statistics and hypothesis testing
The null hypothesis is the there is no difference between two samples in regard to the factor being studied. Two errors can occur in assessing the probability that the null hypothesis is true:
 * Type I error, also called alpha error, is the the rejection of a correct null hypothesis. The probability of this is usually expressed by the p-value. Usually the null hypothesis is rejected if the p-value, or the chance of a type I error, is less than 5%. However, this threshold may be adjusted when multiple hypotheses are tested.
 * Type II error, also called beta error, is the acceptance of an incorrect null hypothesis. This error may occur when the sample size was insufficient to have power to detect a statistically significant difference.

Frequentist method
This approach uses mathematical formulas to calculate deductive probabilities (p-value) of an experimental result. This approach can generate confidence intervals.

A problem with the frequentist analyses of p-values is that they may overstate "statistical significance". See Bayes factor for details.

Likelihood or Bayesian method
Some argue that the P-value should be interpreted in light of how plausible is the hypothesis based on the totality of prior research and physiologic knowledge. This approach can generate Bayesian 95% credibility intervals.

Classification

 * Discriminant analysis
 * Factor analysis
 * Cluster analysis
 * Propensity score
 * Recursive partitioning

Problems in reporting of statistics
In medicine, common problems in the reporting and usage of statistics have been inventoried. These problems tend to exaggerated treatment differences.