Statistics: Ten facts about the chi-square distribution

Tutoring statistics, distributions are of constant interest. The tutor brings up ten points about the chi-square distribution.

The chi-square distribution may not be discussed much in a first-level stats course. It’s used to estimate or evaluate variance, rather than central tendency. Here are ten facts about the chi-square distribution:

  1. Typically, Χ2v is used to denote a chi-square variable or distribution with v degrees of freedom.
  2. Its parameter used for finding its values in tables is the degrees of freedom, typically referred to as v or just n-1, where n is the number of values in the sample.
  3. Its expected value is v.
  4. Its variance is 2v.
  5. It’s not symmetrical, but skewed right.
  6. Since a chi-square random variable is calculated from summing squares, it can’t be negative.
  7. (n-1)s22 has a chi-square distribution with n-1 degrees of freedom. In this context, s2 is the sample variance, while σ2 is the true population variance.
  8. Σ(observed-predicted)2/predicted, for predicted values using a model, follows a chi-square distribution.
  9. The chi-square distribution is used to estimate or test population variance.
  10. The chi-square distribution is used to test goodness-of-fit between a model and a sample.

I’ll be talking more about the uses of the chi-square distribution.

Source:

Harnett, Donald L. and James L. Murphy. Statistical Analysis for Business and Economics, third edition. Don Mills: Addison-Wesley, 1986.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Statistics: an assumption of the linear regression model

Tutoring statistics, linear regression is perennial. The tutor mentions an assumption it includes.

When appropriate, linear regression models data by the equation

y = a + bx + e,

e being an error term due to variability.

An inherent assumption of linear regression modelling is that the error term, e, does not depend on the actual data value, x.

In many lab environments, the assumption that error magnitude does not depend on the measurement’s magnitude makes sense. For instance, measuring with a ruler, the error is often set to ± 0.5mm, regardless of the length measured.

For some types of data, however, the measurement’s magnitude seems to impact its error magnitude. An example might be inventory counting. One imagines that, counting only three items, the error would likely be 0. Counting a thousand, however, would more likely yield an observation a few off from the real number present, and so on.

Perhaps the point is that the data has to be measured or observed, which itself brings error, perhaps dependent on the size of the measurement itself.

Source:

Harnett, Donald L. and James L. Murphy. Statistical Analysis for Business and Economics. Don Mills: Addison-Wesley, 1986.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Estimation of expected value: why sample mean is usually preferred over median

Tutoring statistics, mean is used more than median. The tutor points out a reason why.

In my post from November 30, 2014, I write about why the median might be preferred to the mean in some cases. For instance, the median tends to be less sensitive to outliers. Why, then, is the mean typically preferred in academic statistics?

The estimator of the median, it turns out, has higher variance than that of the mean. In particular,

var (sample median) = (π/2)*var(x)

Therefore, if you want to estimate the true expected value of the population, a sample size of 157 is needed to estimate it as precisely from the median as you can from a sample size of 100 using the mean.

HTH:)

Source:

Harnett, Donald L. and James L. Murphy. Statistical Analysis for Business and Economics. Don Mills: Addison Wesley, 1986.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Statistics: When can you use the normal approximation to the binomial distribution?

Tutoring statistics, rules of usage are key. The tutor shares one about when the normal approximation to the binomial distribution can be used.

The binomial distribution imagines a series of n trials, each with probability p of success and q=(1-p) of failure. The number of successes in n trials is the random variable x.

Apparently, the binomial random variable x can be approximated to normal if np≥5 and also nq≥5.

HTH:)

Source:

PennState Eberly College of Science

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Statistics: distribution shape: skew

Tutoring statistics, the concept of skew arises. The tutor gives some brief ideas about skew.

The normal distribution isn’t skewed, because its left and right tails are simply mirror images of each other.

When a distribution’s left tail is longer than its right, we say that distribution is negatively skewed. In such a case you get a long left tail that ramps up to a clump of values on the right.

The opposite is true for a positively skewed distribution; most of the population is clumped to the left, but a long tail extends to the right.

Therefore, skew describes the location of the distribution’s longer tail: negative skew means long tail to the left (the negative) side, whereas positive skew means long tail to the right (positive) side.

HTH:)

Source:

Harnett, Donald L. and James L. Murphy. Statistical Analysis for Business and Economics, 3rd ed. Don Mills: Addison-Wesley, 1986.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Statistics: what is multiple linear regression?

Tutoring statistics, regression is an important topic. The tutor defines multiple linear regression.

Simple linear regression models the dependent variable by a linear connection to the independent:

y = a + bx

Multiple linear regression models the dependent variable by a linear connection across several independent variables:

y = a + b1x1 + b2x2 + b3x3 + ….

Source:

Harnett, Donald L. and James L. Murphy. Statistical Analysis for Business and Economics. Don Mills: Addison-Wesley, 1986.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Calculator usage, statistics: the amazing Casio fx-991ES PLUS C

Tutoring math, you see all kinds of scientific calculators. The tutor continues to praise the Casio fx-991ES PLUS C.

If you were to be marooned on an island with a single calculator…which one would you prefer? The question is absurd, of course. Yet, I believe I know my answer: the Casio fx-991ES PLUS C. It has seemingly endless functionality for a little scientific calculator.

Today I noticed that it calculates binomial pdf.

Example: Using the Casio fx-991ES PLUS C, find the probability of 20 successes in 35 trials when p=0.44.

Solution:

  1. Press MODE, then arrow down
  2. Select 3
  3. Select 4
  4. For a single query, select 2
  5. It will ask for x (the number of successes). Key in 20 =
  6. Next, it will ask for N (the number of trials). Key in 35 =
  7. Now it will ask for p (the probability of success each time). Key in .44 =
  8. Hopefully you receive the answer 0.0401

HTH:)

Source:

Casio fx-991ES PLUS C User’s Guide

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Calculator usage: linear regression on the Casio fx-991ES PLUS C

Tutoring statistics, you cover linear regression. The tutor shows how to get a best-fit line on the Casio fx-991ES PLUS C.

Let’s imagine you have the following data

x y
10.1 14.2
17.3 19.5
25.4 22.9
40.0 31.8

Furthermore, you’d like to find a line of the form y=A+Bx that fits the data. Here’s how you might do it using the Casio fx-991ES PLUS C:

  1. Press Mode then 3 for Stat mode.
  2. Press 2 for y=A+Bx
  3. In the table that appears, enter the x and y values.
  4. After all the x and y values have been entered, press AC.
  5. Now, press Shift then 1.
  6. Press 5
  7. You’ll see choices for A, B, and other stats. Select the one you want, then press Enter.
  8. If, for example, you select A first and get its value, press Stat then 1 then 5 to return to the other choices. You can then choose B.

Source:

Casio fx-991ES PLUS C User’s Guide.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Statistics: how to calculate standard deviation with the Casio fx-991ES PLUS C

Tutoring statistics, you see lots of great calculators. The tutor shows how to calculate one-var standard deviation on the Casio fx-991ES PLUS C.

Let’s imagine you want the standard deviation of the short list of numbers 23, -12, 0, 15, 71. Here’s how to calculate it using the Casio fx-991ES PLUS C:

  1. Key in MODE then 3 then 1
  2. Key 23 = -12 = 0 = 15 = 71 =
  3. Perhaps surprisingly, Press AC
  4. Now, Shift then 1 then 4
  5. Hopefully, you see a menu offering the standard deviation, mean, and so on. Choose the one you want, then press Enter.

Source:

Casio fx-991ES PLUS C User’s Guide.

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.

Statistics, spreadsheets: confidence interval for the mean, population standard deviation unknown: CONFIDENCE.T() function on LibreOffice Calc

Tutoring statistics, the tutor is happy to share the CONFIDENCE.T() function from LibreOffice Calc.

My last couple of posts (here and here) I’ve talked about confidence intervals for the mean. Yesterday I mentioned finding one using Excel or LibreOffice Calc’s CONFIDENCE() function.

While the CONFIDENCE() function assumes the population standard deviation is known, I pointed out that, with sample size n≥31, the t-distribution approximates the normal closely enough that the sample standard deviation can be used. Today, I’ll make a direct comparison.

Yesterday’s post considered a sample mean of 67.3, known population standard deviation of 12.4, and sample size 42. The input

=confidence(0.05,12.4,42)

gave the result 3.75, meaning a confidence interval of 67.3±3.75, or 63.55 to 71.05.

LibreOffice Calc’s CONFIDENCE.T() function has the following format:

=confidence(1-confidence_level, sample_standard_deviation, sample_size)

Since it uses the sample standard deviation, CONFIDENCE.T() calculates the confidence interval from the t-distribution. By constrast, CONFIDENCE() takes the population standard deviation, so uses the normal distribution to calculate the confidence interval.

The following input

=confidence.t(0.05, 12.4, 42)

gives the result 3.864, implying a confidence interval of 67.3±3.864 or 63.44 to 71.16. Obviously this is not much different from the confidence interval 63.55 to 71.05 gotten using =confidence(0.05,12.4,42).

So, the CONFIDENCE.T() function seems to demonstrate that, for a sample size n≥31, the t-distribution approximates the normal distribution closely enough that the sample standard deviation can be used when the population standard deviation is unavailable.

HTH:)

Jack of Oracle Tutoring by Jack and Diane, Campbell River, BC.