A relatively good, introductory text on statistics, starting without either calculus or linear algebra, with nice progress into experimental design and analysis of variance, that is, multivariate statistics with discrete data, from some serious experts in that field, from University of Iowa, for some serious reasons, how to maximize corn yields considering soil chemistry, water, seed variety, plowing techniques, fertilizer, etc., long commonly used for undergraduates in the social sciences, e.g., educational statistics, agriculture, etc. is (with my mark-up for TeX):
George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}
A good, first, no toy, text on regression, with minimal prerequisites:
N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis.\/}
Some good books on regression and its usual generalizations, e.g., IIRC, factor analysis, i.e., principle components, discriminate analysis, etc.:
Maurice M.\ Tatsuoka, {\it Multivariate Analysis: Techniques for Educational and Psychological Research.\/}
Donald F.\ Morrison, {\it Multivariate Statistical Methods: Second Edition.\/}
William W.\ Cooley and Paul R.\ Lohnes, {\it Multivariate Data Analysis.\/}
A mathematically relatively serious text on regression:
C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition.\/}
For multivariate statistics with discrete data, consider
George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}
Stephen E.\ Fienberg, {\it The Analysis of Cross-Classified Data.\/}
Yvonne M.\ M.\ Bishop, Stephen E.\ Fienberg, Paul W.\ Holland, {\it Discrete Multivariate Analysis:\ \ Theory and Practice.\/}
Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 1, Introductory Topics.\/}
Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 2, New Developments.\/}
The classic on the math of analysis of variance:
Henry Scheff\'e, {\it Analysis of Variance.\/}
Broadly, in all of this, we are trying to analyze data on, call it, several variables, make predictions, etc.
In all of this, and as in the title
{\it The Analysis of Cross-Classified Data.\/}
above, suppose we have in mind random variables Y and X.
What is a random variable? Go outside. Measure something. Call that the value of random variable Y. What you measured was one value of possibly many that you might have measured. Considering all those possible values, there is a cumulative distribution F_Y(y) that for any real number y we have the probability that random variable Y is <= real number y
P(Y <= y) = F_Y(y).
So, F_Y(y) is defined for all real numbers y, is at 0 at the limit of y at minus infinity and at 1 at the limit of y at plus infinity. So, as we move real number y from left to right, F_Y(y) increases -- monotonically. On a nice day, function F_Y is differentiable, and with the derivative from calculus
f_Y(y) = d/dy F_Y(y)
and is the probability density of real random variable Y.
Here's the standard way to discover something about Y, in particular about its cumulative distribution F_Y:
We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y. Then for positive integer n, and for real number y, by the law of large numbers (the weak version has an easy proof), in the limit as n grows large, as accurately as we please, the fraction of the values
Y_1, Y_2, ...,Y_n
that are <= y
is F_Y(y). So, via such simple random sampling, for any real number y we can estimate F_Y(y), the cumulative distribution of Y.
For a little more, under meager assumptions that hold nearly universally in practice, if we take the ordinary grade school average of
Y_1, Y_2, ..., Y_n
as n increases we will approximate the average or expected value of Y denoted by E[Y].
To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along.
Now suppose we are also given random variable X. Maybe the values of X are just real numbers, some 10 real numbers, 20 values, from set {1, 2, 3}, the last three weeks 100 times a second of the NYSE price of Microsoft, or full details on the atmosphere of earth every microsecond for the past 5 billion years. That is, for the values of X we can accept a lot of generality. Still more generality is possible, but that would take us on a detour for a while.
For our point here, let's assume that X takes on just discrete values or we have just rounded off the values and forced them to be discrete. In practice we will have only finitely many discrete values.
Now we want to use X to predict Y.
So, sure, much of machine learning is to construct a model, maybe with regression trees or neural networks, to make this prediction, but here we will show a simpler way that is always the most accurate possible whenever we have enough data. How 'bout that!
This simpler way is just old cross tabulation.
Net, over a wide range of real cases trying to predict Y from X, we should just use cross tabulation unless we don't have enough data. Or, the main reason for just empirical curve fitting using regression linear models or neural network continuous models is that we don't have enough data just to use cross tabulation.
For a preview of a coming attraction, will notice in nearly all of regression and neural networks big concerns about over fitting. Well, cross tabulation doesn't have that problem. How 'bout that!