In my opinion one of the best starting points is "Information Theory, Inference and Learning Algorithms" by David MacKaye. It's a bit long in the tooth now, but it is still one of the most approachable and well written books in the field.
Another old book that stands up very well is "Probability Theory: the Logic of Science" by E. T. Jaynes.
"Elements of Statistical Learning" by Tibshirani is also good.
"Bayesian Data Analysis" by Andrew Gelman is another great read.
"Deep Learning" by Ian Goodfellow and Yoshua Bengio is useful for getting caught up with recent advances in that field.
"Information Theory, Inference and Learning Algorithms" by David MacKaye
http://www.inference.org.uk/itprnn/book.pdf
"Probability Theory: the Logic of Science" by E. T. Jaynes
http://www.med.mcgill.ca/epidemiology/hanley/bios601/Gaussia...
"Elements of Statistical Learning" by Tibshirani
https://web.stanford.edu/~hastie/Papers/ESLII.pdf
"Bayesian Data Analysis" by Andrew Gelman
http://hbanaszak.mjr.uw.edu.pl/TempTxt/(Chapman%20&%20Hall_C...
edit: Goodfellow/Bengio/Courville, not mentioned in the previous comment, is also available online: http://www.deeplearningbook.org
So beyond just saying that you'd need grounding in multivariable calculus to do serious ML work, I would be super interested in hearing more about why that is and what kinds of problems crop up in ML that demand it.
A system which is at an optimum will, at that exact point, be no longer increasing or decreasing: a metal sheet balanced at the peak of a hill rests flat.
Many problems in ML are optimization problems: given some set of constraints, what choices of unknown parameters minimizes error? This can be very hard (NP-hard) in general, but if you design your situation to be "smooth" then you can use calculus and its very nice set of algebraic solutions.
You also need multivariate calculus because typically while you're only trying to minimize "error", you do so by changing many, many parameters at once. This means that you've got to talk about smooth changes in a high-dimensional space.
--
The other side of calculus is integration which talks about "measuring" how big things are. Most of probability is discussing very generalized ratios: of the total, "how big is this piece" is analogous to "what are the odds this will happen".
The general discussion of measure is complex and essentially the only tool to tackle it involves gigantic (infinite, really) sums of small, well-behaved pieces to form a complex whole.
It just happens to turn out (and this is the big secret of calculus) that this machinery (integration) is dual to the study of smooth changes and you can knock them both out together.
--
So ultimately, ML hinges upon being able to measure things (integration) and talk about how they change (derivation). Those two happen to be the same concept in a way and they are essentially what you study in calculus.
Additionally, if you want to calculate a probability given a density function, or evaluate an expectation, you need to calculate several integrals. This arises quite often in the theoretical sections of ML papers/textbooks.
The use of calculus in ML is probably similar to the use of number theory in crypto- you can do applied work fine without it, but you understand the work a lot better by knowing the math, and are less likely to make dumb mistakes.
If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.
For ML you just need Calculus 1 and 2. The curl/div and Stokes is Calculus 3 which a physics thing. You don't need that for ML.
You may need the basics of functional analysis in certain areas of ML, which is arguably Calculus 4.
Beyond that, for iterative methods, convergence is a matter of limits. This again is calculus. Formulating iteration as repeatedly applying a function, we converge to a fixed point of that function if and only if the derivative at that fixed point lies between -1 and 1. Again derivatives come in.
Finally, for error estimation, taylor-expansions are often useful. Again, the topic here is calculus. Notably, all I can think of regards limits and derivatives, not integrals. That might just be due to my hatred of integrals though.
[0] https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver...
The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture (another option is to treat it as an adverserial game with nature). In the probabilistic approach, we make the assumption that functions being revealed to us is in some probabilistic proximity of the true function and the sample is closing onto it slowly. We have to be careful to be not too eager to model the revealed function, our goal is to optimize the function where these revealed functions are ultimately headed.
Those things aside, if you have to choose just one prereq, I think it has to be linear algebra and you already have that in your bag. Without it, a lot of multivariate calculus will not make much sense anyway. Then one can push things a little bit and go for the linear algebra where your vectors have infinte dimension. This becomes important because often your data would have far too much information that you can encode in a finite dimensional vector. Thankfully a lot of intution carries over to infinite dimension (except when it does not). This goes by the name functional analysis. Not absolutely essential, but then lack of intution here can rein you in from doing some certain kinds of work. You will just get a better (at times spatial or geometric) understanding of the picture, etc etc.
Other than theeir motivating narratives, there is not much difference btween probability/stats and information theory. There is a one to one mapping between many if not all of their core problems. A lot of this applies to signal processing too. Many of the problems that we are stuck at in these domains are the same. Sometimes a problem seems better motivated in one narrative over the other. Some will call it finding the best code for the source, others will call it parameter estimation, yet others will call it learning.
Or If I may paraphrase for the CS audience, blame the reals \mathbb{R}. Otherwise it would have been the problem of reverse engineering a noisy Turing machine that we can access only through its input and output. Pretty damn hard even if we dont get into reals. In those situations you could potentially get by without calculus, algebra by itslef should go a long way, but as I said it gets frigging hard. Learning even the lowly regular expression from examples is hard. Calculus would still be helpful because many combinatorial / counting prolems that come up can be dealt with generating function techniques where you would run into integral calculus with complex numbers.
If you want to read that book you need real analysis more specifically measure theory (unless that subject is in probability theory for you). You cannot get into the last few chapters without it. Dirichlet Process are described using measures.
I don't believe you need multivar calc or info theory. Info theory stuff are used but not as often. I believe you're slanted toward researcher phd position. Gini index, entropy, etc... and such are taken as given when needed.
http://andrewgelman.com/2017/08/02/seemingly-intuitive-low-m...
I disagree on multivar calc. Statistics often makes use of matrix derivatives. I have found it helpful to know.
KL(P,Q) = E_P[Q_length(X)] - E_P[P_length(X)]
This is like saying, "You don't need to really know calculus, just integrals."
I think a solid background in linear algebra, multivariate calculus, and convex optimization will take you really far.
The linear algebra and probability theory are most important imho. I'd also distinguish between probability theory and statistics. Both are important, but they are distinct disciplines.
thanks for the list! the only roadblock i've ran into getting into many of these topics are book prices :O usually they are pretty steep
If you want to develop new techniques and algorithms, the the skies the limit, you'll of course want Stats too though.
[Foundations of Data Analysis](https://courses.edx.org/courses/course-v1:UTAustinX+UT.7.11x...)
Note: In this course, Dr. Michael J. Mahometa uses R. But I'd recommend you not to focus on R vs Python debates; the goal of this course is to learn about Statistics & Data Analysis in real-world scenarios. With that in mind, even just going through the reading material and lecture videos will be valuable enough if you're starting from scratch (but I'd recommend you to take the extra step and complete the Labs too).
for the what's up in Data Science i like datatau.com. and there are some great podcasts too, like datascienceathome and partiallyderivative (there are lists).
It is important to note that just because you can do all the stuff a PhD Scientist might regularly do, doesn't mean that someone will hire you for it. In that case you might need to have a PhD in mathematics, computer science or a related field. But that is more a consequence of competition and long term talent investment, than the practice of ML/AI itself.
In a way, you don't really need to know much more because there is a lot of good software out there.
If you want to learn more math, learn Linear Regression, Logistic Regression, p-values, probability density functions, cumulative density function, the Central Limit Theorem, Gaussian Distributions, Exponential Distributions, Binomial Distribution, (maybe) Student-T distribution.
If you want to learn even more, first learn matrices (adding, multiplying, inverting, rank, span, matrix decomposition (SVD, and eigendecomposition are the most important)).
If you want to learn even more, it's time to learn calculus. Integral calculus is needed for continuous probability distributions and information theory. Differential calculus is needed to understand back propagation.
There are a lot of other good suggestions written by the other commentators.
- Core statistics. You need to be familiar with how statisticians treat data, because it comes up a lot.
- Calculus. You do not need to be a wizard at working the numbers but you do need to understand how to describe the process of differentiation and integration over multiple variables comfortably.
- Linear algebra. It's essentially the basis for everything, even more than statistics.
- Numerical nethods for computing. I constantly have to refer to references to understand why people make the choices they do.
- Theory of computation and the research clustered around it. Familiarity here helps a lot. Sometimes I even catch errors or am able to recognize improvements available. Also there is a lot of crossover, as one would expect. An example: everyone is remembering how good automatic differentiation is! And given that properly combined differentiable equations are also differentiable, AD let's you optimize over your optimization process. It's differentiable turtles all the way down.
My next big challenge is nonparametric statistics. Many researchers tell me that this is a very fruitful place to be and many methods there are increasingly making improvements in ML.
Tbh, I'm not where I want to be with them. So maybe next year I can talk about 2017 and my math oddessy.
If you're interested in finding more "freely available online" maths references, check out:
http://people.math.gatech.edu/~cain/textbooks/onlinebooks.ht...
http://www.openculture.com/free-math-textbooks
https://open.umn.edu/opentextbooks/SearchResults.aspx?subjec...
https://ocw.mit.edu/courses/online-textbooks/#mathematics
https://aimath.org/textbooks/approved-textbooks/
There's also a TON of high-quality maths instructional content on Youtube, Videolectures.net, etc. For example, there's some really good stuff by David McKay (also mentioned in CuriouslyC's post) here:
http://videolectures.net/david_mackay/
Be sure to check out Professor Leonard:
https://www.youtube.com/user/professorleonard57
Gilbert Strang:
https://www.youtube.com/results?search_query=gilbert+strang
and 3blue1brown:
https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw
as well.
another nice yt channel about math and physics.
I also recommend Siraj Raval's Youtube course the Math of Intelligence: https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2...
Multiavariable calc you either "abolsutely" need or don't really need. Should be well versed in graph theory, or don't need it much.
Surely some of the contradiction is caused by different assumptions of what the goal is. But some of its hard to relate to as a reader. For example, I haven't been in the field but but have tried to read enough to understand the concepts, and having studied graph theory I don't see how it's a top 5 recommendation.
I don't doubt anyone's experience, would just be nice to know which assumption is behind a suggestion.
On the other hand, if the assumption is that your particular problem is not solvable easily and reliably with the current approaches, then quite a lot of the math background helps - if you want to improve on the current results, or debug/understand why your solution doesn't work as intended, or why the conceptual solution can't work on your problem because of incompatible assumptions, then these areas of math are useful. If you want to use a new bleeding-edge construct, or a rare niche construct that's not yet implemented in the framework of your choice, then you're going to need to write it yourself, and then you need to understand how it works.
There's a large distance between using and applying ML techniques and researching and improving ML techniques; it's a continuum, but there's space for many people standing purely in the applied end.
Just having things be less opaque reduces cognitive load, makes more room for creative solutions.
But a good number of people that are doing work haven't taken real analysis, or it's been awhile and so you should be current on multivariable and vector calculus. Calculus of variations shows up from time to time.
For math reviews, look at the following (there's others if you want more refs, ping me):
http://www.deeplearningbook.org/
https://metacademy.org/roadmaps/
http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...
The former needs the mathematical background mentioned here to develop groundbreaking algorithms or improve on existing ones, while the latter merely implements them and requires a much smaller mathematical toolset.
I think cars are a good analogy. In the early days of automobiles, you needed to be something just short of a mechanical engineer to keep one going for any length of time, and it was routine to need to carry around tools and spare parts to perform significant repairs. You really needed to know a pretty good bit about how the car worked to use it effectively. But over time cars developed better abstractions and became more dependable and it became possible to operate a car without caring one lick about how it works, beyond know that it needs gas (or electricity!) and taking it in for the occasional tuneup /tire change / alignment / etc.
I wouldn't say we're at the point yet where ML afford one the opportunity to be completely divorced from caring about the underlying details, but I think we are at a point where you can legitimately get useful stuff done without needing to be able to, say, derive the equations for backprop by hand.
So for example, start with some source in ML/AI you'd like to read. If you get stuck, ask somewhere (possibly an online forum like this) what field you're having trouble with and how to get started there.
So, what do you mean by "pursuing"?
But even still, I would caution against trying to upload a bunch of new math concepts into your brain without first understanding the ML/AI context.
I would say go through both of Andrew Ng's ML and DL courses on Coursera.
Then, pick a domain/ problem that you're interested in.
Then, read papers about how ML/AI is applied in that domain.
Then, try to reproduce a paper that you understand and are interested in.
[Added] Of course writing webpages pays well enough, but I still can't shake off this feeling that I am missing something by not jumping on the AI/ML train though.
http://www.r-bloggers.com/in-depth-introduction-to-machine-l... Introduction to Statistical Learning http://www-bcf.usc.edu/~gareth/ISL/ (Rob S and by Trevor H, Free I guess) for more in depth, Elements of Statistical Learning by the same.
Linear Algebra (Andrew Ng's this part in Introduction to Machine Learning is a short and crisp one)
If you're not scared by Derivatives, you can check them. But you can easily survive and even excel as a data scientist or ML practitioner with these.
Since you would rely on frameworks like Tensorflow to handle figuring out the derivatives for you, you don't really need to know much calculus. Just read up on what derivative of a function at a particular point signifies. This should give you enough intuition to understand things initially.
A skill that would really come in useful would be ability to look at a function and think how increasing/decreasing one of the variables would affect it's value. This would help develop intuition around a lot of concepts used in Deep Learning topologies.
Watch the course.fast.ai lectures quickly, just to see a lot of practical ML/AI applications. You'll see how effective you can be just by knowing the tools with very little math background.
Next I'd look at the NEW Andrew Ng introduction on Coursera. It is much more approachable than his first course. You might still feel a little overwhelmed by a few equations, but then you'll implement them yourself in numpy. (And the ipython/jupyter notebooks are really well written, walking you through every step.)
Why do you want answers only from people doing deep learning? Deep learning is just a subset of the overall field (albeit an incredibly popular and useful one).
Anyway, the simple solution is just to use some simple machine learning of your own to analyze the data set which these threads constitute. Look for patterns... are certain answers being repeated over and over again, by different posters? Then I'd argue that your Bayesian posterior for "this is legitimately important" should go up.
Take Linear Algebra for example... given the sheer number of people saying "linear algebra" in their answers, it seems a reasonably bet to me that LA is really, truly useful. Either that or there's some really freaking group-think shit going on. :-)
I have attempted to read the Statistical Learning book, and its so daunting because the book expects a lot of background knowledge, and it takes a while to really wrap your head around these concepts. I think people should learn from a lighter book, before diving into these books if you are lacking the background.
My current approach to pursuing a career in DL and ML is going to graduate school, taking a graduate ML course, and trying to apply my knowledge to different problems I am interested in.
I am reading the Bishop book Pattern Recognition now. I think from the perspective of having to re-learn a lot of calculus and probability, that book is more approachable than Statistical learning.
My advice (which I am attempting now) to dive deep into ML is follows:
1. Taking Bayesian ML class (at Cornell) 2. Read/Study Pattern Recognition by Bishop, for 5hrs/day 3. Try exercises, if fail, review solutions 4. If lost(which is usually), review missing concepts from MIT OCW scholar courses
There simply aren't very many people in those roles because the number of ML/AI/DL jobs out there are still limited, I think.
2. Regarding books I second the late David McKay's "Information Theory, Inference and Learning Algorithms" and the second edition of "Elements of Statistical Learning" by Tibshirani et al. (there's also a more accessible version of a subset of the material targeting MBA students called James et al., An Introduction to Statistical Learning). Duda/Hart/Stork's Pattern Classification (2nd ed.) is also great. The self-published volume by Abu-Mostafa/Magdon-Ismail/Lin, Learning from Data: A Short Course is impressive, short and useful for self-study.
3. Wikipedia is surprisingly good at providing help, and so is Stack Exchange, which has a statistics sub-forum, and of course there are many online MOOC courses on statistics/probability and more specialized ones on machine learning.
4. After that you will want to consult conference papers and online tutorials on particular models (k-means, Ward/HAC, HMM, SVM, perceptron, MLP, linear and logistic regression, kNN, multinomial naive Bayes, ...).
People like to make this look harder than it is.
Here is a short tutorial on linear algebra: https://minireference.com/static/tutorials/linear_algebra_in... and a preview of the full book: https://minireference.com/static/excerpts/noBSguide2LA_previ...
Provides a very good idea of the courses required and their time frame. I roughly followed along this path but took "Analytics Edge" https://www.edx.org/course/analytics-edge-mitx-15-071x-3 for introduction into ML algorithms.
Doing any kind of ML means questioning all the assumptions that go into your results and understanding how those assumptions could affect the outcome. That process starts in stats.
I would recommend reading Toby Segaran's Programming Collective Intelligence: http://shop.oreilly.com/product/9780596529321.do
Russell and Norvig have a good book at http://aima.cs.berkeley.edu that covers many different topics in AI, although it is definitely not comprehensive. I would say that whatever you learn in an undergraduate CS degree would give you a good starting point for learning any particular AI topics.
- you are starting with the equivalent of a high school level of maths
- you want to take a ML course or read an ML book without feeling totally lost
As some commenters have said, Calculus, Probability and Linear Algebra will be very helpful.
Some people like to recommend the "best" or "most important" books which you "should" read, but there is a strong chance these will end up sitting on a bookshelf, barely touched. So I will recommend some books which are perhaps more accessible.
- Calculus by Gilbert Strang
- Linear Algebra by Gilbert Strang
For Probability: I don't have any favourites, sorry.
I would recommend probability theory and statistics as well.
* Linear algebra
* Optimisation
* Probability
Various universities have very good course content freely available online, often including textbook recommendations, course notes, exercises, sample exams, and video lectures. Realistically it is probably going to be quite difficult to learn this on your own.
1) First read all the prerequisites and then work on a problem
2) Start working on a problem and learn all the math ML/AI as you need
The second option works best.
(1) Calculus
Generally should have college freshman and sophomore calculus.
(1.1) Functions
So, there can understand better what a function is. E.g., function
f(x) = 3x^2 + 1.
(1.2) DerivativesThen will learn how to find the slope of the graph of a function. That is the derivative of the function. E.g., for function f with f(x) = 3x + 2, as in high school algebra, the slope is 3. Then for each x, the derivative of f at x is just 3.
The derivative of function f is denoted by either of
f'(x) = d/dx f(x)
E.g., for function f(x) = 3x^2 + 1 it turns
out that f'(x) = 6x.
(1.3) IntegrationFor function
g(x) = 6 x
maybe we want to know what function f(x)
will give us f'(x) = g(x)
Finding such a function f is
anti-differentiation, that is, undoes
differentiation. So, sure, f(x) = 3x^2 + C
for any constant C.Such anti-differentiation is also the way to find the area under a curve. So, can use that to find the area of a circle, volume of a cylinder, etc. Doing that the anti-differentiation is integration.
The fundamental theorem of calculus shows how differentiation and integration are related.
(1.4) Analytic Geometry
Commonly taught at the beginning of a calculus course is analytic geometry.
So, take a cone an cut it. Then the cut surfaces will be one of a circle, an ellipse, a parabola, a hyperbola, or just two crossed straight lines. So, those curves are from a cone and are the conic sections.
There is some simple associated algebra.
Conic sections are important off and on; e.g., applied math is awash in circles; the planets move in ellipses; a baseball moves in a parabola or nearly so; an electron moving toward a negative charge will turn away from that charge in a hyperbola.
It turns out that in linear algebra (below) circles and ellipses are important.
(1.5) Role of Calculus
Calculus was invented by Newton as part of working with force and acceleration for understanding the motion of the planets.
E.g., if at time t function d(t) gives distance traveled, then function v(t) = d'(t) is the velocity at time t and function a(t) = v'(t) is the acceleration at time t.
Then Newton's second law is
F(t) = m a(t)
where F(t) is the force at time t applied
to mass m.Calculus is the first approach to the analysis of continuous change and is a pillar of civilization.
Knowledge of calculus will commonly be assumed in work in ML/AL, data science, statistics, optimization, applied math, engineering, etc.
E.g., a lot in ML, AI, and data science is getting best fits to data; best fitting is to minimize errors in the fit; such minimization is mostly a calculus problem; one of the main steps in ML is steepest descent, and that is from a derivative.
Probability theory (e.g., evaluating coin tossing, poker hands, accuracy in ML) will be important in ML/AI, etc.; two of the basic notions in probability are cumulative distributions and density distributions; the cumulative is from an integration, and the density is from a differentiation.
(2) Linear Algebra
(2.1) Linear Equations
The start of linear algebra was seen in high school algebra, solving systems of linear equations.
E.g., we seek numerical values of x and y so that
3 x - 2 y = 7
-x + 2 y = 8
So, that is two equations in the two
unknowns x and y.Well, for positive integers m and n, we can have m linear (linear is in the above example but omitting here a careful definition) equations in n unknowns.
Then depending on the constants, there will be none, one, or infinitely many solutions.
E.g., likely the central technique of ML and data science is fitting a linear equation to data. There the central idea is the set of normal equations which are linear (and, crucially, symmetric and non-negative semi-definite as covered carefully in linear algebra).
(2.2) Gauss Elimination
The first technique for attacking linear equations is Gauss elimination. There can determine if there are none, one, or infinitely many solutions. For one solution, can find it. For infinitely many solutions can find one solution and for the rest characterize them as from arbitrary values of several of the variables.
(2.3) Vectors and Matrices
A nice step forward in working with systems of linear equations is the subject of vectors and matrices.
A good start is just
3 x - 2 y = 7
-x + 2 y = 8
we saw above. What we do is just rip out
the x and y, call that pair a vector,
leave the constants on the left as a
matrix, and regard the constants on the
right side as another vector. Then the
left side becomes the matrix theory
product of the matrix of the constants
and the vector of the unknowns x and y.The matrix will have two rows and two columns written roughly as in
/ \
| 3 - 2 |
| |
| -1 2 |
\ /
So, this matrix is said to be 2 x 2 (2 by
2).Sure, for positive integers m and n, we can have a matrix that is m x n (m by n) which means m rows and n columns.
The vector of the unknowns x and y is 2 x 1 and is written
/ \
| x |
| |
| y |
\ /
So, we can say that the matrix is A; the
unknowns are the components of vector v;
the right side is vector b; and that the
system of equations is Av = b
where the Av is the matrix product of A
and v. How is this product defined? It is
defined to give us just what we had with
the equations we started with -- here
omitting a careful definition.So, we use a matrix and two vectors as new notation to write our system of linear equations. That's the start of matrix theory.
It turns out that our new notation is another pillar of civilization.
Given a m x n matrix A and an n x p matrix B, we can form the m x p matrix product AB. Amazingly, this product is associative. That is, if we have p x q matrix C then we can form m x q product
ABC = (AB)C = A(BC)
It turns out this fact is profound and powerful.
The proof is based on interchanging the order two summation signs, and that fact generalizes.
Matrix product is the first good example of a linear operator in a linear system. The world is awash in linear systems. There is a lot on linear operators, e.g., Dunford and Schwartz, Linear Operators. Electronic engineering, acoustics, and quantum mechanics are awash in linear operators.
To build a model of the real world, for ML, AL, data science, ..., etc., the obvious first cut is to build a linear system.
And if one linear system does not fit very well, then we can use several in patches of some kind.
(2.4) Vector Spaces
For the set of real numbers R and a positive integer n, consider the set V of all n x 1 vectors of real numbers. Then V is a vector space. We can write out the definition of a vector space and see that the set V does satisfy that definition. That's the first vector space we get to consider.
But we encounter lots more vector spaces; e.g., in 3 dimensions, a 2 dimensional plane through the origin is also a vector space.
Gee, I mentioned dimension; we need a good definition and a lot of associated theorems. Linear algebra has those.
So, for matrix A, vector x, and vector of zeros 0, the set of all solutions x to
Ax = 0
is a vector space, and it and its dimension are central in what we get in many applications, e.g., at the end of Gauss elimination, fitting linear equations to data, etc.
(2.5) Eigen Values, Vectors
Eigen in German translates to English as special, unique, singular, or some such.
Well, for a n x n matrix A, we might have that
Ax = lx
for number l. In this case what matrix A does to vector x is just change its length by l and keep its direction the same. So, l and x are quite special. Then l is an eigenvalue of A, and x is a corresponding eigenvector of A.
These eigen quantities are central to the crucial singular value decomposition, the polar decomposition, principal components, etc.
(2.6) Texts
A good, now quite old, intermediate text in linear algebra is by Hoffman and Kunze, IIRC now available for free as PDF on the Internet.
A special, advanced linear algebra text is P. Halmos, Finite Dimensional Vector Spaces written in 1942 when Halmos was an assistant to John von Neumann at the Institute for Advanced Study. The text is an elegant finite dimensional introduction to infinite dimensional Hilbert space.
At
http://www.american.com/archive/2008/march-april-magazine-co...
is an entertaining article about Harvard's course Math 55. At one time that course used that book by Halmos and also, see below, Baby Rudin.
For more there is
Richard Bellman, Introduction to Matrix Analysis.
Horn and Johnson, Matrix Analysis.
There is much more, e.g., on numerical methods. There a good start is LINPACK, the software, associated documentation, and references.
(5) More
The next two topics would be probability theory and statistics.
For a first text in either of these two, I'd suggest you find several leading research universities, call their math departments, and find what texts they are using for their first courses in probability and statistics. I'd suggest you get the three most recommended texts, carefully study the most recommended one, and use the other two for reference.
Similarly for calculus and linear algebra.
For more, that would take us into a ugrad math major. Again, make some phone calls for a list of recommended texts. One of those might be
W. Rudin, Principles of Mathematical Analysis.
aka, "Baby Rudin". It's highly precise and challenging.
For more,
H. Royden, Real Analysis
W. Rudin, Real and Complex Analysis
L. Breiman, Probability
M. Loeve, Probability
J. Neveu, Mathematical Foundations of the Calculus of Probability
The last two are challenging.
For Bayesian, that's conditional expectation from the Radon-Nikodym theorem with a nice proof by John von Neumann in Rudin's Real and Complex Analysis.
After those texts, often can derive the main results of statistics on your own or just use Wikipedia a little. E.g., for the Neyman-Pearson result in statistical hypothesis testing, there is a nice proof from the Hahn decomposition from the Radon-Nikodym theorem.
I'm no expert but does anyone think these apply?
It depends on what you're doing. I was literally just watching a video on Generative Adversarial Networks this morning, and game theory did come up there, at least in passing. If one sat down and started reading the papers on this subject and trying to implement / improve stuff in this area, I suspect game theory would be at least moderately important.
There is also the field of Competitive Learning where game theory has some application. See, for example:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71....
None.
> Are there any seminal texts/courses/content which should be consumed before starting?
No.
You don't need to know binary to start being a programmer/developer either. Just start already. As long as you are not in charge of a medical diagnosis or financial model, you don't get any drawback in experimenting (and failing miserably).
Assuming applied ML, the most difficult part will be the human-political business element of it: People not understanding your model or using its output correctly, bias, feedback loops, acquiring enough resources, etc. The more you can explain to them, without resorting to heavy maths, the better communicator you are.
That said, it can't hurt to do Ng's Coursera course (a lot of top performers started out with this course). Learning from Data by Caltech's Abu-Mostafa goes very wide on machine learning. "Programming Collective Intelligence" is a, somewhat dated, good book.
As for seminal texts, the field is too wide for this. A better bet is: Find a professor in the field you are interested in. Say "Deep Learning", you could have a look at LeCun, Hinton, Schmidhuber, Bengio, ... Now look at their PhD-students, their papers, their courses, their conference talks, their software, their current research. Basically become a student under the most authoritative professor in the subfield you can find and resonate with, without ever paying any university tuition or them knowing you exist. This is very possible these days.
But by all means: Just start out. Machine learning is fun. Learning about dry 100 year old maths not so much. Make mistakes. Learn to detect and avoid overfit. Find out if you are passionate and curious about parts of the field, then the theory will come eventually. A lot of the time these questions seem to demand answers like: "You need a PhD-level understanding of mathematics" Just so your brain can go: "I am not good enough for this, so let's look at something easier". Don't use this as an excuse. Start making intelligent stuff. There are 16-year-olds on Kaggle routinely beating maths PhD's.
Also remember that, despite the current trend of calling everything "AI", that AI is a very wide field, of which mathematics is only a small part. There is philosophy, linguistics, cognitive science, physics, neuroscience, psychology, computer science, robotics, logic, ... all these parts vary wildly in their prerequisite maths knowledge.