Math for Data Science covers the elements of linear algebra, probability, statistics, and calculus most relevant to data science. These mathematical tools are applied to key topics such as dimensionality reduction, machine learning, and optimization techniques - including neural network training, stochastic gradient descent, linear and logistic regression, and accelerated methods.
Throughout, examples are accompanied by Python code, available as Jupyter notebooks here. The book also includes over 400 exercises and nine appendices that provide background material and additional context.
A neural network is a function defined by parameters, or weights. Given a large dataset inserted into the network, the goal is to train the network: to adjust the weights so the resulting network outputs closely match the dataset targets. This is achieved by using gradient descent to navigate the error landscape in weight space, thereby minimizing the error between outputs and targets.
Historically, training neural networks at scale was impractical due to the large number of weights involved. A breakthrough came with stochastic gradient descent (SGD), first introduced in the 1950s and widely applied to neural networks in the 1980s. SGD enables convergence to a minimum error by following approximations of the true gradient, even when those approximations are noisy.
While computing the full gradient requires summing over many terms, SGD estimates the gradient using small subsets of the dataset, known as minibatches. This reduces computational demands while maintaining convergence, albeit at the cost of longer training times. Despite this trade-off, SGD has made large-scale neural network training feasible, paving the way for deep learning and AI.