diff --git a/_book/convnets.md b/_book/convnets.md
index 6be038e..3bd8638 100644
--- a/_book/convnets.md
+++ b/_book/convnets.md
@@ -112,7 +112,7 @@ Prior to this chapter, we've just looked at _fully-connected layers_, in which e
 
 {% include figure_multi.md path1="/images/figures/weights_analogy_2.png" caption1="Weights analogy" %}
 
-We can interpret the set of weights as a _feature detector_ which is trying to detect the presence of a particular feature. We can visualize these feature detectors, as we did previously for MNIST and CIFAR. In a 1-layer fully-connected layer, the "features" are simply the the image classes themselves, and thus the weights appear as templates for the entire classes. 
+We can interpret the set of weights as a _feature detector_ which is trying to detect the presence of a particular feature. We can visualize these feature detectors, as we did previously for MNIST and CIFAR. In a 1-layer fully-connected layer, the "features" are simply the image classes themselves, and thus the weights appear as templates for the entire classes. 
 
 In convolutional layers, we instead have a collection of smaller feature detectors called _convolutional filters_ which we individually slide along the entire image and perform the same weighted sum operation as before, on each subregion of the image. Essentially, for each of these small filters, we generate a map of responses--called an _activation map_--which indicate the presence of that feature across the image.
 
@@ -216,7 +216,7 @@ Generative applications of convnets, including those in the image domain and ass
 
 {% include further_reading.md title="Conv Nets: A Modular Perspective" author="Chris Olah" link="https://colah.github.io/posts/2014-07-Conv-Nets-Modular/" %} 
  
-{% include further_reading.md title="Visualizing what ConvNets learn (Stanford CS231n" author="Andrej Karpathy" link="https://cs231n.github.io/understanding-cnn/" %} 
+{% include further_reading.md title="Visualizing what ConvNets learn (Stanford CS231n)" author="Andrej Karpathy" link="https://cs231n.github.io/understanding-cnn/" %} 
 
 {% include further_reading.md title="How do Convolutional Neural Networks work?" author="Brandon Rohrer" link="https://brohrer.github.io/how_convolutional_neural_networks_work.html" %} 
 
diff --git a/_book/how_neural_networks_are_trained.md b/_book/how_neural_networks_are_trained.md
index cbdfbc4..2dae9c4 100644
--- a/_book/how_neural_networks_are_trained.md
+++ b/_book/how_neural_networks_are_trained.md
@@ -167,7 +167,7 @@ So how do we actually calculate where that point at the bottom is exactly? There
 
 ## The curse of nonlinearity
 
-Alas, ordinary least squares cannot be used to optimize neural networks however, and so solving the above linear regression will be left as an exercise left to the reader. The reason we cannot use linear regression is that neural networks are nonlinear; Recall the essential difference between the linear equations we posed and a neural network is the presence of the activation function (e.g. sigmoid, tanh, ReLU, or others). Thus, whereas the linear equation above is simply $$y = b + W^\top X$$, a 1-layer neural network with a sigmoid activation function would be $$f(x) = \sigma (b + W^\top X) $$. 
+Alas, ordinary least squares cannot be used to optimize neural networks however, and so solving the above linear regression will be left as an exercise to the reader. The reason we cannot use linear regression is that neural networks are nonlinear; Recall that the essential difference between the linear equations we posed and a neural network is the presence of the activation function (e.g. sigmoid, tanh, ReLU, or others). Thus, whereas the linear equation above is simply $$y = b + W^\top X$$, a 1-layer neural network with a sigmoid activation function would be $$f(x) = \sigma (b + W^\top X) $$. 
 
 This nonlinearity means that the parameters do not act independently of each other in influencing the shape of the loss function. Rather than having a bowl shape, the loss function of a neural network is more complicated. It is bumpy and full of hills and troughs. The property of being "bowl-shaped" is called [convexity](https://en.wikipedia.org/wiki/Convex_function), and it is a highly prized convenience in multi-parameter optimization. A convex loss function ensures we have a global minimum (the bottom of the bowl), and that all roads downhill lead to it.