The recent success in deep learning is largely driven by the application of large neural networks. In this course, we will cover the foundations of the techniques that are commonly used to train those large neural networks, namely, the foundations of the first-order optimization methods. Those include gradient descent and its variants. This course will cover both theoretical analysis and empirical implementation.

WARNING: This course is very mathy but does involve a good amount of coding.


  • Instructor: Shangtong Zhang
  • Location: Olsson Hall 120
  • Time: Tuesday & Thursday, 14:00 - 15:15
  • TA: Kallie Whritenour
  • Office Hour:
    • Shangtong: Tuesday & Thursday 15:30 - 16:30 (Rice Hall 422)
    • Kallie: Monday & Wednesday 15:00 - 16:00 (Rice Hall 442)
  • UVACanvas: 23F CS4501 Optimization
  • Prerequisite:
    • APAM 2120 or equivalent: C- or better, we will use calculus
    • CS 2150 or CS 3140 or equivalent: C- or better, we will use Python
    • Math 3350 or APMA 3080 or equivalent: recommended but not necessary, we will heavily use linear algebra and I will cover some related topics at the first few lectures
    • APMA 3100 or APMA 3110 or MATH 3100 or equivalent: not required, we will not involve probability
  • Teaching: We will use the textbook First-Order Methods in Optimization (FOMO) by Amir Beck. We will also use Algebra, Topology, Differential Calculus, and Optimization Theory for Computer Science and Machine Learning (ATDO) by Jean Gallier and Jocelyn Quaintance as reference. FOMO is not free but ATDO is free. There will be no slides and the lectures are mostly whiteboards. To encourage attendance, I will NOT provide detailed notes. I will, however, provide a roadmap.pdf which is used to remind myself of what to discuss next. As a courtesy, I will try my best to record each lecture (not guaranteed) and post the recording in Canvas.


We will cover three algorithms: projected gradient descent, mirror descent, and proximal gradient descent. Chapters 1 - 7 provide background for the whole textbook. We will only cover the background related to the four algorithms. Chapters 8 - 10 analyze the four algorithms in depth. We will only cover their basic forms. The entire book rests on the idea of subgradient. We will cover only its special case, gradient, which simplifies everything. See roadmap.pdf for more details.

All the written assignments are expected to be PDF files generated by LaTeX. Here are some tips for LaTeX. The final letter grade is based on curved scores.

Date Comments
08/22 Assignment 1 released.
08/24 Reading: Section 3.1 - 3.6 of ATDO
08/29 Reading: Matrix Calculus
09/19 Assignment 1 due. Assignment 2 released.
10/03 Reading Days, no lecture. Assignment 2 due. Assignment 3 released.
10/17 Assignment 3 due.
10/24 Assignment 4 released.
11/07 Election Day, no lecture
11/09 Assignment 4 due. Assignment 5 released.
11/16 Assignments 7 released.
11/23 Thanksgiving recess, no lecture
11/28 Assignment 5 due. Assignment 6 (conjugate gradient descent) is skipped due to time limit.
12/05 Last lecture.
12/10 Assignment 7 due.


  • Late Policy: If you submit within 8 hours (a grace period) after the due date, there is no penalty. If you submit within 24 hours after the due date, you lose 33% scores. If you submit within 48 hours after the due date, you lose 66% scores. If you are not able to submit within 48 hours after the due date, you lose all scores. No exception will be made unless a doctor’s note is provided.
  • Regrading Policy: For every assignment, one regrading request is allowed. I will regrade the entire assignment and there is no guarantee that the score will not decrease.