Optimization in Machine Learning (Fall 2023)

Lecture Information
Wed 10:50-12:15 at Tian Jiabing Building 128

Grading
Your grade will be determined by assignments (50%), a project (40%) and participance (10%).

Teaching Staffs

Instructor: Cheng Chen, 304 Math Building, chchen@sei.ecnu.edu.cn
Teaching Asistant: Ziqi Yao, 51265902073@stu.ecnu.edu.cn

Schedule

Date	Topic	Material	Homework
9.20.	Introduction; Basic mathematics	slides	hw, sol
9.27.	Convex sets; Convex functions	slides, notes	hw, sol
10.11.	Smooth & Strongly convex; Gradient descent	slides	hw, sol
10.18.	More on gradient descent	slides	hw, sol
11.1.	Projected gradient descent; Frank-Wolfe algorithm	slides	hw, sol
11.8.	Subgradient; Subgradient descent	slides	hw, sol
11.15.	Proximal operator; Proximal gradient descent	slides	hw, sol
11.22.	Accelerated gradient methods; Newton and quasi-Newton methods	slides	reading material
11.29.	Stochastic gradient descent; Variance reduction	slides	hw, sol
12.6.	Nonconvex optimization; Minimax optimization	slides	-

Project
The course project can either be a practical implementation or a literature review:

Practical implementation: You are encouraged to investigate an optimization algorithm for a machine learning application and gain insight into that algorithm. You should provide empirical evidence for the behavior of your chosen optimization algorithm or modification. The optimization algorithms can be anything of your choice. You need to submit your code and a report which is up to 3 pages (with unlimited page for references and appendix). Your report should look like the “Experiments” section of a machine learning paper.
Literature review: You are encouraged to review a topic related to optmization. The literature review should involve in-depth summaries of the theoretical results and provide empirical comparision of representative algorithms. If you choose this option, you can do it either individually or in groups of up to 3 students. Project report is up to 6 pages (with unlimited page for references and appendix).

Requirement

You need to submit an English report and your code. The submission deadline is January 7th, 2024.
You are required to use Latex to write the report. We suggest you to use the ICML templete.
Plagiarism is strictly prohibited, and it is unacceptable to duplicate entire sentences from other sources.

Topic examples for practical implementation
These areas are just mentioned for illustration, and you’re welcome to choose topics not in the list.

Local minima for deep learning: Can you find differences between the 'shape’ of local minima that SGD finds, vs e.g. AdaGrad or full gradient descent?
Meta-Learning: Can you learn the learning rate? See AutoML.
Zero-order optimization methods for ML applications.
performance of quantized SGD. Is it different for deep learning as compared to linear ML models?
How do different optimization variants affect generalization (test error)?
Practical second-order ideas in deep learning: E.g. pre-conditioning as in shampoo, or AdaHessian.
Distributed optimization algorithms.
...

Topic examples for literature review
You are encouraged to freely choose very different topics as long as they concern optimization for machine learning.

Esacape from saddle points in nonconvex optimization, e.g., “Sharp analysis for nonconvex sgd escaping from saddle points”, C. Fang, Z. Lin, and T. Zhang, 2019.
Minimax optimization, e.g., “On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems”, T. Lin, J. Chi, and M. Jordan, ICML 2020.
Acceleration for variance reduction, e.g., “Katyusha: The First Direct Acceleration of Stochastic Gradient Methods”, Z. Allen-Zhu, STOC 2017.
Variance reduction for nonconvex optimization, e.g., “Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator”, C. Fang, C. Li, Z. Lin, T. Zhang, NeurIPS 2018.
Adam-like algorithms, e.g., “On the convergence of Adam and beyond”, S. Reddi, S. Kale, S. Kumar, ICLR 2018.
Federated learning, e.g., “On the convergence of fedavg on non-iid data”, X. Li, K. Huang, W. Yang, S. Wang, Z. Zhang, ICLR 2020.
Bilevel optimization, e.g., “A Fully First-Order Method for Stochastic Bilevel Optimization”, J. Kwon, D. Kwon, S. Wright, R. Nowak, ICML 2023.
Explicit superlinear convergence of quasi-Newton methods, e.g., “Greedy quasi-Newton methods with explicit superlinear convergence”, A. Rodomanov and Y. Nesterov. SIAM Journal on Optimization, 31(1):785–811, 2021
Zeroth-order optimization, e.g., “Random gradient-free minimization of convex functions”, Y. Nesterov, V. Spokoiny, Foundations of Computational Mathematics, 17(2):527–566, 2017.
Nonsmooth nonconvex optimization, e.g., “Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions”, J. Zhang, H. Lin, S. Jegelka, S. Sra, A. Jadbabaie, ICML 2020.
...