Problem Statement
- Mechanistic view
- DNN model that can read in an entire dataset and make predictions for new data point
- Training this network uses a meta-dataset, which itself consists of many datasets, each for a different task
- Probabilistic view
- Extract prior information from a set of tasks taht allows effcient learning of new tasks
- Learning a new task this prior and training set to infer most likely posterior parameters
- supervised learnings
- $\arg\max\limits_{\phi}p(\phi|\mathcal{D})=\arg\max\limits_{\phi} \log p(\mathcal{D}|\phi)+\log p(\phi)$
- require large amounts of labeled data
Description
- meta-parameters: $\theta$: $p(\theta|\mathcal{D}_{\text{meta-train}})$
- $\phi$⫫$\mathcal{D}_{\text{meta-train}}|\theta$
- $\theta^*=\arg\max\limits_\theta\log p(\theta|D_{\text{meta-train}})=\arg\max\limits_\theta\sum_{i=1}^n\log p(\phi_i|\mathcal{D}_i^{\text{ts}}),\phi_i=f_\theta(D_i^\text{tr})$
- $\arg\max\limits_\phi\log p(\phi|\mathcal{D},\mathcal{D}_{\text{meta-train}})$
- $\log p(\phi|D,D_{\text{meta-train}})=\log \int_\Theta p(\phi|\mathcal{D},\theta)p(\theta|D_{\text{meta-train}})d\theta\approx \log p(\phi|\mathcal{D},\theta^*)+\log p(\theta^*|\mathcal{D}_{\text{meta-train}})$
- $\arg\max\limits_\phi\log p(\phi|\mathcal{D},\mathcal{D}_{\text{meta-train}})\approx \arg\max\limits_\phi\log p(\phi|\mathcal{D},\theta^*)$
- Dataset
- $\mathcal{D}_{\text{meta-train}}={(D_1^{\text{tr}},D_1^{\text{ts}}),\cdots,(D_n^{\text{tr}},D_n^{\text{ts}})}=$ support set
- $\mathcal{D}_i^{\text{tr}}={(x_1^i,y_1^i),\cdots,(x_k^i,y_k^i)}$
- $k$-shot: $k$ instances per class
- $t$-way: $t$ classes
- $\mathcal{D}_i^{\text{ts}}={(x_1^i,y_1^i),\cdots,(x_l^i,y_l^i)}$
- $\mathcal{D}_i=\mathcal{D}_i^{\text{tr}}+\mathcal{D}_i^{\text{ts}}$
- episode/task: $\mathcal{T}_i=D_i^{\text{tr}}+D_i^{\text{ts}}$
- $\mathcal{D}_i^{\text{tr}}={(x_1^i,y_1^i),\cdots,(x_k^i,y_k^i)}$
- $\mathcal{D}_{\text{meta-validation}}$
- $\mathcal{D}_{\text{meta-test}}=$ query
- $\mathcal{D}_{\text{meta-train}}={(D_1^{\text{tr}},D_1^{\text{ts}}),\cdots,(D_n^{\text{tr}},D_n^{\text{ts}})}=$ support set
Related Problem
- multi-task learning
- special case of meta-learning where $\phi_i=\theta$
- hyperparameter optimization
- $\theta$ = hyperparameters, $\phi$ = network weights
- auto-ML
- $\theta$ = architecture, $\phi$ = network weights
Dataset (few-shot image recognition)
- Omniglot dataset
- 1623 characters from 50 different alphabets
- 20 instances of each character
- Minilmagenet
- CIFAR
- CUB
- CelebA
Meta-learning algrithms
Black-box adaptation
Train a neural network to represent $p(\phi_i|\mathcal{D}_i^{\text{tr}},\theta)$
- For now: use deterministic $\phi=f_\theta(D_i^\text{tr})$
- $y^\text{ts}=f_\theta(\mathcal{D}_i^{\text{tr}}, x^\text{ts})$
- Form of $f_\theta$
- RNN
- LSTM
- NTM (Neural turing machine)
- Self-attention
- 1D convolutions
- feedforward + average
- Loss Function: $\mathcal{L}(\phi_i,\mathcal{D}i^\text{ts})=\sum{(x,y)\sim D_i^{\text{ts}}}\log g_{\phi_i}(y|x)$
- supervised learning: $\min\limits_\theta\sum_{\mathcal{T}_i}\mathcal{L}(f_\theta(\mathcal{D}_i^\text{tr}),\mathcal{D}_i^\text{ts})$
- Algorithm: for each iteration
Sample task $\mathcal{T}_i$
Sample disjoint datasets $\mathcal{D}_i^\text{tr},\mathcal{D}_i^\text{ts}$ from $\mathcal{D}i$
Compute $\phi_i\leftarrow f\theta(\mathcal{D}i^\text{tr})$
Update $\theta$ using $\nabla\theta\mathcal{L}(\phi_i,\mathcal{D}_i^{\text{ts}})$
- Challenges & Solution
- Outputting all neural net parameters does not seem scalable & only sufficient statistics
- Essays
- Optimization as a model for few-shot learning
Optimization-based inference (Model-agnostic meta-learning)
Acquire $\phi_i$ through optimization
- $y^\text{ts}=f_\text{MAML}(\mathcal{D}_i^\text{tr},x^\text{ts})=f_{\phi_i}(x^\text{ts})$
- $\phi_i=\theta-\alpha\nabla_\theta\mathcal{L}(\theta,\mathcal{D}_i^\text{tr})$
- $\max\limits_{\phi_i}\log p(\mathcal{D}_i^\text{tr}|\phi_i)+\log p(\phi_i|\theta)$
- Meta-parameters $\theta$ serve as a prior
- Initialization
- Fine-tuning
- Loss Function: $\min\limits_\theta\sum_{i}\mathcal{L}(\theta-\alpha\nabla_\theta\mathcal{L}(\theta,\mathcal{D}_i^\text{tr}), \mathcal{D}_i^\text{ts})$
Sample task $\mathcal{T}_i$
Sample disjoint datasets $\mathcal{D}_i^\text{tr},\mathcal{D}_i^\text{ts}$ from $\mathcal{D}i$
Optimize $\phi_i\leftarrow\theta-\alpha\nabla\theta\mathcal{L}(\theta,\mathcal{D}i^\text{tr})$
Update $\theta$ using $\nabla\theta\mathcal{L}(\phi_i,\mathcal{D}_i^{\text{ts}})$
- Essays
- Model-agnostic meta-learning
Non-parametric methods
use parametric meta-learners that produce effective non-parametric learners
- Key idea: use non-parametric learner
- Siamese network: 两个网络共享权值,衡量输入相似程度
- Contrastive Loss
- Cosine Loss
- pseudo siamese network
- Siamese network: 两个网络共享权值,衡量输入相似程度
- $y^{\text{ts}}=f_{PN}(\mathcal{D}_i^{\text{tr}},x^{\text{ts}})=\text{softmax}(-d(f_\theta(x),c_k))$
- Prototype: $c_k=\frac{1}{|D_i^{\text{tr}}}|\sum_{(x,y)\in D_i^{tr}}f_\theta(x)$
- Loss function: $J(\phi)=-\log p_{\phi}(y=k|x)$
- learn embedding $f_\phi:\mathbb{R}^D\rightarrow\mathbb{R}^M$
- Essays
- Prototypical Networks for Few-shot Learning