[MAE] Masked Autoencoders Are Scalable Vision Learners

2024/11/19 10:09:51 来源：https://blog.csdn.net/sinat_30618203/article/details/141069729 浏览: 次关键词：[MAE] Masked Autoencoders Are Scalable Vision Learners

1、目的

NLP领域的自监督预训练非常成功，CV领域可以参考其masked autoencoding方法。主要挑战有：

1）CNN不会直接用mask tokens或者positional embeddings，而是在规则网格上运算 -> Vision Transformers (ViT)

2）Language是人为创造的，在语义和信息上非常密集，即便只训练模型预测一个句子中的个别缺失的单词，也能学到有用的信息；而图像则有极大的空间冗余，只用相邻图像块，而无需高级语义理解，就可以进行图像复原 -> 随机遮挡大量的patch

3）decoder很大程度上决定了学到的latent representation的semantic level

asymmetric encoder-decoder

1）masking

a）non-overlapping patches

b）random sampling (uniform distribution)

c）high masking ratio (75%；减小了redundancy，避免了从visible neighboring patches中推断出内容)

2）encoder

a）ViT (transformer blocks + positional embedding)

b）只用visible patches作为输入，不用mask tokens。极大地减少了预训练时间（3x），减少了存储消耗

3）decoder

a）lightweight；Transformer blocks + positional embeddings

b）同时用latent representation和mask tokens作为输入

4）reconstruction target

a）decoder的最后一层的channel数 = patch pixel数目

each element in the output is a vector of pixel values representing a patch

b）只在masked patches上计算MSE loss

c）以patch为单位对pixel进行normalize可以提升representation quality