2D旋转位置编码
- Rotary Position Embedding for Vision Transformer https://arxiv.org/abs/2403.13298
- https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
- Transformer升级之路:4、二维位置的旋转式位置编码 https://kexue.fm/archives/8397
- RoPE 相对位置编码解读与外推性研究 https://blog.csdn.net/weixin_43378396/article/details/138977299
- 苏剑林 Transformer与位置编码相关 https://kexue.fm/search/Transformer%E5%8D%87%E7%BA%A7%E4%B9%8B%E8%B7%AF/
发展路线:正弦位置编码->旋转位置编码->二维旋转位置编码->多模态位置编码的简单思考
2D 可学习的绝对位置编码
NaVit是可学习的x和y绝对位置编码
Factorized & fractional positional embeddings. To handle arbitrary resolutions and aspect ratios, we
revisit the position embeddings. Given square images of resolution R×R, a vanilla ViT with patch size P
learns 1-D positional embeddings of length (R/P ) 2 (Dosovitskiy et al., 2021). Linearly interpolating these
embeddings is necessary to train or evaluate at higher resolution R. Pix2struct (Lee et al., 2022) introduces learned 2D absolute positional embeddings, whereby positional embeddings of size [maxLen, maxLen] are learned, and indexed with (x, y) coordinates of each patch. This enables variable aspect ratios, with resolutions of up to R = P · maxLen. However, every combination of
(x, y) coordinates must be seen during training. To support variable aspect ratios and readily extrapolate to unseen resolutions, …
2D 余弦绝对位置编码
open_clip在sin_cos_2d的时候,是用2d余弦绝对位置编码
elif pos_embed_type == 'sin_cos_2d':
2D的余弦位置编码:
# --------------------------------------------------------
# 2D sine-cosine position embedding
# References:
# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py
# MoCo v3: https://github.com/facebookresearch/moco-v3
# --------------------------------------------------------
def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):"""grid_size: int of the grid height and widthreturn:pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)"""grid_h = np.arange(grid_size, dtype=np.float32)grid_w = np.arange(grid_size, dtype=np.float32)grid = np.meshgrid(grid_w, grid_h) # here w goes firstgrid = np.stack(grid, axis=0)grid = grid.reshape([2, 1, grid_size, grid_size])pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)if cls_token:pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)return pos_embed
多模态位置编码
- Transformer升级之路:17、多模态位置编码的简单思考 https://kexue.fm/archives/10040