您的位置:首页 > 游戏 > 手游 > 位置编码(三) 2D位置编码 (2D旋转位置编码等)

位置编码(三) 2D位置编码 (2D旋转位置编码等)

2024/12/24 21:36:26 来源:https://blog.csdn.net/duoyasong5907/article/details/139271883  浏览:    关键词:位置编码(三) 2D位置编码 (2D旋转位置编码等)

2D旋转位置编码

  • Rotary Position Embedding for Vision Transformer https://arxiv.org/abs/2403.13298
  • https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
  • Transformer升级之路:4、二维位置的旋转式位置编码 https://kexue.fm/archives/8397
  • RoPE 相对位置编码解读与外推性研究 https://blog.csdn.net/weixin_43378396/article/details/138977299
  • 苏剑林 Transformer与位置编码相关 https://kexue.fm/search/Transformer%E5%8D%87%E7%BA%A7%E4%B9%8B%E8%B7%AF/

发展路线:正弦位置编码->旋转位置编码->二维旋转位置编码->多模态位置编码的简单思考

2D 可学习的绝对位置编码

NaVit是可学习的x和y绝对位置编码

Factorized & fractional positional embeddings. To handle arbitrary resolutions and aspect ratios, we
revisit the position embeddings. Given square images of resolution R×R, a vanilla ViT with patch size P
learns 1-D positional embeddings of length (R/P ) 2 (Dosovitskiy et al., 2021). Linearly interpolating these
embeddings is necessary to train or evaluate at higher resolution R. Pix2struct (Lee et al., 2022) introduces learned 2D absolute positional embeddings, whereby positional embeddings of size [maxLen, maxLen] are learned, and indexed with (x, y) coordinates of each patch. This enables variable aspect ratios, with resolutions of up to R = P · maxLen. However, every combination of
(x, y) coordinates must be seen during training. To support variable aspect ratios and readily extrapolate to unseen resolutions, …

2D 余弦绝对位置编码

open_clip在sin_cos_2d的时候,是用2d余弦绝对位置编码

 elif pos_embed_type == 'sin_cos_2d':

2D的余弦位置编码:

# --------------------------------------------------------
# 2D sine-cosine position embedding
# References:
# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py
# MoCo v3: https://github.com/facebookresearch/moco-v3
# --------------------------------------------------------
def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):"""grid_size: int of the grid height and widthreturn:pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)"""grid_h = np.arange(grid_size, dtype=np.float32)grid_w = np.arange(grid_size, dtype=np.float32)grid = np.meshgrid(grid_w, grid_h)  # here w goes firstgrid = np.stack(grid, axis=0)grid = grid.reshape([2, 1, grid_size, grid_size])pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)if cls_token:pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)return pos_embed

多模态位置编码

  • Transformer升级之路:17、多模态位置编码的简单思考 https://kexue.fm/archives/10040

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com