기준:
코드가 있을것 (크게 어렵지 않을 것.)
paper with codes 기준 30 스타 이상
Pretrain을 직접 해야하는 논문은 최대한 지양(CNN - RNN 구조)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
https://paperswithcode.com/paper/show-attend-and-tell-neural-image-caption

DenseCap: Fully Convolutional Localization Networks for Dense Captioning
https://arxiv.org/abs/1511.07571

Attention on Attention for Image Captioning
