HAN (Korean Heritage Augmented Narrative Visual-Language Description Dataset) is a new large-scale visual-language dataset focused on Korea's rich cultural heritage. This dataset comprises images paired with detailed, narrative captions in both Korean and English.
The HAN dataset aims to support research objectives such as:
- Developing visual-language models with a deep understanding of Korean cultural heritage.
- Enhancing the ability to generate narrative and contextual descriptions beyond simple image tagging.
- Advancing multilingual (Korean/English) visual-language processing capabilities.
- Promoting research in culturally specialized AI models.
- Total Images: 41,000
- Total Captions: 410,000 (10 captions per image)
- Korean Captions: 205,000
- English Captions: 205,000
- Main Content: Images and narrative descriptions related to Korean architecture, artifacts, artworks, traditional events, natural landscapes, and other aspects of cultural heritage.
Thank you for your interest in our HAN dataset.
- Full Dataset Release:
- Following the paper notification, the full HAN dataset, including all 41,000 images and 410,000 Korean/English captions, is officially released.
→ https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=71866
This dataset is available under the terms of the KOGL (Korea Open Government License) Type 2 (Attribution + Non-Commercial).
(This KOGL Type 2 license is similar to the Creative Commons BY-NC license)
Key Conditions:
- Attribution (BY): You must give appropriate credit (indicate the source of the work, e.g., author, public institution). (This is a base condition from Type 1).
- Non-Commercial (NC): You may not use the material for commercial purposes.
- Freedom to Use and Adapt: As long as you comply with the two conditions above, you are free to use the public work without separate permission and to adapt it to create derivative works.
KOGL Type 2 Description:
Type 2: Type 1 + Commercial Use Prohibition
The user can freely use the public work without fee, and can change it to create secondary work, but it is not permitted to use for commercial purpose.
This means that in addition to the KOGL Type 1 (Attribution) condition, commercial use is prohibited.
For more details, please refer to the links below:
- Detailed View of KOGL Type 2 License (Official Site - in Korean)
- KOGL License Type Guide (Ministry of Culture, Sports and Tourism - in Korean)
If you use the HAN dataset in your research, please cite our paper (details will be updated upon publication):
@inproceedings{moon2025han,
author = {SungHyun Moon and Aidyn Zhakatayev and SeungJae Lee},
title = {HAN: Korean Heritage Augmented Narrative Visual-Language Description Dataset},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
year = {2025},
month = oct,
location = {Dublin, Ireland},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
% pages = {XX--XX},
numpages = {8},
doi = {10.1145/3746027.3758229},
isbn = {979-8-4007-2035-2/2025/10},
url = {https://doi.org/10.1145/3746027.3758229},
series = {MM '25}
}