Skip to content
/ HAN Public

HAN is a large-scale Korean–English visual-language dataset of Korean cultural heritage: 41k images with rich narrative captions. Built for retrieval, captioning, and cross-lingual VLM research.

Notifications You must be signed in to change notification settings

dnotitia/HAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

HAN: Korean Heritage Augmented Narrative Visual-Language Description Dataset

KOGL Type 2: Source Indication + Commercial Use Prohibition


🌟 About HAN

HAN (Korean Heritage Augmented Narrative Visual-Language Description Dataset) is a new large-scale visual-language dataset focused on Korea's rich cultural heritage. This dataset comprises images paired with detailed, narrative captions in both Korean and English.

The HAN dataset aims to support research objectives such as:

  • Developing visual-language models with a deep understanding of Korean cultural heritage.
  • Enhancing the ability to generate narrative and contextual descriptions beyond simple image tagging.
  • Advancing multilingual (Korean/English) visual-language processing capabilities.
  • Promoting research in culturally specialized AI models.

📊 Dataset Overview

  • Total Images: 41,000
  • Total Captions: 410,000 (10 captions per image)
    • Korean Captions: 205,000
    • English Captions: 205,000
  • Main Content: Images and narrative descriptions related to Korean architecture, artifacts, artworks, traditional events, natural landscapes, and other aspects of cultural heritage.

📢 Important Notice: Data Release

Thank you for your interest in our HAN dataset.

  • Full Dataset Release:
    • Following the paper notification, the full HAN dataset, including all 41,000 images and 410,000 Korean/English captions, is officially released.

💾 Full dataset access (official): Download from AI Hub

https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=71866


📜 License

This dataset is available under the terms of the KOGL (Korea Open Government License) Type 2 (Attribution + Non-Commercial).

(This KOGL Type 2 license is similar to the Creative Commons BY-NC license)

Key Conditions:

  • Attribution (BY): You must give appropriate credit (indicate the source of the work, e.g., author, public institution). (This is a base condition from Type 1).
  • Non-Commercial (NC): You may not use the material for commercial purposes.
  • Freedom to Use and Adapt: As long as you comply with the two conditions above, you are free to use the public work without separate permission and to adapt it to create derivative works.

KOGL Type 2 Description:

Type 2: Type 1 + Commercial Use Prohibition

The user can freely use the public work without fee, and can change it to create secondary work, but it is not permitted to use for commercial purpose.

This means that in addition to the KOGL Type 1 (Attribution) condition, commercial use is prohibited.

For more details, please refer to the links below:


📜 Citation

If you use the HAN dataset in your research, please cite our paper (details will be updated upon publication):

@inproceedings{moon2025han,
  author    = {SungHyun Moon and Aidyn Zhakatayev and SeungJae Lee},
  title     = {HAN: Korean Heritage Augmented Narrative Visual-Language Description Dataset},
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
  year      = {2025},
  month     = oct,
  location  = {Dublin, Ireland},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  % pages     = {XX--XX},
  numpages  = {8},
  doi       = {10.1145/3746027.3758229},
  isbn      = {979-8-4007-2035-2/2025/10},
  url       = {https://doi.org/10.1145/3746027.3758229},
  series    = {MM '25}
}

About

HAN is a large-scale Korean–English visual-language dataset of Korean cultural heritage: 41k images with rich narrative captions. Built for retrieval, captioning, and cross-lingual VLM research.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published