Hello, author! I found that some models, like BLIP-2, can generate captions for source videos. However, the generated prompts are usually very long. In contrast, the editing prompts used in this paper are typically short and focus only on the subject and motion. How can these concise editing prompts be generated?