Prompt for the source video: A cat is walking on the floor at a room
Prompt for each reference image: a photo of <cat> 
Prompt for the synthesized video: A <cat> is walking on the floor at a room