StableDiffusion生成人类简史的小视频

桌面运维 2023-07-12 张二河手机阅读

Stable-Diffusion除了可以文生图，还有一个功能，生成视频的功能。它是在两张图片中通过插值实现补全的效果。上篇文章中，我介绍了如何用pipeline实现文生图。简单回顾下：

#install the diffuser package 
!pip install --upgrade diffusers transformers scipy accelerate

#load the model from stable-diffusion model card 
import torch 
from diffusers import StableDiffusionPipeline 

model_id = "CompVis/stable-diffusion-v1-4" 
device = "cuda" 
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) 
pipe = pipe.to(device)

images = pipe("a cat").images  
images 

#show a single result 
images[0]

生成视频的方式也是类似：

pip install -U stable_diffusion_videos 

#Making Videos 
from stable_diffusion_videos import StableDiffusionWalkPipeline 
import torch 
#"CompVis/stable-diffusion-v1-4" for 1.4 

pipeline = StableDiffusionWalkPipeline.from_pretrained( 
    "runwayml/stable-diffusion-v1-5", 
    torch_dtype=torch.float16, 
    revision="fp16", 
).to("cuda") 

#Generate the video Prompts 1 
video_path = pipeline.walk( 
    prompts=['environment living room interior, mid century modern, indoor garden with fountain, retro,m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light, –ar 16:9 –stylize 45000', 
            'environment living room interior, mid century modern, indoor garden with fountain, retro,m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light, –ar 16:9 –stylize 45000', 
            'environment living room interior, mid century modern, indoor garden with fountain, retro,m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light, –ar 16:9 –stylize 45000', 
            'environment living room interior, mid century modern, indoor garden with fountain, retro,m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light, –ar 16:9 –stylize 45000', 
            'environment living room interior, mid century modern, indoor garden with fountain, retro,m vintage, designer furniture made of wood and plastic, concrete table, wood walls, indoor potted tree, large window, outdoor forest landscape, beautiful sunset, cinematic, concept art, sunstainable architecture, octane render, utopia, ethereal, cinematic light, –ar 16:9 –stylize 45000'], 
    seeds=[42,333,444,555], 
    num_interpolation_steps=25, #frame num for each image
    #height=1280,  # use multiples of 64 if > 512. Multiples of 8 if  512. Multiples of 8 if < 512. 
    output_dir='dreams',        # Where images/videos will be saved 
    name='imagine',        # Subdirectory of output_dir where images/videos will be saved 
    guidance_scale=8.5,         # Higher adheres to prompt more, lower lets model take the wheel 
    num_inference_steps=50,     # Number of diffusion steps per image generated. 50 is good default 
)

关键点：

prompts 提示词数组，根据输入的提示词列表生成图像列表
seeds 指定生成图片的随机种子，可以随意指定，固定值表示每次生成图像一样，可以用于复现图像生成
num_interpolation_steps表示为每幅图像插值的图像数量。帧数为25，也就是大概1秒钟视频
output_dir和name连起来，表示输出图片视频目录：output_dir/name_%5d
num_inference_steps是stable diffusion推断步骤，决定生成图像的质量

根据上面生成视频的基本能力，我们来实现一个人类简史的小视频，关键是如何创建prompts，prompts必须场景化，便于sd 输出具体化的场景，每个场景具有一定相关性，这样sd 插值补全的视频看起来比较连贯。

比如：

prompt=HD photo of a large amount of spiral galaxies

（大量旋转的星系高清照片）

Stable-Diffusion生成人类简史的小视频

怎么生成时间轴上的历史大事件，这个可以用chatGPT生成，大致的提纲：

大约138亿年前：宇宙大爆炸
大约46亿年前：地球的诞生
大约41亿年前：地球温度下降形成固态地核
大约35亿年前：地球上出现最早的生命形式
大约23亿年前：氧气开始出现在地球上
大约2.5亿年前：地球上出现最早的哺乳动物类群
大约2亿年前：恐龙开始盛行
大约6500万年前：恐龙灭绝
大约650万年前：奥斯特拉洛比猿出现
大约400万年前：第一个早期人类——赫伯特猿人出现
大约300万年前：人类祖先露西出现
大约200万年前：早期人类（如直立人、旧人）开始在非洲生活
大约70万年前：非洲人类最早开始使用火
大约50万年前至3.5万年前：新人类（如尼安德特人、智人）开始在世界各地扩散
公元前4000年：人类开始进入农业文明时期
公元前3000年：古埃及、古印度、古中国等文明开始发展
公元前753年：罗马城的建立，古罗马从此开始历史
公元1世纪：基督教诞生于古巴勒斯坦
公元3世纪：伊斯兰教诞生于阿拉伯
1096年：第一次十字军东征开始
1455年：欧洲印刷术的发明
1492年：哥伦布发现美洲
1517年：马丁·路德宣布新教的创立
1789年：法国大革命爆发
1861年：美国南北战争爆发
1869年：第一条跨越苏伊士运河通航
1898年：美西战争爆发
1914年：第一次世界大战爆发
1917年：俄国十月革命，苏维埃政权建立
1929年：全球经济大崩溃
1937年：世界范围内的第二次世界大战爆发
1945年：第二次世界大战结束
1949年：中华人民共和国成立
1957年：苏联发射了世界第一颗卫星
1961年：尤里·加加林成为世界上第一位进入太空的人
1969年：阿波罗11号完成人类首次登月任务
1989年：柏林墙倒塌，冷战结束
2001年：911恐怖袭击事件
2003年：美国开始在伊拉克进行战争
2020年：新冠病毒爆发，全球流行病暴发

这个提纲内容还是过粗，可以作为输出内容的总体纲要，对每个重要时间需要再次使用具体化的场景来描述，比如这个生成的效果就很不错：

Stable-Diffusion生成人类简史的小视频

将上面内容再次补充一下，就可以产出自己的视频内容：

!pip install diffusers transformers scipy accelerate
!pip install stable_diffusion_videos

#from huggingface_hub import notebook_login
#notebook_login() 

import torch
from stable_diffusion_videos import StableDiffusionWalkPipeline

pipeline = StableDiffusionWalkPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16,
    revision="fp16",
).to("cuda")

pipeline = StableDiffusionWalkPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    revision="fp16",
).to("cuda")

prompts=["the beginning of the universe,nothing but darness",
 "spectial effects render of The Big Bang,rapidly expand",
"HD photo of a large amount of spiral galaxies",
"solr system，earth around the sun,sun around the galaxy",
"the Hadean earth was bombarded with asteroids and massive volcanic eruptions",
"panoramic view of earth with ocean surrounding newly formed land and volcanos",
"hydrothermal vents at the bottom of the ocean"
,"bacteria under a microscope"
,"ammonites floating in the ocean"
,"the first reptile to leave the ocean and crawl onto the land"
......
]

seeds=[3764,1537,6573,1791,9973,736,3639,3559,4724,3359,
       ......]

video_path = pipeline.walk(
    prompts=prompts,
    seeds=seeds,
    num_interpolation_steps=25,
    height=512,  # use multiples of 64 if > 512. Multiples of 8 if  512. Multiples of 8 if < 512.
    output_dir='dreams',        # Where images/videos will be saved
    name='human_history',        # Subdirectory of output_dir where images/videos will be saved
    guidance_scale=8.5,         # Higher adheres to prompt more, lower lets model take the wheel
    num_inference_steps=50,     # Number of diffusion steps per image generated. 50 is good default
)

这里生成了太多的文件和目录，特别是视频，比如我这边40个pompts生成了40*25=1000个图片，40个1秒短视频。现在需要把这些视频小文件合并成最终的视频，我可以用ffmpeg工具来完成，先安装相关的依赖包：

!pip install moviepy
!pip install ffmpeg
!pip install imageio-ffmpeg

然后从目录中过滤mp4格式的视频文件，按照文件生成顺序组合：

# 主要是需要moviepy这个库
from moviepy.editor import *
import os

1. 定义一个数组
videoFiles = []
videos=[]

1. 访问 video 文件夹 (假设视频都放在这里面)
for root, dirs, files in os.walk("/content/drive/MyDrive/gpt/human/"):
    1. 按文件名排序
    files.sort()
    1. 遍历所有文件
    for file in files:
        1. 如果后缀名为 .mp4
        if os.path.splitext(file)[1] == '.mp4':
            1. 拼接成完整路径
            filePath = os.path.join(root, file)
            videoFiles.append(filePath)

videoFiles.sort()
print(",n".join(str(x) for x in videoFiles))
for videoFile in videoFiles:
    1. 载入视频
    video = VideoFileClip(videoFile)
    1. 添加到数组
    videos.append(video)

1. 拼接视频
final_clip = concatenate_videoclips(videos)

1. 生成目标视频文件
final_clip.to_videofile("./target.mp4", fps=25, remove_temp=False)

导出视频，然后使用视频编辑工具进行润色和加上音效，这部分虽然也可以通过程序实现，但是音乐需要提前准备好，使用视频编辑处理，这里就不做介绍了。

效果看这里吧，大家勉强看吧。还有很多优化空间，比如模型生成的图片质量还不够，特别是针对中国的训练内容过少，导致经典场景的图像难以捕获。