Video Llm Summary
Summarizing video transcripts with an LLM
Tools Used:
*Big thanks to @simonw for an unending flow of useful tools which makes life easier every day.
Today I had the task of reviewing a series of video files and compare them to a legal filing (which came as one large PDF).
The videos were longform interviews, so to speed up the review process I wanted to get these into a workable format depending on workflow: video, audio, and text.
First, in the directory with the videos, I converted .webm
files to mp3 with the following:
for file in *.webm;
do ffmpeg -i "$file" "${file%.webm}.mp3";
done
I then wanted to get the transcript from each so I could pipe the results into an LLM for analysis using @simonw’s llm tool. Sure, you could use yt-dlp
or similar to get the Youtube-generated transcript, but usually those aren’t so great. Plus, I wanted to try out some of the work from the MLX Community
mlx_whisper
does a really fast job of this on my Macbook Pro. Since Simon is always beaming about uv
and its utility uvx
I gave it a go to get whisper into my cli.
I did the following for each .mp3
file in the directory:
uvx --from mlx-whisper mlx_whisper video1.mp3
This left me with a bunch of .txt
files at the same path as the .mp3
s. I now had a format of each:
video1.webm
video1.mp3
video1.txt
This was a quick way to get things going; ultimately I wanted to use an llm for analysis. I had a legal filing I wanted to compare these to also. To get the PDF into a workable format I did a simple:
pdftotext legal_filing.pdf
That gave me a legal_filing.txt
file to work with.
Now I could pipe these into llm
however I needed. Ultimately I went with something like the following (note this uses fish
syntax):
echo -en "$(cat legal_filing.txt) \n\n##### START OF TRANSCRIPTS ####\n\n $(cat video*.txt)" \
| llm -s "You will be provided the text of a legal document, as well as a series of transcripts from related interviews. Provide an analysis and comparison." \
| tee analysis_all.md
| bat -l markdown
The -s
flag specifies the system prompt for the model (in this case, a very generic one); here the ‘user’ message is the legal_filing.txt
document, with some custom delimeter I added, then the entire contents of all video .txt
transcripts in the directory. I then tee
it so I can review the results as they’re generated but also save it to a file. bat
is a nice bonus just to view some aestetic formatting in the terminal.
Finally, we also needed to crawl a related website to get background information. I used Claude to whip up the following script:
# spider.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from collections import deque
def spider_website(start_url):
# Parse the domain from the start URL
domain = urlparse(start_url).netloc
# Initialize our queues and sets
queue = deque([start_url])
discovered_urls = {start_url}
while queue:
current_url = queue.popleft()
print(f"Crawling: {current_url}")
try:
# Get the webpage content
response = requests.get(current_url, timeout=5, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all links on the page
for link in soup.find_all('a'):
href = link.get('href')
if not href:
continue
# Convert relative URLs to absolute URLs
full_url = urljoin(current_url, href)
# Only process URLs from the same domain that we haven't seen before
if (urlparse(full_url).netloc == domain and
full_url not in discovered_urls):
queue.append(full_url)
discovered_urls.add(full_url)
except Exception as e:
print(f"Error crawling {current_url}: {str(e)}")
return discovered_urls
# Usage
urls = spider_website("https://website.com")
unique_urls = set([x.split("#")[0] for x in urls]) # Remove anchors
print("\n".join(list(unique_urls)))
This gave me a nice list of unique URLs to download. I used shot-scraper
to do just that and save each page to its own PDF.
python spider.py | xargs -I{} shot-scraper pdf {}