Semantic Aware Video Filter/ SEE – Significant Embedding Explanation
CLIP allows us to encode a picture into a dense, information rich vector. With a somewhat stable video feed, like a live feed of a stationary camera, we can compute an average CLIP space embedding of a scene. Then, over time, we can see how this embedding changes. Changes in clip space represent a change in semantic meaning. Perhaps a bird has flown into the scene, or perhaps the weather suddenly changes.
Further, if we have a few examples of semantically interesting video frames, we can compare the current frame's CLIP embedding to important prior clusters. This way, we can detect what exactly is happening in the vide. In this way, we can differentiate specifically if a bird, squirrel, or weather change is causing the semantic change.
Finally, as a reach goal, if the user provides a text description of what they're interested in, then that text may also be encoded into CLIP space, and the Semantic Aware Video Filter can detect if the current image frame is somewhat similar to the provided text description.
In addition to the calculating nearest clip-space neighbors of each frame in a video stream, we can attempt to build a GRAD-CAM style pixel highlight. GRAD-CAM will compute a heatmap of each pixel's relevant for a given class in a class-prediction model. However, I believe a similar technique may be used for computing the relevant pixels for a clip space embedding.
Development will require the following steps:
Developing code that can pull a current frame from a live stream.
Passing the frame through a CLIP vision model, to generate the image embedding.
Using the image embedding, we can compare it to:
Additionally, we can take the gradients computed by the model, and attempt to produce a GRAD-CAM style heatmap of the frame.
I have found a few live streams on youtube that seem relevant:
FL Birds Live Cam: https://www.youtube.com/watch?v=8Zsc_2mGpOg
A stationary camera that watches a well maintained bird feeder. Some squirrels and birds make an appearance, making this a good candidate for doing a nearest neighbor to determine if a frame has "nothing", "squirrel" or "bird".
Night Walk in Tokyo: https://www.youtube.com/watch?v=0nTO4zSEpOs
This is a video, not a live stream, of backpack mounted camera walking through Tokyo. While the other streams feature largely the same "background", this video feed features radically different views of back alleys, well lit malls, down town parks, etc. There are 12 distinct locations that are visited, and it would be fun to compare the average CLIP embedding of each of the 12 video segments with each other.
EVNautilus: https://www.youtube.com/@EVNautilus
An underwater research stream, outputting many frames of the sea floor, with the occasional sea creature/ trash/ geologic formation as the main focus for several minutes. This camera is mobile, so a tradition motion detection camera would not be effective at identifying exciting frames.
I will be using pertain clip models from hugging face. Initial prototypes currently are using: CLIPModel.from_pretrained("openai/clip-vit-base-patch32") and clip.load("ViT-B/32", device=device)
I will determine if the semantic filters are able to successfully identify or label the content in a video. By providing human labeled clips with known entities in them, I can evaluate the accuracy of the system.
This could be relevant for researchers that collect long streams of video data, to help them filter and identify clips that are semantically relevant.
Further, the GRAD-CAM style heat map could help AI researchers understand how a computer vision model understands images as they evolve over time.