Vision Prompt Injection is a brand new problem. The situation is made even more difficult by the fact that GPT-4 Vision is not open-source and we don't quite know how text and vision input affect each other. I tried techniques based on adding additional instructions in the text part and ordering the LLM to ignore potential instructions contained in the image. It seems to improve the model's behavior, at least to some extent.
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.