On Giving AI Eyes and Ears


Jobs that require visual or audio interactions felt insulated from AI, but now they are not – for better and for worse.

What multimodal AI does is let the AI “see” images and “understand” what it is seeing. I had some fun with this by uploading a recent meme (and thus one that would not be in the training data of the AI) with a little bit of image manipulation to make sure Bing couldn’t find the image by searching for it. The AI not only accurately described the scene in the meme but read the text in the image, and even explained why the meme was funny. I also gave it the challenge of coming up with creative ideas for foods in my fridge based on an original photo (it identified the items correctly, though the creative recipe suggestions were mildly horrifying). However, as amazing as it is, the system still suffers from hallucinations and mistakes. In my third experiment, while it correctly identified that I was playing poker and even figured out my hand, it misidentified other cards and suffered from a typical case of AI confusion, where it forgot important details until reminded. So, multimodal AI feels a lot like text-based AI: miraculously cool things mixed with errors.

But multimodal AI also lets us do things that were previously impossible. Because the AI can now “see,” it can interact with the world in an entirely new way with some big implications.

Read More at One Useful Thing

Read the rest at One Useful Thing