New functionalities in AI, including multimodal support

Posted by Todd Bryant in Uncategorized | Leave a comment

In the last post, we introduced NotebookLM as a way to create AI focused queries on a user’s selection of texts for education in research. There have been other improvements in the capabilities of the AI models, particularly in the area of multimodal AI, which allow for the input and output of mediums besides text. Here we’ll look at some examples you may find useful for teaching and digital storytelling projects.

Image Generation

Admittedly, image generation based on text prompts was widespread earlier last year, with MidJourney and others. However, I do want to highlight Firefly, from Adobe, which we have access to as part of our Adobe license at Dickinson and uses licensed images from their library.  The beta version of Photoshop also uses AI for it’s Generative Fill tool.

Image creation – for prompts works best if include item, context and style for the image

  • Firefly from Adobe
  • Edge browser, Copilot Microsoft using OpenAI’s DALLE-3

Image and File Inputs for Chat Models

The free versions of major chat models now have other options for input besides text prompts. Those I mention below can also understand scientific diagrams and charts within articles. For the most part it’s quite impressive, but the most difficult part can sometimes be getting the model to read the correct figure. At times you may be better off only uploading the pages of the article that contain the chart and related information.

  • ChatGPT from OpenAI
    • Accepts files uploaded directly, via OneDrive, or Google Drive.
    • Can understand images including many infographics and charts
  • Gemini from Google
    • Accepts only images
  • NotebookLM also from Google for querying your own documents.
  • Edge browser (Copilot using ChatGPT Microsoft’s partner OpenAI’s ChatGPT)
    • Accepts files including images and screenshots uploaded directly.
    • Can understand images including many infographics and charts
  • Claude (Anthropic)
    • Can upload files. I had trouble testing due to size limitations

Audio Generation and Editing

Many of the major audio generators are focused on music generation. For education, we still recommend using creative commons licensed music. Though we have found AI useful for some editing as well as sound effects and background noise.

Adobe Podcast – The Enhance Speech function does very well with removing unwanted static, background noises, etc.

Stable Audio – Note it assumes you want to generate music, but trying entering something into the prompt such as “Background and noises of a metro station”.

Video Generation

Video generation is still very much a work in progress, and they are also much more resource intensive than text. As a result, there aren’t many truly free versions available, but a future where students can create AI for B-roll footage as part of class project etc. may not be far off.

Sora – OpenAI appeared to be a leader in the space with their announcement of Sora. Human movement was markedly improved, though you’ll still notice glitches if you watch the woman’s feet in the first example. It still isn’t widely available for testing, however.

Veed – The best “free” one I could find.  Free version has a watermark and limits export to 720 p.  See my New York skyline video.  By default it adds audio and subtitles which I removed in it’s basic editor. As with all freemium products, they may remove features once they have established a user base.

 

Leave a Reply