We use image recognition AIs to use filters

Produce better content with artificial intelligence

Photo credit: Morris Mac Matzen

Eugene L. Gross

"My wish for the future of Artificial Intelligence is that everyone understands that it is not science fiction, but a Swiss Army knife that everyone should use."

Eugen Gross has many years of experience in many areas of TV and moving image production, both from the creative and the technical side. He trained as a camera assistant in Vienna, worked as a cameraman mainly for shows and entertainment formats, was an SNG operator and OB truck manager, and produced and directed himself.

Due to the change in the market and the digitization of the media, he saw the need for a professional change. In addition to many smaller workshops, he trained as a producer in Cologne and then completed the "Executive MBA in Media Management" certificate course at the Hamburg Media School. He was able to expand his in-depth professional experience there through practical studies. Aiconix emerged from the master's thesis.



Gunnar Brune / AI.Hamburg: Mr. Gross, you were a cameraman and now you use artificial intelligence to analyze videos. What do you do and how did it come about?

Eugene L. Gross: Strangely enough, as a cameraman I've always been interested in data. I think I was the only one who made a pivot table to see how many days I worked, how many days I had to travel, etc. That always interested me. I was always technically savvy, at times I was an OB truck driver and even a bit of a satellite technician. I don't come from the “film”. I did that very early, in fact I come from television. I'm the classic TV fuzzy. My world, the show was classic entertainment. 8:15 p.m., curtain up for the Red Hot Chili Peppers, also Thomas Gottschalk and Anne Will.

Gunnar Brune / AI.Hamburg: Did you mainly do live TV?

Eugene L. Gross: Yes, of course I also did documentaries. For example, I had a production company for a long-term documentary in Kiel. But my daily business has been entertainment for many, many years. I've done 80 concerts from Tokio Hotel to Helene Fischer, from Marianne and Michael to the New York Philharmonic. I did all the talks from Harald Schmidt to Anne Will and Beckmann. I also co-founded a professional association for television cameramen.

Gunnar Brune / AI Hamburg: And now you no longer work with the camera, but with artificial intelligence. How did you get there?

Through my MBA in Media Management, I came up with the question of how to make better use of data. In my opinion, data is currently being used too much for sales and too little to develop better content and better products. You just have to bring all the factors together, and that is only possible with artificial intelligence. And with this idea I founded aiconix.

Gunnar Brune / AI Hamburg: What is your most exciting project right now?

Eugene L. Gross: The most exciting is always the latest. Speech recognition, i.e. speech-to-text, is what almost everything is about at the moment. A lot starts with speech-to-text. Sales: Customers start with speech-to-text. The content of a video is also recognized today based on the text, regardless of what the image looks like. That's the way it is today, I can tell you more about that later. In the case of short formats in particular, the language conveys the information.

If I can convert speech into information or text, then I can semantically analyze it and I can extract topics. For that you need speech-to-text and we have solutions in all directions. We have a front end and offer an API, i.e. an interface. We also have a slackbot. And, here we are pioneers, we also do it live. We're already doing this for the Hessian state parliament. We are currently applying for a project in the German Bundestag. We have inquiries from public prosecutors, the police, or the federal press conference. The latter is about live subtitles, for example. Speech-to-text is therefore particularly important to us from a business point of view.

We are cloud-agnostic and therefore broadly based. We are in the Oracle startup program, have just entered into a partnership with Microsoft, we use Amazon, Google, many small providers in parallel and are open to everything.

Gunnar Brune / AI.Hamburg: So you record the sound, capture the texts and then semantic analyzes run very, very quickly?

Eugene L. Gross: Almost, speech-to-text runs very, very quickly. If you want semantics, that's an extra call. First you have to get the text and then you send it to another provider and you can then extract the topics from there. It is extremely interesting to create tables of contents with time information for longer videos or podcasts. I am amazed that, as far as I know, nobody has done this before! A podcast is a good use case. On a long podcast that lasts 60 minutes, I don't want to go through the 60 minutes to find the one point that interests me. I would like to know that there is an intro in the first ten minutes, in the second ten minutes there is - for example - a job description and in the third ten minutes there is AI. I might just want to listen to the third ten minutes on the topic of AI. I also don't want to listen to all 20 episodes of a podcast if only episode 14 is my interest AI. Spotify has just advertised a large project that developers can participate in. This is about the question of what options there are to thematically go into the individual episodes.

Let's say you attended a long webinar or a conference that lasted eight hours. A video will be made available at the end. You can remember a particularly interesting talk, or an exciting thing was said, but you can't press Control-F and activate the search function to jump to that keyword. You usually don't have an overview of the topics. There is still a lot of music in these applications and it all starts with speech-to-text.

Gunnar Brune / AI.Hamburg: Sounds great, but how do you solve the problem in detail? How is the language recognized?

Eugene L. Gross: That would be nice, but no artificial intelligence for speech-to-text can currently recognize language at the moment of speaking. Some providers test which language is involved in the first seven to ten seconds. There is no other solution to this yet. You mentioned noises, that's a huge topic. If you have music in the background, you can filter it out. If you have street noise, you can filter it out. But we have not yet found a solution to generally extract the speech in order to filter out all noise. What man can do so well. As a person, I can recognize the language between all these noises. We have not yet found a model that can do this in general. We can only do this for individual sounds or for groups of sounds. What works is improving the language so that you can recognize it better. The all inclusive is now of good quality. When a speaker speaks clearly, for example when making political speeches, that is our ideal. Because these are mostly experienced speakers with a good microphone. But at the moment, and we currently have a few cases in which a reality TV show says “Hey Dicker, dude”, then it will be a disaster. Digital State Minister Dorothee Bär was our guest at a trade fair. She then asked: "Can he speak Franconian?" The answer was of course "No, he doesn't speak Franconian". I'm austrian. As soon as you get into the dialect, the quality drops significantly. This is because the German language is not that important in the world. For the English language there is a much larger database that could be used for training. Therefore, Artificial Intelligence can understand and recognize an Indian or Australian who speaks English relatively well. When it comes to the German language, the quality of Artificial Intelligence in Munich or Vienna goes down because it has been trained in High German and the dialects are a huge topic. Dialects are a real issue of the heart for me. I already have a solution in my head and as an architecture, but unfortunately I don't have a budget to implement it.

It is currently the case that we are using several Artificial Intelligences to achieve the optimum for speech-to-text. If they do not agree on a term which is the right one, this is underlined. Then you are offered the various options. So you don't have to worry about the whole text, but only about the words marked in blue, where the AIs disagreed. This is very easy to see in the text: Everything that is black is a 100 percent hit rate. Green is case sensitive, we have optimized the process so that this is checked first. It gets interesting with blue because our own algorithm evaluated several sources with different results for its decision. Numbers, for example, are difficult: When it comes to numbers, some providers write out these numbers, others give numbers in digits. If a text has a lot of blue markings, then our algorithm had to work very hard because there were immense differences between the individual providers. This means that artificial intelligence is a framework. And ours helps as a meta-level that can return the highest probability of speech-to-text. But I would recommend everyone, especially when it comes to sensitive data that must not be wrong, to do a final inspection again.

What does that mean in practice? Take a 4-minute video. It probably takes about 20 to 30 minutes of work if you type it up. To create a subtitle file with a second code you usually need ten times as much, i.e. forty minutes for the 4-minute video. With AI support, it only costs a few cents plus the final inspection. That means the workflow is immensely cheaper and faster.

In summary: With our speech-to-text service, we are on the one hand a meta-provider, i.e. we have an algorithm that optimizes the data of other providers so that the highest probabilities are achieved. We also train our own AI, not as our own speech recognition, but as a support technology. For the mandate of the state parliament, we had to be better than Google for the transcription. Of course, our meta-algorithm always works very well. But it was very clear that when a member of parliament goes on stage and the speaker of parliament announces that he is “Here is the member of parliament Vranitzky”, then he has to be recognized correctly, even if he does not have a name in the dictionary like Meier or Müller. This is exactly what Google cannot do, because their technology is of course much broader. We train these special words for the tasks of our customers. For example, medical vocabulary is also trained for a medical conference. We recently had an issue because numbers were badly recognized in the live transcription. We were already familiar with this problem in principle, because, as I said, numbers are often a weakness of the models. In this particular case, however, there were absurd errors. Phonetically, a million is sometimes mistaken for a billion, but in this case 38 became 24, which actually have nothing to do with each other phonetically. When we discover something special, we train very specifically.

Gunnar Brune / AI.Hamburg: The current benefit is clear. I can even read what's being said, and I can get it very, possibly relatively quickly, with less effort, including less stenographers. Where is the journey going? How will this benefit us in the future?

Eugene L. Gross: A lot will go much faster. I made my last film for Red Bull in Mexico and then I waited two days for the transcript and translation before I could continue editing. This is already much faster and better with AI. I recently spoke to a journalist who is the department head of a business magazine. If he's doing an interview, that's what he takes with the
iPhone on. He earns quite well. Nevertheless, he has to transcribe himself. He then sits there 3 hours a day with the recordings of his own interviews and types them up. It doesn't have to be. AI can do this much faster, better, more cost-effectively. He should focus on writing an article because that is his core competency.

Gunnar Brune / AI.Hamburg: How do you get your algorithms? Do you use catalogs or do you work with mathematicians who write algorithms for you?

Eugene L. Gross: As well ... as. The first step on our platform was actually to look at the AI ​​offerings from Google, Amazon and Microsoft. What can they do? And how can we as a meta level generate a benefit by curating? Which algorithm is the best for the use case?

Gunnar Brune / AI.Hamburg: Do not get me wrong. You are a cameraman, how did you judge that?

Eugene L. Gross: (laughs) Well, quite simply, by throwing a voice recording into the machine and manually creating a transcript at the same time. Let's say Microsoft made 60 mistakes, Amazon made 20 mistakes, and Google maybe only ten mistakes, then we knew that Google had the best results.

Gunnar Brune / AI.Hamburg: So you just tried it? You didn't analyze the math formulas, but said we'll just try it out and what's better is better?

Eugene L. Gross: Exactly. We actually made a manual transcript of the language and this manual transcript compared to the transcripts, the AI ​​benchmarked, under ideal conditions. And we have found: the result is different in every language. One provider is better in German, the other is better in English or Russian. If someone wants to transcribe Russian and chooses us, they get another provider and that is our promise. With us you can use the cheapest provider, which then only has 80 percent of the accuracy that the best provides. If that's enough for you, because it's about editing a long format documentary, so to roughly understand what it's about, then that's enough. But; if you're a news agency, working under pressure and needing subtitles, then better quality is required. Then you will get the best results with us. This is our promise.

Gunnar Brune / AI.Hamburg: So your model is to curate Artificial Intelligence for the tasks of your customers. You can just run the one inexpensive algorithm. Or can you use the 10 best and use the meta-evaluation?

Eugene L. Gross: Correct. But that's just the beginning for us. We built the algorithm that I described earlier, which compares the data from the various sources and offers the answer with the highest probability. In the beginning it was a smart model because we save customers having to implement different implementations of apps because we have already integrated them all. But the goal was of course very quickly that we not only sell data from others, but also offer our own models. And we started training them straight away. We designed our own face recognition. We have built setting sizes of recordings into a custom model. For me as a cameraman, that was also very important for the vision we have. This model detects whether a person is in the closeup or in the American setting. Is it "head to toe"? How many people are in the picture? This is a separate model and an exciting application for the future.

Gunnar Brune / AI.Hamburg: So we have come to image recognition. Before we talked a lot about language, how far is the analysis of the images with AI?

Eugene L. Gross: I have this case, an advertising agency approached us to find out why some social media postings are doing well for a customer, a car manufacturer and others are not doing well. We started from the working hypothesis that the perspective from which a car was photographed and the colors with which it was depicted have an influence on the success of the postings with the target group. We differentiate, for example, between brown, warmer colors in a rural environment or steel blue colors in an urban environment with high-rise buildings. The idea is that, as an urban person, you might be more interested in an SUV in an urban environment, or maybe exactly the opposite, because you move around the city with the SUV but see yourself as a country person. This is easy to determine with the AI, because there are use cases and, above all, providers that extract color spectra from images that vehicles recognize. They can also see from which perspective the vehicle was photographed, i.e. from the side, from the front or from the rear. They identify what percentage of the image is covered by a vehicle.This is an approach that goes in the direction of visual storytelling and it is about answering the question of which images are well received by the audience.

The reason I founded this startup is precisely this: I believe that too much data is used for marketing and sales and too little for the creative. Data can be very useful in helping creatives understand why the target group switched off in the second 50. What happened in second 50? I don't mean the level of the individual person, but the entire target group. If I lose 20 percent of this target group, then as a creative I have to look at what went wrong in this storytelling. That is the big, big goal with which we probably also have a lot more potential worldwide. And that's what we're working on right now.

We have several approaches here and we have three different sources. One is the medium itself - in the current case a short online format. The second source is the data that we can extract from it and tell ourselves: who is talking about which topic? How is the cutting sequence? We can of course also look at the colors. We can extract the subject. These data are all available and we can use data to determine everything that can still be changed at the editing suite. You can determine the background music, at least by and large. Maybe also match the people and the topic together and analyze whether a person is credible for a topic for the target group. As a third data source, we have the target group. For this purpose, we have set up a neuro-laboratory in which we measure the EEG data of test subjects. If we see from the analysis data that the target group switches off in second 50, then a problem has not arisen in that second, but the problem is the sum of all information that we have given before. Maybe we should have built the film a little differently in terms of dramaturgy. Maybe we should have just rearranged the scenes, had a different storytelling and not lost the audience. To measure what the main reasons why the audience is turning off, you have to look inside your head. At the moment you can only do that with the EEG device. We did it at a very high level in a laboratory with a 40,000 euro device. This is not a toy because we have to collect a lot of information. We were able to see from the data how attention evolved. We cannot yet say whether a moment is exciting or not. There is still a lot to learn here.

Online videos have the advantage that I have a lot of data available very quickly. I quickly have 10,000 people who all switched off in the same second. That said, it's about how the story is told, it's about the storytelling. And that's where we want to go: We want to use artificial intelligence to help produce good content.

Gunnar Brune / AI.Hamburg: Where is the journey going?

Eugene L. Gross: My wish for the future of artificial intelligence is that everyone understands that it is not science fiction, but a Swiss Army knife that everyone should use.

We currently have the information from the language and what is happening in the picture. We want to take care of the storytelling now. Even with a 5-minute cooking video, I want to understand whether this is a three-act act, where is the intro and where the main part is. Where is the plot In the cooking video, the plot is when the dish is ready. If I understand how a story works, then I can tell the filmmaker what worked better and what didn't based on the data from the last thousand videos he made. What was better understood and what wasn't. This enables us to give recommendations on how to improve the film or optimize it dramatically so that it does not lose the audience. And I can say to the advertiser: Dear advertiser, please do not book the mid-roll at 30 seconds, that is in the middle of the sentence, that is in the middle of the scene. I can tell you where the chances of losing your audience are much lower, because that's right before the plot comes up. Just before the dish is ready. If I can determine the plot automatically, then I can determine the right time for the commercial break. If you as a viewer have already made it through to second 70 because you really care what the cake looks like, then it will now be much easier to play the commercials without losing yourself compared to second 30, if you have just gotten into the video. That is, we want to go into history. We don't want to look at the product or video as a whole and say that this one has a hundred thousand clicks and that only fifty thousand clicks. Instead, we want to use artificial intelligence to get into the product, into the storytelling and identify the dramaturgy and the arc of suspense. So far we have clearly differentiated ourselves from all other startups and companies that deal with video.

Gunnar Brune / AI.Hamburg: Do you want to automatically identify the cliffhanger?

Yes. That would be much better, but there is still a very long way to go. But we are on the way. With artificial intelligence, we now all have the tools in hand and it is relatively easy to use the technology. You can start up 100 servers in the cloud and calculate something within hours that used to take years. However, we still have a long way to go. In any case, I believe our approach to optimizing films is correct. Artificial intelligence will create better products and not only products will be better sold.

A look into the artificial intelligence workshop: the aiconix algorithm gives a text recommendation for the results of various speech-to-text services running in parallel.


The interview is conducted by Gunnar Brune from AI.Hamburg

Gunnar Brune

Gunnar Brune is a marketing evangelist, strategy and storytelling expert. He is a management consultant with Tricolore Marketing, partner of the NEPTUN Crossmedia Award, author and multiple jury member for awards in the areas of marketing, communication and storytelling. Furthermore, Gunnar Brune is associated in the Enable2Grow network and is involved in AI.Hamburg to convey the possibilities and promote the use of artificial intelligence.

Gunnar Brune is the author of the marketing book “Frischer! Fruity! More natural! ”And the illustrated book“ Roadside ”. He is co-author of the books: “DIE ZEIT explains the economy” and “Viral Communication” and he has been writing regularly for specialist magazines for many years. His articles can be found in Advertising Age (trade magazine Werbung USA), Horizont, Fischer's archive and RUNDSCHAU for the food trade.

Contact information:

Gunnar Brune, [email protected], 0176 5756 7777