Defining Sound Bite Edges for Collaborative Editing
Submitted by kentbye on Wed, 2005-12-14 20:03. Development | Editing | playlist | Theory
As I've described before in my collaborative editing schema, my plan is to use the playlist mechanism to have users help filter through the interview audio content to help identify good sound bites. But who takes the takes the first cut of defining the boundaries of the "sound bites"? Well, I'll be taking a first crack at defining the sound bite edges, but the volunteers will be able to redefine the "In" and "Out" points.
I was chatting with Kevin Marks yesterday -- who is a former Apple employee, podcasting technology pioneer and Technorati engineer -- and Marks made the great point that there are three stages to editing: "shot-logging, sequencing and polishing."
Later in the conversation, Marks split the first shot-logging into two separate sections, and this is how he described the post-production process of editing a massive data set (with spelling errors corrected):
one is labeling times of interest, which is naturally sloppy -- let people tag as it goes by
second is making good sound bite clips, or chunks of meaning -- which involves a bit of effort to pick good in and out points -- and appeals to a subset of people
3rd is sequencing clips, which once you have good standalone ones defined by meaningful chunk rather than time, is easy -- and gives you a much better granularity to tag, annotate, vote on and so on
the difficulty is bridging the stages
Indeed, there is a lot of difficulty and complexity in bridging these stages. And so that's why I've planned on doing the shot-logging and clean-up process myself so that I can distribute the sequencing portion.
I'll be doing the shot-logging offline using the Final Cut Pro blade tool to select sound bites IN and OUT times in the timeline, and then export this IN and OUT times via Final Cut Pro XML. Then the text will be aligned with timecode data for each sound bite and then uploaded as a Drupal node with a unique URL, which will allow volunteers to then annotate the sound bites will tags and comments. Volunteers will also be able to shorten or lengthen each sound bite, and so there may need to be ways to account for the metadata associated with sound bites that a high variance of IN/OUT edges. Or perhaps the variance will be negligible considering that the context and meaning of the sound bite will be relatively the same.
Then volunteers help with the sequencing stage through web browser-based "editing" using the playlist mechanism. Marks makes a distinction between "editing" and "sequencing":
The key is to distinguish editing and sequencing -- editing needs sample accuracy -- and you aren't going to get that with XML and intermediates without a lot of pain.
sequencing of self-contained chunks without attempting laps or dissolves
So these "sequenced" sound bites will be done within playlists by volunteers, and then I'll be exporting the timecode data from this "edited" sequence back into Final Cut Pro so that I can "polish the edits" offline.
In an ideal situation, I would distribute the first editing phase of sound bite parsing to a large set of eager volunteers who would listen to over 45 hours of footage. They would highlight an interesting audio segment, and then tag it and annotate it on the fly as Marks suggests.
The mechanism to do this online could be accomplished with something like the BBC's Annotatable Audio Project as described by Tom Coates -- but it's still behind the firewall of the BBC. I plan on talking with Coates about it in more details soon, and maybe he'll give me some more insights into how the BBC will be normalizing or making sense of this fuzzy data set. But either way, I probably won't have the resources or technological mechanisms to be able to effectively distribute this task.
So I'll be the one who will be taking a first cut of determining "In" and "Out" points of the sound bites. This is certainly a huge bottleneck that could eventually be overcome by integrating something like the Annotatable Audio tool into the workflow. But doing it myself is a satisfactory workaround for the moment considering that volunteers will still be able to either shorten or lengthen the edges of the sound bites. And also considering that I'm ultimately interested in the distributed sequencing portion of the editing process.
I'm still on the lookout for PHP coding help in making this happen, so please e-mail me at firstname.lastname@example.org if you're interested in helping out.
Below is the full transcription of the IRC chat that I had yesterday about Kevin Marks with more commentary on this issue.
I was specifically searching for a way to automatically create smaller sections of MP3s by entering in a set of IN and OUT times. Marks says that it's theoretically possible to automate this task in QuickTime or Final Cut Pro as well as with iTunes, but it's way too complex for me to figure out, and so I'm sticking with the SMIL solution for the moment...
KentBye: Hey Kevin, I saw you in the global voices IRC over the weekend -- and Chris Messina suggested I ask you this question: Can you suggest any programs where you input an MP3 with specific IN & OUT times, and then it creates "sound bite" MP3s? It's for a collaborative film editing application http://www.echochamberproject.com/collaborativefilmmaking
KevinMarks: how are you goign to generate the in and out times?
KentBye: Final Cut Pro
KentBye: FCP XML
KevinMarks: basically, MP3 is not really designed for editing
KevinMarks: why in an mp3 then?
KevinMarks: FCP works in raw, so edit in that, then convert afterwards
KentBye: Right. I want to create smaller MP3 files so that they can be dynamically strung together
KentBye: Using a modified playlist within DRupal
KentBye: Assign each sound bite to a URL (i.e. drupal node).
KentBye: Then put a list of nodes, and then have it string together the edited sequence
KentBye: THen port back to offline editing via FCP XML
KentBye: MP3 is a dummy file for timecode continuity
KentBye: I need a way to take a five-minute MP3, and enter in IN/OUT times, and then output a bunch of smaller MP3 files
KevinMarks: well if you're sure your chunks are properly self contained, and don't start getting clever adn trying to do laps, sequenceing is easy
KentBye: Right. It's the creating the smaller and granular-sized sound bites that is the current bottleneck. I've looking into SMIL, but it has to load in the entire 5-minute sequence before playing. Smaller files = quicker load times
KentBye: ANd Doug Kaye apparantly puts together his show on the fly gluing together a bunch of MP3 files
KevinMarks: why do you want to do this server side?
KentBye: I want to upload a bunch of small MP3 files, so that they can be pointed to and edited together on the fly. ANd SMIL can do that.
KentBye: It's the best I could come up with so far for my collaborative editing flowchart, but I'm certainly open to whatever the easiest / best solution is.
KevinMarks: QT can do it too
KevinMarks: The key is to distinguish editing and sequencing
KevinMarks: editing needs sample accuracy
KevinMarks: and youa ren't going to get that with XML and intermediates wihtout a lot of pain
KentBye: you mean with SMIL and QT?
KevinMarks: sequencing of self-contained chunsk wihtout attempting laps or dissolves
KentBye: It doesn't have to be perfect edit. It just needs to get in the ball park
KentBye: The idea is to distribute the editing process of 45+ hours of interviews into a film.
KentBye: It uses the playlist mechanism to discover popluar sound bites
KentBye: A post-filtering application for a ton of interview footage, I've gathered. It's not the final product
KevinMarks: Editing has three phases
KevinMarks: shot-logging, sequencing and polishing
KentBye: I'm doing shot logging offline
KentBye: The sequencing is done by volunteers
KentBye: ANd then the clean-up is done offline
KevinMarks: offline in the net sense
KentBye: ANd then the audio/video is released for others to do their own clean-up versions, etc
KentBye: After I release my film.
KevinMarks: well, my point is, that identifying a good clip really needs sample accuracy
KevinMarks: if you are making clips, ratehr than just making shot-logs
KentBye: The issue is that if I create small MP3 files with a program, then I can use SMIL for additional IN/OUT times.
KentBye: In other words, once I create the hard MP3 file and do simple sequence, then I'd have master say with what a "sound bite" is and isn't. clip begin and clip end with SMIL could allow the user additional control with what that soundbite would be
KevinMarks: it could, but in practice it isn't accurate enough
KentBye: What do you mean? What isn't accurate enough?
KentBye: I can get down to 1/30 of a second
KevinMarks: between rounding errors in seconds and MP3 chunk sizes, you get word fragments
KevinMarks: or annoyingly inconsistent playback
KevinMarks: QT will decode the MP3 and do it strictly by time
KentBye: I did a demo here: http://www.echochamberproject.com/node/692
KevinMarks: most others will approximate by stream slicing
KentBye: There are five minute MP3s
KentBye: And I created a SMIL file that edits them together using FCP XML data
KentBye: Direct link to demo: http://www.echochamberproject.com/files/demo.mov
KentBye: And I even did a demo with text streaming: http://www.echochamberproject.com/files/timecode_demo.mov
KentBye: It's actually pretty good. ANd it's certainly good enough for what I want to do with it.
KevinMarks: I lost the 'a' of about on nolan
KentBye: I'm not as worried about that because I can clean it up in the third phase. The idea is to piece together concepts
KentBye: Big concepts and allow users to help filter through the content and identify the popular and interesting sound bites
KevinMarks: did you see tom coates' demo?
KentBye: Yeah, I did.
KentBye: It's in Flash right?
KentBye: You're talking about the audio annotation, correct? I did see it, and I wish I had it. :)
KentBye: My ultimate question is if you know of any software that can take [MP3/*.wav file] + [IN/OUT times] and then spit out smaller *.MP3 files. Or if you have any other completely different ideas for what I'm trying to do.
KevinMarks: You could do the editng with QT, but you'd need somehting else to re-encode to mp3
KevinMarks: you will likely get a generation loss through recompression
KentBye: With SMIL it just excerpts the sound bites and peaces them together -- it doesn't have to recreate a file and suffer from another layer of generation loss
KentBye: The issue is that SMIL loads the *entire* file in before playing.
KentBye: In other words, it loads in the entire 5-minutes -- even if you're calling the first 15 seconds as a sound bite.
KentBye: So I need an easy way to dump in the IN / OUT times and generate a lot of smaller MP3 files.
KentBye: Are you saying that QT is scriptable in that it can generate smaller *.wav/*.aiff files? It doesn't sound like you've heard of a scriptable software solution for doing this with MP3s.
KevinMarks: QTPlayer is Applescriptable
KevinMarks: QT core is programmable in various languages to edit
KevinMarks: and smart enough to just extract the bits it needs
KevinMarks: but it won't keep them in mp3 form
KentBye: That's great info to know.
KentBye: I can input *.wav files via Final Cut Pro
KevinMarks: FCP is criptable too, and calls QT underneath
KentBye: Hmmm.... That's an interesting thought. Have you -- or anyone you know -- ever done anything with it?
KevinMarks: well, I wrote the video capture stuff FCP calls, but I haven't looked at the top level stuff in 2 years
KentBye: Damn. That's cool. Saw that you were at Apple, but didn't realize you were doing anything related to FCP. Thinking....
KentBye: I guess the ultimate solution to what I'd want is to use the Blade tool to parse sound bites in a sequence within FCP -- and then have a script save each of those audio sections into a unique file. And then use some other program to convert those *.wav sound bite files to MP3 files. Do you know if this is theoretically possible with the scripts?
KevinMarks: you cna script itUnes to make mp3's though it is a bit Heath Robinson
KentBye: Ah. That's interesting. Hmm... Well you've been a huge help. Thanks. I'll look into the FCP and itunes scripting. Is there an IRC channel where applescript experts hang out?
KevinMarks: not sure
KevinMarks: there is an Apple mailing list
KevinMarks: my take on this is a little different
KentBye: I'd love to hear any feedback / suggestions / whatever...
KevinMarks: I think that you shoudl consider 3 modes
KevinMarks: one is labelling times of interest, which is naturally sloppy
KevinMarks: let people tag as it goes by
KentBye: That's where I'd love to have the BBC tool in hands...
KevinMarks: second is making good soundbit clips, or chunks of meaning
KevinMarks: which involves a bit of effort to pick good in and out points
KevinMarks: and appeals to a subset of people
KevinMarks: 3rd is sequencing clips, which once you ahve good standalone oens defined by meaningful chunk ratehr than time, is easy
KevinMarks: and gives you a much better grnaualrity to tag, annotate, vote on and so on
KentBye: Right. Well in my case the first step, I don't have the $ or resources to implement
KevinMarks: the difficulty is bridging the stages
KentBye: Yeah, bridging. I'm trying to get to stage 3 by doing the first two stages myself
KentBye: But with SMIL, there is some flexibility to have others change the IN/OUT times for what the sound bite is. Lengthen / Shorten, Etc.
KevinMarks: yes, but it is mostly crap
KentBye: What do you mean?
KevinMarks: hard to edit properly, fiddly and inconsistent in playback between impementations
KevinMarks: an entirely other way to think about this is to use GarageBand
KentBye: Yeah. There isn't an optimal solution at the moment. I'm trying to do the best I can with the tools that are out there. Stage 3 is what I'm most interesting in getting to -- since that is ultimately what I'm going to need to actually do collaborative film editing.
KevinMarks: hang on, i thought this was audio?
KentBye: It is audio.
KentBye: But I have video too.
KevinMarks: I know lots fo people who use Garageband for podcasts
KentBye: EDLs are to film as playlists are to audio
KevinMarks: that is false
KevinMarks: playlists are simple sequences
KentBye: There's an extra dimensions and complexity of course
KentBye: My point is that for a documentary film
KevinMarks: EDLs are hideously complex frame accurate things
KevinMarks: I worked on OMF import into QT, trust me on this
KevinMarks: playlists are like shotlists for film
KentBye: I guess I'm using it to filter through the *big concepts* for how the media failed leading up to the war in Iraq.
KentBye: Integrating b-roll and the visual dimension is a problem for down the road.
KentBye: But at the moment, I want help with filtering through the good and bad sound bites for the 76 interviews I've done.
KentBye: There is a way to use the playlist mechanism to simulate b-roll cuts. I've done a very simple implementation in this demo: http://www.echochamberproject.com/files/timecode_demo.mov
KevinMarks: well, the chaps from medialab who are now at Yahoo research ahve been chasing this for years
KentBye: I met a lot of Yahoo folks at the open media developers summit -- ryan shaw and others...
KevinMarks: my basic contention is that seqeuncing and editing are distinct activities
KevinMarks: is anotehr thought on this
KentBye: I agree. Sequencing is the first step which is why I've been focusing in on audio. I'd need an AJAX implementation of a browswerbased editing in order to maintain the proportional timecode lengths to do proper video editing on top of audio. But the audio is the foundation, which is why I'm starting with it.
KentBye: Cool. I'll have to look at it.
KentBye: BTW, I interviewed and discussed using playlists for collaborative editing with Playlist Maven Lucas Gonze here: http://www.echochamberproject.com/node/689
KentBye: Cool. That corante article reminds me of an App that the NYU ITP department did which correlated the chat logs to the time codes of the video presentations. I noticed that you were looking for timecoded chat logs from GV summit to do something similar.
KevinMarks: I think that is a good way to do the stage 1 idea above
KentBye: Definitely. That's a great thought.
KentBye: See where chat spikes up.
KentBye: I'll have to see if I can dig up that program -- I think they've released the code.
KevinMarks: well, chat spikes and dips have multiple meanings
KentBye: Yeah, they're very noisy. That ITP program was created by Dan Shiffman -- still looking...
KentBye: Found a screenshot: http://www.shiffman.net/2005/11/15/temporal-comments/
KentBye: Here's the homepage: http://www.shiffman.net/itp/omds/videocomments/