I tossed this out on Twitter recently: "I am now imagining a text-conversation game in which you don't choose what to say -- but you have a sarcasm dial that you can turn up and down."
That thought was inspired by a card game that I found on the Web. Relationship (by Zach Weiner) has a satire-romance theme. Each card has a numeric value, and some cliched romantic sentiment ("8: you complete me", etc). Then there's a "Sarcasm" card you can play, which negated the value of the card (8 to -8).
This amused me, naturally, but then the idea got mixed in with game conversation engines. We've seen games where your choices are limited to "friendly" and "hostile", or "positive" and "negative", or some such. But of course the game then spits out a complete response on your behalf. You don't have any control of what positive or negative thing you say.
Sarcasm is a nifty compromise. Imagine the conversation is running along in real time, but you can see your upcoming line displayed as a subtitle. You can slide your controller anywhere from "sincere" to "brutal sneering sarcasm". As your lines come out, the words are predetermined, but the tone shifts.
(Tweetfriends immediately commented "That's just like screenwriting!" and "That's just like business meetings!" Message received: it's just like life.)
Of course, I then immediately started thinking about technology. This idea is most cool if you can adjust the tone dynamically -- word by word, or moment by moment, even. (The responses of the other character depends on your tone, so it's a branching tree structure like most game dialogues. But it feels different to the extent that your control is continuous, rather than an occasional isolated decision.)
This requires a lot of dialogue-writing, which is why I jumped over to think about the technology, of course. But it shouldn't be too difficult to write the text for a short example. You'd write it in outline form; at any point there might be two or three possible responses. The branching could be based on the average sarcasm of the player's previous line, or you could break it up into phrases and cue on the sarcasm of a single phrase. Most of the player's input information, the fine adjustments, would be ignored by this scheme -- but so what? The varying intonation would be an interesting output all by itself. You'd feel like you had fine control over your character. You would have fine control, at one level.
I'm sure the walkthroughs would be written in terms of "slider to max, wait one sentence, slider to min, wait two sentences..." but again, so what.
Can software adjust the emotional lading of spoken text in real time? Software can do all sorts of crazy crap these days, so I'm sure the answer is "yes". How would I put it together out of off-the-shelf parts, though?
I suppose I'd record each line three times -- sincere, mild irony, biting sarcasm -- and then apply some sort of audio processing to interpolate between them. I'd think that pitch and pacing should be sufficient. That is, you just (I say "just" not knowing whether it's hard or trivial) have to match up phonemes between the three versions, and then derive a pitch curve (high/low) and a duration curve (faster/slower). Interpolate on the fly and you've got it.
This doesn't require full speech recognition. It's just finding correlations between two spoken iterations of a line, and I'm sure that's much easier. I suspect (again, without any research) that you'd map each line into intervals of silence, high frequency ("s", "t") and low frequency (vowels), and then fudge those until the two lines match.
Does Echo Nest do this already? Somebody run with it.