For 1000 movies, this study compared lines included on imdb's memorable quotes page to those that were not. People who hadn't seen the movies were able to pick the correct one 78% of the time, although, caveat lector, that's with only n = 68.
What features allow this above chance classification? The authors suggest 1) distinctiveness (i.e., a lower likelihood of coming from samples of standard English text), 2) generality (fewer personal pronouns, more present tense), and 3) complexity (words with more syllables and fewer coordinating conjunctions like "for" and "and").
Interestingly, their best support vector machine only correctly classified examples 64% of the time, so either the human data is somehow biased, or there are plenty more subtleties for machines to learn before they can best us humans in recognizing literary wit.
What features allow this above chance classification? The authors suggest 1) distinctiveness (i.e., a lower likelihood of coming from samples of standard English text), 2) generality (fewer personal pronouns, more present tense), and 3) complexity (words with more syllables and fewer coordinating conjunctions like "for" and "and").
Interestingly, their best support vector machine only correctly classified examples 64% of the time, so either the human data is somehow biased, or there are plenty more subtleties for machines to learn before they can best us humans in recognizing literary wit.