On writing fiction with voice-recognition software

Justine Larbalestier, a very good novelist with very bad RSI, has written a great post called "Why I Cannot Write a Novel With Voice Recognition Software." In it, she explains why machine-based speech-to-text software isn't sufficient for fiction. I think that if I absolutely lost the use of my hands and had no other choice, I'd probably dive into speech-to-text and gut it out, but nothing short of absolute necessity would get me to write fiction with machine-based speech-recognition.

Most of my first drafts are written in a gush of words as the characters and story come flowing out of me. Having to start and stop as I correct the VRS errors, and try to get it to write what I want it to write, interrupts my flow, throw me out of the story I’m trying to write, and makes me forget the gorgeously crafted sentence that was in my head ten seconds ago.

Now, yes, when I’m typing that gorgeously crafted sentence in my head it frequently turns out to not be so gorgeously crafted but, hey, that’s what rewriting is for. And when I’m typing the sentence it always has a resemblance to its platonic ideal. With VRS if I don’t check after every clause appears I wind up with sentences like this:

Warm artichoke had an is at orange night light raining when come lit.

Rather than

When Angel was able to emerge into the orange night Liam’s reign was complete.

Which is a terrible sentence but I can see what I was going for and I’ll be able to fix it. But that first sentence? Leave it for a few minutes and I’ll have no clue what I was trying to say.

However, checking what the VRS has produced after Every Single Clause slows me down and ruins the flow.

Why I Cannot Write a Novel With Voice Recognition Software

(Image: Arthritis, a Creative Commons Attribution (2.0) image from interactivestrategy's photostream)


  1. I think David Weber does all his writing with speech recognition software.  And there is a lot of it.

  2. As a programming, server admin, and networking guy: I wish voice recognition were even that useful for me.

    For the writing problem, though… maybe it would be possible to record what you wanted on a cheap audio device, and then play the result into VRS?  Then you could go through afterwards and correct things.  That would also allow you to read and hear the writing at the same time while doing corrections, which might make it easier to identify places that need heavy reworking.

    Neither way is ideal, but knowing how bad humans can be with accent differences doesn’t give me much hope of software outperforming us on dictation any time soon.

    1.  An expensive, high-fidelity recording device might work, but a cheap one won’t. VRS is incredibly picky — even just adjusting your mic a different way from the usual can create different results.

      On another note: I read an article about what it physically takes to write with a goose quill as a pen, and what it must have been like for Shakespeare to get the words down on the page. A writer would have to stop and either dip the quill in the inkwell or sharpen the quill several times per page — talk about flow interruption. I think the moral of the story is that when you change one aspect of how you physically write (VRS vs. typing it in, writing with quill & ink vs. ballpoint), you have to change all of it.

      1.  True.  If you did it in a quiet room and were able to feed the sound directly into the computer from the device, though, it might not be too bad.  If a decent, low cost device wouldn’t work, then I would have to wonder whether a better microphone for the computer itself would help him.

        Of course, even if it worked it wouldn’t address the third complaint he has: for most types of text, it’s just slower to speak than to type even if you can expect the listener to understand you.  Oh, for the day when we can plug our brains directly into the computer and write at the speed of thought…

          1.  Hmm, good point.  Though I have to wonder about “normal speech” being consistently that high.  The wikipedia page mentions that as recommended for audiobooks, which probably takes less thought than normal speech (and a rehearsed presentation is noted as slower).

            I suppose I was probably thinking of how much slower speech is when on the receiving end.  But in the context of trying to write a book, I wouldn’t be surprised if the speed difference ended up weighted more towards typing.

          2. For Voice Recording I always have to slow down my speech about 1/3 (or it gets really messy). And even then it sometimes takes the software ages to “write”, in which time you could have typed another sentence.

            So in the end it is about the same speed – if
            – you don’t have direct speech, because the ” take time
            – you don’t use words or names that are uncommon
            – if you don’t care too much about spelling at the time of writing, but correct it later on

            and it is much more fun to type.

    2. What would be really cool is VRS that records an audio track while transcribing your speech, allowing you to scrub back through the audio track while at the same time highlighting the text on screen.

      1. I have to wonder whether something like this has been created for medical or court records, since it seems like an obvious improvement to plain paper recordings created from dictation.  Storage might be an issue, but you could probably compress the audio quite a bit once the initial transcription is done.

      2. I believe Dragon DOES that and has for ages. I don’t remember how long any of the different versions (pro, etc) keep the audio (it might be just for a page or two), but it’s totally useful when you’re trying to identify a mistranscription later, and I think the medical/pro versions are designed to people who use that functionality- ie professionals or doctors whose secretaries or transcriptionists need to figure out any errors transcript after the fact.

  3. Am I the only one who wants to go read some classical literature into voice-recognition software, just to see what it spits out?

    Edit: Yeah, other people probably have lives.

  4. Imagine trying to write software like this.


    You’d probably end up with this:

    pound include < String >

    1.  Sadly, since I broke my hand recently, I don’t currently have to imagine.  Even with all the tricks I already had set up to deal with my RSI, it took about ten minutes to give up, call the boss, and arrange to be on meetings for the next six weeks instead.

      Technical documentation is also bad, sometimes worse because it can seem to get it right to a casual glance.

  5. My wife does medical transcription for a living and most medical transcription services offer non-medical transcription for other industries (she’s done subtitling for exercise videos, insurance claims, NCAA political videos).  I can’t imagine that sci-fi would be harder than medical transcription, especially for somebody like my wife that is already a sci-fi geek.  I don’t know how the rates compare out, but they usually do it by word or page count, so it would be easy enough to extrapolate.

  6. Given an inability to type, I’d go for dictation onto a recording media that could then be typed out. Asimov was said to have used this tech, and I know of a couple of master’s theses successfully done this way.

  7. “Warm artichoke had an is at orange night light raining when come lit.”

    VRS: bad for writing fiction, great for writing dada-ist poetry.

  8. I tried this recently, and it didn’t go well. “Lose his inheritance” came out as “lose his hair density.” Which may also have happened, but it wasn’t what I was after. I write longhand and then type it in (I can then edit as I do so), but I have learned to not go so long without typing stuff up. It took so much longer to fix the dictation software mistakes than it would have to just have typed it up normally.

  9. My voice recognition software is far better than that. I can’t write fiction in any medium, so I won’t opine about different tools for that task. If I have more than a paragraph or so in my head, I can get more of it out before forgetting, and do so faster, with VRS. The microphone adjustment is not super critical, but being in a quiet place is. My hopes of dictating in the car were thoroughly quashed!

  10. Here’s an essay by Richard Powers on how he does all his writing by voice:
    “… For one, I can write lying down. I can forget the machine is even there. I can live above the level of the phrase, thinking in full paragraphs and capturing the rhythmic arcs before they fade. I don’t have to queue, stop, batch dispatch and queue up again. I spend less mental overhead on orthography and finger mechanics and more on hearing my characters speak themselves into existence. Mostly, I’m just a little closer to what my cadences might mean, when replayed in the subvocal voices of some other auditioner.

  11. I admit I’m puzzled by the article.  I use Dragon Dictate and don’t experience  anything like the error rate the author seems to.  Perhaps it’s a question of which software package is being used…?  Also, the article makes an issue out of the occasional mismatched word as though that derails the whole process.  But — why?  The trick with VRS, for me at least, has been to use it to generate an initial draft, without pausing at the end of every sentence to correct errors.  That’s what the second draft is for.  In my experience, VRS has been a huge productivity boost.

  12. I bought Dragon to try this out for Nanowrimo and I just couldn’t get past the small errors. For instance, if I said “select blah” it would select the word blah. If I said “strike that” which was the command the tutorial said would delete the current selection, then Dragon would deselect blah, jump to the next instance of that, and delete it. *headdesk*

  13. I love doing stream-of-consciousness writing with voice recognition software. From the errors you’re describing, I’m guessing that your speech pattern is difficult for it to understand (the software should be trainable, but some accents and speech artifacts are more difficult, I think), or that your microphone is a problem.

    For technical stuff or work that involves a lot of formatting, it isn’t as useful to me, but Dragon was good enough at picking up my speech that it worked almost exactly as advertised for blog posts, articles, and other stream-of-consciousness writing.

    It took me some $150 worth of headset experiments to figure this out, though- the microphone made a very, very large difference in how clearly it picked up my ramblings. 

    Also, it took a huge mental shift for me to be able to compose that way- I used to think that typing was thought-to-paper-with-no-secondary-thinking sort of communication, but I eventually got to the point where dictating to the program was effortless as well.

    And yes- writing lying down. Or in a hammock on a sunny day. Or anything outside on a bright sunny day where the screen visibility stopped mattering (at least for the first draft)

    1. This has been my experience too! I can’t use Dragon to compose *formal* prose, because for that I need to edit the words of a sentence even as I’m composing it, something that is extremely hard to do with dictation software.

      However, Dragon works incredibly well for generating *informal* prose: Email, chat, (some) blog posts, brainstormed ideas, discussion-board posts, etc. And as it turns out, the volume of informal prose I generate is vastly huger than the volume of formal prose.

      For me, the biggest problem with dictation software is that it suffers from the autocorrect problem. If you make a typo while keyboarding, you can generally spot it pretty easily when you rescan the text for errors. But Dragon doesn’t make typos — it makes substitution errors, just like autocorrect on phones. Substitution errors are much harder to spot when you’re copy-editing or re-scanning a text, so the likelihood of writing an email with a few completely berserk substitutions is high.

  14. I am a closed captioner for the hearing impaired and use voice recognition software to do my job every day.  I do live news programming like CSPAN, Bloomberg, BBC and a ton of local news affiliates.  Our terms with the stations guarantee a 97% accuracy rate and we get live on-air quarterly reviews to make sure we are consistent.  My accuracies are rarely below 98%.

    Our voice recognition system (the company’s proprietary software) is far from perfect.  Too many words sound alike for it to work perfectly, but it is pretty remarkable just how well it does work.  That said, I can’t imagine writing with it.  I think it would be distracting and make for more work in the editing process.  

    Talking is also much more physically taxing than writing.  You know how your hands cramp up after a while at the keyboard?  Imagine doing it to your voice.  I can’t do my job when my voice burns out or I have a sore throat.  But I can use a keyboard.

    Next time you watch CSPAN, turn on the captions.  Now imagine there’s someone in a studio somewhere repeating everything the speaker is saying into VRS.  Notice the captions appear nearly instantaneously with what is being said.  It’s pretty fucking rad.  I think in 10 years, voice recognition will be so good and so omnipresent, we’ll all take it for granted.  

  15. Another annoying example of boing boing using a TLA (three letter acronym) without explaining it.  I had to Google RSI.  C’mon boingsters- better writing, please.   I know I’m talking to the wind here.

    1.  Maybe I’m old and cranky, but the idea that somebody can use computers and the net regularly  enough to follow Boing Boing but not know what RSIs are surprises me. Didn’t keyboard jockeys practically “invent” RSIs back in the ’90s? (Not really, of course, but the incidence of RSIs went through the roof.)

  16. It sounds to me as if the writer needs to get a better mic. That simple. I use a wireless broadcast lapel microphone and it works so much better than even my Rode NT1000 studio mic. Dragon screams along with it and over the last year the error rate has plummeted.

  17. So, as a professor of computing and interaction design, I have been telling my students for some time now to prepare and design with the aim and the assumption that much of the human-computer interfaces we see prevalent should collapse. This is because, quite simply, in most cases they are in the way of what we are trying to do. Computing enables a huge efficiency and increased scope of ability of course, but the points at which we meet computers still cause much friction, and so must improve incrementally until this is not the case. This will increasingly mean that existing barriers to entry into all manner of disciplines requiring any form of professional expertise will disappear in much the same way as we have seen barriers to entry into professional news media production systematically removed. It is fun to read you here Mr. Doctorow, evangelist for this very same progressive shift in the structure of these hierarchies, dismiss with such confidence the use of VRS (Voice Recognition Software) as a writing tool. In its current form perhaps, but let’s not forget what this is actually about at its core, Cory. The keyboard, the pen- writing- remain the barriers to entry into the (your) realm of professional storytelling. 

    Marshall McLuhan
    – “A goose’s quill put an end to talk, abolished mystery, gave architecture and towns, brought roads and armies, bureaucracies. It was the basic metaphor with which the cycle of civilization began, the step from the dark into he light of the mind. The hand that filled a paper built a city.” – Counterblast, 1954

    Talk. Storytelling. Word of mouth. We have only been writing en masse for a comparatively  short span of time, in comparison to that in which we have been speaking. The technologies and strictures of the alphabet, the word, written language are huge obstacles (technologies themselves) to be overcome in childhood if we are lucky. Clearly they are worthwhile obstacles, in terms of communication, but for the folk art of storytelling the idea that you need to sit quietly tapping at a selection of buttons for a month, like some lab-rat, is the very opposite of the spirit of the storyteller. Having seen some of your talks, it is clear that you are fortunate enough to be gifted as both a storyteller in person, by mouth, as well as through your writing. But we all understand, surely, that those people who are the very best storytellers, gifted spinners of yarns, are very often simply incapable, either due to time restrictions or inclination, of writing their tales down. Spoken word stories, as we know, were the only means by which most cultures’ mythology and allegories- their spirits- were preserved through generations. The keyboard is a blip in the narrative of storytelling. There will soon be a time where a full text of all the conversations we have throughout our whole lives, should we so chose, will be easily available as text, searchable, printable, ready for distribution, recall etc. etc. Who would not like to pick out an ancestor or favourite celebrated figure, a folk hero of any description and pick through their day to day, find those gems where they hold forth with some great tale were it possible? Many other applications spring to mind, but we are talking stories here.

    Please take this in the spirit in which it was intended. You write and deliver some fantastic and inspirational ideas, but language and stories are central to so much that will save us, and so dismissing the potential (inevitable?) emancipation of stories from the professional practice of writing seems cruel.

  18. Dragon VRS learns from you and loads a different profile for each user. It’s pretty good right out of the box, but if you’re patient over a few hours and put some work into going back and correcting its mistakes, the software learns your quirks and preferences and vocal inflections. I use it for translation (which obviously has to be very precise) and it flies. My issue with VRS for fiction would be that it doesn’t give me a chance to pause and arrange my thoughts, but that’s a matter of process rather than tech.

  19. Why not just leave out the machine step? Dictate your writing, hire a typist, then hunt-and-peck your way through edits and corrections (or while looking over someone’s shoulder).

    Not endorsing or condemning the work he produces (I’ve only read a few of his books), but  I believe Kevin J. Anderson does all of his writing onto a digital recorder while hiking around the Rockies.

  20. blind and beloved humorist james thurber wrote using the oldest speech recognition program of all time,  a stenographer.

  21. Advice from someone who uses S2T all the time: “Diction is done with the tip of the tongue on the teeth.”

  22. Pratchett did an interview on NPR awhile back where  he discussed, among other things, having to use text-to-speech software to write because he can no longer keep his focus to type.  His major complaint was that the software was American and didn’t recognize British English — “You don’t have words like ‘arsehole’.”

    The interviewer laughed and said “We have that one, we just pronounce it differently.”

  23. If it’s that important that you can’t be bothered to type yourself or you physically can’t then you could record audio, have it transcribed or just hire a typist.

Comments are closed.