BlindConfidential: My Theory on Simulated Motion and Audio User Interfaces (Geeky)

The other day, I included an off-hand comment about Apple listening to my ideas on 3D audio as a way to improve efficiency in screen readers by adding dimensional information to the steady stream of text tokens that the uni-dimensional screen readers provide today.  A couple of people asked me why I chose to take credit for something Apple had decided to include in its new screen reader.  My answer is fairly simple, as VP/Software Engineering at Freedom Scientific, I attended a ton of industry meetings about the future of access technology and, quite vociferously, argued, pounded tables, shouted, begged, cajoled and, in off line, hallway conversations and in private sessions, discussed the ideas behind “rich contextual information” being a requirement for audio interfaces of the future.  Mary Beth Jaynes and Travis Roth of Apple were in attendance at many of these meetings and, if I remember correctly, I explained to them, as well as to people from Sun, Microsoft and lots of other companies why such techniques were important to the future of computer users with vision impairment.

In addition to speaking on the record about the need for rich contextual information and 3D audio in meetings of ATIA technical groups, Accessibility Forum, various RERC presentations and elsewhere, I have published a number of articles and a chapter in a book on this topic.  To my knowledge, which is pretty well informed as I have performed a ton of literature searches on this subject, I pretty well stand alone in describing the potential efficiency gains available through multi-dimensional, context rich audio information from a future generation of screen readers.  If the rumors about Apple adding 3D audio to VoiceOver are true, I will be ecstatic at actually seeing some of the ideas I have professed for many years now making it to the point of a shipping screen reader.  I hope JAWS, Window-Eyes, System Access, Narrator and others follow soon.

My ideas have grown over time and, as it has been since February 2006, when I published an article in Access World on what screen readers can learn from audio games, that I’ve addressed a non-academic audience about these concepts and since April 2006 when my last article was published at CHI 2006 in the Next Generation workshop, I thought I would write an egghead piece for BC readers to update you folks on where my thoughts have led in the past six months or so.

A few years ago, when I first started my campaign for greater contextual richness in screen reader information, I talked mostly about presenting things on a flat audio plane, something akin to the display sighted people view but described in audio.  I suggested ideas like taking a Visio diagram and, by using 3D audio techniques like those in audio and video games; provide the user with positional information using simulated spatial data by providing a sense of up, down, left and right to the information displayed on the screen.  I used a lot of examples involving Visio, VisualStudio and Excel as they provide problems for screen reader users in that their layout, designed for efficiency by a sighted user, does not conform well to a long stream of syllables and pauses as provided by a speech synthesizer based screen reader.

I remain confident that these techniques, when added to screen readers, will create the true next generation of usability for computer users with vision impairment.  If one reads up on the gnome accessibility API, discussed in various articles in BC but most extensively in articles on the Sun Microsystems web site, they can see how an API can expose contextual information.  I haven’t played much with either ORCA or the IBM gnome screen reader in Open Office or elsewhere so I can’t speak to how well this API works in a real world situation or if the accessibility hackers actually chose to provide said information to their users.

UIA, as I understand it, also provides a lot of ways a screen reader can provide useful contextual information and it now seems as though Apple will be using aspects of its own API to provide greater context and, as a result, greater efficiency to its users.

I describe first generation screen readers as those that work in a text based environment like DOS or a console based GNU/Linux environment (whether they actually take data from a screen buffer or hack the kernel to gather the information is inconsequential).  Second generation screen readers, like virtually everything available for Windows, Macintosh or gnome today provide information from graphical interfaces but use a single stream of speech information, a seemingly endless stream of syllables and pauses, to provide information to users.  Now, if rumors are correct, Apple will become the first to break the barrier into G3 by adding additional dimensions.  I will state that JAWS, under my leadership, started in this direction by becoming the first screen reader to add a lot of supplemental information and to use sounds in a semantically interesting way through its speech and sounds manager.  If the people behind the Windows screen readers decide to do some cool things with Direct X, they can follow Apple and likely jump back into the innovation leadership position by extracting information from MS application object models and/or UIA in the coming year.

Thus, I continue applauding innovation (YAY Apple!) and encouraging it across the industry but what have I been thinking up lately?  When I spoke, argued and published about adding dimensional and contextual richness to screen readers, those ideas were brand new to me and to the biz.  Now, through my advocacy and that of Will Pearson and others, we’re starting to see the concepts show up in product.  Have I been sitting still?  Intellectually, no, I have taken my theoretical work a few steps further.  Thus, what follows is the result of my current research into multi-dimensional audio information.  This research has resulted in the creation of a few demonstration bits of software that describe three dimensional geometric primitives in audio.  The few blind people who have heard these sounds have been able to identify the shapes pretty quickly but the research has been informal so I cannot publish anything approaching empirical results yet.  I will, however, describe how the theory works and some of the supporting literature.

I read a lot about semantic transformations and the psychology of attention in the past year.  Also, for a chapter in Professor Helal’s book on Access Technology, I studied audio games fairly extensively.  Combining the work that David Greenwood did in Shades of Doom and his other really cool games with the linguistic and psychological theory I read about led me to one of those “eureka” moments.

When studying the psychology of attention, I learned that human eyes move constantly.  They move so constantly and so quickly, in fact, that people who can see don’t notice the motion.  This constant movement in the human eye continually refreshes short term memory, one of the most fickle components of the human system.  Like high speed RAM in computers, short term memory in humans needs to be powered and can be erased very quickly.

Humans have two kinds of attention: focal and peripheral.  Focal attention is that to which we pay attention.  If, for example, one is standing on a busy street corner, they are being pounded by tens of thousands of bits of information through virtually all of their senses.  We, as humans, can, however, carry on a conversation without being distracted by the constant sensory bombardment that surrounds us.  This is do to our focal attention, our area of focus.

Meanwhile, while standing on that same corner talking to the same friend, we might suddenly raise a hand to protect our face from a flying object.  We may have already batted away the flying object before we even discover what is or even realize that it had been flying toward us.  This is peripheral attention at work.  It keeps us safe from things to which we are not focusing on and, it also keeps us aware of the context in which we are immersed.  Without peripheral attention, we may forget we are standing on a street corner and fall entirely into the information in focus and, hence, constantly be lost.

Thus, motion keeps short term memory, where attention resides, refreshed and focal and peripheral attention let us inspect a single point of interest while remaining aware of our context.  But, how does this apply to an audio user interface?

The first problem we have with audio is that, unlike eyes, ears do not constantly move and, therefore, audio information is refreshed less often than visual.  Thus, short term memory forgets information taken in through the ear faster than it does through the eye.  Audio information certainly informs our peripheral attention as everyone who can hear can describe distractions that come from misplaced sounds.  Nonetheless, our short term memory loses track of audio information far more quickly than visual information.

How then can we improve understanding of semantic information through non-visual stimuli?

Obviously, we cannot move the inner ear as it wasn’t designed to perform such a function.  How then can we provide focal and peripheral information to let a listener inspect an item without losing context?

As I contemplated this problem, I started playing Shades of Doom.  I started asking myself, how can Greenwood provide 32 simultaneous audio streams and I can understand which is the monster, which is the good guy, which is the wind, which are the echoes of my foot steps against the wall but a screen reader could only provide a single syllable at a time?  The answer came as I learned more about attention and how the eye works.  The “eureka” moment came when it occurred to me that Greenwood uses virtually no “static” sounds but, rather, virtually everything sounds like it is moving.  By simulating motion, Greenwood can deliver profoundly more semantic information than can JAWS as the speech synthesizer sits still in audio space.

The answer to the three dimensional audio object question is to simulate motion in the source and it will have the same effect as the constant motion of the eye.  Thus, if an audio sphere can sound as if it contains motion, our ears will cause our peripheral attention to refresh and we can inspect a piece of the sphere while remembering that it is indeed a sphere we are inspecting.

In practical terms, imagine a wireframe sphere.  Imagine that the vertical circles each have a dot on them that spins around the circles in swooshing loops of sound.  Now, imagine the same kind of point on the horizontal lines making a swishing sound as it spins around the circle.  Finally, imagine an audio “twinkle” that represents the intersections of the lines.  As everything in our sphere is swirling, swooshing or twinkling, it is all moving and, typically, in a very short amount of time (seconds) the blind people who have listened to my audio wireframe image have identified it as a sphere.  Also, within seconds they identify cubes, pyramids, cones, cylinders and other three dimensional primitives.

Isn’t this a lot of sound to represent a fairly primitive object?

Yes.  In order to cut down on the auditory overload, I started cutting down on how many lines and points would sound at once and discovered that with a fairly small portion of the object sounding at random intervals comprehension remained fairly constant but “noise” dropped significantly.

My next step was to provide a manner in which an individual could inspect a single polygon made by the wireframe without losing context.  For this, I added words to certain polygons on the sphere, terms like “Florida, Georgia, etc.  The user could zoom in from “Earth,” which named the entire sphere, into a single polygon, named Florida for instance, and, as the volume reduced for the rest of the sphere, they could inspect details like “Tampa,” and St. Petersburg.”  This part of the experiment is in its very early stages and I expect we’ll find ways to tweak it to improve on efficiency without losing context.

So, in short, I have discovered that simulating motion can provide an incredible amount of semantically rich contextual information without overloading the user with noise.  Actually, all I did was provide a theoretical framework for the work David Greenwood and the other game hackers understood by instinct.  Nonetheless, the wireframe experiments and the theoretical framework is my own work which has been greatly informed by the game dudes.

I hope to write an entire book on human understanding of complex semantic information through non-visual stimuli in the recent future.  Unlike other theorists, though, I don’t like hoarding my ideas until they are perfected or disproven.  I want to hear other people playing with these concepts so, therefore, I write a sketch in my blog in the hope that other people will, like I did, download VisualStudio Express and start playing around with Direct X to see what they can do with three dimensional audio.  My work has shown that blind people can understand swirling, swooshing and twinkling audio images of geometric primitives (a subject on which I will publish formal results in some geeky academic rag in the future).  I would love to see if someone could take my wireframe concept and apply it to an irregular object.  What if one would take a laser range finding scanner and create a detailed wireframe model of Michelangelo’s David?  Could my swooshes, swirls and twinkles give a listener an idea of the statue’s form and beauty?  Using this technique, could the polygons that form the wireframe be filled with wireframes and deeper until a user could get down to a square millimeter while being reminded of the context at a lower volume but at a much larger scale?  What if someone adds a pair of Falcon haptic game controllers with its .5 mm resolution so one can feel the wireframe in detail while being reminded of the context in audio?

So, people, I will continue my experiments, articles and advocacy for innovation in the audio user interface paradigm.  I ask that you do the same as I am just one guy and will definitely not have all of the ideas.  In fact, I hope that I don’t have most of the ideas as I’m not really that smart and there are lots of other people who, if given the time and a little money, can come up with lots more really cool ideas and the world of audio and non-visual interfaces can explode.  Ask your screen reader vendor what they plan to invest in innovation and suggest cool ideas so, like Apple, they might risk a few development dollars on greatly advancing your productivity tools.

--End

3 Comments:

Anonymous said...: You know, I'm glad you've done as much work as you have on this. Though, I know for a fact that you are not the only person who has been talking to people at Microsoft and/or Apple about such ideas. Because I have done so myself, quite independently (and mostly unaware) of anything you've said.

(I was an employee at one and was a serious candidate to work on the screenreader offering of the other, and have repeatedly brought up these ideas in various contexts.); 1:48 PM
Anonymous said...: Chris, your Michaelangelo analogy got me thinking. If we could get this kind of three-dimensional audio imaging going, couldn't we attach this to some kind of scanning software that might allow me to get an idea of what my wife and daughters whom I've never seen actually look like?

If this is actually possible, there would be no limit to what field a blind person could choose. Anything from architecture to graphics design and stuff I can't even imagine would be possible for blind people.; 4:44 PM
Anonymous said...: Hi Chris,

Can you put some of your audio samples online? I'd like to compare it to The vOICe approach for conveying arbitrary visual views in sound.

Best regards,

Peter Meijer

Seeing with Sound - The vOICe; 3:28 AM

<< Home

BlindConfidential

Sunday, October 29, 2006

My Theory on Simulated Motion and Audio User Interfaces (Geeky)

3 Comments: