I like targetting atypical I/O mediums to get my creative juices flowing, improve accessibility, & build something more distinctive whose benefits are clear. For this I assemble my own platforms, so today my thanks goes out to the projects I assembled for voice I/O!
eSpeak NG does sound a bit robotic, but I appreciate their hard work for their extensive internationalization, clarity at fast speeds, & all their knobs which allow me to take full advantage of the auditory medium!
On the input side my gratitude goes out to CMU Sphinx, Kaldi, Mozilla DeepSpeech, & Julius! As well as to Voice2JSON/Rhasspy for providing a highlevel wrapper around your choice of these!
Since I believe for privacy's sake it is vital to be running this voice transcription AIs (which the named projects implement in a myriad of ways) locally, which despite that not being the typical approach is how Voice2JSON works.
Please package Voice2JSON, it could be a great component for a freedesktop!
P.S. Speech-to-text engines aren't near as well internationalized as eSpeak NG's text-to-speech, so that's one of many reasons why I need to allow for keyboard input. Internationalizing speech-to-text is a huge challenge since put simply it requires retraining the whole system!
P.P.S. Feel free to research how to generate more natural sounding voices without sacrificing eSpeak's benefits. I'd love to see what you come up with!
* Flite or festival-lite, a small fast portable speech synthesis system.
* Bard Storyteller: an ebook reader that uses Flite.
Both based on the CMU Wilderness Multilingual Speech Dataset, a speech dataset of aligned sentences and audio for ~700 different languages, but it's not licensed and served from a non-HTTPS site: http://festvox.org/
It provides a demo page for the speech synthesis system and a #Talkback plugin. The last build has 13 English voices, plus 4 indic ones: Hindi, Tamil, Telugu and Marathi.
This list could be extended for the full dataset, but It needs some love, currently, there are no students on the project.
@alcinnz @devinprater, BTW, I came across this while trying to see how we could improve the accessibility of CLIs, but at the same time also address the lack of it during startup, in both bioses and bootloaders.
And we need a solid foundation, similar to the one for video, the frame-buffer, but one for accessibility: screen readers, braille, etc. And for it, we could build a hardware co-processor that would accompany or replace the GPU, no need to waste CPU cycles for GUIs that are not used.
Certainly freeing up cycles from graphical UIs would more than make up for the cycles used by text-to-speech.
Not that I haven't toyed with the concept of what this might look like...
I've got a more or less complete picture, I just need to publish some posts and see what's feasible and what's plain stupid. I think that it is doable and this would be a huge boon for accessibility, mainly because the security people might also be interested: separation of concerns and reduced attack surface.
@alcinnz, I've been researching this for some time now. I even looked at the reasoning and history of Linux code changes for pipes and why we only have 3 standard files descriptors: stdin, stdout and stderr. Adding extra standard I/O FDs while keeping backward compatibility with POSIX pipes and not affecting performance; for which I plan to respond to a related security thread. I've got already about 2000 words, but I need to trim it down for Mastodon, or maybe… a blog? hint:ELF
@alcinnz, I think I that even found a decent binary format for an initial test. I now need to find a decent replacement for printf(), a very light templating system that works well with buffered streams and doesn't reduce performance for the users that don't benefit much from this separation of concerns. Also, binary encoding and fancy pipes bean that we can ditch libvte and "-czf". Then we can have fun building interactive sci-fi GUIs, without POSIX code pages.
"Fancy interactive Sci-Fi GUIs for improved security and accessibility"
My previous working title contained security and accessibility next to something about the economics that lead to the staggered keyboard and having to leave that legacy behind.
@walter @devinprater What I can't stop thinking reading this thread is that this is precisely what I like about eSpeak NG! Its less of a chorus & more of a (relatively) talented narrator. But this very much describes what I'm exploring with Rhapsode, though that operates upon HTML markup rather than punctuation.
Why I'm so reluctant to give up eSpeak's strengths, or as with Gemini inline markup!
Still not clear it needs a coprocessor...
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!