I like targetting atypical I/O mediums to get my creative juices flowing, improve accessibility, & build something more distinctive whose benefits are clear. For this I assemble my own platforms, so today my thanks goes out to the projects I assembled for voice I/O!

eSpeak NG does sound a bit robotic, but I appreciate their hard work for their extensive internationalization, clarity at fast speeds, & all their knobs which allow me to take full advantage of the auditory medium!


On the input side my gratitude goes out to CMU Sphinx, Kaldi, Mozilla DeepSpeech, & Julius! As well as to Voice2JSON/Rhasspy for providing a highlevel wrapper around your choice of these!

Since I believe for privacy's sake it is vital to be running this voice transcription AIs (which the named projects implement in a myriad of ways) locally, which despite that not being the typical approach is how Voice2JSON works.

Please package Voice2JSON, it could be a great component for a freedesktop!

P.S. Speech-to-text engines aren't near as well internationalized as eSpeak NG's text-to-speech, so that's one of many reasons why I need to allow for keyboard input. Internationalizing speech-to-text is a huge challenge since put simply it requires retraining the whole system!

P.P.S. Feel free to research how to generate more natural sounding voices without sacrificing eSpeak's benefits. I'd love to see what you come up with!


@alcinnz, cc: @devinprater, for an alternative, check CMU's . They've got several OSS projects, all with better sounding voices:

* Flite or festival-lite, a small fast portable speech synthesis system.
* Bard Storyteller: an ebook reader that uses Flite.

Both based on the CMU Wilderness Multilingual Speech Dataset, a speech dataset of aligned sentences and audio for ~700 different languages, but it's not licensed and served from a non-HTTPS site: festvox.org/


@alcinnz @devinprater, the speech synthesis system is also available as an app in .

It provides a demo page for the speech synthesis system and a plugin. The last build has 13 English voices, plus 4 indic ones: Hindi, Tamil, Telugu and Marathi.

This list could be extended for the full dataset, but It needs some love, currently, there are no students on the project.

Available on a non-HTTPS site: cmuflite.org or github.com/festvox/flite


@walter @devinprater Does Flite provide the internationalization & tone-of-voice controls I love about eSpeak NG?

@alcinnz @walter Mmm, kinda voice controls, not really internationalization.

@alcinnz, I'm not sure how they've "created" the voices, but of the 13 English voices they provide some "pure" English voices and other that have a German, Canadian or Indian accent, and then they provide those Indian (indic) voices. So… probably not… maybe?

I don't know how they are created, but for a demo, you can install the FDroid build: f-droid.org/en/packages/edu.cm

cc: @devinprater


@alcinnz @devinprater, BTW, I came across this while trying to see how we could improve the accessibility of CLIs, but at the same time also address the lack of it during startup, in both bioses and bootloaders.

And we need a solid foundation, similar to the one for video, the frame-buffer, but one for accessibility: screen readers, braille, etc. And for it, we could build a hardware co-processor that would accompany or replace the GPU, no need to waste CPU cycles for GUIs that are not used.

@walter @devinprater The other consideration is "Does text-to-speech actually a co-processor? Is it expensive enough for that?"

Certainly freeing up cycles from graphical UIs would more than make up for the cycles used by text-to-speech.

Not that I haven't toyed with the concept of what this might look like...

@alcinnz @devinprater,
Yes, it sounds like a lot, but the hardware is optional, and we could use existing screen readers, which will receive structured data, not formatted text like now.

I've got a more or less complete picture, I just need to publish some posts and see what's feasible and what's plain stupid. I think that it is doable and this would be a huge boon for accessibility, mainly because the security people might also be interested: separation of concerns and reduced attack surface.

@walter @devinprater Well, I for one am curious to see what you build!

Could be a nice complement to my "Rhapsode" auditory browser...

@alcinnz, I've been researching this for some time now. I even looked at the reasoning and history of Linux code changes for pipes and why we only have 3 standard files descriptors: stdin, stdout and stderr. Adding extra standard I/O FDs while keeping backward compatibility with POSIX pipes and not affecting performance; for which I plan to respond to a related security thread. I've got already about 2000 words, but I need to trim it down for Mastodon, or maybe… a blog? hint:ELF


@alcinnz, I think I that even found a decent binary format for an initial test. I now need to find a decent replacement for printf(), a very light templating system that works well with buffered streams and doesn't reduce performance for the users that don't benefit much from this separation of concerns. Also, binary encoding and fancy pipes bean that we can ditch libvte and "-czf". Then we can have fun building interactive sci-fi GUIs, without POSIX code pages.


@alcinnz @devinprater, thanks, now I've got more naming ideas for the fist blog post, or maybe the series:

"Fancy interactive Sci-Fi GUIs for improved security and accessibility"

My previous working title contained security and accessibility next to something about the economics that lead to the staggered keyboard and having to leave that legacy behind.

@alcinnz @devinprater, and this thread explains one of the reasons I would like to have a hardware accessibility co-processor that replaces the GPU. But read it in full, because the topic kind of changes in the end.


@walter @devinprater What I can't stop thinking reading this thread is that this is precisely what I like about eSpeak NG! Its less of a chorus & more of a (relatively) talented narrator. But this very much describes what I'm exploring with Rhapsode, though that operates upon HTML markup rather than punctuation.

Why I'm so reluctant to give up eSpeak's strengths, or as with Gemini inline markup!

Still not clear it needs a coprocessor...

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!