home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Hacker Chronicles 2
/
HACKER2.BIN
/
785.CSPEECH.TXT
< prev
next >
Wrap
Text File
|
1993-11-22
|
98KB
|
2,319 lines
Archive-name: comp-speech-faq
Last-modified: 1993/11/11
comp.speech
Frequently Asked Questions
==========================
This document is an attempt to answer commonly asked questions and to
reduce the bandwidth taken up by these posts and their associated replies.
If you have a question, please check this file before you post.
The FAQ is not meant to discuss any topic exhaustively. It will hopefully
provide readers with pointers on where to find useful information. It also
tries to list useful material available elsewhere on the net.
This FAQ is posted monthly to comp.speech, comp.answers and news.answers.
It is also available for anonymous ftp from the comp.speech archive site
svr-ftp.eng.cam.ac.uk:/comp.speech/FAQ
It is also available from the news.answers ftp site (and its mirrors) as
rtfm.mit.edu:/pub/usenet/news.answers/comp-speech-faq
If you have not already read the Usenet introductory material posted to
"news.announce.newusers", please do. For help with FTP (file transfer
protocol) look for a regular posting of "Anonymous FTP List - FAQ" in
comp.misc, comp.archives.admin and news.answers amongst others.
Admin
-----
There are several new product entries in this release plus updates
on quite a few entries.
I have introduced Question 1.6 on the use of speech technology as aids
for the handicapped. The first information is on a speech therapy aid.
Can people with experience in this area provide details of aids for
the blind, deaf, speech impaired, RSI, physically impaired and others.
If there is sufficient information it can form its own section.
My email address has changed to andrewh@speech.su.oz.au.
The old one will work for some time still.
Cheers,
Andrew Hunt
Speech Technology Research Group email: andrewh@speech.su.oz.au
Department of Electrical Engineering Ph: 61-2-692 4509
University of Sydney, NSW, Australia. Fax: 61-2-692 3847
========================== Acknowledgements ===========================
Thanks to the following for their significant comments and contributions.
Barry Arons <barons@media-lab.mit.edu>
Joe Campbell <jpcampb@afterlife.ncsc.mil>
Oliver Jakobs <jakobs@ldv01.Uni-Trier.de>
Sonja Kowalewski <kowa@uniko.uni-koblenz.de>
Tony Robinson <ajr@eng.cam.ac.uk>
Mike <mike%jim.uucp@wupost.wustl.edu>
Many others have provided useful information. Thanks to all.
============================ Contents =================================
PART 1 - General
Q1.1: What is comp.speech?
Q1.2: Where are the comp.speech archives?
Q1.3: Common abbreviations and jargon.
Q1.4: What are related newsgroups and mailing lists?
Q1.5: What are related journals and conferences?
Q1.6: What resources are available as handicap aids?
Q1.7: What speech data is available?
Q1.8: Speech File Formats, Conversion and Playing.
Q1.9: What "Speech Laboratory Environments" are available?
PART 2 - Signal Processing for Speech
Q2.1: What speech sampling and signal processing hardware can I use?
Q2.2: What signal processing techniques are for speech technology?
Q2.3: How do I find the pitch of a speech signal?
Q2.4: How do I find the start and end points of a speech signal?
Q2.5: Where can I find FFT software?
Q2.6: How do I convert to/from mu-law format?
PART 3 - Speech Coding and Compression
Q3.1: Speech compression techniques.
Q3.2: What are some good references/books on coding/compression?
Q3.3: What software is available?
PART 4 - Speech Synthesis
Q4.1: What is speech synthesis?
Q4.2: How can speech synthesis be performed?
Q4.3: What are some good references/books on synthesis?
Q4.4: What software/hardware is available?
PART 5 - Speech Recognition
Q5.1: What is speech recognition?
Q5.2: How can I build a very simple speech recogniser?
Q5.2: What does speaker dependent/adaptive/independent mean?
Q5.3: What does small/medium/large/very-large vocabulary mean?
Q5.4: What does continuous speech or isolated-word mean?
Q5.5: How is speech recognition done?
Q5.6: What are some good references/books on recognition?
Q5.7: What speech recognition packages are available?
PART 6 - Natural Language Processing
Q6.1: What are some good references/books on NLP?
Q6.2: What NLP software is available?
=======================================================================
PART 1 - General
Q1.1: What is comp.speech?
comp.speech is a newsgroup for discussion of speech technology and
speech science. It covers a wide range of issues from application of
speech technology, to research, to products and lots more. By nature
speech technology is an inter-disciplinary field and the newsgroup reflects
this. However, computer application is the basic theme of the group.
The following is a list of topics but does not cover all matters related
to the field - no order of importance is implied.
[1] Speech Recognition - discussion of methodologies, training, techniques,
results and applications. This should cover the application of techniques
including HMMs, neural-nets and so on to the field.
[2] Speech Synthesis - discussion concerning theoretical and practical
issues associated with the design of speech synthesis systems.
[3] Speech Coding and Compression - both research and application matters.
[4] Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues
which are relevant to speech technology applications. Could cover parsing,
natural language processing, phonology and prosodic work.
[5] Speech System Design - issues relating to the application of speech
technology to real-world problems. Includes the design of user interfaces,
the building of real-time systems and so on.
[6] Other matters - relevant conferences, books, public domain software,
hardware and related products.
------------------------------------------------------------------------
Q1.2: Where are the comp.speech archives?
comp.speech is being archived for anonymous ftp.
ftp site: svr-ftp.eng.cam.ac.uk (or 129.169.24.20).
directory: comp.speech/archive
comp.speech/archive contains the articles as they arrive. Batches of 100
articles are grouped into a shar file, along with an associated file of
Subject lines.
Other useful information is also available in comp.speech/info.
------------------------------------------------------------------------
Q1.3: Common abbreviations and jargon.
ANN - Artificial Neural Network.
ASR - Automatic Speech Recognition.
ASSP - Acoustics Speech and Signal Processing
AVIOS - American Voice I/O Society
CELP - Code-book excited linear prediction.
COLING - Computational Linguistics
DTW - Dynamic time warping.
FAQ - Frequently asked questions.
HMM - Hidden markov model.
IEEE - Institute of Electrical and Electronics Engineers
JASA - Journal of the Acoustic Society of America
LPC - Linear predictive coding.
LVQ - Learned vector quantisation.
NLP - Natural Language Processing.
NN - Neural Network.
TI - Texas Instruments.
TIMIT - A big speech database from TI and MIT - see Q1.6
TTS - Text-To-Speech (i.e. synthesis).
VQ - Vector Quantisation.
------------------------------------------------------------------------
Q1.4: What are related newsgroups and mailing lists?
NEWGROUPS
comp.ai - Artificial Intelligence newsgroup.
Postings on general AI issues, language processing and AI techniques.
Has a good FAQ including NLP, NN and other AI information.
comp.ai.nat-lang - Natural Language Processing Group
Postings regarding Natural Language Processing. Set up to cover
a broard range of related issues and different viewpoints.
comp.ai.nlang-know-rep - Natural Language Knowledge Representation
Moderated group covering Natural Language.
comp.ai.neural-nets - discussion of Neural Networks and related issues.
There are often posting on speech related matters - phonetic recognition,
connectionist grammars and so on.
comp.compression - occasional articles on compression of speech.
FAQ for comp.compression has some info on audio compression standards.
comp.dcom.telecom - Telecommunications newsgroup.
Has occasional articles on voice products.
comp.dsp - discussion of signal processing - hardware and algorithms and more.
Has a good FAQ posting.
Has a regular posting of a comprehensive list of Audio File Formats.
comp.multimedia - Multi-Media discussion group.
Has occasional articles on voice I/O.
sci.lang - Language.
Discussion about phonetics, phonology, grammar, etymology and lots more.
alt.sci.physics.acoustics - some discussion of speech production & perception.
alt.binaries.sounds.misc - posting of various sound samples
alt.binaries.sounds.d - discussion about sound samples, recording and playback.
MAILING LISTS
ECTL - Electronic Communal Temporal Lobe
Founder & Moderator: David Leip
Moderated mailing list for researchers with interests in computer speech
interfaces. This list serves a broad community including persons from
signal processing, AI, linguistics and human factors.
To subscribe, send the following information to:
ectl-request@snowhite.cis.uoguelph.ca
name, institute, department, daytime phone & e-mail address
To access the archive, ftp snowhite.cis.uoguelph.ca, login as anonymous,
and supply your local userid as a password. All the ECTL things can be
found in pub/ectl.
Prosody Mailing List
Unmoderated mailing list for discussion of prosody. The aim is
to facilitate the spread of information relating to the research
of prosody by creating a network of researchers in the field.
If you want to participate, send the following one-line
message to "listserv@msu.edu" :-
subscribe prosody Your Name
foNETiks
A monthly newsletter distributed by e-mail. It carries job
advertisements, notices of conferences, and other news of
general interest to phoneticians, speech scientists and others
The current editors are Linda Shockey and Gerry Docherty.
#
The email address seems to have changed - does anyone know
the current subscription details?
Digital Mobile Radio
Covers lots of areas include some speech topics including speech
coding and speech compression.
Mail Peter Decker (dec@dfv.rwth-aachen.de) to subscribe.
------------------------------------------------------------------------
Q1.5: What are related journals and conferences?
Try the following commercially oriented magazine:-
Speech Technology - no longer published
Voice Technology News
Try the following technical journals (some contact addresses below):-
IEEE Transactions on Speech and Audio Processing (from Jan 93)
IEEE Transactions on Acoustics, Speech, and Signal Processing
(ASSP) - now obsolete.
Computational Linguistics (COLING)
Computer Speech and Language
Journal of the Acoustical Society of America (JASA)
Transactions of IEEE ASSP
AVIOS Journal
ASR News
Try the following conferences:-
ICASSP Intl. Conference on Acoustics Speech and Signal Processing (IEEE)
ICSLP Intl. Conference on Spoken Language Processing
EUROSPEECH European Conference on Speech Communication and Technology
AVIOS American Voice I/O Society Conference
SST Australian Speech Science and Technology Conference
SpeechTech
Here are a few contact addresses:-
Publications: IEEE Transactions on Speech and Audio Processing (from Jan 93)
IEEE Transactions on Acoustics, Speech, and Signal Processing
(ASSP) - now obsolete.
Organization: Institute of Electrical and Electronics Engineers (IEEE)
Address: IEEE Service Center
445 Hoes Lane
PO Box 1331
Piscataway, NJ 08855, USA
Phone number: 1-800-678-IEEE
(201)981-0060
Publications: Computer Speech and Language
Organization: Academic Press, Ltd.
Address: 24-28 Oval Rd
London NW1
England
Price: $136 (Institutions), $58 (Individuals)
Publications: Association for Computational Linguistics
Organization: Association for Computational Linguistics
Address: MIT Press Journals
55 Hayward St
Cambridge, MA 02142
Phone number: (617)253-2889
------------------------------------------------------------------------
Q1.6: What resources are available as handicap aids?
Can anyone provide information on speech technology aids for the deaf,
blind, speech impaired, physically impaired and other groups who may
benefit from speech technology?
Product Name: SpeechViewer II
Platform: IBM Machines from Mod 25 on.
Description: SpeechViewer II is a speech therapy tool. It provided
graphical feedback of various speech features so that speech
impaired individuals can improve their speech. It works with an
audio bandwidth of 7.3 Khz and thus allows the therapist to work
with sustained vowels and fricatives. A wide range of graphics
are used to provide adequate variability to hold client interest.
An extensive set of statistics are gathered which allows a therapist
to do research or keep therapy records.
The speech therapy modules are:
o Awareness - Sound, Loudness, Pitch, Voicing Onset, Voicing
o Skill Building - Pitch, Voicing, Phonology
o Patterning - Pitch & Loudness - Waveform & Spectrogram, Spectra
o Clinical Management - Profiles, Models, Client Data
Hardware: Requires an IBM M-ACPA (Multimedia-Audio Capture Playback
Adapter). It has a TI TMS320C25 DSP chip. The input sampling
rate is 44.1 Khz stereo, 88.2 Khz mono. This is a 16 bit card.
It has the following jacks: mic in, stereo line in, stereo line
out, speaker out. Note: This card is being replaced by Mwave
technology. For more info on Mwave contact Texas Instruments.
Price: The software is $2130 list, $1491 educational, part number 92F2066.
The M-ACPA is $370 list, $222 educational, part number 92F3378.
The MicroChannel adapter part number is 92F3379 (same price).
Contact: The Psychological Corporation (TPC) [IBM Authorized Remarketer]
Phone: 1-800-228-0752
Or contact IBM on 1-800-426-4832.
------------------------------------------------------------------------
Q1.7: What speech data is available?
A wide range of speech databases have been collected. These databases
are primarily for the development of speech synthesis/recognition and for
linguistic research.
Some databases are free but most appear to be available for a small cost.
The databases normally require lots of storage space - do not expect to be
able to ftp all the data you want.
[There are too many to list here in detail - perhaps someone would like to
set up a special posting on speech databases?]
PHONEMIC SAMPLES
================
First, some basic data. The following sites have samples of English phonemes
(American accent I believe) in Sun audio format files. See Question 1.7
for information on audio file formats.
sounds.sdsu.edu:/.1/phonemes
phloem.uoregon.edu:/pub/Sun4/lib/phonemes
sunsite.unc.edu:/pub/multimedia/sun-sounds/phonemes
HOMOPHONE LIST
==============
A list of homophones in General American English is available by anonymous
FTP from the comp.speech archive site:
machine name: svr-ftp.eng.cam.ac.uk
directory: comp.speech/data
file name: homophones-1.01.txt
LINGUISTIC DATA CONSORTIUM (LDC)
================================
Information about the Linguistic Data Consortium is available via
anonymous ftp from: ftp.cis.upenn.edu (130.91.6.8)
in the directory: /pub/ldc
Here are some excerpts from the files in that directory:
Briefly stated, the LDC has been established to broaden the collection
and distribution of speech and natural language data bases for the
purposes of research and technology development in automatic speech
recognition, natural language processing and other areas where large
amounts of linguistic data are needed.
Here is the brief list of corpora:
* The TIMIT and NTIMIT speech corpora
* The Resource Management speech corpus (RM1, RM2)
* The Air Travel Information System (ATIS0) speech corpus
* The Association for Computational Linguistics - Data Collection
Initiative text corpus (ACL-DCI)
* The TI Connected Digits speech corpus (TIDIGITS)
* The TI 46-word Isolated Word speech corpus (TI-46)
* The Road Rally conversational speech corpora (including "Stonehenge"
and "Waterloo" corpora)
* The Tipster Information Retrieval Test Collection
* The Switchboard speech corpus ("Credit Card" excerpts and portions
of the complete Switchboard collection)
Further resources to be made available within the first year (or two):
* The Machine-Readable Spoken English speech corpus (MARSEC)
* The Edinburgh Map Task speech corpus
* The Message Understanding Conference (MUC) text corpus of FBI
terrorist reports
* The Continuous Speech Recognition - Wall Street Journal speech
corpus (WSJ-CSR)
* The Penn Treebank parsed/tagged text corpus
* The Multi-site ATIS speech corpus (ATIS2)
* The Air Traffic Control (ATC) speech corpus
* The Hansard English/French parallel text corpus
* The European Corpus Initiative multi-language text corpus (ECI)
* The Int'l Labor Organization/Int'l Trade Union multi-language
text corpus (ILO/ITU)
* Machine-readable dictionaries/lexical data bases (COMLEX, CELEX)
The files in the directory include more detailed information on the
individual databases. For further information contact
Linguistic Data Consortium
441 Williams Hall
University of Pennsylvania
Philadelphia, PA 19104-6305
Phone: +1 (215) 898-0464
Fax: +1 (215) 573-2175
e-mail: ldc@unagi.cis.upenn.edu
Center for Spoken Language Understanding (CSLU)
===============================================
1. The ISOLET speech database of spoken letters of the English alphabet.
The speech is high quality (16 kHz with a noise cancelling microphone).
150 speakers x 26 letters of the English alphabet twice in random order.
The "ISOLET" data base can be purchased for $100 by sending an email request
to vincew@cse.ogi.edu. (This covers handling, shipping and medium costs).
The data base comes with a technical report describing the data.
2. CSLU has a telephone speech corpus of 1000 English alphabets. Callers
recite the alphabet with brief pauses between letters. This database is
available to not-for-profit institutions for $100. The data base is described
in the proceedings of the International Conference on Spoken Language
Processing. Contact vincew@cse.ogi.edu if interested.
PhonDat - A Large Database of Spoken German
===========================================
The PhonDat continuous speech corpora are now available on
CD-ROM media (ISO 9660 format).
PhonDat I (Diphone Corpus) : 6 CDs (1140.- DM)
PhonDat II (Train Enquiries Corpus): 1 CD ( 190.- DM)
PhonDat I comprises approx. 20.000, PhonDat II approx. 1500
signal files in high quality 16-bit 16 KHz recording. The
corpora come with a documentation containing the orthographic
transcription and a citation form of the utterances, as well as a
detailed file format description. A narrow phonetic transcription
is available for selected files from corpus I and II.
For information and orders contact
Barbara Eisen
Institut fuer Phonetik
Schellingstr. 3 / II
D 80799 Munich 40
Tel: +49 / 89 / 2180 -2454 or -2758
Fax: +49 / 89 / 280 03 62
------------------------------------------------------------------------
Q1.8: Speech File Formats, Conversion and Playing.
Section 2 of this FAQ has information on mu-law coding.
A very good and very comprehensive list of audio file formats is prepared
by Guido van Rossum. The list is posted regularly to comp.dsp and
alt.binaries.sounds.misc, amongst others. It includes information on
sampling rates, hardware, compression techniques, file format definitions,
format conversion, standards, programming hints and lots more. It is much
too long to include within this posting.
It is also available by ftp
from: ftp.cwi.nl
directory: /pub
file: AudioFormats<version>
------------------------------------------------------------------------
Q1.9: What "Speech Laboratory Environments" are available?
First, what is a Speech Laboratory Environment? A speech lab is a
software package which provides the capability of recording, playing,
analysing, processing, displaying and storing speech. Your computer
will require audio input/output capability. The different packages
vary greatly in features and capability - best to know what you want
before you start looking around.
Most general purpose audio processing packages will be able to process speech
but do not necessarily have some specialised capabilities for speech (e.g.
formant analysis).
The following article provides a good survey.
Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An Evaluation"
Journal of Speech and Hearing Research, pp 314-332, April 1992.
Package: Entropic Signal Processing System (ESPS) and Waves
Platform: Range of Unix platforms.
Description: ESPS is a very comprehensive set of speech analysis/processing
tools for the UNIX environment. The package includes UNIX commands,
and a comprehensive C library (which can be accessed from other
languages). Waves is a graphical front-end for speech processing.
Speech waveforms, spectrograms, pitch traces etc can be displayed,
edited and processed in X windows and Openwindows (versions 2 & 3).
The HTK (Hidden Markov Model Toolkit) is now available from Entropic.
HTK is described in some detail in Section 5 of this FAQ - the
section on Speech Recognition.
Cost: On request.
Contact: Entropic Research Laboratory, Washington Research Laboratory,
600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
(202) 547-1420. email - info@wrl.epi.com
Package: CSRE: Canadian Speech Research Environment
Platform: IBM/AT-compatibles
Description: CSRE is a comprehensive, microcomputer-based system designed
to support speech research. CSRE provides a powerful, low-cost
facility in support of speech research, using mass-produced and
widely-available hardware. The project is non-profit, and relies
on the cooperation of researchers at a number of institutions and
fees generated when the software is distributed. Functions
include speech capture, editing, and replay; several alternative
spectral analysis procedures, with color and surface/3D displays;
parameter extraction/tracking and tools to automate measurement
and support data logging; alternative pitch-extraction systems;
parametric speech (KLATT80) and non-speech acoustic synthesis,
with a variety of supporting productivity tools; and a
comprehensive experiment generator, to support behavioral testing
using a variety of common testing protocols.
A paper about the whole package can be found in:
Jamieson D.G. et al, "CSRE: A Speech Research Environment",
Proc. of the Second Intl. Conf. on Spoken Language Processing,
Edmonton: University of Alberta, pp. 1127-1130.
Hardware: Can use a range of data aqcuisition/DSP
Cost: Distributed on a cost recovery basis.
Availability: For more information on availability
contact Krystyna Marciniak - email march@uwovax.uwo.ca
Tel (519) 661-3901 Fax (519) 661-3805.
For technical information - email ramji@uwovax.uwo.ca
Note: Also included in Q4.4 on speech synthesis packages.
Package: OGI Speech Tools from the Center for Spoken Language
Understanding (CSLU) at the Oregon Graduate Institute of Science
and Technology (Portland Oregon)
Platform: Unix????
Description: The OGI Speech tools include :-
1. An X windows display tool (LYRE) for displaying data in a time
synchronous fashion for a. the speech signal b. spectrograms
c. phoneme labels, and other information.
2. A Neural Network (NOPT) training package.
3. An set of C library routines (LIBNSPEECH) for the manipulation
of speech data, including: a. PLP Analysis, b. Rasta PLP
Analysis, c. Linear Predictive Coding, d. Mel Cepstrum Coding,
e. Fast Fourier Transform
4. A set of utilities for converting file formats such as ADC, NIST,
mu-law, binary files, and ascii. Includes filtering.
5. A database utility (find_phone) to automate speech database
related enquiries. It allows the user to specify a particular
label or set of labels in a given context, display all occurrences
of the label, and relabel the occurrences if desired.
6. A Vector-Quantizer based on the Linde Buzo and Gray (LBG)
algorithm.
7. A set of PEARL Scripts which have been used mainly to automate
the use of the OGI Speech Tools.
8. MAN Pages for all routines and programs developed, as well as
a User manual in both in postscript and {\bf tex} format.
Misc: Software is written in ANSI C.
Availability: By anonymous ftp from
speech.cse.ogi.edu:/pub/tools/
Contact: Try tools@cse.ogi.edu
Package: Signalyze 2.4x from InfoSignal
Platform: Macintosh
Description: Signalyze's basic conception revolves around up to 100
signals, displayed synchronously in HyperCard fashion on "cards".
The program offers a complement of signal editing features,
quite a few spectral analysis tools, manual scoring tools, pitch
extraction routines, a good set of signal manipulation tools, and
extensive input-output capacity.
Handles multiple file formats: Signalyze, MacSpeech Lab, AudioMedia,
SoundDesigner II, SoundEdit/MacRecorder, SoundWave, three sound
resource formats, and ASCII-text.
Sound I/O: Direct sound input from MacRecorder and similar devices,
AudioMedia, AudioMedia II and AD IN, some MacADIOS boards and devices,
Apple sound input (built-in microphone). Sound output via Macintosh
internal sound, some MacADIOS boards and devices as well as via the
Digidesign 16-bit boards.
Compatibility: MacPlus and higher (including II, IIx, IIcx, IIci, IIfx,
IIvx, IIvi, Portable, all PowerBooks, Centris and Quadras). Takes
advantage of large and multiple screens and 16/256 color/grayscales.
System 7.0 compatible. Runs in background with adjustable priority.
Misc: A demo available upon request.
Manuals and tutorial included.
It is available in English, French, and German.
An UPDATER to version 2.48 is now available in:
- The UNIL Gopher server (see last page of InfoSignal News 8)
- The LAIP FTP server. Address: MACFL4082.unil.ch, machine no.
130.223.104.31, login: anonymous, password: your email
Also available are a demo program, and current questions and answers.
Cost: Individual licence US$350, site license US$500, plus shipping.
Contact: North America - Network Technology Corporation
91 Baldwin St., Charlestown MA 02129
Fax: 617-241-5064 Phone: 617-241-9205
Elsewhere - InfoSignal Inc.
C.P. 73, 1015 LAUSANNE, Switzerland,
FAX: +41 21 691-1372,
Email: 76357.1213@COMPUSERVE.COM.
Package: Kay Elemetrics CSL (Computer Speech Lab) 4300
Platform: Minimum IBM PC-AT compatible with extended memory (min 2MB)
with at least VGA graphics. Optimal would be 386 or 486 machine
with more RAM for handling larger amounts of data.
Description: Speech analysis package, with optional separate LPC program
for analysis/synthesis. Uses its own file format for data, but has
some ability to export data as ascii. The main editing/analysis prog
(but not the LPC part) has its own macro language, making it easy to
perform repetitive tasks. Probably not much use without the extra
LPC program, which also allows manipulation of pitch, formant and
bandwidth parameters.
Hardware includes an internal DSP board for the PC (requires ISA
slot), and an external module containing signal processing chips
which does A/D and D/A conversion.
A speaker and microphone are supplied.
Misc: A programmers kit is available for programming signal processing
chips (experts only).
Manuals included.
Cost: Recently approx 6000 pounds sterling. (Less in USA?)
Availibility: UK distributors are Wessex Electronics,
114-116 North Street, Downend, Bristol, B16 5SE
Tel: 0272 571404.
In USA: Kay Elemetrics Corp,
12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798
Tel:(201) 227-7760
Package: MacSpeech Lab II (MSL II)
Platform: Macintosh
Description: A sound analysis and acquisition for Macs. MSL II delivers
the most common functions for speech analysis (FFTs, LPCs, f0
extraction, etc.) & produces grayscale spectrographic displays.
Can be used for various speech technology and phonetic training
tasks. The software an trade off accuracy and speech.
Hardware: requires MacADIOS ("Macintosh Analog/Digital Input/Output
System") hardware for speech I/O at 12/16 bits.
Misc: Software no longer updated by GW Instruments; MSL soft/hardware will
not perform input/output on Quadras, for example, though analysis
seems fine. Known to operate properly on systems as high as IIcx &
II fx.
Cost: $4990 (in May '92 price list; no MSL soft/hardware package
listed in January '93).
Contact: GW Instruments
35 Medford Street, Somerville, MA 02143
Phone: (617) 625-4096 Fax: (617) 625-1322
Package: Ptolemy
Platform: Sun SPARC, DecStation (MIPS), HP (hppa).
Description: Ptolemy provides a highly flexible foundation for the
specification, simulation, and rapid prototyping of systems.
It is an object oriented framework within which diverse models
of computation can co-exist and interact. Ptolemy can be used
to model entire systems.
Ptolemy has been used for a broad range of applications including
signal processing, telecomunications, parallel processing, wireless
communications, network design, radio astronomy, real time systems,
and hardware/software co-design. Ptolemy has also been used as a lab
for signal processing and communications courses.
Ptolemy has been developed at UC Berkeley over the past 3 years.
Further information, including papers and the complete release
notes, is available from the FTP site.
Cost: Free
Availability: The source code, binaries, and documentation are available
by anonymous ftp from "ptolemy.bekeley.edu" - see the README file -
ptolemy.berkeley.edu:/pub/README
Package: Khoros
Description: Public domain image processing package with a basic DSP
library. Not particularly applicable to speech, but not bad
for the price.
Cost: FREE
Availability: By anonymous ftp from pprg.eece.unm.edu
Package: SpeechViewer II
Description: Speech Therapy Tool
See the detailed description in the handicap section (Q1.6).
Can anyone provide information on capability and availability of the
following package?
ILS ("Interactive Laboratory System")
=======================================================================
PART 2 - Signal Processing for Speech
Q2.1: What speech sampling and signal processing hardware can I use?
In addition to the following information, have a look at the Audio File
format document prepared by Guido van Rossum (see details in Section 1.7).
Product: Sun standard audio port (SPARC 1 & 2)
Input: 1 channel, 8 bit mu-law encoded (telephone quality)
Output: 1 channel, 8 bit mu-law encoded (telephone quality)
Product: Ariel
Platform: Sun + others?
Input: 2 channels, 16bit linear, sample rate 8-96kHz (inc 32, 44.1, 48kHz).
Output: 2 channels, 16bit linear, sample rate 8-50kHz (inc 32, 44.1, 48kHz).
Contact: Ariel Corp.433 River Road,
Highland Park, NJ 08904.
Ph: 908-249-2900 Fax: 908-249-2123 DSP BBS: 908-249-2124
Product: IBM RS/6000 ACPA (Audio Capture and Playback Adapter)
Description: The card supports PCM, Mu-Law, A-Law and ADPCM at 44.1kHz
(& 22.05, 11.025, 8kHz) with 16-bits of resolution in stereo.
The card has a built-in DSP (don't know which one). The device
also supports various formats for the output data, like big-endian,
twos complement, etc. Good noise immunity.
The card is used for IBM's VoiceServer (they use the DSP for
speech recognition). Apparently, the IBM voiceserver has a
speaker-independent vocabulary of over 20,000 words and each
ACPA can support two independent sessions at once.
Cost: $US495
Contact: ?
Product: Sound Galaxy NX , Aztech Systems
Platform: PC - DOS,Windows 3.1
Cost: ??
Input: 8bit linear, 4-22 kHz.
Output: 8bit linear, 4-44.1 kHz
Misc: 11-voice FM Music Synthesizer YM3812; Built-in power amplifier;
DSP signal processing support - ST70019SB
Hardware ADPCM decompression (2:1,3:1,4:1)
Full "AdLib" and "Sound Blaster" compatbility.
Software includes a simple Text-to-Speech program "Monologue".
Product: Sound Galaxy NX PRO, Aztech Systems
Platform: PC - DOS,Windows 3.1
Cost: ??
Input: 2 * 8bit linear, 4-22.05 kHz(stereo), 4-44.1 KHz(mono).
Output: 2 * 8bit linear, 4-44.1 kHz(stereo/mono)
Misc: 20-voice FM Music Synthesizer; Built-in power amplifier;
Stereo Digital/Analog Mixer; Configuration in EEPROM.
Hardware ADPCM decompression (2:1,3:1,4:1).
Includes DSP signal processing support
Full "AdLib" and "Sound Blaster Pro II" compatybility.
Software includes a simple Text-to-Speech program "Monologue"
and Sampling laboratory for Windows 3.1: WinDAT.
Contact: USA (510)6238988
Product Name: ATI Stereo F/X Sound Board
Platform: PC XT or AT - DOS, Windows 3.0, 3.1
Cost: $120 Canadian
Description:
Input - 8 bit ADC, 44.1 kHz mono, 22.05 kHz Stereo.
Output - Dynamic range = 48 dB, 32 anti-aliasing filters
Adds Stereo effect to existing mono Adlib or Sound Blaster apps.
11-voice YAMAHA FM Music Synthesizer
Built-in 8 watt power amplifier, 4 watts per channel.
Volume ctrl on rear.
2 Joystick input, software setup (no switches), software included.
"AdLib" and "Sound Blaster" compatibility.
DMA support for high speed digital audio.
ADPCM decomp @ 4:1, 3:1, 2:1. Will play .WAV files.
Optional MIDI I/O port $79. (MIDI IN, OUT, THRU, and sequencer).
Contact: ATI Technologies Inc.
3761 Victoria Park Avenue
Scarborough, Ontario
CANADA, M1W 3S2
Ph: (416) 756-0711 Fax: (416) 756-0720
BBS: (416) 764-9404 (9600 baud N.8.1)
Other PC Sound Cards
============================================================================
sound stereo/mono compatible included voices
card & sample rate with ports
============================================================================
Adlib Gold stereo: 8-bit 44.1khz Adlib ? audio 20 (opl3)
1000 16-bit 44.1khz in/out, +2 digital
mono: 8-bit 44.1khz mic in, channels
16-bit 44.1khz joystick,
MIDI
Sound Blaster mono: 8-bit 22.1khz Adlib audio 11 synth.
FM synth with in/out,
2 operators joystick,
Sound Blaster stereo: 8-bit 22.05khz Adlib audio 22
Pro Basic mono: 8-bit 44.1khz Sound Blaster in/out,
joystick,
Sound Blaster stereo: 8-bit 22.05khz Adlib audio 11
Pro mono: 8-bit 44.1khz Sound Blaster in/out
joystick,
MIDI, SCSI
Sound Blaster stereo: 8-bit 4-44.1khz Sound Blaster audio 20
16 ASP stereo: 16-bit 4-44.1khz in/out,
joystick,
MIDI
Audio Port mono: 8-bit 22.05khz Adlib audio 11
Sound Blaster in/out,
joystick
Pro Audio stereo: 8-bit 44.1khz Adlib audio, 20
Spectrum + Pro Audio in/out,
Spectrum joystick
Pro Audio stereo: 16-bit 44.1khz Adlib audio 20
Spectrum 16 Pro Audio in/out,
Spectrum joystick,
Sound Blaster MIDI, SCSI
Thunder Board stereo: 8-bit 22khz Adlib audio 11
Sound Blaster in/out,
joystick
Gravis stereo: 8-bit 44.1khz Adlib, audio line 32 sampled
Ultrasound mono: 8-bit 44.1khz Sound Blaster in/out, 32 synth.
amplified
out,
(w/16-bit daughtercard) mic in, CD
stereo: 16-bit 44.1khz audio in,
mono: 16-bit 44.1khz daughterboard
ports (for
SCSI and
16-bit)
MultiSound stereo: 16-bit 44.1kHz Nothing audio 32 sampled
64x oversampling in/out,
joystick,
MIDI
=============================================================================
Can anyone provide information on Mac, NeXT and other hardware?
Product: xxx
Platform: PC, Mac, Sun, ...
Rough Cost (pref $US):
Input: e.g. 16bit linear, 8,10,16,32kHz.
Output: e.g. 16bit linear, 8,10,16,32kHz.
DSP: signal processing support
Other:
Contact:
------------------------------------------------------------------------
Q2.2: What signal processing techniques are for speech technology?
This question is far to big to be answered in a FAQ posting. Fortunately
there are many good books which answer the question!
Some good introductory books include
Digital processing of speech signals; L. R. Rabiner, R. W. Schafer.
Englewood Cliffs; London: Prentice-Hall, 1978
Voice and Speech Processing; T. W. Parsons.
New York; McGraw Hill 1986
Computer Speech Processing; ed Frank Fallside, William A. Woods
Englewood Cliffs: Prentice-Hall, c1985
Digital speech processing : speech coding, synthesis, and recognition
edited by A. Nejat Ince; Kluwer Academic Publishers, Boston, c1992
Speech science and technology; edited by Shuzo Saito
pub. Ohmsha, Tokyo, c1992
Speech analysis; edited by Ronald W. Schafer, John D. Markel
New York, IEEE Press, c1979
Douglas O'Shaughnessy -- Speech Communication: Human and Machine
Addison Wesley series in Electrical Engineering: Digital Signal Processing,
1987.
------------------------------------------------------------------------
Q2.3: How do I find the pitch of a speech signal?
This topic comes up regularly in the comp.dsp newsgroup. Question 2.5
of the FAQ posting for comp.dsp gives a comprehensive list of references
on the definition, perception and processing of pitch.
------------------------------------------------------------------------
Q2.4: How do I find the start and end points of a speech signal?
A large number of papers have been presented on this task. Try the
following papers:-
Rabiner LR, Sambur MR, "An Algorithm for Determining the Endpoints
of Isolated Utterances", Bell System Technical Journal, Vol 54,
No. 2, pp 297-315, 1975.
Drago, P.G. et al. "Digital Dynamic Speech Detectors." IEEE Trans on
Communications, Vol 26, No 1, Jan 78, pp. 140-145.
Newman, W.C. "Detecting Speech with an Adapative Neural Network."
Electronic Design. 22 March 1990.
------------------------------------------------------------------------
Q2.5: Where can I find FFT software?
Try the following file - available by anonymous ftp :-
usc.edu:/pub/C-numanal/fft-stuff.tar.gz
It contains a series of optimised fft routines, including mixed-radix
algorithms. Note that the .gz suffix indicates GNU zip format.
------------------------------------------------------------------------
Q2.6: How do I convert to/from mu-law format?
Mu-law coding is a form of compression for audio signals including speech.
It is widely used in the telecommunications field because it improves the
signal-to-noise ratio without increasing the amount of data. Typically,
mu-law compressed speech is carried in 8-bit samples. It is a companding
technqiue. That means that carries more information about the smaller signals
than about larger signals. Mu-law coding is provided as standard for the
audio input and output of the SUN Sparc stations 1&2 (Sparc 10's are linear).
On SUN Sparc systems have a look in the directory /usr/demo/SOUND. Included
are table lookup macros for ulaw conversions. [Note however that not all
systems will have /usr/demo/SOUND installed as it is optional - see your
system admin if it is missing.]
OR, here is some sample conversion code in C.
# include <stdio.h>
unsigned char linear2ulaw(/* int */);
int ulaw2linear(/* unsigned char */);
/*
** This routine converts from linear to ulaw.
**
** Craig Reese: IDA/Supercomputing Research Center
** Joe Campbell: Department of Defense
** 29 September 1989
**
** References:
** 1) CCITT Recommendation G.711 (very difficult to follow)
** 2) "A New Digital Technique for Implementation of Any
** Continuous PCM Companding Law," Villeret, Michel,
** et al. 1973 IEEE Int. Conf. on Communications, Vol 1,
** 1973, pg. 11.12-11.17
** 3) MIL-STD-188-113,"Interoperability and Performance Standards
** for Analog-to_Digital Conversion Techniques,"
** 17 February 1987
**
** Input: Signed 16 bit linear sample
** Output: 8 bit ulaw sample
*/
#define ZEROTRAP /* turn on the trap as per the MIL-STD */
#undef ZEROTRAP
#define BIAS 0x84 /* define the add-in bias for 16 bit samples */
#define CLIP 32635
unsigned char linear2ulaw(sample) int sample; {
static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
int sign, exponent, mantissa;
unsigned char ulawbyte;
/* Get the sample into sign-magnitude. */
sign = (sample >> 8) & 0x80; /* set aside the sign */
if(sign != 0) sample = -sample; /* get magnitude */
if(sample > CLIP) sample = CLIP; /* clip the magnitude */
/* Convert from 16 bit linear to ulaw. */
sample = sample + BIAS;
exponent = exp_lut[( sample >> 7 ) & 0xFF];
mantissa = (sample >> (exponent + 3)) & 0x0F;
ulawbyte = ~(sign | (exponent << 4) | mantissa);
#ifdef ZEROTRAP
if (ulawbyte == 0) ulawbyte = 0x02; /* optional CCITT trap */
#endif
return(ulawbyte);
}
/*
** This routine converts from ulaw to 16 bit linear.
**
** Craig Reese: IDA/Supercomputing Research Center
** 29 September 1989
**
** References:
** 1) CCITT Recommendation G.711 (very difficult to follow)
** 2) MIL-STD-188-113,"Interoperability and Performance Standards
** for Analog-to_Digital Conversion Techniques,"
** 17 February 1987
**
** Input: 8 bit ulaw sample
** Output: signed 16 bit linear sample
*/
int ulaw2linear(ulawbyte) unsigned char ulawbyte; {
static int exp_lut[8] = { 0, 132, 396, 924, 1980, 4092, 8316, 16764 };
int sign, exponent, mantissa, sample;
ulawbyte = ~ulawbyte;
sign = (ulawbyte & 0x80);
exponent = (ulawbyte >> 4) & 0x07;
mantissa = ulawbyte & 0x0F;
sample = exp_lut[exponent] + (mantissa << (exponent + 3));
if(sign != 0) sample = -sample;
return(sample);
}
=======================================================================
PART 3 - Speech Coding and Compression
Q3.1: Speech compression techniques.
Can anyone provide a 1-2 page summary on speech compression? Topics to
cover might include common technqiues, where speech compression might be
used and perhaps something on why speech is difficult to compress.
[The FAQ for comp.compression includes a few questions and answers
on the compression of speech.]
------------------------------------------------------------------------
Q3.2: What are some good references/books on coding/compression?
Douglas O'Shaughnessy -- Speech Communication: Human and Machine
Addison Wesley series in Electrical Engineering: Digital Signal
Processing, 1987.
Bishnu Atal in ed. Fallside, F. and W. Woods, ed. Computer Speech
Processing. London: Prentice/Hall International, 1985.
Makhoul, J. "Linear Prediction: A Tutorial Review." Proc. of the
IEEE 63 (1975): 561 - 580.
------------------------------------------------------------------------
Q3.3: What software is available?
Note: there are two types of speech compression technique referred to below.
Lossless technqiues preserve the speech through a compression-decompression
phase. Lossy techniques do not preserve the speech prefectly. As a general
rule, the more you compress speech, the more the quality degardes.
Package: shorten - a lossless compressor for speech signals
Platform: UNIX/DOS
Description: A lossless compressor for speech signals. It will compile and
run on UNIX workstations and will cope with a wide variety of
formats. Compression is typically 50% for 16bit clean speech
sampled at 16kHz.
Availability: Anonymous ftp svr-ftp.eng.cam.ac.uk: /misc/shorten-1.09.tar.Z
Package: CELP 3.2a & LPC
Platform: Sun (the makefiles & source can be modified for other platforms)
Description: CELP is lossy compression technqiue.
The U.S. DoD's Federal-Standard-1016 based 4800 bps code excited
linear prediction voice coder version 3.2a (CELP 3.2a) Fortran and
C simulation source codes. Available for worldwide distribution
(on DOS diskettes, but configured to compile on Sun SPARC stations)
from NTIS and DTIC. Example input and processed speech files are
included. A Technical Information Bulletin (TIB), "Details to Assist
in Implementation of Federal Standard 1016 CELP," and the official
standard, "Federal Standard 1016, Telecommunications: Analog to
Digital Conversion of Radio Voice by 4,800 bit/second Code Excited
Linear Prediction (CELP)," are also available.
Availability 1: Through the National Technical Information Service:
NTIS
U.S. Department of Commerce
5285 Port Royal Road,
Springfield, VA 22161, USA
The "AD" ordering number for the CELP software is AD M000 118
(US$ 90.00) and for the TIB it's AD A256 629 (US$ 17.50).
The LPC-10 standard, described below, is FIPS Pub 137 (US$ 12.50).
There is a $3.00 shipping charge on all U.S. orders. The telephone
number for their automated system is 703-487-4650, or 703-487-4600
if you'd prefer to talk with a real person.
(U.S. DoD personnel and contractors can receive the package from the
Defense Technical Information Center: DTIC, Building 5, Cameron
Station, Alexandria, VA 22304-6145. Their telephone number is
703-274-7633.)
Availability 2: By anonymous ftp from:
super.org (192.31.192.1):/pub/celp_3.2a.tar.Z
OR
svr-ftp.eng.cam.ac.uk:comp.speech/sources/celp_3.2a.tar.Z
Misc: The following articles describe the Federal-Standard-1016 4.8-kbps
CELP coder (it's unnecessary to read more than one):
Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch,
"The Federal Standard 1016 4800 bps CELP Voice Coder," Digital Signal
Processing, Academic Press, 1991, Vol. 1, No. 3, p. 145-155.
Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch,
"The DoD 4.8 kbps Standard (Proposed Federal Standard 1016),"
in Advances in Speech Coding, ed. Atal, Cuperman and Gersho,
Kluwer Academic Publishers, 1991, Chapter 12, p. 121-133.
Campbell, Joseph P. Jr., Thomas E. Tremain and Vanoy C. Welch, "The
Proposed Federal Standard 1016 4800 bps Voice Coder: CELP," Speech
Technology Magazine, April/May 1990, p. 58-64.
* The U.S. DoD's Federal-Standard-1015/NATO-STANAG-4198 based 2400
bps linear prediction coder (LPC-10) was republished as a Federal
Information Processing Standards Publication 137 (FIPS Pub 137).
It is described in:
Thomas E. Tremain, "The Government Standard Linear Predictive Coding
Algorithm: LPC-10," Speech Technology Magazine, April 1982, p. 40-49.
There is also a section about FS-1015 in the book:
Panos E. Papamichalis, Practical Approaches to Speech Coding,
Prentice-Hall, 1987.
* The voicing classifier used in the enhanced LPC-10 (LPC-10e) is
described in: Campbell, Joseph P., Jr. and T. E. Tremain, "Voiced/
Unvoiced Classification of Speech with Applications to the U.S.
Government LPC-10E Algorithm," Proceedings of the IEEE International
Conf. on Acoustics, Speech, and Signal Processing, 1986, p. 473-6.
* Copies of the official standard, "Federal Standard 1016, Tele-
communications: Analog to Digital Conversion of Radio Voice by 4,800
bit/second Code Excited Linear Prediction (CELP)" are available for
US$ 5.00 each from:
GSA Federal Supply Service Bureau
Specification Section, Suite 8100
470 E. L'Enfant Place, S.W.
Washington, DC 20407
(202)755-0325
* Realtime DSP code for FS-1015 and FS-1016 is sold by:
John DellaMorte, DSP Software Engineering
165 Middlesex Tpk, Suite 206
Bedford, MA 01730, USA
Ph: 1-617-275-3733 Fax: 1-617-275-4323
dspse.bedford@channel1.com
* DSP Software Engineering's FS-1016 code can run on a DSP Research's
Tiger 30 (a PC board with a TMS320C3x and analog interface suited
to development work).
DSP Research
1095 E. Duane Ave.
Sunnyvale, CA 94086, USA
Ph: (408)773-1042 Fax: (408)736-3451 (fax)
Package: 32 kbps ADPCM
Platform: SGI and Sun Sparcs
Description: 32 kbps ADPCM C-source code (G.721 compatibility is uncertain)
Contact: Jack Jansen
Availablity: Anoymous ftp to ftp.cwi.nl: pub/adpcm.shar
Package: GSM 06.10 Compression
Platform: Runs faster than real time on most Sun SPARCstations
Description: GSM 06.10 is lossy compression technqiue.
European GSM 06.10 provisional standard for full-rate speech
transcoding, prI-ETS 300 036, which uses RPE/LTP (residual
pulse excitation/long term prediction) coding at 13 kbit/s.
Contact: Carsten Bormann <cabo@cs.tu-berlin.de>
Availability: An implementation can be ftp'ed from:
tub.cs.tu-berlin.de: /pub/tubmik/gsm-1.0.tar.Z
+/pub/tubmik/gsm-1.0-patch1
or as a faster but not always up-to-date alternative:
liasun3.epfl.ch: /pub/audio/gsm-1.0pl1.tar.Z
Package: G.721/722/723 Compression
Description: ?
Availability: By email to teledoc@itu.arcom.ch, with
GET ITU-3022
as the *only* line in the body of the message.
This is also available by anonymous ftp from:
svr-ftp.eng.cam.ac.uk:comp.speech/sources/G711_G722_G723.tar.Z
Package: U.S.F.S. 1016 CELP vocoder for DSP56001
Platform: DSP56001
Description: Real-time U.S.F.S. 1016 CELP vocoder that runs on a single
27MHz Motorola DSP56001. Free demo software available from PC-56
and PC-56D. Source and object code available for a one-time
license fee.
Contact: Cole Erskine
Analogical Systems
2916 Ramona St.
Palo Alto, CA 94306, USA
Tel:(415) 323-3232 FAX:(415) 323-4222
Internet: cole@analogical.com
Product: 8 Kbit/s CELP on the TMS320C5x family of DSP chips.
Description: For low bandwidth transmission of voice, compact voice storage
for archival purposes, low-cost digital answering machines and
efficient storage for voice mail. Features :-
- near toll quality at 8 Kb/s.
- Variable rate option with 1 Kb/s silence encoding
- Implemented on a fixed-point processor for lower system cost.
- Attractive licensing scheme.
- Future availability of 4 Kb/s.
- Custom rates possible.
Capacity :-
- Two half-duplex or one full duplex channels on the 20 MIPS 'C5x
(at 95% and 55% CPU utilization respectively).
- Two full duplex channels on the 28.6 MIPS 'C5x
(at 77% CPU utilization).
- Requires 9 K-words program memory and 3 K-words data memory.
- Decoding in real-time on a 486 class CPU.
Contact: CVI Inc.
443 Vienna Cres. North Vancouver, BC, Canada V7N 3B3
Tel: (604) 987 1719 Fax: (604) 986 8139
Email: cvi@extropia.wimsey.com
=======================================================================
PART 4 - Speech Synthesis
Q4.1: What is speech synthesis?
Speech synthesis is the task of transforming written input to spoken output.
The input can either be provided in a graphemic/orthographic or a phonemic
script, depending on its source.
------------------------------------------------------------------------
Q4.2: How can speech synthesis be performed?
There are several algorithms. The choice depends on the task they're used
for. The easiest way is to just record the voice of a person speaking the
desired phrases. This is useful if only a restricted volume of phrases and
sentences is used, e.g. messages in a train station, or schedule information
via phone. The quality depends on the way recording is done.
More sophisticated but worse in quality are algorithms which split the
speech into smaller pieces. The smaller those units are, the less are they
in number, but the quality also decreases. An often used unit is the phoneme,
the smallest linguistic unit. Depending on the language used there are about
35-50 phonemes in western European languages, i.e. there are 35-50 single
recordings. The problem is combining them as fluent speech requires fluent
transitions between the elements. The intellegibility is therefore lower, but
the memory required is small.
A solution to this dilemma is using diphones. Instead of splitting at the
transitions, the cut is done at the center of the phonemes, leaving the
transitions themselves intact. This gives about 400 elements (20*20) and
the quality increases.
The longer the units become, the more elements are there, but the quality
increases along with the memory required. Other units which are widely used
are half-syllables, syllables, words, or combinations of them, e.g. word stems
and inflectional endings.
------------------------------------------------------------------------
Q4.3: What are some good references/books on synthesis?
The following are good introductory books/articles.
Douglas O'Shaughnessy -- Speech Communication: Human and Machine
Addison Wesley series in Electrical Engineering: Digital Signal Processing,
1987.
D. H. Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of
the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793.
I. H. Witten. Principles of Computer Speech.
(London: Academic Press, Inc., 1982).
John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech:
The MITalk System", Cambridge University Press, 1987.
------------------------------------------------------------------------
Q4.4: What software/hardware is available?
There appears to be very little Public Domain or Shareware speech synthesis
related software available for FTP. However, the following are available.
Strictly speaking, not all the following sources are speech synthesis - all
are speech output systems. They are in no particular order.
SIMTEL-20
The following is a list of speech related software available from SIMTEL-20
and its mirror sites for PCs.
The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20].
Try looking at your nearest archive site first.
Directory PD1:<MSDOS.VOICE>
Filename Type Length Date Description
==============================================
AUTOTALK.ARC B 23618 881216 Digitized speech for the PC
CVOICE.ARC B 21335 891113 Tells time via voice response on PC
HEARTYPE.ARC B 10112 880422 Hear what you are typing, crude voice synth.
HELPME2.ARC B 8031 871130 Voice cries out 'Help Me!' from PC speaker
SAY.ARC B 20224 860330 Computer Speech - using phonemes
SPEECH98.ZIP B 41003 910628 Build speech (voice) on PC using 98 phonemes
TALK.ARC B 8576 861109 BASIC program to demo talking on a PC speaker
TRAN.ARC B 39766 890715 Repeats typed text in digital voice
VDIGIT.ZIP B 196284 901223 Toolkit: Add digitized voice to your programs
VGREET.ARC B 45281 900117 Voice says good morning/afternoon/evening
Package: ORATOR Text-to-Speech Synthesizer
Platform: SUN SPARC, Decstation 5000. Portable to other UNIX platforms.
Description: Sophisticated speech synthesis package. Has text preprocessing
(for abbreviations, numbers), acronym citation rules, and human-like
spelling routines. High accuracy for pronunciation of names of
people, places and businesses in America, text-to-speech translation
for common words; rules for stress and intonation marking, based on
natural-sounding demisyllable synthesis; various methods of user
control and customization at most stages of processing. Currently,
ORATOR is most appropriate for applications containing a large
component of names in the text, and requires some amount of user-
specified text-preprocessing to produce good quality speech for
general text.
Hardware: Standard audio output of SPARC, or Decstation audio hardware.
At least 16M of memory recommended.
Cost: Binary License: $5,000.
Source license for porting or commercial use: $30,000.
Availability: Contact Bellcore's Licensing Office (1-800-527-1080)
or email: jzilg@cc.bellcore.com (John Zilg)
Package: Text to phoneme program (1)
Platform: unknown
Description: Text to phoneme program. Based on Naval Research Lab's
set of text to phoneme rules.
Availability: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory
/pub/src/phon.tar.Z
Package: Text to phoneme program (2)
Platform: unknown
Description: Text to phoneme program.
Availability: By FTP from "wuarchive.wustl.edu" in the file
/mirrors/unix-c/utils/phoneme.c
Package: Text to phoneme program (3)
Description: A public domain version of the same Naval Research Lab
text to phoneme rules.
Availability: By anonymous ftp from
svr-ftp.eng.cam.ac.uk:comp.speech/sources/english2phoneme.shar
Package: Text to speech program
Description: A implementation of the Klatt phoneme to waveform speech
synthesiser.
Availability: By anonymous ftp from
svr-ftp.eng.cam.ac.uk:comp.speech/sources/klatt-0.02.tar.Z
Package: "Speak" - a Text to Speech Program
Platform: Sun SPARC
Description: Text to speech program based on concatenation of pre-recorded
speech segments. A function library can be used to integrate
speech output into other code.
Hardware: SPARC audio I/O
Availability: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z
Package: TheBigMouth - a Text to Speech Program
Platform: NeXT
Description: Text to speech program based on concatenation of pre-recorded
speech segments. NeXT equivalent of "Speak" for Suns.
Availability: try NeXT archive sites such as sonata.cc.purdue.edu.
Package: TextToSpeech Kit
Platform: NeXT Computers
Description: The TextToSpeech Kit does unrestricted conversion of English
text to synthesized speech in real-time. The user has control over
speaking rate, median pitch, stereo balance, volume, and intonation
type. Text of any length can be spoken, and messages can be queued
up, from multiple applications if desired. Real-time controls such
as pause, continue, and erase are included. Pronunciations are
derived primarily by dictionary look-up. The Main Dictionary has
nearly 100,000 hand-edited pronunciations which can be supplemented
or overridden with the User and Application dictionaries. A number
parser handles numbers in any form. A letter-to-sound knowledge base
provides pronunciations for words not in the Main or customized
dictionaries. Dictionary search order is under user control.
Special modes of text input are available for spelling and emphasis
of words or phrases. The actual conversion of text to speech is done
by the TextToSpeech Server. The Server runs as an independent task
in the background, and can handle up to 50 client connections.
Misc: The TextToSpeech Kit comes in two packages: the Developer Kit and the
User Kit. The Developer Kit enables developers to build and test
applications which incorporate text-to-speech. It includes the
TextToSpeech Server, the TextToSpeech Object, the pronunciation
editor PrEditor, several example applications, phonetic fonts,
example source code, and developer documentation. The User Kit
provides support for applications which incorporate text-to-speech.
It is a subset of the Developer Kit.
Hardware: Uses standard NeXT Computer hardware.
Cost: TextToSpeech User Kit: $175 CDN ($145 US)
TextToSpeech Developer Kit: $350 CDN ($290 US)
Upgrade from User to Developer Kit: $175 CDN ($145 US)
Availability: Trillium Sound Research
1500, 112 - 4th Ave. S.W., Calgary, Alberta, Canada, T2P 0H3
Tel: (403) 284-9278 Fax: (403) 282-6778
Order Desk: 1-800-L-ORATOR (US and Canada only)
Email: manzara@cpsc.UCalgary.CA
Package: SENSYN speech synthesizer
Platform: PC, Mac, Sun, and NeXt
Rough Cost: $300
Description: This formant synthesizer produces speech waveform files
based on the (Klatt) KLSYN88 synthesizer. It is intended
for laboratory and research use. Note that this is NOT a
text-to-speech synthesizer, but creates speech sounds based
upon a large number of input variables (formant frequencies,
bandwidths, glottal pulse characteristics, etc.) and would
be used as part of a TTS system. Includes full source code.
Availability: Sensimetrics Corporation, 64 Sidney Street, Cambridge MA 02139.
Fax: (617) 225-0470; Tel: (617) 225-2442.
Email: sensimetrics@sens.com
Package: SPCHSYN.EXE
Platform: PC?
Availability: By anonymous ftp from evans.ee.adfa.oz.au (131.236.30.24)
in /mirrors/tibbs/Applications/SPCHSYN.EXE
It is a self extracting DOS archive.
Requirements: May require special TI product(s), but all source is there.
Package: CSRE: Canadian Speech Research Environment
Platform: PC
Cost: Distributed on a cost recovery basis
Description: CSRE is a software system which includes in addition to the
Klatt speech synthesizer, SPEECH ANALYSIS and EXPERIMENT CONTROL
SYSTEM. A paper about the whole package can be found in:
Jamieson D.G. et al, "CSRE: A Speech Research Environment", Proc.
of the Second Intl. Conf. on Spoken Language Processing, Edmonton:
University of Alberta, pp. 1127-1130.
Hardware: Can use a range of data aqcuisition/DSP
Availability: For more information about the availability of this software
contact Krystyna Marciniak - email march@uwovax.uwo.ca
Tel (519) 661-3901 Fax (519) 661-3805.
For technical information email ramji@uwovax.uwo.ca
Note: A more detailed description is given in Q1.8 on speech environments.
Package: JSRU
Platform: UNIX and PC
Cost: 100 pounds sterling (from academic institutions and industry)
Description: A C version of the JSRU system, Version 2.3 is available.
It's written in Turbo C but runs on most Unix systems with very
little modification. A Form of Agreement must be signed to say
that the software is required for research and development only.
Contact: Dr. E.Lewis (eric.lewis@uk.ac.bristol)
Package: Klatt-style synthesiser
Platform: Unix
Cost: Free
Description: Software posted to comp.speech in late 1992.
Availability: By anonymous ftp from the comp.speech archives as
svr-ftp.eng.cam.ac.uk:/comp.speech/sources/klatt-0.02.tar.Z
Package: Speech Manager and PlainTalk
Platform: Macintosh
Cost: Free
Description: Apple's new text-to-speech system extension(s) that enable
applications (listed below) to perform text-to-speech
conversion. The Speech Manager runs on most Macs, but PlainTalk
(and the high quality voices) requires a 68020 Mac or better.
Availability: By anonymous ftp from:
ftp.apple.com:/dts/mac/sys.soft/speech
There are 3 files in this directory:
6273632 Aug 14 22:51 macintalk-pro.hqx
PlainTalk Text-To-Speech 1.0 speech synthesizer
extension (includes Female Voice, Compressed);
TTS Female Voice; TTS Male Voice; and
TTS Male Voice, Compressed. Requires 68020 or better!
370108 Aug 13 04:30 speech-manager-docs.hqx
Apple DocViewer format (Inside Macintosh style,
no installation instructions - just drag everything
onto your closed System Folder).
262569 Aug 7 07:01 speech-manager.hqx
Speech Manager 1.1.1 (includes Marvin's voice) and
MacInTalk Voices 1.1.1 (9 more voices). Runs most Macs.
Package: Various Mac Speech Output Applications
Platform: Macintosh
Cost: Free (except for At Ease)
Description: Some of the Speech Manager aware text-to-speech (TTS)
applications, etc. are listed below (there are more on the
Apple Developer CD-ROMs).
Application, etc. Source Comments
_________________ ________ _________________________________________________
AddressSpeech info-mac 4D talking address book (from Speech Pack 2.0)
At Ease 2.0 MacWarehouse Friendly desktop that speaks file names
At Ease 2.0 WG MacWarehouse Friendly desktop that speaks file names
Eliza 3.1 AOL Talking Eliza (Rogerian psych therapist)
FB speech Inside Basic Mag, volume 3, no. 6. FutureBasic demo
FB Speech demo Inside Basic Mag, volume 3, no. 7. FutureBasic demo
Fortune 1.1 info-mac Like a talking UNIX fortune command - slick
Homer 0.92d9 zaphod.ee.pitt.edu GUI IRC client, assign nicks voices - slick
MacMessage 1.0 FirstClassBBS Share talking messages/customizable startup
Say info-mac MPW Tool which converts standard input to speech
ScriptTools 1.2 info-mac Write AppleScript scripts to say text messages
Siege Watch 1.01f info-mac Wryly political speaking clock
SoToSpeak1.0.0b10 info-mac Two voice conversation (also see Fortune's About)
Speak It! info-mac Type in a message and have it spoken
Speaker 1.11 info-mac Simple text file editor, speaks on <CR>, macros
Speecher 1.2.1 info-mac Customizable word pronunciation/substitution
SpeechManagerdemo info-mac Command line interface, C source, aka -explorer
Speech Pack 2.0 info-mac 4th Dimension external, add speech to database
SpeechUnitEx info-mac Pascal source code for speech in Lab 7
speek-02b info-mac Speech XCMD for HyperCard
TalkingClockPro2.0info-mac AppleScriptable talking clock extension (2.0b0)
TeachText 7.2 AV Mac Apple's talking TeachText (simple editor w/QT)
Tex-Edit 1.9 AOL Talking word processor, McSink like, modeming
VoiceDemo 1.0.1 info-mac Bare bones phrase talker
Welcome!v1.3.1 info-mac A talking Welcome to Macintosh startup
? ? Talking Plug-In-Module for MS Word 5,
experimental, unsupported, buggy, beware!
Speech Rhythms AOL A cool text file for one of the above apps
_____
Sources:
AOL = America Online
info-mac = {ftp sumex-aim.stanford.edu, ftp wuarchive.wustl.edu, et al.}
MacWarehouse = (800) 255-6227
Apple's work in spoken language technologies and systems is described in:
Lee, Kai-Fu. "The Conversational Computer: An Apple Perspective."
(Keynote Speech) In Proc. Eurospeech in Berlin, ESCA, September, 1993.
Package: MacinTalk
Platform: Macintosh
Cost: Free
Description: Formant based speech synthesis.
There is also a program called "tex-edit" which apparently
can pronounce English sentences reasonably using Macintalk.
Note: MacinTalk doesn't run reliably on Macintosh's with new
sound hardware under the lastest OS (System 7.1 w/HUD 2.0).
More recent software is listed above.
Availability: By anonymous ftp from many archive sites (have a look on
archie if you can). tex-edit is on many of the same sites. Try
wuarchive.wustl.edu:/mirrors2/info-mac/Old/card/macintalk.hqx[.Z]
/macintalk-stack.hqx[.Z]
wuarchive.wustl.edu:/mirrors2/info-mac/app/tex-edit-15.hqx
Package: Tinytalk
Platform: PC
Description: Shareware package is a speech 'screen reader' which is use
by many blind users.
Availability: By anonymous ftp from handicap.shel.isc-br.com.
Get the files /speech/ttexe145.zip & /speech/ttdoc145.zip.
Package: Narrator - narrator.device
Platform: Amiga
Description: Formant based speech synthesis. Includes a Engish-to-phoneme
translation library, and a SPEAK: pseudo-device for speech
output.
Hardware: Standard Amiga hardware
Availability: Part of AmigaOS
Package: Bliss
Contact: Dr. John Merus (Brown University) Mertus@browncog.bitnet
Package: xxx
Platform: (PC, Mac, Sun, NeXt etc)
Rough Cost: (if appropriate)
Description: (keep it brief)
Hardware: (requirement list)
Availability: (ftp info, email contact or company contact)
Can anyone provide information on the following:
INFOVOX (apparently multi-lingual)
MultiVoice
Monolog
TrueSpeech from DSP Group Inc.
Please email or post suitable information for this list. Commercial,
public domain and research packages are all appropriate.
[Perhaps someone would like to start a separate posting on this area.]
=======================================================================
PART 5 - Speech Recognition
Q5.1: What is speech recognition?
Automatic speech recognition is the process by which a computer maps an
acoustic speech signal to text.
Automatic speech understanding is the process by which a computer maps an
acoustic speech signal to some form of abstract meaning of the speech.
------------------------------------------------------------------------
Q5.2: How can I build a very simple speech recogniser?
Doug Danforth provides a detailed account in article 253 in the comp.speech
archives - also available as file info/DIY_Speech_Recognition.
The first part is reproduced here.
QUICKY RECOGNIZER sketch:
Here is a simple recognizer that should give you 85%+ recognition
accuracy. The accuracy is a function of WHAT words you have in
your vocabulary. Long distinct words are easy. Short similar
words are hard. You can get 98+% on the digits with this recognizer.
Overview:
(1) Find the begining and end of the utterance.
(2) Filter the raw signal into frequency bands.
(3) Cut the utterance into a fixed number of segments.
(4) Average data for each band in each segment.
(5) Store this pattern with its name.
(6) Collect training set of about 3 repetitions of each pattern (word).
(7) Recognize unknown by comparing its pattern against all patterns
in the training set and returning the name of the pattern closest
to the unknown.
Many variations upon the theme can be made to improve the performance.
Try different filtering of the raw signal and different processing methods.
------------------------------------------------------------------------
Q5.2: What does speaker dependent/adaptive/independent mean?
A speaker dependent system is developed (trained) to operate for a single
speaker. These systems are usually easier to develop, cheaper to buy and
more accurate, but are not as flexible as speaker adaptive or speaker
independent systems.
A speaker independent system is developed (trained) to operate for any
speaker or speakers of a particular type (e.g. male/female, American/English).
These systems are the most difficult to develop, most expensive and currently
accuracy is not as good. They are the most flexible.
A speaker adaptive system is developed to adapt its operation for new
speakers that it encounters usually based on a general model of speaker
characteristics. It lies somewhere between speaker independent and speaker
dependent systems.
Each type of system is suited to different applications and domains.
------------------------------------------------------------------------
Q5.3: What does small/medium/large/very-large vocabulary mean?
The size of vocabulary of a speech recognition system affects the complexity,
processing requirements and the accuracy of the system. Some applications
only require a few words (e.g. numbers only), others require very large
dictionaries (e.g. dictation machines).
There are no established definitions but the following may be a helpful guide.
small vocabulary - tens of words
medium vocabulary - hundreds of words
large vocabulary - thousands of words
very-large vocabulary - tens of thousands of words.
------------------------------------------------------------------------
Q5.4: What does continuous speech or isolated-word mean?
An isolated-word system operates on single words at a time - requiring a
pause between saying each word. This is the simplest form of recognition
to perform, because the pronunciation of the words tends not affect each
other. Because the occurrences of each particular word are similar they are
easier to recognise.
A continuous speech system operates on speech in which words are connected
together, i.e. not separated by pauses. Continuous speech is more difficult
to handle because of a variety of effects. First, it is difficult to find
the start and end points of words. Another problem is "coarticulation".
The production of each phoneme is affected by the production of surrounding
phonemes, and similarly the the start and end of words are affected by the
preceding and following words. The recognition of continuous speech is also
affected by the rate of speech (fast speech tends to be harder).
------------------------------------------------------------------------
Q5.5: How is speech recognition done?
A wide variety of techniques are used to perform speech recognition.
There are many types of speech recognition. There are many levels of
speech recognition/processing/understanding.
Typically speech recognition starts with the digital sampling of speech.
The next stage would be acoustic signal processing. Common techniques
include a variety of spectral analyses, LPC analysis, the cepstral transform,
cochlea modelling and many, many more.
The next stage will typically try to recognise phonemes, groups of phonemes
or words. This stage can be achieved by many processes such as DTW (Dynamic
Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), and
sometimes expert systems. In crude terms, all these processes to recognise
the patterns of speech. The most advanced systems are statistically
motivated.
Some systems utilise knowledge of grammar to help with the recognition
process.
Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to
process the speech input.
Some systems try to "understand" speech. That is, they try to convert the
words into a representation of what the speaker intended to mean or achieve
by what they said.
------------------------------------------------------------------------
Q5.6: What are some good references/books on recognition?
Some general introduction books on speech recognition:
Fundamentals of Speech Recognition; Lawrence Rabiner & Biing-Hwang Juang
Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993
ISBN 0-13-015157-2
Speech recognition by machine; W.A. Ainsworth
London: Peregrinus for the Institution of Electrical Engineers, c1988
Speech synthesis and recognition; J.N. Holmes
Wokingham: Van Nostrand Reinhold, c1988
Douglas O'Shaughnessy -- Speech Communication: Human and Machine
Addison Wesley series in Electrical Engineering: Digital Signal Processing,
1987.
Electronic speech recognition: techniques, technology and applications
edited by Geoff Bristow, London: Collins, 1986
Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee.
San Mateo: Morgan Kaufmann, c1990
More specific books/articles:
Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A.
Jack.
Edinburgh: Edinburgh University Press, c1990
Automatic speech recognition: the development of the SPHINX system;
by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989
Prosody and speech recognition; Alex Waibel
(Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988
S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the
Application of the Theory of Probabilistic Functions of a Markov Process
to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4),
pp1035--1074, April 1983
R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in
Neural Computation, v1(1), pp 1-38, 1989.
------------------------------------------------------------------------
Q5.7: What speech recognition packages are available?
Information is included below on the following packages:-
Voice Blaster Ver. 4.0
Votan
HTK (HMM Toolkit)
DragonDictate
VoiceServer for Windows
IN3 Voice Command for Windows
IN3 Voice Command
SayIt
Recnet
Voice Command Line Interface
DATAVOX
Package Name: Voice Blaster Ver. 4.0
Platform: IBM AT or higher, DOS or Wndows 3.1
Description: Uses a Sound Blaster or compatible board. Contains a
microphone headset and a connector for LPT1:. A printer can
still be used on LPT1:. Will recognize 1024 words that are
trained by the operator. Each word activates a macro that can
enter an ascii word on the screen or into a word processor or
invoke a batch file. An optional footswitch may be installed.
Software to run under DOS or Windows 3.1 is included.
Cost: Around $150 Canadian.
Contact: COVOX Inc.
675 Conger Street
Eugene, Oregon
97402
Ph: (503) 342-1271 Fax: (503) 342-1283
BBS: (503) 342-4135
Package Name: Votan
Platform: MS-DOS, SCO UNIX
Description: Isolated word and continuous speech modes, speaker dependant
and (limited) speaker independent. Vocab size is 255 words or up to a
fixed memory limit - but it is possible to dynamically load different
words for effectively unlimited number of words.
Rough Cost: Approx US $1,000-$1,500
Requirements: Cost includes one Votan Voice Recognition ISA-bus board
for 386/486-based machines. A software development system is also
available for DOS and Unix.
Misc: Up to 8 Votan boards may co-exist for 8 simultaneous voice users.
A telephone interface is also available. There is also a 4GL and a
software development system.
Apparently there is more than one version - more info required.
Contact: 800-877-4756, 510-426-5600
Package Name: HTK (HMM Toolkit) - From Entropic
Platform: Range of Unix platforms.
Description: HTK is a software toolkit for building continuous density HMM
based speech recognisers. It consists of a number of library
modules and a number of tools. Functions include speech analysis,
training tools, recognition tools, results analysis, and an
interactive tool for speech labelling. Many standard forms of
continuous density HMM are possible. Can perform isolated word or
connected word speech recognition. It van model whole words, sub-
word units. Can perform speaker verification and other pattern
recognition work using HMMs. HTK is now integerated with the
ESPS/Waves speech research environment which is described in
Section 1.8 of this posting.
Misc: The availability of HTK changed in early 1993 when Entropic obtained
exclusive marketing rights to HTK from the developers at Cambridge.
Cost: On request.
Contact: Entropic Research Laboratory, Washington Research Laboratory,
600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
(202) 547-1420. email - info@wrl.epi.com
Package Name: DragonDictate-30K
Platform: PC
Description: Speaker dependent/adaptive system requiring words to be
separated by short pauses. Vocabulary of 25,000 words including
a "custom" word set.
Rough Cost: $5000
Requirements: Minimum of 20 Mhz 386 with 8M memory and 10M disk space
Contact: Dragon Systems Inc.
90 Bridge Street, Newton MA 02158
Tel: 1-617-965-5200, Fax: 1-617-527-0372
Package Name: VoiceServer for Windows
Platform: PC
Description: Speaker dependent, each with an independent directory.
Isolated word. Upto 1000 words/user, 300 words/window.
1 word occupies 2Kb on hard disk.
Can be used to control Windows applications by issuing
voice commands instead of menu selection.
Rough Cost: 292 Pounds(UK)
Requirements: None
Misc: Price includes a half-sized AT voice card (including a
DSP), software, documentation & a microphone (attachable to
keyboard or speaker). A light-weight high-spec headset is an
optional extra.
Contact: Mark Redwood
Applied Voice Technologies
26 Danbury Street, Islington,
London, UK, N1 8JU
Ph: + 44 71 454 1224 : Fax: + 44 71 454 1225
Package Name: IN3 Voice Command for Windows
Platform: PC with Windows 3.1
Description: IN3 is now available for MS-Windows. Users can call
applications to the foreground with voice commands. Once the
application is called, the user may enter commands and data with
voice commands. Voice macros can reduce the strain of repetitive
stress injuries (RSI) such as Carpel Tunnel Syndrome (CTS) by
replacing heavy repetitive keyboard hammering with simple voice
operations. Voice macros take complex operations and reduce them
to simple verbal commands. Voice input can provide new facilities
for tasks which could not easily have been otherwise performed
without the multiple axis of input. IN3 is hardware-independent,
users with any Windows-compatible audio add speech recognition to
the desktop. IN3 works with either 8 bit or 16 bit Windows audio
boards. IN3 is based on continuous word-spotting technology. A
developer API is also available for creating voice-enabled
applications.
Price: $179 U.S.
Requirements: PC with 80386 processor or better, Microsoft Windows 3.1, and
Windows compatible audio system with microphone.
Misc: Fully functional demos are available on Compuserve in various
Multimedia and CAD forums. Demos are also available from "America
on Line", the comp.binaries.ms-windows archive sites, and various
BBS systems.
An equivilant Sun product is described below.
Contact: Brantley Kelly
Email: cbk@gacc.atl.ga.us CIS: 75120,431
FAX: 1-404-925-7924 Phone: 1-404-925-7950
Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
Package Name: IN3 Voice Command
Platform: Sun SPARCstation
Description: IN3 provides a secure, robust, word spotting, continuous
speech recognition facility for the Sun OS or Solaris operating
systems. The recognition system is a secure operating system
facility capable of working with various interfaces, microphones,
and devices. The operating system interface works with native UNIX
outside of X Windows as well as provides enhanced X Windows facilities
including named window support. The user interface provides a
means to quickly create commands on the fly for replacing long strings
and complex operations with voice macros. [Voice macros can reduce
the strain of repetitive stress injuries (RSI) such as Carpel Tunnel
Syndrome (CTS) by replacing heavy repetitive keyboard hammering with
simple voice operations. ]
The IN3 user interface works with generic X servers and window
managers. A developer API is also available for creating voice-
enabled applications, interfacing with other audio sources, and
providing extensive application control over the recognition facility.
Availability: SunSite archive at SunSITE.unc.edu as well as on Catalyst
CDware as both a runable demo and unlockable software.
Hardware Required: Sun SPARCstation with audio input.
Noise canceling microphone recommended but not required.
Software Required: Sun OS 4.1.2 with OpenWindows 3.0 or
Sun OS 4.1.3 or
Solaris 2.1 or Solaris 2.2
Misc: An equivilant MS-Windows product is described above.
Price: $495 U.S.
Contact: Brantley Kelly
Email: cbk@gacc.atl.ga.us CIS: 75120,431
FAX: 1-404-925-7924 Phone: 1-404-925-7950
Command Corp. Inc, 3675 Crestwood Parkway, Duluth GA 30136, USA
Package Name: SayIt
Platform: Sun SPARCstation
Description: Voice recognition and macro building package for Suns
in the Openwindows 3.0 environment. Speaker dependent discrete speech
recognition. Vocabularies can be associated to applications and the
active vocabulary follows the application that has input focus.
Macros can include mouse commands, keystrokes, Unix commands,
sound, Openwindow actions and more.
An evaluation copy is available by email.
Hardware: Microphone required (SunMicrophone is fine).
Cost: $US295
Contact: Phone: 1-800-245-UNIX or 1-415-572-0200
Fax: 1-415-572-1300
Email: info@qualix.com
Package Name: recnet
Platform: UNIX
Description: Speech recognition for the speaker independent TIMIT and
Resource Management tasks. It uses recurrent networks to estimate
phone probabilities and Markov models to find the most probable
sequence of phones or words. The system is a snapshot of evolving
research code. There is no documentation other than published
research papers. The components are:
1. A preprocessor which implements many standard and many non-
standard front end processing techniques.
2. A recurrent net recogniser and parameter files
3. Two Markov model based recognisers, one for phone recognition
and one for word recognition
4. A dynamic programming scoring package
The complete system performs competatively.
Cost: Free
Requirements: TIMIT and Resource Management databases
Contact: ajr@eng.cam.ac.uk (Tony Robinson)
Availability: by FTP from "svr-ftp.eng.cam.ac.uk" as /misc/recnet-1.3.tar.Z
Package Name: Voice Command Line Interface
Platform: Amiga
Description: VCLI will execute CLI commands, ARexx commands, or ARexx
scripts by voice command through your audio digitizer. VCLI allows
you to launch multiple applications or control any program with an
ARexx capability entirely by spoken voice command. VCLI is fully
multitasking and will run in the background, continuously listening
for your voice commands even while other programs are running.
Documentation is provided in AmigaGuide format.
VCLI 6.0 runs under either Amiga DOS 2.0 or 3.0.
Cost: Free?
Requirements: Supports the DSS8, PerfectSound 3, Sound Master, Sound Magic,
and Generic audio digitizers.
Availability: by ftp from wuarchive.wustl.edu in the file
systems/amiga/incoming/audio/VCLI60.lha and from
amiga.physik.unizh.ch as the file pub/aminet/util/misc/VCLI60.lha
Contact: Author's email is RHorne@cup.portal.com
Package Name: DATAVOX - French
Platform: PC
Description: Continuous speech - speaker independent or dependent.
Rough Cost: ?
Requirements: 2 PC format boards (RdF1000 and TdS 96/25) and an
A/D - D/A module (ASA116)
Misc: Application software may dialog with DATAVOX through 2 types
of interfaces :
1) Keyboard overlay
The application software may be used with any PC compatible
package. No specific adaptation is necessary, you only need
to define your configuration with the application software.
2) C library
Allows a user-written program to drive the recognition system.
DATAVOX is based on the AMADEUS speech recognition software
developed at LIMSI. It provides
- Continuous speech recognition with
* speaker dependant : 500 words
* speaker independant : 50 words (custom-made vocabulary).
- Grammar of the application language (syntax acquisition,
verification and simplification software).
- Large vocabulary : DATAVOX can recognize vocabularies of several
thousand words as long as there are no more than 500 words in the
active vocabulary at any given node. It takes less than 1 second
to change syntax and vocabulary.
- Training controlled by the system (use of co-articulation models).
- Response time less than 500 ms for any phrase length.
- Synthetis (ADPCM) can be heard simultaneously while recognition
is being carried out.
Contact: VECSYS, Le Chene rond, 91570 Bievres, France
Fax: 33 1 69 41 24 30
Voice: 33 1 69 41 15 04
Package Name: xxx
Platform: PC, Mac, UNIX, Amiga ....
Description: (e.g. isolated word, speaker independent...)
Rough Cost: (if applicable)
Requirements: (hardware/software needs - if applicable)
Misc:
Contact: (email, ftp or address)
Can anyone provide info on
Verbex Listen for Windows
Voice Navigator (from Articulate Systems)
SRI Recognisers
BBN Recognisers
Can you provide information on any other software/hardware/packages?
Commercial, public domain and research packages are all appropriate.
=======================================================================
PART 6 - Natural Language Processing
There is now a newsgroup specifically for Natural Language Processing.
It is called comp.ai.nat-lang.
There is also a lot of useful information on Natural Language Processing
in the FAQ for comp.ai. That FAQ lists available software and useful
references. It includes a substantial list of software, documentation
and other info available by ftp.
------------------------------------------------------------------------
Q6.1: What are some good references/books on NLP?
Take a look at the FAQ for the "comp.ai" newsgroup as it also includes some
useful references.
James Allen: Natural Language Understanding. (Benjamin/Cummings Series in
Computer Science) Menlo Park: Benjamin/Cummings Publishing Company, 1987.
This book consists of four parts: syntactic processing, semantic
interpretation, context and world knowledge, and response generation.
G. Gazdar and C. Mellish, Natural Language Processing in {Prolog/Lisp/Pop11},
Addison Wesley, 1989
Emphasis on parsing, especially unification-based parsing, lots of
details on the lexicon, feature propagation, etc. Fair coverage of
semantic interpretation, inference in natural language processing,
and pragmatics; much less extensive than in Allen's book, but more
formal. There are three versions, one for each programming language
listed above, with complete code.
Shapiro, Stuart C.: Encyclopedia of Artificial Intelligence Vol.1 and 2.
New York: John Wiley & Sons, 1990.
There are articles on the different areas of natural language
processing which also give additional references.
Paris, Ce'cile L.; Swartout, William R.; Mann, William C.: Natural Language
Generation in Artificial Intelligence and Computational Linguistics. Boston:
Kluwer Academic Publishers, 1991.
The book describes the most current research developments in natural
language generation and all aspects of the generation process are
discussed. The book is comprised of three sections: one on text
planning, one on lexical choice, and one on grammar.
Readings in Natural Language Processing, ed by B. Grosz, K. Sparck Jones
and B. Webber, Morgan Kaufmann, 1986
A collection of classic papers on Natural Language Processing.
Fairly complete at the time the book came out (1986) but now
seriously out of date. Still useful for ATN's, etc.
Klaus K. Obermeier, Natural Language Processing Technologies
in Artificial Intelligence: The Science and Industry Perspective,
Ellis Horwood Ltd, John Wiley & Sons, Chichester, England, 1989.
The major journals of the field are "Computational Linguistics" and
"Cognitive Science" for the artificial intelligence aspects, "Cognition"
for the psychological aspects, "Language", "Linguistics and Philosophy" and
"Linguistic Inquiry" for the linguistic aspects. "Artificial Intelligence"
occasionally has papers on natural language processing.
The major conferences are ACL (held every year) and COLING (held every two
years). Most AI conferences have a NLP track; AAAI, ECAI, IJCAI and the
Cognitive Science Society conferences usually are the most interesting for
NLP. CUNY is an important psycholinguistic conference. There are lots of
linguistic conferences: the most important seem to be NELS, the conference
of the Chicago Linguistic Society (CLS), WCCFL, LSA, the Amsterdam Colloquium,
and SALT.
------------------------------------------------------------------------
Q6.2: What NLP software is available?
The FAQ for the "comp.ai" newsgroup lists a variety of language processing
software that is available. That FAQ is posted monthly.
Natural Language Software Registry (NLSR)
=========================================
The Natural Language Software Registry is available from the German Research
Institute for Artificial Intelligence (DFKI) in Saarbrucken. Its purpose
is to facilitate the exchange and evaluation of natural language processing
software within the research community. To this end, the NLSR is
cataloging natural language software projects, both commercial and non-
commercial. The new updated and enlarged version contains more than 100
descriptions of natural processing software. Registry listings include:
+ speech signal processors, such as the Computerized Speech Lab
(Kay Electronics)
+ morphological analyzers, such as PC-KIMMO
(Summer Institute for Linguistics)
+ parsers, such as Alveytools (University of Edinburgh)
+ semantic and pragmatic analyzer, such as NLL
(University of the Saarland, Germany)
+ generation programs, such as FUF
(Ben Gurion University of the Negev)
+ knowledge representation systems, such as Rhet
(University of Rochester)
+ multicomponent systems, such as ELU (ISSCO), PENMAN (ISI),
Pundit (UNISYS), SNePS (SUNY Buffalo),
+ NLP-Tools, such as GULP (University of Georgia) or Linguist
(Kansai Research Laboratory)
+ applications programs (misc.)
If you have developed a piece of software for natural language
processing that other researchers might find useful, you can include
it by returning the questionnaire available from the sources below.
ftp: Germany: ftp.dfki.uni-sb.de (134.96.188.252)
(directory: pub/registry, password:anonymous)
e-mail: registry@dfki.uni-sb.de
post: Natural Language Software Registry
Deutsches Forschungsinstitut fuer Kuenstliche Intelligenz (DFKI)
Stuhlsatzenhausweg 3
D-66123 Saarbruecken
Germany
Other ftp sites are
crlftp.nmsu.edu (128.123.1.33)
The directory is pub/non-lexical/NL_Software_Registy
dri.cornell.edu (128.84.180.39)
The directory is /pub/Natural_Language_Software_Registry
or /pub/NLSR
Andrew Hunt
Speech Technology Research Group Ph: 61-2-692 4509
Dept. of Electrical Engineering Fax: 61-2-692 3847
University of Sydney, NSW, 2006, Australia email: andrewh@speech.su.oz.au
******************************************************************************