• Español – América Latina
  • Português – Brasil
  • Tiếng Việt
  • TensorFlow Core

Simple audio recognition: Recognizing keywords

This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic automatic speech recognition (ASR) model for recognizing ten different words. You will use a portion of the Speech Commands dataset ( Warden, 2018 ), which contains short (one-second or less) audio clips of commands, such as "down", "go", "left", "no", "right", "stop", "up" and "yes".

Real-world speech and audio recognition systems are complex. But, like image classification with the MNIST dataset , this tutorial should give you a basic understanding of the techniques involved.

Import necessary modules and dependencies. You'll be using tf.keras.utils.audio_dataset_from_directory (introduced in TensorFlow 2.10), which helps generate audio classification datasets from directories of .wav files. You'll also need seaborn for visualization in this tutorial.

Import the mini Speech Commands dataset

To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The original dataset consists of over 105,000 audio files in the WAV (Waveform) audio file format of people saying 35 different words. This data was collected by Google and released under a CC BY license.

Download and extract the mini_speech_commands.zip file containing the smaller Speech Commands datasets with tf.keras.utils.get_file :

The dataset's audio clips are stored in eight folders corresponding to each speech command: no , yes , down , go , left , up , right , and stop :

Divided into directories this way, you can easily load the data using keras.utils.audio_dataset_from_directory .

The audio clips are 1 second or less at 16kHz. The output_sequence_length=16000 pads the short ones to exactly 1 second (and would trim longer ones) so that they can be easily batched.

The dataset now contains batches of audio clips and integer labels. The audio clips have a shape of (batch, samples, channels) .

This dataset only contains single channel audio, so use the tf.squeeze function to drop the extra axis:

The utils.audio_dataset_from_directory function only returns up to two splits. It's a good idea to keep a test set separate from your validation set. Ideally you'd keep it in a separate directory, but in this case you can use Dataset.shard to split the validation set into two halves. Note that iterating over any shard will load all the data, and only keep its fraction.

Let's plot a few audio waveforms:

png

Convert waveforms to spectrograms

The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the short-time Fourier transform (STFT) to convert the waveforms to as spectrograms , which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model.

A Fourier transform ( tf.signal.fft ) converts a signal to its component frequencies, but loses all time information. In comparison, STFT ( tf.signal.stft ) splits the signal into windows of time and runs a Fourier transform on each window, preserving some time information, and returning a 2D tensor that you can run standard convolutions on.

Create a utility function for converting waveforms to spectrograms:

  • The waveforms need to be of the same length, so that when you convert them to spectrograms, the results have similar dimensions. This can be done by simply zero-padding the audio clips that are shorter than one second (using tf.zeros ).
  • When calling tf.signal.stft , choose the frame_length and frame_step parameters such that the generated spectrogram "image" is almost square. For more information on the STFT parameters choice, refer to this Coursera video on audio signal processing and STFT.
  • The STFT produces an array of complex numbers representing magnitude and phase. However, in this tutorial you'll only use the magnitude, which you can derive by applying tf.abs on the output of tf.signal.stft .

Next, start exploring the data. Print the shapes of one example's tensorized waveform and the corresponding spectrogram, and play the original audio:

Your browser does not support the audio element.

Now, define a function for displaying a spectrogram:

Plot the example's waveform over time and the corresponding spectrogram (frequencies over time):

png

Now, create spectrogram datasets from the audio datasets:

Examine the spectrograms for different examples of the dataset:

png

Build and train the model

Add Dataset.cache and Dataset.prefetch operations to reduce read latency while training the model:

For the model, you'll use a simple convolutional neural network (CNN), since you have transformed the audio files into spectrogram images.

Your tf.keras.Sequential model will use the following Keras preprocessing layers:

  • tf.keras.layers.Resizing : to downsample the input to enable the model to train faster.
  • tf.keras.layers.Normalization : to normalize each pixel in the image based on its mean and standard deviation.

For the Normalization layer, its adapt method would first need to be called on the training data in order to compute aggregate statistics (that is, the mean and the standard deviation).

Configure the Keras model with the Adam optimizer and the cross-entropy loss:

Train the model over 10 epochs for demonstration purposes:

Let's plot the training and validation loss curves to check how your model has improved during training:

png

Evaluate the model performance

Run the model on the test set and check the model's performance:

Display a confusion matrix

Use a confusion matrix to check how well the model did classifying each of the commands in the test set:

png

Run inference on an audio file

Finally, verify the model's prediction output using an input audio file of someone saying "no". How well does your model perform?

png

As the output suggests, your model should have recognized the audio command as "no".

Export the model with preprocessing

The model's not very easy to use if you have to apply those preprocessing steps before passing data to the model for inference. So build an end-to-end version:

Test run the "export" model:

Save and reload the model, the reloaded model gives identical output:

This tutorial demonstrated how to carry out simple audio classification/automatic speech recognition using a convolutional neural network with TensorFlow and Python. To learn more, consider the following resources:

  • The Sound classification with YAMNet tutorial shows how to use transfer learning for audio classification.
  • The notebooks from Kaggle's TensorFlow speech recognition challenge .
  • The TensorFlow.js - Audio recognition using transfer learning codelab teaches how to build your own interactive web app for audio classification.
  • A tutorial on deep learning for music information retrieval (Choi et al., 2017) on arXiv.
  • TensorFlow also has additional support for audio data preparation and augmentation to help with your own audio-based projects.
  • Consider using the librosa library for music and audio analysis.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-03-23 UTC.

speech to text neural network

Automatic Speech Recognition with Transformer

Author: Apoorv Nandan Date created: 2021/01/13 Last modified: 2021/01/13 Description: Training a sequence-to-sequence Transformer for automatic speech recognition.

speech to text neural network

Introduction

Automatic speech recognition (ASR) consists of transcribing audio speech segments into text. ASR can be treated as a sequence-to-sequence problem, where the audio can be represented as a sequence of feature vectors and the text as a sequence of characters, words, or subword tokens.

For this demonstration, we will use the LJSpeech dataset from the LibriVox project. It consists of short audio clips of a single speaker reading passages from 7 non-fiction books. Our model will be similar to the original Transformer (both encoder and decoder) as proposed in the paper, "Attention is All You Need".

References:

  • Attention is All You Need
  • Very Deep Self-Attention Networks for End-to-End Speech Recognition
  • Speech Transformers
  • LJSpeech Dataset

Define the Transformer Input Layer

When processing past target tokens for the decoder, we compute the sum of position embeddings and token embeddings.

When processing audio features, we apply convolutional layers to downsample them (via convolution strides) and process local relationships.

Transformer Encoder Layer

Transformer decoder layer, complete the transformer model.

Our model takes audio spectrograms as inputs and predicts a sequence of characters. During training, we give the decoder the target character sequence shifted to the left as input. During inference, the decoder uses its own past predictions to predict the next token.

Download the dataset

Note: This requires ~3.6 GB of disk space and takes ~5 minutes for the extraction of files.

Preprocess the dataset

Callbacks to display predictions, learning rate schedule, create & train the end-to-end model.

In practice, you should train for around 100 epochs or more.

Some of the predicted text at or around epoch 35 may look as follows:

AI Summer

📖 Check out our Introduction to Deep Learning & Neural Networks course 📖

Speech Recognition: a review of the different deep learning approaches

Speech Recognition: a review of the different deep learning approaches

Humans communicate preferably through speech using the same language. Speech recognition can be defined as the ability to understand the spoken words of the person speaking.

Automatic speech recognition (ASR) refers to the task of recognizing human speech and translating it into text . This research field has gained a lot of focus over the last decades. It is an important research area for human-to-machine communication. Early methods focused on manual feature extraction and conventional techniques such as Gaussian Mixture Models (GMM) , the Dynamic Time Warping (DTW) algorithm and Hidden Markov Models (HMM) .

More recently, neural networks such as recurrent neural networks (RNNs), convolutional neural networks (CNNs) and in the last years Transformers, have been applied on ASR and have achieved great performance.

  • How to formulate Automatic Speech Recognition (ASR)?

The overall flow of ASR can be represented as shown below:

ASR

The main goal of an ASR system is to transform an audio input signal x = ( x 1 , x 2 , … x T ) \mathbf{x} = (x_1, x_2, \dots x_T) x = ( x 1 ​ , x 2 ​ , … x T ​ ) with a specific length T T T into a sequence of words or characters (i.e., labels) y = ( y 1 , y 2 , … , y N \mathbf{y} = ( y_1, y_2, \dots, y_N y = ( y 1 ​ , y 2 ​ , … , y N ​ ), y n ∈ V y_{n}\in \mathbf{V} y n ​ ∈ V , where V \mathbf{V} V is the vocabulary. The labels might be character-level labels (i.e., letters) or word-level labels (i.e., words).

The most probable output sequence is given by:

A typical ASR system has the following processing steps:

Pre-processing

Feature extraction

Classification

Language modeling.

The pre-processing step aims to improve the audio signal by reducing the signal-to-noise ratio, reducing the noise, and filtering the signal.

In general, the features that are used for ASR, are extracted with a specific number of values or coefficients, which are generated by applying various methods on the input. This step must be robust, concerning various quality factors, such as noise or the echo effect.

The majority of the ASR methods adopt the following feature extraction techniques:

Mel-frequency cepstral coefficients (MFCCs)

Discrete Wavelet Transform (DWT).

The classification model aims to find the spoken text which is contained on the input signal. It takes the extracted features from the pre-processing step and generates the output text.

The language model (LM) is an important module as it captures the grammatical rules or the semantic information of a language. Language models are important in order to recognize the output token from the classification model as well as to make corrections on the output text.

  • Datasets for ASR

Various databases with text from audiobooks, conversations, and talks have been recorded.

  • The CallHome English, Spanish and German databases ( Post et al. 1 ) contain conversational data with a high number of words, which are not in the vocabulary. They are challenging databases with foreign words and telephone channel distortion. The English CallHome database has 120 spontaneous English telephone conversations between native English people. The training set has 80 conversations of about 15 hours of speech, while the test and development sets contain 20 conversations, where each set has 1.8 hours of audio files.

Moreover, the CallHome Spanish consists of 120 telephone conversations respectively between native speakers. The training part has 16 hours of speech and its test set has 20 conversations with 2 hours of speech. Finally, the CallHome German consists of 100 telephone conversations between native German speakers with 15 hours of speech in the training set and 3.7 hours of speech in the test set.

  • TIMIT 2 is a large dataset with broadband recordings from American English, where each speaker reads 10 grammatically rich sentences. TIMIT contains audio signals, which have been time-aligned, corrected and can be used for character or word recognition. The audio files are encoded in 16 bits. The training set contains a large number of audios from 462 speakers in total, while the validation set has audios from 50 speakers and the test set audios from 24 speakers.
  • Feature extraction for ASR

Mel-frequency Cepstral coefficients is the most common method for extracting speech features . The human ear is a nonlinear system concerning how it perceives the audio signal. In order to cope with the change in frequency, the Mel-scale was developed to make a linear model of the human auditory system. Only frequencies in the range of [0,1] kHz can be transformed to the Mel-scale, while the remaining frequencies are considered to be logarithmic. The mel-scale frequency is computed as:

where f H z f_{Hz} f H z ​ is the frequency of the original signal.

The MFCC feature extraction technique basically includes the following steps:

Window the signal

Apply Discrete Fourier Transform

Logarithm of the magnitude

Convert to a Mel scale

Apply inverse discrete cosine transform (DCT)

Deep Neural Networks for ASR

In the deep learning era, neural networks have shown significant improvement in the speech recognition task. Various methods have been applied such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), while recently Transformer networks have achieved great performance.

  • Recurrent Neural Networks

RNNs perform computations on the time sequence since their current hidden state

is dependent on all the previous hidden states. More specifically, they are designed to model time-series signals as well as capture long-term and short-term dependencies between different time-steps of the input.

Concerning speech recognition applications, the input signal x = ( x 1 , x 2 , … x T ) \mathbf{x} = (x_1, x_2, \dots x_T) x = ( x 1 ​ , x 2 ​ , … x T ​ ) is passed through the RNN to compute the hidden sequences h = ( h 1 , h 2 , … h N ) \mathbf{h} = (h_1, h_2, \dots h_N) h = ( h 1 ​ , h 2 ​ , … h N ​ ) and the output sequences y = ( y 1 , y 2 , … y N ) \mathbf{y} = (y_1, y_2, \dots y_N) y = ( y 1 ​ , y 2 ​ , … y N ​ ) , respectively. One major drawback of the simple form of RNNs is that it generates the next output based only on the previous context.

bi_rnn

RNNs compute the sequence of hidden vectors h \mathbf{h} h as:

where W \mathbf{W} W are the weights, b \mathbf{b} b are the bias vectors and H H H is the nonlinear function.

  • RNNs limitations and solutions

However, in speech recognition, usually the information of the future context is equally significant as the past context (Graves et al. 3 ). That’s why instead of using a unidirectional RNN, bidirectional RNNs (BiRNNs) are commonly selected in order to address this shortcoming. BiRNNs process the input vectors in both directions i.e., forward and backward, and keep the hidden state vectors for each direction as shown in the above figure.

Neural networks, both feed-forward and recurrent, can be only used for frame-wise classification of the input audio.

This problem can be addressed using:

Hidden Markov Models (HMMs) to get the alignment between the input audio and its transcribed output.

Connectionist Temporal Classification (CTC) loss, which is the most common technique.

CTC is an objective function that computes the alignment between the input speech signal and the output sequence of the words . CTC uses a blank label that represents the silence time-step i.e., the person doesn't speak, or represents the transition between words or phonemes. Given the input x \mathbf{x} x and the output probability sequence of words or characters y \mathbf{y} y , the probability of an alignment path α \boldsymbol{\alpha} α is calculated as:

where α t \alpha_t α t ​ is a single alignment at time-step t t t .

For a given transcription sequence, there are several possible alignments since labels can be separated from blanks in different ways. For example the alignments ( a , − , b , c , − , − ) (a,-,b,c,-,-) ( a , − , b , c , − , − ) and ( − , − (-,- ( − , − , a , − a,- a , − , b , c ) b,c) b , c ) , ( − - − is the blank symbol) both correspond to the character sequence ( a , b , c ) (a,b,c) ( a , b , c ) .

Finally, the total probability of all paths is calculated as:

CTC aims to maximize the total probability of the correct alignments in order to get the correct output word sequence . One main benefit of CTC is that it doesn't require prior segmentation or alignment of the data. DNNs can be used directly to model the features and achieve great performance in speech recognition tasks.

The decoding process is used to generate predictions from a trained model using CTC. There are several decoding algorithms. The most common step is the best-path decoding algorithm , where the max probabilities are used in every time-step. Since the model assumes that the latent symbols are independent given the network outputs in the frame-wise case, the output with the highest probability is obtained at each time-step as:

Beam search has also been adopted for CTC decoding. The most likely translation is searched using left-to-right time-steps and a small number B B B of partial hypotheses is maintained. Each hypothesis is actually a prefix of the output sequence, while at each time-step it is extended in the beam with every possible word in the vocabulary.

RNN-Transducer

In other works (e.g Rao et al. 4 ), an architecture commonly known as RNN-Transducer, has also been employed for ASR. This method combines an RNN with CTC and a separate RNN that predicts the next output given the previous one . It determines a separate probability distribution P ( y k ∣ t , u ) P(y_k|t,u) P ( y k ​ ∣ t , u ) for every timestep t t t of the input and time-step u u u of the output for the k k k -th element of the output y \mathbf{y} y .

An encoder network converts the acoustic feature x t x_t x t ​ at time-step t t t to a representation h e t = f e n c ( x t ) he_t = f_{enc}(x_t) h e t ​ = f e n c ​ ( x t ​ ) . Furthermore, a prediction network takes the previous label y u − 1 y_{u-1} y u − 1 ​ and generates a new representation h p t = f p ( y u − 1 ) hp_t = f_{p}(y_{u-1}) h p t ​ = f p ​ ( y u − 1 ​ ) . The joint network is a fully-connected layer that combines the two representations and generates the posterior probability P ( y ∣ t , u ) = f j o i n t ( h e t : h p t ) P(\mathbf{y}|t,u) = f_{joint}(he_t : hp_t) P ( y ∣ t , u ) = f j o i n t ​ ( h e t ​ : h p t ​ ) . In this way the RNN-Transducer can generate the next symbols or words by using information both from the encoder and the

prediction network based on if the predicted label is a blank or a non-blank label. The inference procedure stops when a blank label is emitted at the last time-step.

RNN_T

Graves et al. 3 tested regular RNNs with CTC and RNN-Transducers in TIMIT 2 database using different numbers of layers and hidden states.

The feature extraction is performed with a Fourier transform filter-bank method of 40 coefficients that are distributed on a logarithmic mel-scale concatenated with the first and second temporal derivatives.

In the table below, it is shown that RNN-T with 3 layers of 250 hidden states each has the best performance of 17.7 % 17.7\% 1 7 . 7 % phoneme error rate (PER), while simple RNN-CTC models perform worse with PER > 18.4 % > 18.4\% > 1 8 . 4 % .

timi_rnn

  • End-to-end ASR with RNN-Transducer (RNN-T)

Rao et al. 4 proposed an encoder-decoder RNN. The proposed method adopts an encoder network consisting of several blocks of LSTM layers, which are pre-trained with CTC using phonemes, graphemes, and words as output. In addition, 1D-CNN reduces the length T T T of the time sequence by a factor of 3 using specific kernel strides and sizes.

The decoder network is an RNN-T model trained along with an LSTM language model that also predicts words. The target of the network is the next label in the sequence and is used in the cross-entropy loss to optimize the network. Concerning feature extraction, 80-dimensional mel-scale features are computed every 10 msec and stacked every 30 msec to a single 240-dimensional acoustic feature vector.

enc_dec_rnn

The method is trained on a set of 22 million hand-transcribed audio recordings extracted

from Google US English voice traffic, which corresponds to 18,000 hours of training data. These include voice-search as well as voice-dictation utterances. The language model was pretrained on text sentences obtained from the dataset. The method was tested with different configurations. It achieves 5.2 % 5.2\% 5 . 2 % WER on this large dataset when the encoder contains 12 layers of 700 hidden units and the decoder 2 layers of 1000 hidden units each.

enc_dec_rnnt_results

  • Streaming end-to-end speech recognition for mobile devices

RNN-Transducers have also been adopted for real-time speech recognition (He et al. 5 ). In this work, the model consists of 8 layers of uni-directional LSTM cells, while a time-reduction layer is used in the encoder to speed up training and inference. Memory caching techniques are also used to avoid redundant computation for identical prediction histories. This saves about 50 – 60 % 50–60\% 5 0 – 6 0 % of the prediction network computations. In addition, different threads are used for the encoder and the prediction network to enable pipe-lining and save a significant amount of time.

The encoder inference procedure is split over two threads corresponding to the components before and after the time-reduction layer, which balances the computation between the

two encoder components and the prediction network, and has a speedup of 28 % 28\% 2 8 % compared against single-threaded execution. Furthermore, parameters are quantized from 32-bit floating-point precision into 8-bit to reduce memory consumption, both on disk and at run-time, and to optimize the model’s execution in real-time.

The algorithm was trained on a dataset that consists of 35 million English utterances with a size of 27,500 hours. The training utterances are hand-transcribed and are obtained from Google’s voice search and dictation traffic and it was created by artificially corrupting clean utterances using a room simulator. The reported results are evaluated on 14800 voice search (VS) samples extracted from Google traffic assistant, as well as 15700 voice command samples, denoted as the IME test set. The feature extraction step creates 80-dimensional mel-scale features computed every 25msec. The results are reported in inference speed divided by audio duration (RT90) and WER. The RNN-T model with symmetric quantization achieves WERs of 7.3 % 7.3\% 7 . 3 % on the voice search set and 4.2 % 4.2\% 4 . 2 % on the IME set.

streaming_results

  • Fast and Accurate Recurrent Neural Network Acoustic Models for ASR

Sak et al. 6 adopt long-short memory (LSTM) networks for large vocabulary speech recognition. Their method extracts high-dimensional features using mel-filter banks using a sliding window technique. In addition, they incorporate context-dependent states and further improve the performance of the model. The method is evaluated on hand-transcribed audio recordings from real Google voice search traffic. The training set has 3 million utterances with an average duration of 4 seconds. The results are shown in the tables below:

cd_results

  • Attention-based models

Other works have adopted the attention encoder-decoder structure of the RNN that directly computes the conditional probability of the output sequence given the input sequence without assuming a fixed alignment. The encoder-decoder method uses an attention mechanism, which does not require pre-segment alignment of data. An attention-based model uses a single decoder to produce a distribution over the labels conditioned on the full sequence of previous predictions and the input audio. With attention, it can implicitly learn the soft alignment between input and output sequences , which solves a big problem for speech recognition.

The model can still have a good effect on long input sequences, so it is also possible for such models to handle speech input of various lengths. More specifically, the model computes the output probability density P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P ( y ∣ x ) , where the lengths of the input and output are different. The encoder maps the input to the context vector c i \mathbf{c}_i c i ​ for each output y i y_i y i ​ . The decoder computes:

conditioned on the I I I previous outputs and the context c i c_i c i ​ .

The posterior probability of symbol y i y_i y i ​ is calculated as:

where s i s_i s i ​ is the output of the recurrent layer f f f and g g g is the softmax function.

The context is obtained from the weighted average of the hidden states of all time-steps as:

where a i , t ∈ [ 0 , 1 ] a_{i,t} \in [0,1] a i , t ​ ∈ [ 0 , 1 ] , ∑ t = 1 T a i , t = 1 \sum_{t=1}^{T} a_{i,t} = 1 ∑ t = 1 T ​ a i , t ​ = 1 .

The attention mechanism selects the temporal locations over the input sequence that should be used to update the hidden state of the RNN and to predict the next output value. It asserts the attention weights a i , t a_{i,t} a i , t ​ to compute the relevance scores between the input and the output.

  • Attention-based recurrent sequence generator

Chorowski et al. 7 , adopts an attention-based recurrent sequence generator (ARSG) that generates the output word sequence from speech features h = ( h 1 , h 2 , h T ) \mathbf{h} = (h_1, h_2, h_T) h = ( h 1 ​ , h 2 ​ , h T ​ ) that can be modelled by any type of encoder. ARSG generates the output y i y_i y i ​ by focusing on the relevant features:

where s i s_i s i ​ is the i-th state of the RNN, a i a_i a i ​ are the attention weights.

A new state is generated as:

In more detail, the scoring mechanism works as:

ARSG is evaluated on the TIMIT dataset and achieves WERs of 15.8 % 15.8\% 1 5 . 8 % and 17.6 % 17.6\% 1 7 . 6 % on validation and test sets.

  • Listen-Attend-Spell (LAS)

In Chan et al 8 and Chiu et.al 9 the Listen-Attend-Spell (LAS) method was developed. The encoder (i.e., Listen) takes the input audio x \mathbf{x} x and generates the representation h \mathbf{h} h . More specifically, it uses a bidirectional Long Short Term Memory (BLSTM) module with a pyramid structure, where in each layer the time resolution is reduced. The output at the i i i -th time step, from the j j j -th layer is computed as:

The decoder (i.e., Attend-Spell) is an attention-based module that attends the representation h \mathbf{h} h and produces the output probability P ( y ∣ x ) P(\mathbf{y}|\mathbf{x}) P ( y ∣ x ) . In more detail, an attention-based LSTM transducer produces the next character based on the previous outputs as:

where s i s_i s i ​ , c i c_i c i ​ are the decoder state and the context vector, respectively.

LAS was evaluated on 3 million Google voice search utterances with 2000 hours of speech, where 10 hours of utterances were randomly selected for validation. Data augmentation was also performed on the training dataset using a room simulator noise as well as by adding other types of noise and reverberations. It was able to achieve great recognition rates with WERs of 10.3 % 10.3\% 1 0 . 3 % and 12 , 0 % 12,0\% 1 2 , 0 % on clean and noisy environments, respectively.

LAS

  • End-to-end Speech Recognition with Word-based RNN Language Models and Attention

Hori et al. 10 , adopt a joint decoder using CTC, attention decoder, and an RNN language model. A CNN encoder network takes the input audio x \mathbf{x} x and outputs the hidden sequence h \mathbf{h} h that is shared between the decoder modules. The decoder network iteratively predicts the 0 label sequence c \mathbf{c} c based on the hidden sequence. The joint decoder utilizes both CTC, attention and the language model to enforce better alignments between the input and the output and find a better output sequence. The network is trained to maximize the following joint function:

During inference, to find the most probable word sequence c ^ \hat{\mathbf{c}} c ^ , the decoder finds the most probable words as:

hori_2018

The method is evaluated on Wall Street Journal (WSJ) and LibriSpeech datasets.

WSJ 11 is a well-known English clean speech database including approximately 80 hours.

LibriSpeech is a large data set of reading speech from audiobooks and contains 1000 hours of audio and transcriptions 12 . The experimental results of the proposed method on WSJ and Librispeech are shown in the following table, respectively.

hori2018_results

Convolutional Models

Convolutional neural networks were initially implemented for computer vision (CV) tasks. In recent years, CNNs have also been widely applied in the field of natural language processing (NLP), due to their good generation, and discrimination capability.

A very typical CNN architecture is formed of several convolutional and pooling layers with fully connected layers for classification. A convolutional layer is composed by kernels that are convolved with the input. A convolutional kernel divides the input signal into smaller

parts namely the receptive field of the kernel. Furthermore, the convolution operation is performed by multiplying the kernel with the corresponding parts of the input that are into the receptive field. Convolutional methods can be grouped into 1-dimensional and 2-dimensional networks, respectively.

2D-CNNs construct 2D feature maps from the acoustic signal . Similar to images, they organize acoustic features i.e., MFCC features, in a 2-dimensional feature map, where one axis represents the frequency domain and the other represents the time domain. In contrast, 1D-CNNs accept acoustic features directly as input .

In 1D-CNN for speech recognition, every input feature map X = ( X 1 , … , X I ) X=(X_1,\dots, X_I) X = ( X 1 ​ , … , X I ​ ) is connected to many feature maps O = ( O 1 , … , O J ) O = (O_1, \dots, O_J) O = ( O 1 ​ , … , O J ​ ) . The convolution operation can be written as:

where w \mathbf{w} w is the local weight.

In 1D-CNNs: w \mathbf{w} w , O \mathbf{O} O are vectors

In 2D-CNNs they are matrices.

Abdel et al. 13 were the first that applied CNNs to speech recognition. Their method adopts two types of convolutional layers. The first one adopts full weight sharing (FWS), where weights are shared across. This technique is common in CNNs for image recognition since the same characteristics may appear at any location in an image. However, in speech recognition, the signal varies across different frequencies and has distinct feature patterns in different filters. To tackle this, limited weight sharing (LWS) is used, where only the convolution filters that are attached to the same pooling filters share the same weights.

cnn_2d_asr

The speech input was analyzed with a 25-ms Hamming window with a fixed 10-ms frame rate. More specifically, feature vectors are generated by Fourier-transform-based filter-bank analysis, which includes40 log energy coefficients distributed on a mel scale, along with

their first and second temporal derivatives. All speech data were normalized so that each vector dimension has a zero mean and unit variance.

The building block of their CNN architecture has convolutions and pooling layers. The input features are organized as several feature maps. The size (resolution) of feature maps gets smaller at upper layers as more convolution and pooling operations are applied as shown in the figure below. Usually, one or more fully connected hidden layers are added

on top of the final CNN layer to combine the features across all frequency bands before feeding to the output layer. They made a comprehensive study with different CNN configurations and achieved great results on TIMIT, which are shown in the below table. Their best model adopts only LWS layers and achieves a WER of 20.23 % 20.23\% 2 0 . 2 3 % .

cnn_1d_arch_2014

  • Residual CNN

Wang et al. 14 adopted residual 2D-CNN (RCNN) with CTC loss for speech recognition. The residual block uses direct connections between the previous and the next layer as follows:

where f f f is a nonlinear function. This helps the network to converge faster without the use of extra parameters. The proposed architecture is depicted in the figure below. The Residual CNN-CTC method adopts 4 groups of residual blocks with small 3 × 3 3 \times 3 3 × 3 filters. Each Residual group has N N N number of convolutional blocks with 2 layers. Each residual group also has different strides to reduce the computational cost and model temporal dependencies with different contexts. Batch normalization and ReLU activation are applied on each layer.

res_cnn_ctc

The RCNN is evaluated on WSJ with the standard configuration (si284 set

for training, eval92 set for validation, and dev93 set for testing). Furthermore, it is evaluated on the Tencent Chat data set that contains about 1400 hours of speech data for training and an independent 2000 sentences for test. The experimental results demonstrate the effectiveness of residual convolutional neural networks. RCNN can achieve WERs of 4.29 % / 7.65 % 4.29\%/7.65\% 4 . 2 9 % / 7 . 6 5 % on validation and test sets of WSJ and 13.33 % 13.33\% 1 3 . 3 3 % on the Tencent Chat dataset.

Li et al. 15 implemented a residual 1D-CNN with dense and residual blocks as shown below. The network extracts mel-filter-bank features and uses residual blocks that contain batch normalization and dropout layers for faster convergence and better generalization. The input is constructed from mel-filter-bank features obtained using 20 msec windows with a 10 msec overlapping. The network has been tested with different types of normalization and activation functions, while each block is optimized to fit on a single GPU kernel for faster inference. Jasper is evaluated on LibriSpeech with different settings of configuration. The best model has 10 blocks of 4 layers and BatchNorm + ReLU and achieves validation WERs of 6.15 % 6.15\% 6 . 1 5 % and 17.38 % 17.38\% 1 7 . 3 8 % on clean and noisy sets, respectively.

jasper

  • Fully Convolutional Network

Zeghidour et al. 16 implement a fully convolutional network (FCN) with 3 main modules. The convolutional front-end is a CNN with low pass filters, convolutional filters similar to filter-banks, and algorithmic function to extract features. The second module is a convolutional acoustic model with several convolutional layers, GELU activation function, dropout, and weight regularization and predicts the letters from the input. Finally, there is a convolutional language model with 14 convolutional residual blocks and bottleneck layers.

This module is used to evaluate the candidate transcriptions of the acoustic model using a beam search decoder. FCN is evaluated on WSJ and LibriSpeech datasets. Their best configuration adopts a trainable convolutional front-end with 80 filters and a convolutional Language model. FCN achieves WERs of 6.8 % 6.8\% 6 . 8 % on the validation set and 3.5 % 3.5\% 3 . 5 % on the test set of WSJ, while on LibriSpeech it achieves validations WERs of 3.08 % / 9.94 % 3.08\%/9.94\% 3 . 0 8 % / 9 . 9 4 % on clean and noisy sets and testing WERs of 3.26 % / 10.47 % 3.26\%/10.47\% 3 . 2 6 % / 1 0 . 4 7 % on clean and noisy sets, respectively.

fcn

  • Time-Depth Separable Convolutions (TDS)

Differently from other works, Hannum et al. 17 use time-separable convolutional networks with limited number of parameters and because time-separable CNNs generalize better and are more efficient. The encoder uses 2D depth-wise convolutions along with layer normalization. The encoder outputs two vectors, the keys k = k 1 , k 2 , … k T \mathbf{k} = k_1, k_2,\dots k_T k = k 1 ​ , k 2 ​ , … k T ​ and the values v = v 1 , v 2 , … v T \mathbf{v} = v_1, v_2,\dots v_T v = v 1 ​ , v 2 ​ , … v T ​ from the input sequence x \mathbf{x} x as:

As for the decoder, a simple RNN is used and outputs the next token y u y_u y u ​ as:

where S u \mathbf{S}_u S u ​ is a summary vector and Q u \mathbf{Q}_u Q u ​ is the query vector.

TDS is evaluated on LibriSpeech with different receptive fields and kernel sizes in order to find the best setting for the time-separable convolutional layers. The best option is 11 time-separable blocks, which achieve WERs of 5.04 % 5.04\% 5 . 0 4 % and 14.46 % 14.46\% 1 4 . 4 6 % on dev clean and other sets, respectively.

tsn

ContextNet 18 is a fully convolutional network that feeds global context information into the layers with squeeze-and-excitation modules. The CNN has K K K layers and generates the features as:

where C C C is a convolutional block followed by batch normalization and activation functions. Furthermore, the squeeze-and-excitation block generates a global channel-wise weight θ \theta θ with a global average pooling layer, which is multiplied by the input x \mathbf{x} x as:

ContextNet is validated on LibriSpeech with 3 different configurations of ContextNet, with or without a language model. The 3 configurations are ContextNet(Small), ContextNet(Medium), and ContextNet(Large), which contain different numbers of layers and filters.

contextnet_results

Transformers

Recently, with the introduction of Transformer networks 19 , machine translation and speech recognition have seen significant improvements. Transformer models that are designed for speech recognition are usually based on the encoder-decoder architecture similar to seq2seq models. In more detail, they are based on the self-attention mechanism instead of recurrence that is adopted by RNNs. The self-attention can attend to different positions of a sequence and extract meaningful representations. The self-attention mechanism takes three inputs, queries, values, and keys.

Let us denote the queries as Q ∈ R t q × d q \mathbf{Q}\in\mathrm{R^{t_q\times d_q}} Q ∈ R t q ​ × d q ​ , the values V ∈ R t v × d v \mathbf{V}\in\mathrm{R^{t_v\times d_v}} V ∈ R t v ​ × d v ​ and the keys K ∈ R t k × d k \mathbf{K}\in\mathrm{R^{t_k\times d_k}} K ∈ R t k ​ × d k ​ , where t ∗ t_{*} t ∗ ​ are the corresponding dimensions. The outputs of self-attention is calculated as:

where 1 d k \frac{1}{\sqrt{d_k}} d k ​ ​ 1 ​ is a scaling factor. However, Transformer adopts the Multi-head attention, which calculates the self-attention h h h times, one for each head i i i . In this way, each attention module focuses on different parts and learns different representations. Moreover, the multi-head attention is computed as:

where W i Q ∈ R d m o d e l × d q \mathbf{W}_i^Q\in \mathrm{R}^{d_{model}\times d_q} W i Q ​ ∈ R d m o d e l ​ × d q ​ , W i K ∈ R d m o d e l × d k \mathbf{W}_i^K\in\mathrm{R}^{d_{model}\times d_k} W i K ​ ∈ R d m o d e l ​ × d k ​ , W i V ∈ R d m o d e l × d v \mathbf{W}_i^V\in\mathrm{R}^{d_{model}\times d_v} W i V ​ ∈ R d m o d e l ​ × d v ​ , W O ∈ R h d v × d m o d e l \mathbf{W}^O\in \mathrm{R}^{hd_v\times d_{model}} W O ∈ R h d v ​ × d m o d e l ​ and d m o d e l d_{model} d m o d e l ​ the dimensionality of the Transformer. Finally, a feed-forward network is used that contains two fully connected networks and ReLU activation functions as:

where W 1 ∈ R d m o d e l × d f f , W 2 ∈ R d f f × d m o d e l \mathbf{W}_1\in \mathrm{R}^{d_{model}\times d_{ff}}, \mathbf{W}_2\in \mathrm{R}^{ d_{ff}\times d_{model}} W 1 ​ ∈ R d m o d e l ​ × d f f ​ , W 2 ​ ∈ R d f f ​ × d m o d e l ​ are the weights and b 1 ∈ R d f f , b 2 ∈ R d m o d e l \mathbf{b}_1\in \mathrm{R}^{d_{ff}}, \mathbf{b}_2\in \mathrm{R}^{d_{model}} b 1 ​ ∈ R d f f ​ , b 2 ​ ∈ R d m o d e l ​ are the biases. In general, to enable the Transformer to attend relative positions, we adopt a positional encoding which is added to the input. The most common technique is the sinusoidal encoding, described by:

where j , i j,i j , i represents the position in the sequence and the i i i -th dimension, respectively. Finally, normalization layers and residual connections are used to speed up training.

  • Speech-Transformer

The Speech-Transformer 29 transforms the speech feature sequence to the corresponding character sequence . The feature sequence which is longer than the output character sequence is constructed from 2-dimensional spectrograms with time and frequency dimensions. More specifically, CNNs are used to exploit the structure locality of spectrograms and mitigate the length mismatch by striding along time.

speech_transformer

In the Speech Transformer, 2D attention is used in order to attend at both the frequency and the time dimensions . The queries, keys, and values are extracted from convolutional neural networks and fed to the two self-attention modules. The Speech Transformer is evaluated on WSJ datasets and achieves competitive recognition results with a WER of 10.9 % 10.9\% 1 0 . 9 % , while it needs about 80 % 80\% 8 0 % less training time than conventional RNNs or CNNs.

  • Transformers with convolutional context

Mohamed et al. 21 adopt an encoder-decoder model formed by CNNs and a Transformer to learn local relationships and context of the speech signal. For the encoder, 2D convolutional modules with layer normalization and ReLU activation are used. In addition, each 2D convolutional module is formed by K K K convolutional layers with max-pooling. For the decoder, 1D convolutions are performed over embeddings of the past predicted words.

  • Transformer-Transducer

Similar to RNN-Transducer, a Transformer-Transducer 22 model has also been developed for speech recognition. Compared to RNN-T, this model joint network combines the output of the audio encoder A E \mathrm{AE} A E at time-step t i t_i t i ​ and the previously predicted label sequence z 0 i − 1 = ( z 1 , . . . , z i − 1 ) \mathbf{z}_0^{i-1} = (z_1,..., z_{i-1}) z 0 i − 1 ​ = ( z 1 ​ , . . . , z i − 1 ​ ) , which is produced from a feedforward network and a softmax layer, denoted as L E \mathrm{LE} L E .

The joint representation is produced as:

where f c \mathbf{fc} f c is a fully connected layer.

Then, the distribution of the alignment at time-step t i t_i t i ​ is computed as:

The Conformer 24 is a variant of the original Transformer that combines CNNs and transformers in order to model both local and global speech dependencies by using a more efficient architecture and fewer parameters. The module of the Conformer contains two feedforward layers (FFN), one convolutional layer (CNN), and a multi-head attention module (MHA). The output of the Conformer is computed as:

Here, the convolutional module adopts efficient pointwise and depthwise convolutions along with layer normalization.

conformer

CTC and language models have also been used with Transformer networks 24 .

  • Semantic mask for transformer-based ASR

Wang et al. 25 utilized a semantic mask of the input speech according to corresponding output tokens in order to generate the next word based on the previous context. A VGG-like convolution layer is used in order to generate short-term dependent features from the input spectrogram, which are then modeled by a Transformer. On the decoder network, the position encoding is replaced by a 1D convolutional layer to extract local features.

  • Weak-attention suppression or transformer-based ASR

Shi et al. 26 propose a weak attention module to suppress non-informative parts of the speech signal such as during silence. The weak attention module sets the attention probabilities smaller than a threshold to zero and normalizes the rest attention probabilities.

The threshold is determined based on the following:

Then, softmax is applied again on the new attention probabilities to generate the new attention matrix.

vgg_transformer

It is evident that deep architectures have already had a significant impact on automatic speech recognition. Convolutional neural networks, recurrent neural networks, and transformers have all been utilized with great success. Today’s SOTA models are all based on some combination of the aforementioned techniques. You can find some benchmarks on the popular datasets on paperswithcode .

If you find this article useful, you might also be interested in a previous one where we review the best speech synthesis methods . And as always, feel free to share it with your friends.

  • Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch, and Sanjeev Khudanpur, “ Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus ,” inProceedings of the International Workshop on Spoken Language Translation(IWSLT), Heidelberg, Germany, December 2013. ↩
  • John S Garofolo, “ Timit acoustic phonetic continuous speech corpus ,” Linguistic Data Consortium, 1993, 1993. ↩
  • Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “ Speech recognition with deep recurrent neural networks ,” in2013 IEEE international conference on acoustics, speech and signal processing. Ieee, 2013, pp. 6645–6649. ↩
  • Kanishka Rao, Ha ̧sim Sak, and Rohit Prabhavalkar, “ Exploring architectures, data, and units for streaming end-to-end speech recognition with rnn-transducer ,” in2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 193–199. ↩
  • Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, RuomingPang, et al., “ Streaming end-to-end speech recognition for mobile devices ,”inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.31 ↩
  • Ha ̧sim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays, “ Fast and accurate recurrent neural network acoustic models for speech recognition ,”arXiv preprint arXiv:1507.06947, 2015. ↩
  • Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “ Attention-based models for speech recognition ,”arXivpreprint arXiv:1506.07503, 2015. ↩
  • William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “ Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964. ↩
  • Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “ State-of-the-art speech recognition with sequence-to-sequence models ,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778. ↩
  • Takaaki Hori, Jaejin Cho, and Shinji Watanabe, “ End-to-end speech recognition with word-based rnn language models, ” in2018 IEEE Spoken LanguageTechnology Workshop (SLT), 2018, pp. 389–396. ↩
  • Douglas B Paul and Janet Baker, “ The design for the wall street journal-based csr corpus ,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992. ↩
  • Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “ Librispeech: an asr corpus based on public domain audio books ,” in2015IEEE international conference on acoustics, speech and signal processing(ICASSP). IEEE, 2015, pp. 5206–5210. ↩
  • Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu, “ Convolutional neural networks for speech recognition, ” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014. ↩
  • Yisen Wang, Xuejiao Deng, Songbai Pu, and Zhiheng Huang, “ Residual convolutional ctc networks for automatic speech recognition ,”arXiv preprintarXiv:1702.07793, 2017. ↩
  • Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde, “ Jasper: An end-to-end convolutional neural acoustic model, ”arXiv preprintarXiv:1904.03288, 2019. ↩
  • Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, GabrielSynnaeve, and Ronan Collobert, “ Fully convolutional speech recognition ,”arXiv e-prints, pp. arXiv–1812, 2018. ↩
  • Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Collobert, “ Sequence-to-sequence speech recognition with time-depth separable convolutions ,”arXivpreprint arXiv:1904.02619, 2019. ↩
  • Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu, “ Contextnet: Improving convolutional neural networks for automatic speech recognition with global context ,”arXiv preprint arXiv:2005.03191, 2020. ↩
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “ Attention is all you need, ”arXiv preprint arXiv:1706.03762, 2017. ↩
  • Linhao Dong, Shuang Xu, and Bo Xu, “ Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition ,” in2018 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2018, pp. 5884–5888. ↩
  • Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer, “T ransformers with convolutional context for asr ,”arXiv preprintarXiv:1904.11660, 2019. ↩
  • Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss ,”inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833. ↩
  • Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Del-croix, Atsunori Ogawa, and Tomohiro Nakatani, “ Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration ,”Proc. Interspeech 2019, pp. 1408–1412, 2019. ↩
  • Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Ji-ahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al., “ Conformer: Convolution-augmented transformer for speech recognition ,”arXiv preprint arXiv:2005.08100, 2020. ↩
  • Chengyi Wang, Yu Wu, Yujiao Du, Jinyu Li, Shujie Liu, Liang Lu, Shuo Ren, Guoli Ye, Sheng Zhao, and Ming Zhou, “ Semantic mask for transformer-based end-to-end speech recognition ,”arXiv preprintarXiv:1912.03010, 2019. ↩
  • Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, FrankZhang, Duc Le, Ching-Feng Yeh, and Michael L Seltzer, “ Weak-attention suppression for transformer-based speech recognition ,”arXiv preprintarXiv:2005.09137, 2020 ↩

Deep Learning in Production Book

Deep Learning in Production Book 📖

Learn how to build, train, deploy, scale and maintain deep learning models. understand ml infrastructure and mlops using hands-on examples..

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.

  • Custom Software Development
  • Web Development
  • Mobile Development
  • UI/UX Design
  • Testing & QA
  • IT Consulting
  • MVP development
  • Advanced technology
  • Business Process Automation
  • IT Infrastructure Services
  • Case studies
  • Cross-Platform
  • Integration
  • Referral Partnership

How Neural Networks Recognize Speech-to-Text

Ivan Ozhiganov

Ivan Ozhiganov

Founder & CEO at Azoft

#Advanced technology

Reading time:

We will send you an article on:

Gartner experts say that by 2020, businesses will automate conversations with their customers. According to statistics, companies lost up to 30% of incoming calls because call center employees either missed calls or didn’t have enough competence to communicate effectively.

To quickly and efficiently process incoming requests, modern businesses use chatbots. Conversational AI assistants are replacing standard chatbots and IVR. They are especially in demand among B2C companies. They use websites and mobile apps to stay competitive. Convolutional neural networks are trained to recognize human speech and automate call processing. They help to keep in touch with customers 24/7 and simplify the typical request processing.

There is no doubt that in the future call centers will become independent from operator qualification. Speech synthesis and recognition technologies will be a reliable support for them.

Our R&D department is interested in these technologies and  has conducted new research  at the client’s request. They trained neural networks to recognize a set of 14 voice commands. Learned commands can be used to robocall. Keep reading to learn about the results of the study and how they can help businesses.

Why Businesses Should Consider Speech-to-text Recognition

Speech recognition technologies are already used in mobile applications — for example, in Amazon Alexa or Google Now. Smart voice systems make apps more user-friendly as it takes less time to speak rather than type. Beyond that, voice input frees hands up.

Speech-to-text technologies solve many business issues. For instance, they can:

  • automate call processing when customers want to get a consultation, place or cancel an order, or participate in a survey,
  • support Smart Home system management interface, electronic robots and household device interfaces,
  • provide voice input in computer games and apps, as well as voice-controlled cars,
  • allow people with disabilities to access social services,
  • transfer money by voice commands.

Call centers have become the “ears” of business. To make these “ears” work automatically, R&D engineers train bots using machine learning.

Azoft’s R&D department has concrete and practical expertise in solving transfer learning tasks. We’ve written multiple articles on:

  • face recognition technology in photo and video ,
  • object detection in images using convolutional neural networks .

This time, our R&D department trained a convolutional neural network to recognize speech commands and to study how neural networks can help in dealing with speech-to-text tasks.

How Neural Networks Recognize Audio Signals

The new project’s goal is to create a model to correctly identify a word spoken by a human. To get a final model, we taught neural networks on a body of data and tailored to the target data. This method helps when you don’t have access to a large sample of target data.

As part of a study, we:

  • studied the features of signal processing by a neural network
  • preprocessed and identified the attributes that would help to recognize words from a voice recording (these attributes are on the input, and the word is on the output)
  • researched on how to apply convolutional networks in a speech-to-text task
  • adapted the convolutional network to recognize speech
  • tested the model in streaming recognition

How We Taught Neural Networks to Recognize Incoming Audio Signals

For the research, we used an audio signal in the wav format, in 16-bit quantization at a sampling frequency of 16 Khz. We took a second as a standard of duration. Each entry contained one word. We used 14 simple words: zero, one, two, three, four, five, six, seven, eight, nine, ten, yes, and no.

Attribute Extraction

The initial representation of the sound stream is not easy to perceive as it looks like a sequence of numbers in time. This is why we used spectral representation. It allowed us to decompose sound waves of different frequencies and find out which waves from the original sound formed it, and what features they had. Taking into account the logarithmic dependence of how humans perceive frequencies, we used small-frequency spectral coefficients.

The process of extracting spectral attributes

The process of extracting spectral attributes

  • Pre-emphasis

Signals differ in volume level. To bring audio in one form, we standardized and used a high-pass filter to reduce noise. Pre-emphasis is a filter for speech recognition tasks. It amplifies high frequencies, which increases noise resistance and provides more information to the acoustic model.

The original signal is not stationary. It’s divided into small gaps (frames) that overlap each other and are considered stationary. We applied the Hann window function to smooth the ends of the frames to zero. In our study, we used 30 ms frames with an overlap of 15 ms.

  • Short-time Discrete Fourier Transform

The Fourier transform allows you to decompose the original stationary signal into a set of harmonics of different frequencies and amplitudes. We apply this operation to the frame and get its frequency representation. Applying the Fourier transform to all the frames forms a spectral representation. Then, we calculate the spectrum power. The spectrum power is equal to half the square of the spectrum.

  • Log mel filterbank

According to scientific studies, a human recognizes low frequencies better than the higher ones, and the dependence of his/her perception is logarithmic. For this reason, we applied a convolution of N-triangular filters with 1 in the center (Image 2). As the filter increases, the center shifts in frequency and increases logarithmically at the base. This allowed us to capture more information in the lower frequencies and compress the performance of high frequencies of the frame.

Mel-spectrogram

The Choice of Architecture

We used a convolutional neural network as a basic architecture. It was the most suitable model for this task. A CNN analyzes spatial dependencies in an image through a two-dimensional convolution operation. The neural network analyzes nonstationary signals and identifies important criteria in the time and frequency domains.

We applied the tensor n x k, where n is the number of frequencies, and k is the number of time samples. Because n is usually not equal to k, we use rectangular filters.

The Model’s Architecture

The Model's Architecture

We adapted the standard convolutional network architecture to process the spectral representation of the signal. In addition to two-dimensional filters on the first layer, which distinguished common time-frequency features, one-dimensional filters were used.

To bring this idea to fruition, we had to separate the processes of identifying frequency and time criteria. To accomplish this, the second and third layers were made to contain sets of one-dimensional filters in the frequency domain. The next layer extracted time attributes. Global Max Pooling allowed us to compress the resulting attributes map into a single attribute vector.

Model Preparation

We used transfer learning to improve the quality of the model. The chosen architecture was trained on a large data package — a dataset from Google of 65,000 one-second records for 30 commands in English.

Learning and Testing Results

The training sample included 137488 objects and the testing sample had 250 objects. For the testing sample, we took speakers’ recordings that were not included in the training sample. We trained the neural network using the Adam optimization method in three variations:

  • model training from scratch (Fresh)
  • convolutional layer freezing in a pre-trained model (Frozen)
  • retraining a pre-trained model without freezing (Pre-Trained)

“Fresh” was carried out in seven stages and “Frozen” and “Pre-trained” in three stages. Check out the results in the table below.

Table showing results

As a result, we chose to use a pre-trained neural network on a large data package with fine-tuning and without freezing convolutional layers. This model adapts better to the new data.

Stream Test

The model was also  tested live . The speaker pronounced words in the microphone, and the network produced the result. We didn’t use the speaker’s voice in the training sample. This allowed us to check the quality of unknown data. The sound was read every quarter second, the cached second was updated, and the model classified it. To avoid neural network mistakes, we used a confidence threshold.

Test device characteristics:

  • CPU: Intel Core i7 7700HQ 2.8 GHz

Test characteristics:

  • Incoming audio stream refresh rate: 0.25 sec
  • Number of channels: 1
  • Sampling rate: 16 KHz
  • Incoming stream: 4000 by 16 bytes

Recognition speed:

  • Silence speech: 0.23 sec
  • Activity speech: 0.38 sec

Stream Test Results

The model recognized individual commands from the target dataset well but could give false answers to words that sound similar to the commands from the dataset. In continuous speech, consisting of many words, the quality of the processing of audio signals dropped significantly.

We examined the recognition of commands from the speech stream and revealed that:

  • Transfer learning can be very helpful when there is not a large body of data. Preprocessing and methods of representing audio signals are important in the recognition of commands.
  • Noise makes it difficult to recognize audio.
  • A similar speech recognition technology can be applied with the well-known small dictionary of commands.
  • To train a neural network, quality data is needed.

Businesses are interested in neural network signal recognition as it helps to build communication with Generation Zero. They use messages as their main method to communicate with friends, consume content, and explore products.

Signal recognition by neural networks has already sparked great interest among businesses as a way to establish communication with “generation zero”. This audience uses messages as the main means of communicating with friends, viewing content and exploring products. The purchasing power of “generation zero” is 200 billion dollars a year. A number which is expected to increase in the future.

Chatbot developers take into account the perspectives of millennials: today users easily make orders with instant messengers like Facebook, WhatsApp, Viber, Slack, and Telegram. According to researchers, 80% of companies will increase the number of their customer self-services in 2 years. Audio recognition systems will be a useful feature.

Our team will continue studying this topic. We will study new learning models that can improve speech-to-text recognition using neural networks.

  • Share on Facebook
  • Share on Linkedin
  • Share on Twitter
  • Share on Pocket

Sign Up for The free Azoft Newsletter

Receive one exclusive article a month and learn efficient ways to develop custom software.

Related articles

Applied use of m2m communication: traffic sign recognition in ios apps, vr, ar, mr technologies: differences and areas of application, fully convolutional networks for semantic segmentation.

TECHNOLOGIES

  • Agriculture
  • Banking & Finance
  • Healthcare & Insurance
  • Media & Entertainment
  • Transport & Logistics

Nothing Found. Please try again with some different keywords.

speech to text neural network

Thank You for Subscribing!

Thank you for subscribing to our newsletter. We’ll rarely send you articles to keep you updated with the latest software development trends.

Until then, make sure to check out the following resources:

  • Our Portfolio

Andrew Gibiansky   ::    Math  → [ Code ]

Kronos Notebook logo

Speech Recognition with Neural Networks

Wednesday, April 23, 2014

We've previously talked about using recurrent neural networks for generating text , based on a similarly titled paper . Recently, recurrent neural networks have been successfully applied to the difficult problem of speech recognition. In this post, we'll look at the architecture that Graves et. al. propose in that paper for their task.

Neural Network Architecture

We will begin by discussing the architecture of the neural network used by Graves et. al. However, the architecture of the neural network is only the first of the major aspects of the paper; later, we discuss exactly how we use this architecture for speech recognition.

Standard Recurrent Neural Networks

Recall that a recurrent neural network is one in which each layer represents another step in time (or another step in some sequence), and that each time step gets one input and predicts one output. However, the network is constrained to use the same "transition function" for each time step, thus learning to predict the output sequence from the input sequence for sequences of any length.

For a standard recurrent neural network, we iterate the following equations in order to do prediction:

$$\begin{aligned} h_i &= \sigma(W_{hh} h_{i-1} + W_{hx} x_i + b_h) \\ \hat y_i &= W_{yh} h_i \end{aligned}$$

The hidden layer at step $i$ is given by $h_i$; similarly, $x_i$ is the input layer at timestep $i$, $\hat y_i$ is output layer at timestep $i$, and the $W_*$ are the weight matrices (with biases $b_*$).

Note that this formulation of recurrent networks (RNNs) is equivalent to having a one-hidden-layer feed-forward network at each timestep (with layers $x_i$, $h_i$, and $y_i$). One can also consider $h_{i-1}$ to be part of the input layer at each timestep.

Long Short-Term Memory

Although standard RNNs are powerful in theory, they can be very difficult to train. Techniques such as Hessian-free optimization have been applied to them in order to improve training capacity. However, in addition to modifying the training algorithm, we can modify the network architecture to make it easier to train.

One of the reasons training networks is difficult is that the errors computed in backpropoagation are multiplied by each other once per timestep. If the errors are small, the error quickly dies out, becoming very small; if the errors are large, they quickly become very large due to repeated multiplication. An alternative architecture built with Long Short-Term Memory (LSTM) cells attempts to negate this issue.

A single LSTM unit is shown below.

The inputs $x_t$ dictate the behaviour of our LSTM cell. Note that each of the cells (circles) shown above may actually be vectors, and can store a large set of values.

The intuition behind this memory unit is that the cell $c_t$ stores a value over time. It feeds itself and can either remember or forget its value, depending on the activation of the forget gate $f_t$. The cell can optionally output its value, depending on the activation of the output gate $o_t$. Finally, the cell acquires new values if the input gate $i_t$ allows it to. Note that in the diagram above, the $\otimes$ symbols indicate these complex activation functions that allow these behaviours.

Next, we present the equations that implement the LSTM unit. First of all, the cell $c_t$ must forget its value if the forget gate $f_t$ is low and acquire a new value if the input gate $i_t$ is high. The value it acquires is dictated by the previous hidden layer and the current input. Thus, $c_t$ is determined via the following equation:

$$ c_t = f_t c_{t-1} + i_t \tanh(W_{hc} h_{t-1} + W_{xc} x_t + b_c) $$

Note that when $f_t = 0$, the cell completely forgets its old value, and stores a new one. When $i_t = 1$, the new value is given by the inputs and the previous hidden layer; if the input gate is off, though, the cell is either unchanged (for $f_t = 1$) or simply set to zero (for $f_t = 0$). In this manner, the cell implements the main memory storage function of the unit.

In order to let $f_t$ and $i_t$ range from zero to one, we let their activation functions be the standard sigmoid, a differentiable approximation to the step function. Both of these are fairly standard, and depend on the input and previous values of the cell and hidden layers:

$$\begin{align*} f_t &= \sigma(W_{xf} x_t + W_{hf} h_{t-1} + W_{cf} c_{t-1} + b_f) \\ i_t &= \sigma(W_{xi} x_t + W_{hi} h_{t-1} + W_{ci} c_{t-1} + b_i) \end{align*}$$

The only caveat to these is that we enforce a constraint on the weight matrices from the cell to the gate. Recall that the gate is actually a vector of cells. We enforce the constraint that the $m$th gate element depends only on the $m$th cell element, so that each element of the LSTM unit acts independently. We encode this constraint by enforcing that $W_{c*}$ is a diagonal matrix.

In order to get the data out of the cell, we have a custom output gate. The output gate is computed just like the other two gates:

$$\begin{align*} o_t &= \sigma(W_{xo} x_t + W_{ho} h_{t-1} + W_{co} c_{t-1} + b_o) \end{align*}$$

The output gate controls whether the hidden state comes from the cell:

$$h_t = o_t \tanh c_t$$

When the output gate is high, the hidden state is set directly from the cell; when the output gate is low, the hidden state is effectively set to zero.

Note that in this diagram, $h$ is the entire hidden state. However, in a full RNN architecture we may want to mix LSTM units with standard RNN units, in which case $h$ may contain other things as well.

Bidirectional RNNs

In a standard RNN, the output at a given time $t$ depends exclusively on the inputs $x_0$ through $x_t$ (via the hidden layers $h_0$ through $h_{t-1}$). However, while this makes sense in some contexts, many sequences have information relevant to output $y_t$ both before timestep $t$ and after timestep $t$. In speech recognition, specifically, the sound before and after a given point gives information about the sound at a particular point in the sequence.

In order to utilize this information, we need a modified architecture. There are several approaches possible approaches:

  • Windowed Feed-Forward Network : Instead of using an RNN, simply use a window around the output and use a standard feed forward network. This has a benefit of being easier to train; however, it limits applicability because we must have a window of exactly that size, and because we do not use information far away from the output (the size of the window is limiting).
  • RNN with delay : Instead of predicting timestep $t$ after seeing inputs 0 through $t$, predict timestep $t$ after seeing inputs 0 through $t + d$, where $t$ is some fixed delay. This is fairly close to a standard RNN, but also lets you look a few steps in the future for contextual information.
  • Bidirectional RNN : Add another set of hidden layers to your recurrent network going backwards in time. These two hidden layers are entirely separate and do not interact with each other, except for the fact that they are both used to compute the output. Given your weights, you need to run propagation forward in time (from time 0 to the end) to compute the forward hidden layers, and run it backward in time (from the end to time 0) to compute the backward hidden layers; finally, using the values at both of the hidden layers for a given timestep, compute the output at every timestep.

The paper that introduced bidirectional RNNs (by Schuster and Paliwal) has two graphics that are very helpful for understanding them and the differences from these other approaches. First of all, we can visualize what part of a sequence each type of network can utilize in order to predict a value at time $t_c$:

The windowed approach is labeled MLP (for multilayer perceptron). Standard RNNs are labeled RNN , and utilize information right up to $t_c$. Delayed RNNs (going forward and backward) can use all their history, with an extra window around $t_c$. Finally, bidirectional RNNs ( BRNN s) can use the entire sequence for their prediction.

Graves et. al. propose using LSTM units in a bidirectional RNN for speech recognition, so we focus on that approach. It can be trained similar to a standard RNN; however, it looks slightly different when expanded in time (shown in the graphic below, also from Schuster and Paliwal).

Here we see the BRNN expanded in time, showing only the timesteps around timestep $t$. We see that in the middle we have two hidden states (gray), one propagating forwards and one propagating backwards in time. The input (striped) feeds to both of these, and both of them feed to the output of the RNN (black).

Final Architecture

The architecture ultimately proposed by Graves et. al. in their paper utilizes both BRNNs and LSTM units. However, in addition, they extend the architecture by adding more hidden layers at each timestep. Instead of only having one hidden layer between the input and the output, the BRNN has $N$ hidden layers. In a normal deep RNN, each hidden layer $h^n_t$ (at time $t$) receives input from the previous hidden layer $h^{n-1}_t$ as well as from the same hidden layer from the previous time step $h^n_{t-1}$.

However, in a BRNN, each hidden layer has a direction associated with it; we can denote the two directions as $+$ and $-$, as in $h^{n,+}_t$ and $h^{n,-}_t$. Each hidden layer receives input not just from its previous timestep but also from the previous hidden layers of both directions . Thus, by utilizing a deep architecture, information from the end and beginning of a sequence can very effectively mix to form the prediction.

In addition, using LSTM units allows information to propagate for long distances both from the beginning and end of the sequence.

Training an Acoustic Model

The first goal for speech recognition is to build a classifier which can convert from a sequence of sounds into a sequence of letters or phonemes.

Suppose that we have an input sequence $x$ (sound data) and a desired output sequence $y$ (phonemes). However, even if our output sequence is short (say, two spoken words, maybe ten or twenty sounds), our input sequence will be much longer, as we will want to sample each sound many times to be able to distinguish them. Thus, $x$ and $y$ will be of different lengths, which poses a problem from our standard RNN architecture (in which we predict one output for one input).

We have several options for correcting this problem. The first option is to align the output sequence $y$ with the input sequence; each element $y_i$ of the output sequence is placed on some corresponding element $x_i$. Then, the network is trained to output $y_i$ at timestep $i$ (with input $x_i$) and output a "blank" element on timesteps for which there is no output. These sequences are said to be "aligned", since we've placed each output element $y_i$ in its proper temporal position.

Sadly, aligning the sequences is an onerous requirement. While unaligned data may be easy to come by (simply record sound and ask speakers to transcribe it), aligned data may be much harder to acquire; it may require careful aligning as well as understanding of the sounds being produced (and a sound understanding of phonology).

Instead of requiring aligned data, however, we can train our network directly on unaligned data. This requires some clever tricks, objective functions, and output decoding algorithms; collectively, this method is known as Connectionist Temporal Classification .

Connectionist Temporal Classification

For the purposes of Connectionist Temporal Classification (CTC), consider the entire neural network to be simply a function that takes in some input sequence $x$ (of length $T$) and outputs some output sequence $y$ (also of length $T$). As long as we have an objective function on the output sequence $y$, we can train our network to produce the desired output.

Suppose that for each input sequence $x$ (sound data) we have a label $\ell$. The label is a sequence of letters from some alphabet $L$, which is potentially shorter than the input sequence $x$; let $U$ be the length of the label. They key idea behind CTC is that instead of somehow generating the label as output from the neural network, we instead generate a probability distribution at every timestep (from $t = 1$ to $t = T$). We can then decode this probability distribution into a maximum likelihood label. Finally, we train our network by creating an objective function that coerces the maximum likelihood decoding for a given sequence $x$ to correspond to our desired label $\ell$.

There are several moving parts here, and we will talk about them in order.

Probability Distribution

Given an input sequence $x$ of length $T$, the network generates some output $y$ which parameterizes a probability distribution over the space of all possible labels. Let $L'$ be our alphabet $L$ with an extra symbol representing a "blank". The output layer of our network is required to be a softmax layer, which assigns a probability to each element of $L'$. Let $y_i(n)$ be the probability assigned by the network to seeing $n \in L'$ at time $t = i$.

The output generated by the network is known as a "path". The probability of a given path $\pi$ (given inputs $x$) can then be written as the product of all its constituent elements:

$$P(\pi | x) = \prod_{t=1}^T y_t(\pi_t), \text{ where $\pi_t$ is the $t^\text{th}$ element of the path $\pi$}$$

Note that this assumes that the outputs are all conditionally independent (given the internal state of the network); we ensure this by forbidding connections from the output layer to other output layers or to other hidden layers.

If we traverse the path by removing all blanks and duplicate letters, we get some label. Note that we remove duplicate letters in addition to blanks; effectively, this means we really care about transitions from blanks to letters or from a letter to another letter. Let label($\pi$) be the label corresponding to a path $\pi$. Thus, the probability of seeing a particular label $\ell$ given the input sequence $x$ can be written as the sum of all the path probabilities over the paths that get us that label:

$$P(\ell | x) = \sum_{\text{label}(\pi) = \ell} P(\pi | x) = \sum_{\text{label}(\pi) = \ell} \prod_{t=1}^T y_t(\pi_t)$$

Output Decoding

Given the probability distribution $P(\ell | x)$, we can compute a label $\ell$ for an input sequence $x$ by taking the most likely label. Thus, given that $L^{\le T}$ is the set of sequences of length less than or equal to $T$ with letters drawn from the alphabet $L$, we can express our desired classifier $h(x)$ as follows:

$$h(x) = \arg \max_{\ell \in L^{\le T}} P(\ell | x)$$

Computing the most likely $\ell$ from the probability distribution $P(\ell | x)$ is known as decoding . However, given that the alphabet $L$ and the maximum sequence length $T$ may be quite large, it is computationally intractable to examing every possible $\ell \in L^{\le T}$. There is no known algorithm to efficiently compute this $h(x)$ precisely; however, there are several ways to approximate decoding which work well enough in practice.

Traditionally, decoding is done in one of two ways, which we discuss now. However, note that Graves et. al. do not end up using either of these ways, as they augment the CTC-style network in several ways, and find that a different decoding strategy works better in that context.

Best Path Decoding

The first traditional decoding strategy is best path decoding , which assumes that the most likely path corresponds to the most likely label. This is not necessarily true: suppose we have one path with probability 0.1 corresponding to label $A$, and ten paths with probability 0.05 each corresponding to label $B$. Clearly, label $B$ is preferable overall, since it has an overall probability of 0.5; however, best path decoding would select label $A$, which has a higher probability than any path for label $B$.

Best path decoding is fairly simple to compute; simply look at the most active output at every timestep, concatenate them, and convert them to a label (via removing blanks and duplicates). Since at each step we choose the most active output, the resulting path is the most likely one.

Prefix Search Decoding

As an alternative to the naive best path decoding method, we can perform a search in the label space using heuristics to guide our search and decide when to stop. One particular set of heuristics yields an algorithm called prefix search decoding , which is somewhat inspired by the forward-backward algorithm for hidden Markov models.

The intuition behind prefix search decoding is that instead of searching among all labels, we can look at prefixes of strings. We continue growing the prefixes by appending the most probable element until it is more probable that the prefix ends (the string consists only of that prefix), at which point we stop.

The search proceeds as follows:

At each step, we maintain a list of growing prefixes. Initialise this list with a single element consisting of the empty prefix. Along with each prefix store its probability; we know that the empty prefix has probability one.

Find the most likely prefix. Consider each possible extension of the prefix, or consider terminating the prefix and ending the string.

Compute the probability of each of these options.

If terminating the prefix has a higher probability than extending this or any other prefix, terminate the prefix; we have found our decoding.

If extending the prefix has a higher probability than terminating it, extend the prefix and store it with the new probability (instead of the old, shorter prefix).

Iterate these steps until you have found your decoding.

Note that if given enough time, prefix search will find the true best decoding, and may thus require exponentially many prefixes. However, if the output distribution is concentrated around the best decoding, the search will finish significantly faster; also, heuristics may be used to speed it up. (For instance, Graves et. al. cut the sequence into chunks which are likely to start and end with a blank by segmenting based on the probability of a blank, and then run prefix search on the small chunks.)

Graves et. al. provide the following diagram to help understand this process in their paper describing CTC networks :

Note that at this point we have said nothing about how we can compute the probability of a prefix once we extend it, which is what we address next.

To efficiently compute the probability of a prefix, we define some extra values which we will compute incrementally in a dynamic programming algorithm. Let $\gamma_t(p_n)$ be the probability that the prefix $p$ is seen at time $t$ and that the last seen output is a non-blank ($n$ stands for non-blank). Similarly, let $\gamma_t(p_b)$ be the probability that the prefix $p$ is seen at time $t$ and the last seen output is a blank ($b$ stands for blank). Note that these are probabilities that the prefix and nothing else have been seen at time $t$. Then, the probability that a given prefix $p$ is the entire labeling is given by

$$P(p|x) = \gamma_T(p_n) + \gamma_T(p_b),$$

where $T$ is the length of our sequence. Also, let us define the harder-to-compute probability of seeing a prefix $p$ that has a non-empty string following it as

$$P(p\ldots|x) = \sum_{\ell \ne \varnothing} P(p + \ell | x)$$

With these in mind, we can proceed to implement the search algorithm described earlier. First, we must initialize all our values. We know that for any $t$, we can compute the probability of seeing nothing by that $t$:

$$\begin{align*} \gamma_t(\varnothing_n) &= 0 \\ \gamma_t(\varnothing_b) &= \prod_{i=1}^t y_i(b) \end{align*}$$

To understand the first of these, note that it's impossible that we see nothing when the path doesn't end in a blank (since then we'd have seen that non-blank). For the second, note that in order to see nothing and have it end in a blank, we simply multiply the probabilities of seeing blanks at every timestep up to $t$.

Next, initialize our set of prefixes to $P = \{\varnothing\}$, the set containing only the empty prefix. Let $\ell^*$ be the growing output labeling, and let $p^*$ be the current prefix we're looking at. Initialize both of these to the empty string as well.

We now begin iteratively growing our prefixes, extending them by one character at a time. If $p^*$ is the current best prefix, all the prefixes we wish to consider next are of the form $p' = p^* + k$, where $k$ is a character in our alphabet. For each, we wish to compute the probability that this prefix is the entire labeling, $P(p'|x) = P(p^*+k|x)$, as well as the probability that the best labeling starts with this prefix, $P(p'\ldots|x)$. We can compute these using a dynamic programming algorithm with $\gamma_t(x)$, by starting with initial values at $\gamma_1(x)$ and building up through time.

Thus, initialize $\gamma_1(p'_n)$ and $\gamma_1(p'_b)$. Intuitively, $\gamma_1(p'_b)$ must be zero, because $p'$ does not end in a blank (it ends in $k$) and at time zero if we output a blank we clearly could not have seen $p'$. On the other hand, $\gamma_1(p'_n)$ can be non-zero if $p'$ consists only of $k$, in which case $\gamma_1(p'_n) = y_1(k)$.

We now have values of $\gamma$ at $t = 1$. In addition, let us define the probability that we the last character in $p'$ appear at time $t$. Since the last character is $k$, this probability is the probability that seeing a $k$ at time $t$ would indeed be a new character:

$$\text{new}(t) = \gamma_{t-1}(p^*_b) + \begin{cases}0 & \text{if $p^*$ ends in $k$}\\ \gamma_{t-1}(p^*_n) & \text{otherwise}\end{cases}$$

This equation says that we have two ways in which seeing a $k$ can yield a new character: either the previous timestep we had a sequence which ended in a blank, or we had a sequence which ended in a non-$k$ character.

With this value, we can compute $\gamma$ for time $t$ if we have computed it for times less than $t$:

$$\begin{align*} \gamma_t(p'_n) &= y_t(k) \big(\text{new}(t) + \gamma_{t-1}(p'_n)\big) \\ \gamma_t(p'_b) &= y_t(b) \big(\gamma_{t-1}(p'_b) + \gamma_{t-1}(p'_n)\big) \end{align*}$$

These equations are fairly intuitive. For the first one, to end in a non-blank and generate $p'$, we must end with $k$ (thus the $y_t(k)$); in order to have generated $p'$, $k$ must either be a new label or we must have already generated all of $p'$ (with $k$ included) at the previous timestep (thus the $\gamma_{t-1}(p'_n)$). For the second equation, to generate $p'$ and end in a blank, we must clearly have a blank as the last character, and at the previous timestep we must have generated all of $p'$ (and ended with blank or non-blank).

With $\gamma$ fully computed up through the sequence length $T$, we can now compute the quantities we are interested in, namely $P(p'\ldots|x)$ and $P(p'|x)$. As we said before,

$$P(p'|x) = \gamma_T(p'_n) + \gamma_T(p'_b)$$

We now have values for $\gamma$, so we can compute this numerically.

In order to compute $P(p'\ldots|x)$, we simply compute the probability of seeing $p'$ as a prefix and subtract the probability of seeing it as the entire label. The probability of seeing it as the entire label is the value we just computed, $P(p'|x)$. The probability of seeing $p'$ as a prefix can be written as

$$P(p' \text{ is a prefix }|x) = \gamma_1(p'_n) + \sum_{t=2}^T y_t(k) \cdot \text{new}(t),$$

because we simply consider separately the possibilities that we see $p'$ as a prefix for the first time at every time $t$. (For $t=1$, that's just $\gamma_1(p'_n)$, since there's no way to have seen $p'$ as a prefix before; for other $t$, it's exactly the probability of seeing $k$ times the probability new$(t)$ that we're seeing a new character being generated, which we computed before.) Thus, we can write $P(p'\ldots|x)$ as

$$P(p'\ldots|x) = \gamma_1(p'_n) + \sum_{t=2}^T y_t(k) \cdot \text{new}(t) - P(p'|x).$$

Now we have the probability of each extended prefix and the probability of ending with each one. When you compute these for a prefix, do the following:

  • If our prefix is a better labeling than $\ell^*$ ($P(p'|x) > P(\ell^*|x)$), update our best labeling $\ell^*$.
  • If the probability of starting with $p'$ is higher than our labeling ($P(p'\ldots|x) > P(\ell^*|x)$, add it to the list of prefixes we're considering $P$.

After looking at each of the extensions of $p^*$:

  • Get rid of $p^*$ from the list of prefixes we're considering.
  • Update $p^*$ by choosing it to be the prefix that maximizes $P(p^*\ldots|x)$ (the prefix that the labeling is most likely to start with).

We wish to continue growing our prefix until the current estimate of the best labeling has higher probability than any of our other options from the best prefix. The probability of extending the best prefix is just $P(p^*\ldots|x)$, and the best labeling has probability $P(\ell^*|x)$, so we iterate until $P(p^*\ldots|x) < P(\ell^*|x)$. After each step, we have these values, so we can easily test for termination. Once we're done, $\ell^*$ will contain our best labeling, and decoding will be complete.

This entire algorithm is summarized in the following graphic, taken from Alex Graves' dissertation :

Formulating an Objective

Now that we have defined the probability distribution used in CTC networks as well as figured out how to decode the CTC network output, we are left with the question of how we can train our CTC networks. In order to train our network, we need an objective function, which we can then minimize via some standard minimization algorithm (such as gradient descent or hessian free optimization). In this section, we derive this objective function.

The objective function is based off maximum likelihood; minimizing the objective function maximizes the (log) likelihood of observing our desired label. We begin by deriving this likelihood function; namely, we wish to compute $P(\ell|x)$ where $\ell$ is the label and $x$ is the input sequence.

Naively computing this is computationally intractable, as demonstrated via the following equation (which we came up with above):

$$P(\ell | x) = \sum_{\large\text{label}(\pi) = \ell} P(\pi | x) = \sum_{\text{label}(\pi) = \ell} \prod_{t=1}^T y_t(\pi_t)$$

However, we can compute this efficiently via a dynamic programming algorithm similar to the one we used to do decoding. This algorithm, however, has a forward and backward pass. The forward pass computes probabilities of prefixes, and the backward pass computes probabilities of suffixes.

The maximum likelihood function works by probabilistically matching elements of the label sequence with elements of the output sequence. We know that the output sequence will have many blanks; in particular, we expect that there will very often be a blank between successive letters. To simplify our matching, we can account for this by adjusting our mental model of the label we're matching. Thus, instead of considering a label $\ell$, we consider a modified label $\ell'$, which is just $\ell$ with blanks inserted between all letters, as well as at the beginning and end. This way, if the network outputs blanks between its letters, they will correspond to existing blanks between the letters in the label. Since we have a blank between each pair of letters and at the beginning and end, the lengths of the new sequence is $|\ell'| = 2|\ell| + 1$ (if $|\ell|$ is the length of the original sequence).

Let $\ell'_{1:q}$ be the substring of $\ell'$ that starts at element 1 and ends at element $q$ (such that $\ell'_{1:0}$ is the empty string). Then, let $\alpha_t(s)$ be the probability that the prefix $\ell'_{1:s}$ is observed by time $t$. We can write this probability by summing over all paths $\pi$ that contain $\ell'_{1:s}$ in their first $t$ elements (label$(\pi_{1:t}) = \ell'_{1:s}$):

$$\alpha_t(s) = \sum_{\large\text{label}(\pi_{1:t}) = \ell'_{1:s}} p(\pi | x)= \sum_{\large\text{label}(\pi_{1:t}) = \ell'_{1:s}} \prod_{j=1}^t y_j(\pi_j)$$

Note that the probability of $\ell$ is the combined probability of $\ell'$ with and without the last blank:

$$P(\ell|x) = \alpha_T(|\ell'|) + \alpha_T(|\ell'|-1).$$

We compute $\alpha$ via the following dynamic programming algorithm. We start by initializing $\alpha$ for time $t = 0$: $$\begin{align*} \alpha_0(0) &= 1 \\ \alpha_0(i) &= 0 \text{ (for $i > 0$)} \end{align*}$$ Before any data is presented we cannot predict any elements of the label (and thus the probability of the empty string must be one, and the probability of a non-empty string is zero.)

This forms a base case. Next, we can compute any other $\alpha_t(s)$ via the following recursive relations (where $b$ is a blank):

$$\alpha_t(s) = \begin{cases} y_t(\ell'(s)) \cdot (\alpha_{t-1}(s) + \alpha_{t-1}(s-1)) & \text{if } \ell'(s) = b \text{ or } \ell'(s - 2) = \ell'(s)\\ y_t(\ell'(s)) \cdot (\alpha_{t-1}(s) + \alpha_{t-1}(s-1) + + \alpha_{t-1}(s-2)) & \text{otherwise} \end{cases}$$

These relations may initially seem fairly cryptic, so let us look at each one in turn.

Suppose $\ell'(s) = b$: the last letter in our prefix is a blank. In this case, we have two ways in which we will have seen this prefix by time $t$. First, we could have seen the entire prefix by time $t-1$, followed by seeing a blank (which does nothing). This probability is just $y_t(\ell'(s)) \cdot \alpha_{t-1}(s)$, where $y_t(\ell'(s))$ corresponds to seeing the blank and $\alpha_{t-1}(s)$ corresponds to seeing the entire prefix by time $t-1$. The other way in which we can see this prefix by time $t$ is if we see everything but the last blank by time $t-1$ and we see a blank at time $t$; for this, the probability is $y_t(\ell'(s)) \alpha_{t-1}(s-1)$.

Suppose $\ell'(s) = \ell'(s-2)$: our original sequence $\ell$ has two repeated letters, and we stuck a blank in between them. Once more, we have two ways in which we can get our full prefix $\ell'_{1:s}$ by time $t$. First, if we have already seen the prefix by time $t-1$, we can just see another $\ell'(s)$ and the repeated $\ell'(s)$ will just be removed; the probability of seeing that repeated $\ell'(s)$ is $y_t(\ell'(s)) \alpha_{t-1}(s)$. Second, if we have seen everything but the last letter, then we must see the last letter at time $t$; this probability is $y_t(\ell'(s)) \alpha_{t-1}(s-1)$. Note that the sum of this case is identical to that of the previous case.

  • Finally, suppose we have non-blank $\ell'(s)$ which is distinct from the previous non-blank, $\ell'(s-2)$. In that case, we have the same options as before, except we have a third which corresponds to outputting $\ell'(s)$ immediately after $\ell'(s-2)$ (with no intervening blank). That will happen if we have seen $\ell'(s-2)$ by time $t-1$ and immediately see $\ell'(s)$ afterwards; this probability is $y_t(\ell'(s))\alpha_{t-1}(s-2)$, and forms the last term in the second case.

Note also that for any for any $s$ that is smaller than $|\ell'| - 2(T-t)-1$, we do not have enough time steps to complete the rest of the sequence, so $\alpha_t(s) = 0$. Thus, we can compute $\alpha_t(s)$ for any $t$, $s$, and $\ell'$; we can also thus compute $P(\ell|x)$ for any $\ell$ and $x$.

We can now formulate our objective function. Given the dataset $S = \{(x, \ell)\}$ of training samples where $x$ is the input and $\ell$ is the desired output, we wish to maximize the likelihood (log probability) of each training sample, which corresponds to minimizing the following objective function:

$$\mathcal{O}(S) = -\sum_{(x, \ell) \in S} \ln P(\ell|x).$$

Training the Network

Now that we have an objective function, can devise a training algorithm to minimize it. As we'll see, this is where the backwards pass of our forward-backward algorithm comes into play. We minimize it by taking the gradient with respect to the weights, at which point we can use gradient descent.

The difficulty in this minimization comes from the fact that we need to compute the derivative with respect to the neural network outputs $y_t$, since our objective is a fairly complicated function of these outputs. Once we have the derivatives with respect to the neural network outputs $y_t$, we can use standard neural network backpropagation to compute the derivatives with respect to the weights. Note also that since all training samples are independent, we will just compute our derivatives for a single training sample; simply sum over all training samples to deal with the entire dataset.

In order to compute our gradients, we are going to need our set of backwards variables. Thus, let $\beta_t(s)$ be the probability that $\ell'_{s:|\ell'|}$ is observed after time $t$; that is, the probability that the $|\ell'|-s$-length suffix of $t$ is seen starting at time $t$. If $\pi$ are paths, we define these variables as the sum of the path probabilities over all paths with the desired suffix:

$$\beta_t(s) = \sum_{\large\text{label}(\pi_{t:T}) = \ell'_{s:|\ell'|}} \prod_{i=t}^T y_i(\pi(i))$$

As before, we have some fairly convenient initializations, this time at time $t = T$:

$$\begin{align*} \beta_T(|\ell'|) &= y_T(b) \\ \beta_T(|\ell'| - 1) &= y_T(\ell_{|\ell|}) \\ \end{align*}$$

The above equations simply state that the probability of seeing the last character as a suffix at time $t$ is the probability of seeing the network output that character at time $t$ (with or without the blank trailing $\ell'$). We also know that it's impossible to see a two or more character suffix if we're only looking at the last time output:

$$\beta_T(s) = 0 \text{ for all } s < |\ell'| - 1.$$

Next, we define the recursive relations that allow us to compute $\beta_t(s)$ for any other $s$ and $t$. Unsurprisingly, they look like backwards versions of the $\alpha_t(s)$ relations:

$$\beta_t(s) = \begin{cases} y_t(\ell'(s))\cdot (\beta_{t+1}(s) + \beta_{t+1}(s+1)) & \text{ if $\ell'(s) = b$ or $\ell'(s) = \ell'(s+2)$} \\ y_t(\ell'(s))\cdot (\beta_{t+1}(s) + \beta_{t+1}(s+1) + \beta_{t+1}(s+2)) & \text{ otherwise} \\ \end{cases}$$ The reasoning from before carries over as well, as long as you keep in mind that $\beta_t(s)$ is the probability of seeing the $|\ell'|-s$ length suffix of $\ell'$ starting at time $t$.

Now, we have $\alpha_t(s)$, the probability of $s$-length prefix being seen at time $t$, as well as $\beta_t(s)$, the probability of the $|\ell'|-s$ length suffix being seen at time $t$. The next key insight is that if we have both $\beta_t(s)$ and $\alpha_t(s)$, we are observing symbol $s$ at time $t$, because if we were observing anything else, either the suffix or the prefix would not be observed exactly at time $t$. Thus, $\alpha_t(s)\beta_t(s)$ is the probability of all paths corresponding to $\ell$ that go through symbol $s$ at time $t$. Specifically, recall the definitions of $\alpha$ and $\beta$:

$$\begin{align*} \alpha_t(s) &= \sum_{\large\text{label}(\pi_{1:t}) = \ell'_{1:s}} \prod_{j=1}^t y_j(\pi_j)\\ \beta_t(s) &= \sum_{\large\text{label}(\pi_{t:T}) = \ell'_{s:|\ell'|}} \prod_{i=t}^T y_i(\pi(i)) \end{align*}$$

Now, consider the terms of the product $\alpha_t(s)\beta_t(s)$. Each of these is a term from the $\alpha_t(s)$ times a term from the $\beta_t(s)$ sum. Since each term is for a distinct prefix or suffix, the cross product of these two sets yields all possible prefixes and suffixes. The only constraint is that the prefixes and suffixes end and start with symbol $s$ at time $t$; that is, that the paths have $s$ emitted at time $t$. When we multiply two terms (both of which are products over the path) from the two sums, the resulting term is also just a product over the path. Since $\alpha_t(s)$ contributes a product over the prefix and $\beta_t(s)$ contributes a product over the suffix, the result is a product over the entire path. Note, however, that since both the suffix and prefix include $s$, we have to avoid double counting it, so we divide by $y_t(\ell'(s))$. This yields the equation

$$\frac{\alpha_t(s)\beta_t(s)}{y_t(\ell'(s))} = \sum_{\substack{\large\text{label}(\pi) = \ell\\\large\pi(t) = \ell'(s)}} \prod_{t=1}^T y_t(\pi(t))= \sum_{\substack{\large\text{label}(\pi) = \ell\\\large\pi(t) = \ell'(s)}} P(\pi|x)$$

In the equation above, we are summing over all paths that have symbol $s$ at time $t$. We know that some symbol from the path must exist at time $t$, though. Thus, the total probability of $\ell$ is the sum of the probabilities that any s appears at time $t$. Thus, we can write that

$$P(\ell|x) = \sum_{s=1}^{|\ell'|} \frac{\alpha_t(s)\beta_t(s)}{y_t(\ell'(s))}.$$

Note that this is valid for any $t$ from 1 to $T$. Thus, we can differentiate this probability with respect to $y_t(k)$ for any character $k$ in the alphabet (or blank) and time $t$. Note though that a sequence $\ell'$ may have many instances of $k$, so let $\text{loc}(\ell', k)$ be the set of locations $s$ such that $\ell'(s) = k$. Then, we can write the derivative as

$$\frac{\partial}{\partial y_t(k)} P(\ell|x) = \frac{1}{y_t(k)^2} \sum_{\large s \in \text{loc}(\ell', k)} \alpha_t(s)\beta_t(s)$$

Recall that the final objective function is actually the natural log of the probability. However, we know that

$$\frac{\partial}{\partial y_t(k)} \ln P(\ell|x) = \frac{1}{P(\ell|x)}\frac{\partial}{\partial y_t(k)} P(\ell|x)$$

and that the probability itself may be written as

$$P(\ell|x) = \alpha_T(|\ell'|)+\alpha_T(|\ell'|-1),$$

which leads us to our final objective function for the CTC network:

$$\frac{\partial}{\partial y_t(k)} \ln P(\ell|x) = \frac{1}{y_t(k)^2\cdot(\alpha_T(|\ell'|)+\alpha_T(|\ell'|-1))} \sum_{\large s \in \text{loc}(\ell', k)} \alpha_t(s)\beta_t(s)$$

This concludes our analysis of connectionist temporal classification (CTC) networks; the details may be accessed in this paper and Alex Graves' dissertation , both of which address several other issues which arise in practice with CTC networks, and include experimental findings related to their use.

Training a Linguistic Model

The connectionist temporal classification model we described above does a good job as an acoustic model; that is, it can be trained to predict the output phonemes based on the input sound data. However, it does not account for the fact that the output is actually human language, and not just a stream of phonemes. We can augment the acoustic model with a "linguistic" model, one that depends solely on the character stream, and not on the sound data.

This second model is also done as an RNN, known as an RNN transducer. A full account of it may be viewed in this paper .

Using the same architecture as we defined in the first section (before we looked into CTC networks), we train an RNN to do one-step prediction. Namely, if we have a data sequence $d = (g_1, g_2, \ldots, g_k)$ of characters , we train our neural network to predict $d$ from an input sequence $(b, g_1, g_2, \ldots, g_{k-1})$, where $b$ is a blank character (encoded the same as in our CTC networks).

Now we have two models - one RNN that does character-level prediction, and one that does sound-based prediction. If $f_t$ is the output of the acoustic model at time $t$ and $g_u$ is the output of the character-based model at character $u$, we can combine these into a single function $h(t, u)$:

$$h(t, u) = \exp(f_t + g_u).$$

Note that $f_t$ and $g_u$ are both vectors, and the exponentiation and addition are done elementwise. The length of the vector is dependent on the number of characters in the alphabet, with potentially an extra space for the blank.

From $h$, we can create a normalized probability distribution for observing character or blank $k$ at time $t$ and location $u$:

$$P(k | t, u) = \frac{h(t, u)_k}{\sum_{j=1}^K h(t, u)_j},$$ where $K$ is the number of elements in $h(t, u)$.

Using this probability distribution, we can define the functions $\ell(t, u)$ and $b(t, u)$ as the probability of outputting the $u+1$th element of $\ell$ and the probability of outputting a blank:

$$\begin{align*} \ell(t, u) &= P(\ell(u+1)|t, u) \\ b(t, u) &= P(b|t, u) \end{align*}$$

Note that these functions are effectively predicting the next character emitted. With these, we can redefine our forward and backward variable $\alpha_t(u)$ and $\beta_t(u)$ relations as follows:

$$\begin{align*} \alpha_t(u) &= \alpha_{t-1}(u)b(t-1, u) + \alpha_t(u-1)\ell(t, u-1) \\ \beta_t(u) &= \beta_{t+1}(u)b(t, u) + \beta_t(u+1)\ell(t, u) \end{align*}$$

These have a similar justification as in the previous section. (Note that we did not use $\ell'$ in this explanation, though.) Next, we proceed through the rest of the CTC algorithm in a similarly motivated way. Decoding, however, must be done with a beam search, which again is documented in the original paper .

Minor Modifications

Finally, we have all the components we need to create our final network. Our final network greatly resembles the RNN tranducer network we discussed above. While that is the standard formulation, Graves et. al. propose a modification. Note that the function

$$h(t, u) = \exp(f_t + g_u)$$

effectively multiplies the softmax outputs of $f_t$ and $g_u$. Instead, Graves et. al. propose simply feeding the hidden layers that feed into $f_t$ and $g_u$ to another single-hidden-layer neural network, which computes $h(t, u)$. They find that this decreases deletion errors during speech recognition.

In this article, we've gone over a ton of material. Motivated by the Graves et. al. paper , we looked at many ways to augment standard recurrent neural networks and apply them to speech recognition. We used Long Short-Term Memory (LSTM) units in deep (multi-hidden-layer) bidirectional recurrent neural networks (BRNNs) as our base architecture. We worked through an explanation of connectionist temporal classificiation (CTC) networks, a method via which we can train RNNs to work with unaligned data sequences. We worked through two possible decoding algorithms for standard CTC networks and derived the objective function as well as the way in which we can compute the gradient in order to train our networks. We looked at RNN tranducers, an approach used to augment the CTC network with a linguistic model (or any model that just models output-output relationships). (Note that we skipped over a number of things related to decoding data from the RNN transducer network.)

In summary, neural networks can be really darn complicated.

Neural Text to Speech (TTS): Making Voice Experiences More Human

If you’ve ever confused a bot’s speech with a human caller, you’ve probably experienced neural text to speech. But do you know how it can help your brand?

Neural Text to Speech (TTS): Making Voice Experiences More Human

Text-to-speech (TTS) technology is changing the way we interact with our machines. It speaks to us from our smart speakers, virtual assistants, and voice bots. Combined with a range of smart technologies—automatic speech recognition (ASR), natural language understanding (NLU), dialog management, and natural language generation (NLG), at minimum—text to speech lets us issue commands and get responses entirely through speech. The resulting voice user interfaces turn computing into a more human experience.

But to really transform personal computers into personable computers , robotic TTS voices won’t do. Thankfully, artificial intelligence (AI) allows us to create synthetic speech that’s barely discernible from the real thing. This AI-powered TTS is called neural text to speech. It’s how the ReadSpeaker VoiceLab crafts custom synthetic voices for brands and creators. And thanks to AI, neural text to speech is more natural, expressive, and welcoming than ever.

If you’ve ever mistaken machine-generated speech for a human speaker, neural TTS is probably the reason why. Here’s what it is, what it can do, and why it’s important for your business.

What Makes Text to Speech “Neural?”

In a nutshell, neural text to speech is a form of machine speech built with neural networks. A neural network is a type of computer architecture modeled on the human brain. Your brain processes data through unbelievably complex webs of electrochemical connections between nerve cells, or neurons. As these connective pathways develop through repetition, they require less effort to activate. We call that “learning.”

Neural networks loosely mimic this action. They’re clusters of processing units—artificial neurons—that classify input data and transmit it to other artificial neurons. By setting parameters for desired results, then processing large datasets, neural networks learn to map optimal paths from neuron to neuron, input to output. Unlike traditional computing, you don’t write the rules for a neural network; there’s no “If A, then B.” Rather, the network derives the rules from the training data. It’s a form of machine learning that’s been applied to everything from image recognition to picking winning stocks.

But not all neural networks are deep neural networks (DNN), the technology ReadSpeaker’s VoiceLab uses to produce more lifelike machine speech. We call a neural network “deep” when it consists of three or more processing layers:

The input layer initially classifies data, passing it through one or more “hidden” layers. These hidden layers further refine the signal, sorting it into more and more complex classifications. Finally, the output layer produces the final result: Labeling an image correctly, for instance, or predicting a stock fluctuation—or producing an audio signal that sounds uncannily like human speech.

What Makes Text to Speech “Neural?”

Neural Text to Speech Models: Duration, Pitch, and Acoustic Predictions

To create a neural TTS voice, we train DNN models on recordings of human speech. The resulting synthetic voice will sound like the input data—the source speaker—which is why we often call neural TTS voice cloning. But it takes multiple DNNs working in concert to pull off this imitation act. In fact, neural TTS voices require at least three distinct DNN models, which combine to create the overall voice reproduction:

  • The acoustic model reproduces the timbre of the speaker’s voice, the color or texture that listeners identify as belonging to that speaker.
  • The pitch model predicts the range of tones in the speech—not just how high or low the TTS voice will be, but also the variance in tone from one phoneme to the next.
  • The duration model predicts how long the voice should hold each phoneme. It helps the TTS engine pronounce the word “feet” rather than “fffeet,” for instance.

The pitch and duration prediction models are called prosodic parameters. That’s because they determine prosody , or non-phonetic properties of speech like intonation, rhythm, and breaks. Meanwhile, the acoustic model predicts acoustic parameters that capture information about the speaker’s voice timbre and the phonetic properties of speech. Today, we can combine these models for increasingly lifelike TTS voices with faster production times—and that’s just one of the capabilities DNNs bring to the field of machine speech.

New Possibilities for Neural Text to Speech Technology

The most obvious advantage of neural TTS is that it sounds better . In a 2016 study, participants rated DNN-based TTS systems as more natural than other types of TTS. A 2019 review of DNN-based TTS says that deep learning makes “the quality of the synthesized speech better than the traditional methods.” And DNN technology has only improved since 2019. But neural text to speech is also leading to unexpected TTS-production techniques that simultaneously reduce costs and improve quality.

That’s important for brands. Text to speech allows you to engage consumers through voice-first channels like smart speakers, virtual assistants, and interactive voice response (IVR) systems. Here are a few ways DNN-based TTS makes these experiences better for brands and consumers alike.

Prosody Transfer

Say you like the sound of one TTS voice but the speaking style of another. Prosody transfer makes it possible to get the best of both. As long as the two voices are compatible—meaning they’re in the same language, and they aren’t too far apart in pitch range—we can combine the prosody from one voice with the sound of another. For brands, prosody transfer makes it possible to give a custom branded TTS voice more expressive range—without starting from scratch for each new speaking style.

Speaker-Adapted Models

An advanced machine learning technique called transfer learning reduces the amount of training data required to produce a new neural TTS voice. Large datasets from existing TTS voices fill in the learning gaps left by shorter new voice recordings. While a few hours of voice recordings are always ideal for training voice models, speaker-adaptation allows us to emulate a new voice even when only shorter recordings are available. In other words, we can train these multi-voice models faster, with less original training data, and still produce lifelike TTS voices. This will help drive down costs and expand access to original, branded text-to-speech personas.

“Emotional” Speaking Styles

Training data determines the sound of every TTS voice. If you record three hours of someone speaking angrily, with large pitch variances and high intensity, you’ll end up with an “angry” TTS voice. With traditional text to speech, you needed a good 25 hours of recorded data to produce a decent voice—and that voice had to be relatively neutral in expression. With DNN models, you can get terrific results by training models on just a few hours of recorded speech—and even less with speaker adaptation.

These advances allow the ReadSpeaker VoiceLab to record three to five hours of a single speaking style or affective mood, then another hour or so of the same speaker performing in styles that suggest different moods. (Lacking these recordings, you could always find a more expressive TTS voice and use prosody transfer to mimic the performance.) That allows us to create voices with emotional variation, adjustable at the point of production through our proprietary voice-text markup language (VTML). So you can produce an enthusiastic TTS message, and another apologetic statement, all with the same, recognizable TTS voice, and all through the same TTS production engine.

Combine this capability with conversational AI to create automated chatbots, IVR systems, and virtual assistants that adjust speaking tone to match the mood of the speaker. That, in turn, improves the customer experience through fully automated voice channels. Maybe that’s why, in 2021, an estimated 67% of companies were investing in conversational AI. With the wide 2023 release of generative AI chatbots like ChatGPT, those investments are likely to increase. After all, McKinsey says chatbots will soon assist with core tasks in marketing and sales—and with neural TTS, AI chatbots will speak as well as they write. By creating a more natural audio experience, neural text to speech is helping to power this shift toward voice marketing strategies.

A phone on a blue background

ReadSpeaker’s industry-leading voice expertise leveraged by leading Italian newspaper to enhance the reader experience Milan, Italy. – 19 October, 2023 – ReadSpeaker, the most trusted,…

Accessibility Overlays: What Site Owners Need to Know

Accessibility overlays have gotten a lot of bad press, much of it deserved. So what can you do to improve web accessibility? Find out here.

Make STEM accessible with LaTeX and ReadSpeaker - Person writing on white board.

Put your whole class on an equal playing field by making your STEM lessons more accessible for students who need audio assistance.

  • ReadSpeaker webReader
  • ReadSpeaker docReader
  • ReadSpeaker TextAid
  • Assessments
  • Text to Speech for K12
  • Higher Education
  • Corporate Learning
  • Learning Management Systems
  • Custom Text-To-Speech (TTS) Voices
  • Voice Cloning Software
  • Text-To-Speech (TTS) Voices
  • ReadSpeaker speechMaker Desktop
  • ReadSpeaker speechMaker
  • ReadSpeaker speechCloud API
  • ReadSpeaker speechEngine SAPI
  • ReadSpeaker speechServer
  • ReadSpeaker speechServer MRCP
  • ReadSpeaker speechEngine SDK
  • ReadSpeaker speechEngine SDK Embedded
  • Accessibility
  • Automotive Applications
  • Conversational AI
  • Entertainment
  • Experiential Marketing
  • Guidance & Navigation
  • Smart Home Devices
  • Transportation
  • Virtual Assistant Persona
  • Voice Commerce
  • Customer Stories & e-Books
  • About ReadSpeaker
  • TTS Languages and Voices
  • The Top 10 Benefits of Text to Speech for Businesses
  • Learning Library
  • e-Learning Voices: Text to Speech or Voice Actors?
  • TTS Talks & Webinars

Make your products more engaging with our voice solutions.

  • Solutions ReadSpeaker Online ReadSpeaker webReader ReadSpeaker docReader ReadSpeaker TextAid ReadSpeaker Learning Education Assessments Text to Speech for K12 Higher Education Corporate Learning Learning Management Systems ReadSpeaker Enterprise AI Voice Generator Custom Text-To-Speech (TTS) Voices Voice Cloning Software Text-To-Speech (TTS) Voices ReadSpeaker speechCloud API ReadSpeaker speechEngine SAPI ReadSpeaker speechServer ReadSpeaker speechServer MRCP ReadSpeaker speechEngine SDK ReadSpeaker speechEngine SDK Embedded
  • Applications Accessibility Automotive Applications Conversational AI Education Entertainment Experiential Marketing Fintech Gaming Government Guidance & Navigation Healthcare Media Publishing Smart Home Devices Transportation Virtual Assistant Persona Voice Commerce
  • Resources Resources TTS Languages and Voices Learning Library TTS Talks and Webinars About ReadSpeaker Careers Support Blog The Top 10 Benefits of Text to Speech for Businesses e-Learning Voices: Text to Speech or Voice Actors?
  • Get started

Search on ReadSpeaker.com ...

All languages.

  • Norsk Bokmål
  • Latviešu valoda

Amir

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, text-to-speech synthesis.

95 papers with code • 6 benchmarks • 17 datasets

Text-To-Speech Synthesis is a machine learning task that involves converting written text into spoken words. The goal is to generate synthetic speech that sounds natural and resembles human speech as closely as possible.

Benchmarks Add a Result

--> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
NaturalSpeech
Token-Level Ensemble Distillation
Mia
Tacotron 2
Tacotron 2
Match-TTSG

speech to text neural network

Most implemented papers

Fastspeech 2: fast and high-quality end-to-end text to speech.

speech to text neural network

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units.

FastSpeech: Fast, Robust and Controllable Text to Speech

In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Efficient Neural Audio Synthesis

The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time.

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

speech to text neural network

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Clone a voice in 5 seconds to generate arbitrary speech in real-time

FastSpeech: Fast,Robustand Controllable Text-to-Speech

Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

🤖💬 Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.

as-ideas/TransformerTTS

Folders and files.

NameName
623 Commits

Repository files navigation

speech to text neural network

A Text-to-Speech Transformer in TensorFlow 2

Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS). This repo is based, among others, on the following papers:

  • Neural Speech Synthesis with Transformer Network
  • FastSpeech: Fast, Robust and Controllable Text to Speech
  • FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
  • FastPitch: Parallel Text-to-speech with Pitch Prediction

Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:

(older versions are available also for WaveRNN )

For quick inference with these vocoders, checkout the Vocoding branch

Non-Autoregressive

Being non-autoregressive, this Transformer model is:

  • Robust: No repeats and failed attention modes for challenging sentences.
  • Fast: With no autoregression, predictions take a fraction of the time.
  • Controllable: It is possible to control the speed and pitch of the generated utterance.

Can be found here.

These samples' spectrograms are converted using the pre-trained MelGAN vocoder.

Try it out on Colab:

Open In Colab

  • 06/20: Added normalisation and pre-trained models compatible with the faster MelGAN vocoder.
  • 11/20: Added pitch prediction. Autoregressive model is now specialized as an Aligner and Forward is now the only TTS model. Changed models architectures. Discontinued WaveRNN support. Improved duration extraction with Dijkstra algorithm.
  • 03/20: Vocoding branch.

Installation

Model weights.

Make sure you have:

  • Python >= 3.6

Install espeak as phonemizer backend (for macOS use brew):

Then install the rest with pip:

Read the individual scripts for more command line arguments.

Pre-Trained LJSpeech API

Use our pre-trained model (with Griffin-Lim) from command line with

Or in a python script

You can specify the model step with the --step flag (CL) or step parameter (script). Steps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).

IMPORTANT: make sure to checkout the correct repository version to use the API. Currently 493be6345341af0df3ae829de79c2793c9afd0ec

You can directly use LJSpeech to create the training dataset.

Configuration

  • swap the content of data_config_wavernn.yaml in config/training_config.yaml to create models compatible with WaveRNN
  • EDIT PATHS : in config/training_config.yaml edit the paths to point at your dataset and log folders

Custom dataset

Prepare a folder containing your metadata and wav files, for instance

if metadata.csv has the following format wav_file_name|transcription you can use the ljspeech preprocessor in data/metadata_readers.py , otherwise add your own under the same file.

Make sure that:

  • the metadata reader function name is the same as data_name field in training_config.yaml .
  • the metadata file (can be anything) is specified under metadata_path in training_config.yaml

Change the --config argument based on the configuration of your choice.

Train Aligner Model

Create training dataset.

This will populate the training data directory (default transformer_tts_data.ljspeech ).

Train TTS Model

Compute alignment dataset.

First use the aligner model to create the durations dataset

this will add the durations.<session name> as well as the char-wise pitch folders to the training data directory.

Training & Model configuration

  • Training and model settings can be configured in training_config.yaml

Resume or restart training

  • To resume training simply use the same configuration files
  • To restart training, delete the weights and/or the logs from the logs folder with the training flag --reset_dir (both) or --reset_logs , --reset_weights

Monitor training

With model weights.

From command line with

Access the pre-trained models with the API call.

Old weights

Model URL Commit Vocoder Commit
0cd7d33 aca5990
1c1cb03 aca5990
1c1cb03 aca5990
1c1cb03 3595219
1c1cb03 3595219
d9ccee6 3595219
d9ccee6 3595219
2f3a1b5 3595219

Maintainers

  • Francesco Cardinale, github: cfrancesco

Special thanks

MelGAN and WaveRNN : data normalization and samples' vocoders are from these repos.

Erogol and the Mozilla TTS team for the lively exchange on the topic.

See LICENSE for details.

Contributors 5

  • Python 100.0%

Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology

Product Manager, Speech

Many Google products (e.g., the Google Assistant, Search, Maps) come with built-in high-quality text-to-speech synthesis that produces natural sounding speech. Developers have been telling us they’d like to add text-to-speech to their own applications, so today we’re bringing this technology to Google Cloud Platform with Cloud Text-to-Speech .

You can use Cloud Text-to-Speech in a variety of ways, for example:

  • To power voice response systems for call centers (IVRs) and enabling real-time natural language conversations 
  • To enable IoT devices (e.g., TVs, cars, robots) to talk back to you 
  •  To convert text-based media (e.g., news articles, books) into spoken format (e.g., podcast or audiobook)

Rolling in the DeepMind

In late 2016, DeepMind introduced the first version of WaveNet — a neural network trained with a large volume of speech samples that's able to create raw audio waveforms from scratch. During training, the network extracts the underlying structure of the speech, for example which tones follow one another and what shape a realistic speech waveform should have. When given text input, the trained WaveNet model generates the corresponding speech waveforms, one sample at a time, achieving higher accuracy than alternative approaches.

Fast forward to today, and we're now using an updated version of WaveNet that runs on Google’s Cloud TPU infrastructure .The new, improved WaveNet model generates raw waveforms 1,000 times faster than the original model, and can generate one second of speech in just 50 milliseconds. In fact, the model is not just quicker, but also higher-fidelity, capable of creating waveforms with 24,000 samples a second. We’ve also increased the resolution of each sample from 8 bits to 16 bits, producing higher quality audio for a more human sound.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/wavenet-waveform-anim-optimised-171004-r01s3cw.GIF

With these adjustments, the new WaveNet model produces more natural sounding speech. In tests, people gave the new US English WaveNet voices an average mean-opinion-score (MOS) of 4.1 on a scale of 1-5 — over 20% better than for standard voices and reducing the gap with human speech by over 70%. As WaveNet voices also require less recorded audio input to produce high quality models, we expect to continue to improve both the variety as well as quality of the WaveNet voices available to Cloud customers in the coming months.

https://storage.googleapis.com/gweb-cloudblog-publish/images/cloud-text-to-speech81hv.max-600x600.PNG

Cloud Text-to-Speech is already helping multiple customers deliver a better experience to their end users. Customers include Cisco and Dolphin ONE.

“As the leading provider of collaboration solutions, Cisco has a long history of bringing the latest technology advances into the enterprise. Google’s Cloud Text-to-Speech has enabled us to achieve the natural sound quality that our customers desire.

Tim Tuttle, CTO of Cognitive Collaboration, Cisco

As the leading provider of collaboration solutions, Cisco has a long history of bringing the latest technology advances into the enterprise. Google’s Cloud Text-to-Speech has enabled us to achieve the natural sound quality that our customers desire.

Jason Berryman, Dolphin ONE

Get started today

  • Google Cloud
  • AI & Machine Learning

Related articles

https://storage.googleapis.com/gweb-cloudblog-publish/images/General-GC_Blog_header_2436x1200-v1.max-700x700.jpg

The overwhelmed person’s guide to Google Cloud: week of June 27

By Richard Seroter • 4-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/whats_new.max-700x700.jpg

What’s new with Google Cloud

By Google Cloud Content & Editorial • 6-minute read

What’s new with Google Cloud - 2023

By Google Cloud Content & Editorial • 29-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/2023-GC-Recap.max-700x700.jpg

The year in Google Cloud: Top news of 2023

  • Meeting Transcription
  • Meeting Note Taker
  • Meeting Recording
  • Headphones and Devices
  • Audio Quality
  • Tips and Best Practices
  • Meeting Apps
  • Meeting Templates
  • Remote Work
  • Contact Centers
  • Accent Localization
  • Engineering Blog

Best Speech-to-Text API Solutions in 2024

Avatar photo

4. Post-Processing and Error Correction:

Speech-to-text api frequently asked questions.

Spread the word

APIs are revolutionizing the way we interact with technology.

By converting spoken language into written text, these APIs open new possibilities for accessibility, productivity, and user interaction across numerous platforms and devices. As we delve into the intricacies of speech-to-text technology, it’s essential to understand both the foundational components and the advanced mechanisms that drive these systems.

The purpose of this article is to delve into the best speech-to-text API solutions available in 2024 , focusing on their technical aspects, industry applications, and advantages.

speech to text neural network

What is Behind Speech-to-Text API Technology?

Speech-to-text APIs have become an integral part of modern technology, enabling a wide range of applications from automated transcriptions to voice-controlled interfaces. Understanding the underlying technology helps in appreciating the complexity and the advancements that make these APIs so powerful. Here’s a deep dive into the technical aspects of speech-to-text API technology:

Core Components of Speech-to-Text Technology

1. automatic speech recognition (asr):.

  • Phoneme Recognition: Identifying the smallest units of sound in speech.
  • Feature Extraction: Converting raw audio signals into a format that the ASR system can process, typically involving the extraction of features like Mel-frequency cepstral coefficients (MFCCs).
  • N-gram Models: Probabilistic models that predict the next word in a sequence based on the previous ‘n’ words.
  • Neural Language Models: Use deep learning to predict word sequences with greater context and accuracy.

ASR

2. Deep Learning and Neural Networks:

  • Recurrent Neural Networks (RNNs): Specialized for sequence data, RNNs are adept at processing sequences of audio signals. Variants like Long Short-Term Memory (LSTM) networks are particularly effective in handling long-range dependencies in speech.
  • Convolutional Neural Networks (CNNs): Primarily used for image processing, CNNs have found applications in speech recognition by helping to identify features in audio spectrograms.
  • Transformer Models: The latest advancement in deep learning, transformer models use attention mechanisms to focus on important parts of the input sequence, significantly improving the accuracy and efficiency of speech-to-text systems.

3. Real-Time Processing:

  • Streaming APIs: Enable continuous transcription of audio in real-time, which is essential for applications like live captioning and interactive voice response systems.
  • On-Device Processing: Reduces latency and dependency on cloud services by performing speech recognition directly on the user’s device. This approach is particularly beneficial for applications requiring immediate response and enhanced privacy.
  • Text Normalization: Converts transcribed text into a more readable format by addressing issues like punctuation, capitalization, and spacing.
  • Contextual Understanding: Advanced speech-to-text systems incorporate contextual understanding to correct errors based on the surrounding text, improving the overall accuracy of the transcription.

AI

Speech-to-Text APIs Industry Applications

Speech-to-text technology is utilized across various industries, each benefiting from its unique capabilities. Here is a table summarizing the applications in different industries:

Industry Speech-to-Text API Application
Automates the transcription of patient records.
Enables hands-free operation of medical devices.
Provides real-time transcription of customer interactions.
Enhances AI-powered customer service tools.
Automates the generation of captions for video content.
Assists in the transcription of interviews and podcasts.
Provides students with accurate transcriptions of lectures.
Enhances language learning apps with accurate feedback.

Advancements in Speech-to-Text Technology

Recent advancements have significantly improved the capabilities of speech-to-text APIs:

  • Multilingual Support: Modern APIs support a wide range of languages and dialects, making them accessible to a global audience.
  • Enhanced Accuracy: Continuous improvements in deep learning models and large-scale datasets have led to higher transcription accuracy.
  • Privacy and Security: On-device processing and encrypted data transmission ensure that user data remains secure, addressing privacy concerns.

Challenges and Future Directions

While speech-to-text technology has come a long way, it still faces several challenges:

  • Accurate Transcription in Noisy Environments: Background noise can significantly impact the accuracy of transcriptions. Advanced noise-cancellation algorithms and robust acoustic models are being developed to address this issue.
  • Dialect and Accent Variability: Ensuring accurate transcription across different dialects and accents remains a challenge. Ongoing research focuses on creating more inclusive models that can handle diverse speech patterns.
  • Real-Time Translation: Integrating speech-to-text with real-time translation presents both a challenge and an opportunity. Achieving seamless translation while maintaining accuracy is a key area of development.

Here are some of the top speech-to-text API solutions available in 2024, based on extensive research from reputable sources such as Deepgram, AssemblyAI, and others​​:

1. Assembly AI

Assembly AI Speech-to-text

Assembly AI is a leading provider of speech-to-text solutions, known for its high accuracy and advanced machine learning models. It supports multiple languages and dialects, making it a versatile choice for various industries.

Assembly AI

  • High accuracy with advanced machine learning models.
  • Support for multiple languages and dialects.
  • Real-time and batch processing capabilities.
  • Excellent accuracy for various accents and dialects.
  • Flexible integration options with APIs and SDKs.
  • Robust support and documentation.
  • Requires significant computational resources for processing.
  • Limited offline capabilities.

Use Cases: Suitable for transcription services, call centers, and media industries.

2. Deepgram

Geepgram API speech to text

Deepgram offers deep learning-based ASR with customizable models, providing high accuracy and fast processing speeds. It integrates seamlessly with various platforms, making it ideal for voice assistants and call analytics.

  • Deep learning-based ASR with customizable models.
  • High accuracy and fast processing speeds.
  • Integration with various platforms via APIs.
  • Highly scalable for large-scale applications.
  • Offers real-time and batch processing options.
  • Supports multiple languages and dialects.
  • Customization may require technical expertise.
  • Premium features can be costly.

Use Cases: Ideal for voice assistants, transcription, and call analytics.

3. Speechmatics

speechmatics speech to text API

Speechmatics is renowned for its universal speech recognition technology, offering high accuracy across diverse accents and dialects. It is particularly useful for enterprise applications, providing scalable solutions for various industries.

Speechmatics

  • Universal speech recognition with high accuracy.
  • Support for diverse accents and dialects.
  • Scalable solutions for enterprise applications.
  • Highly accurate transcription across various dialects.
  • Strong enterprise support and scalability.
  • Continuous improvements and updates.
  • Setup can be complex for new users.
  • Higher cost for extensive usage.

Use Cases: Useful for broadcast media, telecommunication, and transcription services.

Rev AI API

Rev AI stands out with its industry-leading accuracy, offering human-reviewed options for even higher precision. It supports real-time and asynchronous transcription, making it perfect for media production and legal sectors.

  • Industry-leading accuracy with human-reviewed options.
  • Real-time and asynchronous transcription.
  • Easy integration with SDKs and APIs.
  • Highly accurate transcriptions with human review.
  • Versatile integration options for various platforms.
  • Strong reputation in the industry.
  • Human-reviewed transcriptions can be more expensive.
  • Limited free tier options.

Use Cases: Perfect for media production, legal, and education sectors.

Whisper, developed by OpenAI, is a cutting-edge speech recognition technology offering high accuracy and robust performance. It supports multiple languages and is ideal for developers seeking open-source solutions.

  • OpenAI’s cutting-edge speech recognition technology.
  • High accuracy and robust performance.
  • Support for multiple languages.
  • Open-source and customizable.
  • Strong performance across various languages.
  • Free to use with extensive documentation.
  • May require fine-tuning for specific applications.
  • Limited support compared to commercial solutions.

Use Cases: Suitable for developers seeking open-source solutions for diverse applications.

Symbl AI speech-to-text API

Symbl offers advanced conversational intelligence with contextual understanding, providing real-time transcription and analysis. It integrates well with communication platforms, making it ideal for customer service and team collaboration.

  • Conversational intelligence with contextual understanding.
  • Real-time transcription and analysis.
  • Integration with communication platforms.
  • Advanced contextual understanding enhances transcription accuracy.
  • Seamless integration with various communication tools.
  • Offers real-time insights and analytics.
  • Can be complex to integrate without technical expertise.
  • Some features are available only in premium plans.

Use Cases: Ideal for customer service, sales, and team collaboration tools.

Krisp: The Ultimate Transcription Solution for Call Centers

Krisp is a versatile and reliable transcription software designed to enhance call center operations and improve customer service.

Technical Advantages of Krisp for Enterprise Call Centers

Superior transcription accuracy.

  • 96% Accuracy:  Leveraging cutting-edge AI, Krisp ensures high-quality transcriptions even in noisy environments, boasting a Word Error Rate (WER) of only 4%.

On-Device Processing

  • Enhanced Security:  Krisp’s desktop app processes transcriptions and noise cancellation directly on your device, ensuring sensitive information remains secure and compliant with stringent security standards.

Unmatched Privacy

  • Real-Time Redaction:  Ensures the utmost privacy by redacting Personally Identifiable Information (PII) and Payment Card Information (PCI) in real-time.
  • Private Cloud Storage:  Stores transcripts in a private cloud owned by customers, with write-only access, ensuring complete control over data.

Centralized Solution Across All Platforms

  • Cost Optimization:  By centralizing call transcriptions across all platforms, Krisp CCT optimizes costs and simplifies data management.
  • Streamlined Operations:  Eliminates the need for multiple transcription services, making data handling more efficient.

No Additional Integrations Required

  • Effortless Integration:  Krisp’s plug-and-play setup integrates seamlessly with major Contact Center as a Service (CCaaS) and Unified Communications as a Service (UCaaS) platforms.
  • Operational Efficiency:  Requires no additional configurations, ensuring smooth and secure operations from the start.

Use Cases Enabled by Krisp Call Center Transcription

Use Case Description
Boost your BPO’s efficiency by ensuring quality control of customer interactions, enabling targeted training and coaching sessions, refining sales strategies, and improving call center metrics for an enhanced operation.
Maintain regulatory compliance and adhere to industry standards with Krisp CCT, which provides a searchable record of all customer interactions. This can support your compliance efforts and offer valuable information for dispute resolution.
Streamline customer research and analysis, identify actionable customer insights, and collect feature requests to better understand and serve your customers.
Identify fraudulent patterns in customer interactions, mitigate data breaches, and enhance fraud prevention strategies to protect your business and customers with Krisp CCT.

Book a Demo

Related Articles

  • Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

How to create text-to-speech with neural network

I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.

While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.

Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.

Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.

How Should I train this?

From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.

Since RNNs deal with such variable data, I thought it would come in handy here.

Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.

EDIT 1 : In response to Aaron's comment

From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this ". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?

On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.

Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?

  • neural-network
  • speech-recognition
  • text-to-speech

Ajay H's user avatar

  • In brief, you have to come up with a feature list. What goes into the list and how it is measured and represented, depends on the problem that you are trying to solve. –  DYZ Commented Mar 27, 2017 at 18:51
  • I already know the features of the input text. I'm not concerned about the input format. It's the output layer that I'm stumped on. How do I train the speech OUTPUT on the text input with the neural network? –  Ajay H Commented Mar 27, 2017 at 18:55
  • You need to first find a way to translate the text into phonemes. This is more typically done with curated databases than learning algorithms, but in your case is what you're really doing with the neural net. audio files of recorded phonemes are more or less appended together to form words. –  Aaron Commented Mar 27, 2017 at 21:25
  • I edited my answer in response to @Aaron 's comment. Please check EDIT 1 of my answer. –  Ajay H Commented Mar 28, 2017 at 13:06
  • @AjayH I just did some preliminary (Wikipedia) research on Kannada, and while in general it was stated that it is purely phonetic, it did also mention at least one instance where that rule is broken. Also mentioned was that there were as many as 20 dialects. While the trivial approach may be sufficient for many instances, you should still definitely include a framework to add overrides for larger (than single letter) patterns where exceptions need to be made. –  Aaron Commented Mar 28, 2017 at 13:36

It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu .

You might also want to study existing implementation of wavenet training .

Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin .

Nikolay Shmyrev's user avatar

  • I edited my answer under EDIT 1 (in response to another comment) . Can you let me know your thoughts on that. –  Ajay H Commented Mar 28, 2017 at 13:08

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged python neural-network speech-recognition text-to-speech or ask your own question .

  • Featured on Meta
  • We spent a sprint addressing your requests — here’s how it went
  • Upcoming initiatives on Stack Overflow and across the Stack Exchange network...
  • What makes a homepage useful for logged-in users

Hot Network Questions

  • Is it an option for the ls utility specified in POSIX.1-2017?
  • Are there countries where only voters affected by a given policy get to vote on it?
  • Why does King Aegon speak to his dragon in the Common Tongue (English)?
  • As a DM, what should I do if a person decides to play a rogue?
  • An adjective for something peaceful but sad?
  • Did any other European leader praise China for its peace initiatives since the outbreak of the Ukraine war?
  • Pattern on a PCB
  • Futuristic show/film about an empire and rebels where the empire rigs a rebel to explode
  • What enforcement exists for medical informed consent?
  • Of "ils" and "elles", which pronoun is, grammatically speaking, used to refer to a group with an overwhelming female majority?
  • Does the oven temperature for a lasagna really matter?
  • Are there other proposed translations of "aelfheres" in Beowulf than a name?
  • Can IBM Quantum hardware handle any CSWAP at all?
  • Using grout that had hardened in the bag
  • What do American people call the classes that students go to after school for SATs?
  • Is infinity a number?
  • USB Data communication issue on a panelized pcba hardware test platform
  • A web site allows upload of pdf/svg files, can we say it is vulnerable to Stored XSS?
  • Looking for title of old Star Trek TOS book where Spock is captured and gets earring
  • Using register after multiplier in the MACC
  • I am trying to calculate Albumin-Creatinine ratios for research, why is the result so high?
  • Does 誰と mean 誰とも?
  • A Ring of Cubes
  • Is it a security issue to expose PII on any publically accessible URL?

speech to text neural network

Addressing Hallucinations in Speech Synthesis LLMs with the NVIDIA NeMo T5-TTS Model

speech to text neural network

NVIDIA NeMo has released the T5-TTS model , a significant advancement in text-to-speech (TTS) technology. Based on large language models (LLMs) , T5-TTS produces more accurate and natural-sounding speech. By improving alignment between text and audio, T5-TTS eliminates hallucinations such as repeated spoken words and skipped text. Additionally, T5-TTS makes up to 2x fewer word pronunciation errors compared to other open-source models such as Bark and SpeechT5 . 

Listen to T5-TTS model audio samples.

NVIDIA NeMo is an end-to-end platform for developing multimodal generative AI models at scale anywhere—on-premises and on any cloud.

The role of LLMs in speech synthesis

LLMs have revolutionized natural language processing (NLP) with their remarkable ability to understand and generate coherent text. Recently, LLMs have been widely adopted in the speech domain, using vast amounts of data to capture the nuances of human speech patterns and intonations. LLM-based speech synthesis models produce speech that is not only more natural, but also more expressive, opening up a world of possibilities for applications in various industries.

However, similar to their use in text domain, speech LLMs face the hallucinations challenges, which can hinder their real-world deployment.

T5-TTS model overview

The T5-TTS model leverages an encoder-decoder transformer architecture for speech synthesis. The encoder processes text input, and the auto-regressive decoder takes a reference speech prompt from the target speaker. The auto-regressive decoder then generates speech tokens by attending to the encoder’s output through the transformer’s cross-attention heads. These cross-attention heads implicitly learn to align text and speech. However, their robustness can falter, especially when the input text contains repeated words.

speech to text neural network

Addressing the hallucination challenge

Hallucination in TTS occurs when the generated speech deviates from the intended text, causing errors ranging from minor mispronunciations to entirely incorrect words. These inaccuracies can compromise the reliability of TTS systems in critical applications like assistive technologies, customer service, and content creation.

The T5-TTS model addresses this issue by more efficiently aligning textual inputs with corresponding speech outputs, significantly reducing hallucinations. By applying monotonic alignment prior and connectionist temporal classification (CTC) loss , the generated speech closely matches the intended text, resulting in a more reliable and accurate TTS system. For word pronunciation, the T5-TTS model makes 2x fewer errors compared to Bark , 1.8x fewer errors compared to VALLE-X ( open-source implementation ), and 1.5x fewer errors compared to SpeechT5 (Figure 2).

speech to text neural network

Implications and future considerations for research

The release of the T5-TTS model by NVIDIA NeMo marks a significant advancement in TTS systems. By effectively addressing the hallucination problem, the model sets the stage for more reliable and high-quality speech synthesis, enhancing user experiences across a wide range of applications.

Looking forward, the NVIDIA NeMo team plans to further refine the T5-TTS model by expanding language support, improving its ability to capture diverse speech patterns, and integrating it into broader NLP frameworks.

Explore the NVIDIA NeMo T5-TTS model

The T5-TTS model represents a major breakthrough in achieving more accurate and natural text-to-speech synthesis. Its innovative approach to learning robust text and speech alignment sets a new benchmark in the field, promising to transform how we interact with and benefit from TTS technology. 

To access the T5-TTS model and start exploring its potential, visit NVIDIA/NeMo on GitHub. Whether you’re a researcher, developer, or enthusiast, this powerful tool offers countless possibilities for innovation and advancement in the realm of text-to-speech technology. To learn more, see Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment .

Acknowledgments

We extend our gratitude to all the model authors and collaborators who contributed to this work, including Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Boris Ginsburg, Rafael Valle, and Rohan Badlani.

Related resources

  • GTC session: Making Large Language Models and Retrieval-Augmented Generation Work With Ease (Presented by Softserve, Inc.)
  • GTC session: Training Optimization for LLM with NVIDIA NeMo and AWS
  • GTC session: Large Language Model Fine-Tuning using Parameter Efficient Fine-Tuning (PEFT)
  • NGC Containers: NeMo Framework
  • SDK: NeMo LLM Service
  • SDK: NeMo Megatron

About the Authors

Subhankar Ghosh

Related posts

Image of two people sitting in their cubicles with speech recognition visualizations in the background.

Pushing the Boundaries of Speech Recognition with NVIDIA NeMo Parakeet ASR Models

speech to text neural network

NVIDIA Speech and Translation AI Models Set Records for Speed and Accuracy

speech to text neural network

Creating Robust Neural Speech Synthesis with ForwardTacotron

Architecture diagram for Riva server.

Getting a Real Time Factor Over 60 for Text-To-Speech Services Using NVIDIA Riva

speech to text neural network

Generate Natural Sounding Speech from Text in Real-Time

Decorative image of text and speech recognition processes encircling the globe.

New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model

speech to text neural network

Turbocharge ASR Accuracy and Speed with NVIDIA NeMo Parakeet-TDT

speech to text neural network

Event: Speech and Generative AI Developer Day at NVIDIA GTC 2024

SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network

Send feedback.

Enter your feedback below and we'll get back to you as soon as possible. To submit a bug report or feature request, you can use the official OpenReview GitHub repository: Report an issue

BibTeX Record

  • Today's news
  • Reviews and deals
  • Climate change
  • 2024 election
  • Fall allergies
  • Health news
  • Mental health
  • Sexual health
  • Family health
  • So mini ways
  • Unapologetically
  • Buying guides

Entertainment

  • How to Watch
  • My Portfolio
  • Latest News
  • Stock Market
  • Biden Economy
  • Stocks: Most Actives
  • Stocks: Gainers
  • Stocks: Losers
  • Trending Tickers
  • World Indices
  • US Treasury Bonds
  • Top Mutual Funds
  • Highest Open Interest
  • Highest Implied Volatility
  • Stock Comparison
  • Advanced Charts
  • Currency Converter
  • Basic Materials
  • Communication Services
  • Consumer Cyclical
  • Consumer Defensive
  • Financial Services
  • Industrials
  • Real Estate
  • Mutual Funds
  • Credit Cards
  • Balance Transfer Cards
  • Cash-back Cards
  • Rewards Cards
  • Travel Cards
  • Credit Card Offers
  • Best Free Checking
  • Student Loans
  • Personal Loans
  • Car Insurance
  • Mortgage Refinancing
  • Mortgage Calculator
  • Morning Brief
  • Market Domination
  • Market Domination Overtime
  • Asking for a Trend
  • Opening Bid
  • Stocks in Translation
  • Lead This Way
  • Good Buy or Goodbye?
  • Financial Freestyle
  • Fantasy football
  • Pro Pick 'Em
  • College Pick 'Em
  • Fantasy baseball
  • Fantasy hockey
  • Fantasy basketball
  • Download the app
  • Daily fantasy
  • Scores and schedules
  • GameChannel
  • World Baseball Classic
  • Premier League
  • CONCACAF League
  • Champions League
  • Motorsports
  • Horse racing
  • Newsletters

New on Yahoo

  • Privacy Dashboard

Yahoo Finance

Speechgen.io revolutionizes audio content creation with multi-voice ai text-to-speech technology.

HONG KONG, CHINA / ACCESSWIRE / July 7, 2024 / SpeechGen.io is set to transform the landscape of audio content production in 2024 with its cutting-edge Speechgen text to speech technology. By leveraging advanced neural networks, SpeechGen.io enables the creation of diverse and natural-sounding audio content, featuring multiple virtual speakers in a single audio file. This innovative technology not only simplifies the audio production process but also opens up new possibilities across various sectors, including education, entertainment, business, and healthcare.

Unlocking New Opportunities with Multi-Voice Narration

The multi-voice AI text-to-speech technology from SpeechGen.io allows users to generate dynamic dialogues and narratives in multiple languages using a single neural network. This capability significantly reduces the time and resources required for audio content creation, making it an ideal solution for businesses, educators, and content creators looking to produce high-quality audio efficiently.

Key Features and Benefits:

Natural Dialogues and Narratives: The technology simulates real-life conversations, making audio content more engaging and authentic.

Multi-Language Support: Create audio content in multiple languages using the same neural network, expanding your reach to a global audience.

Resource Efficiency: Save time and resources by streamlining the audio production process with multi-voice narration.

Versatile Applications: Ideal for various fields, including education, entertainment, business, healthcare, and more.

Innovative Use Cases

Education: Enhance learning experiences with interactive audio courses, history podcasts, and multi-voice lectures.

Foreign Language Learning: Develop immersive language courses and audio dictionaries that cater to diverse learning styles.

Culture and Tourism: Create engaging multilingual audio guides and virtual city tours for museums and travel agencies.

Literature and Entertainment: Produce audiobooks, interactive fairy tales, and audio versions of comic books with distinct character voices.

Business and Corporate Training: Facilitate effective corporate training and negotiation simulations with multi-voice audio content.

Marketing and Advertising: Craft compelling product commercials, interactive voice banners, and audio product catalogs.

Media and Journalism: Deliver comprehensive news digests, magazine audio versions, and investigative podcasts.

Healthcare: Provide accessible medical instructions, disease prevention guides, and first aid audio courses.

Simplifying Audio Content Creation

Using SpeechGen.io's multi-voice AI text-to-speech technology is straightforward. Users can select virtual speakers from available voice models, assign text to each speaker, and generate a cohesive audio file that mimics natural dialogues. This user-friendly process allows anyone to create professional-grade audio content without extensive technical knowledge.

About SpeechGen.io

SpeechGen.io is a leading innovator in the field of AI-powered text-to-speech technology. Our mission is to empower creators, educators, and businesses to produce high-quality audio content effortlessly. By combining state-of-the-art neural networks with user-friendly interfaces, we provide a versatile solution for a wide range of audio production needs.

Contact Information SpeechGen.io [email protected] Units A-C, 25/F., Seabright Plaza, No. 9-23 Shell Street, North Point, Hong Kong. https://speechgen.io/

SOURCE: SpeechGen.io

View the original press release on accesswire.com

ACM Digital Library home

  • Advanced Search

Deep correlation network for synthetic speech detection

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, fmfcc-a: a challenging mandarin dataset for synthetic speech detection.

As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin ...

Synthetic speech detection using phase information

Phase information based synthetic speech detectors (RPS, MGD) are analyzed.Training using real attack samples and copy-synthesized material is evaluated.Evaluation of the detectors against unknown attacks, including channel effect.Detectors work well ...

Extracting Efficient Spectrograms From MP3 Compressed Speech Signals for Synthetic Speech Detection

Many speech signals are compressed with MP3 to reduce the data rate. In many synthetic speech detection methods the spectrogram of the speech signal is used. This usually requires the speech signal to be fully decompressed. We show that the design of MP3 ...

Information

Published in.

Elsevier Science Publishers B. V.

Netherlands

Publication History

Author tags.

  • Synthetic speech detection
  • Deep correlation network
  • Correlation learning network
  • Common embedding
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Graph neural networks for text classification: a survey

  • Open access
  • Published: 01 July 2024
  • Volume 57 , article number  190 , ( 2024 )

Cite this article

You have full access to this open access article

speech to text neural network

  • Kunze Wang 1 ,
  • Yihao Ding 1 , 2 &
  • Soyeon Caren Han 1 , 2  

460 Accesses

3 Altmetric

Explore all metrics

Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph, which captures words, documents, and corpus global features. In this survey, we bring the coverage of methods up to 2023, including corpus-level and document-level graph neural networks. We discuss each of these methods in detail, dealing with the graph construction mechanisms and the graph-based learning process. As well as the technological survey, we look at issues behind and future directions addressed in text classification using graph neural networks. We also cover datasets, evaluation metrics, and experiment design and present a summary of published performance on the publicly available benchmarks. Note that we present a comprehensive comparison between different techniques and identify the pros and cons of various evaluation metrics in this survey.

Similar content being viewed by others

speech to text neural network

Deep learning, graph-based text representation and classification: a survey, perspectives and challenges

speech to text neural network

BoW-based neural networks vs. cutting-edge models for single-label text classification

speech to text neural network

KGAT: An Enhanced Graph-Based Model for Text Classification

Avoid common mistakes on your manuscript.

1 Introduction

Text classification aims to classify a given document into certain pre-defined classes, and is considered to be a fundamental task in natural language processing (NLP). It includes a large number of downstream tasks, such as topic classification (Zhang et al. 2015 ), and sentiment analysis (Tai et al. 2015 ). Traditional text classification methods build representation on the text using N-gram (Cavnar et al. 1994 ) or Term Frequency-Inverse Document Frequency (TF-IDF) (Hakim et al. 2014 ) and apply traditional machine learning models, such as SVM (Joachims 2005 ), to classify the documents. With the development of neural networks, more deep learning models have been applied to text classification, including convolutional neural networks (CNN) (Kim 2014 ), recurrent neural networks (RNN) (Tang et al. 2015 ) and attention-based (Vaswani et al. 2017 ) models and large language models (Devlin et al. 2018 ).

However, these methods are either unable to handle the complex relationships between words and documents (Yao et al. 2019 ), and can not efficiently explore the contextual-aware word relations (Zhang et al. 2020 ). Graph neural networks (GNN) are introduced to resolve such obstacles. GNN is used with graph-structure datasets, so a graph needs to be built for text classification. There are two main approaches to constructing graphs: corpus-level and document-level graphs. The datasets are either built into single or multiple corpus-level graphs representing the whole corpus or numerous document-level graphs and each of them represents a document. The corpus-level graph can capture the global structural information of the entire corpus, while the document-level graph can explicitly capture the word-to-word relationships within a document. Both ways of applying graph neural networks to text classification achieve good performance.

This paper mainly focuses on GNN-based text classification techniques, datasets, and their performance. The graph construction approaches for both corpus-level and document-level graphs are addressed in detail. Papers on the following aspects will be reviewed:

GNNs-based text classification approaches. Papers that design GNN-based frameworks to enhance the feature representation or directly apply GNNs to conduct sequence text classification tasks will be summarized, described and discussed. GNNs applied for token-level classification (Natural Language Understanding) tasks, including NER, slot filling, etc, will not be discussed in this work.

Text classification benchmark datasets and their performance applied by GNN-based models. The text classification datasets with commonly used metrics used by GNNs-based text classification models will be summarized and categorized based on task types and the model performance on these datasets.

1.1 Related surveys and our contribution

Before 2019, the text classification survey papers (Xing et al. 2010 ; Khan et al. 2010 ; Harish et al. 2010 ; Aggarwal and Zhai 2012 ; Vijayan et al. 2017 ) have focused on covering traditional machine learning-based text classification models. Recently, with the rapid development of deep learning techniques, (Minaee et al. 2021 ; Zulqarnain et al. 2020 ; Zhou 2020 ; Li et al. 2022 ) review the various deep learning-based approaches. In addition, some papers not only review the SoTA model architectures but summarize the overall workflow (Jindal et al. 2015 ; Kadhim 2019 ; Mirończuk and Protasiewicz 2018 ; Kowsari et al. 2019 ; Bhavani and Kumar 2021 ) or specific techniques for text classification including word embedding (Selva Birunda and Kanniga 2021 ), feature selection (Deng et al. 2019 ; Shah and Patel 2016 ; Pintas et al. 2021 ), term weighting (Patra and Singh 2013 ; Alsaeedi 2020 ) and etc. Meanwhile, some growing potential text classification architectures are surveyed, such as CNNs (Yang et al. 2016 ), attention mechanisms (Mariyam et al. 2021 ). Since the powerful ability to represent non-Euclidean relation, GNNs have been used in multiple practical fields and reviewed e.g. financial application (Wang et al. 2021 ), traffic prediction (Liu and Tan 2021 ), bio-informatics (Zhang et al. 2021 ), power system (Liao et al. 2021 ), recommendation system (Gao et al. 2022 ; Liang et al. 2021 ; Yang et al. 2021 ). Moreover, (Bronstein et al. 2017 ; Battaglia et al. 2018 ; Zhang et al. 2019 ; Zhou et al. 2020 ; Wu et al. 2020 ) comprehensively review the general algorithms and applications of GNNs, as well as certain surveys mainly focus on specific perspectives including graph construction (Skarding et al. 2021 ; Thomas et al. 2022 ), graph representation (Hamilton et al. 2017 ), training (Xie et al. 2022 ), pooling (Liu et al. 2022 ) and more. However, only (Minaee et al. 2021 ; Li et al. 2022 ) briefly introduce certain SoTA GNN-based text classification models. A recent short review paper (Malekzadeh et al. 2021 ) reviews the concept of GNNs and four SoTA GNN-based text classification models. However, our study focuses explicitly on applying GNN techniques in text classification tasks. We delve into various GNN-related methodologies, including graph construction, node and edge representation, and training approaches commonly employed in text classification. Unlike (Malekzadeh et al. 2021 ) that typically review a limited number of models, our survey encompasses around 30 models categorised into document-level and corpus-level classifications, enabling a comprehensive analysis for comparing and contrasting these approaches. Additionally, our study goes beyond merely examining models by providing an in-depth analysis of metrics and datasets commonly used in GNN-based text classification tasks, aiming to offer valuable insights for future research in similar areas.

The contribution of this survey includes:

This is the first survey focused only on graph neural networks for text classification with a comprehensive description and critical discussion on more than twenty GNN text classification models.

We categorize the existing GNN text classification models into two main categories with multiple sub-categories, and the tree structure of all the models shows in Fig.  1 .

We compare these models in terms of graph construction, node embedding initialization, and graph learning methods. And we also compare the performance of these models on the benchmark datasets and discuss the key findings.

We discuss the existing challenges and some potential future work for GNN text classification models.

1.2 Text classification tasks

Text classification involves assigning a pre-defined label to a given text sequence. The process typically involves encoding pre-processed raw text into numerical representations and using classifiers to predict the corresponding categories. Typical sub-tasks include sentiment analysis, topic labelling, news categorization, and hate speech detection. Specific frameworks can be extended to advanced applications such as information retrieval, summarising, question answering, and natural language inference. This paper focuses specifically on GNN-based models used for typical text classification.

Sentiment analysis is a task that aims to identify the emotional states and subjective opinions expressed in the input text, such as reviews, micro-blogs, etc. This can be achieved through binary or multi-class classification. Effective sentiment analysis can aid in making informed business decisions based on user feedback.

Topic classification is a supervised deep learning task to automatically understand the text content and classify it into multiple domain-specific categories, typically more than two. The data sources may be gathered from different domains, including Wikipedia pages, newspapers, scientific papers, etc.

Junk information detection involves detecting inappropriate social media content. Social media providers commonly use approaches like hate speech, abusive language, advertising or spam detection to remove such content efficiently.

1.3 Text classification development

Many traditional machine learning methods and deep learning models are selected as baselines for comparison with the GNN-based text classifiers. We mainly summarized those baselines into three types:

Traditional machine learning : In earlier years, traditional methods such as Support Vector Machines (SVM) (Zhang et al. 2011 ) and Logistic Regression (Genkin et al. 2007 ) utilized sparse representations like Bag of Words (BoW) and TF-IDF. However, recent advancements (Lilleberg et al. 2015 ; Yin and Jin 2015 ; Ren et al. 2016 ) have focused on dense representations, such as Word2vec, GloVe, and Fasttext, to mitigate the limitations of sparse representations. These dense representations are also used as inputs for sophisticated methods, such as Deep Averaging Networks (DAN) (Iyyer et al. 2015 ) and Paragraph Vector (Doc2Vec) (Le and Mikolov 2014 ), to achieve new state-of-the-art results.

Sequential models : RNNs and CNNs have been utilized to capture local-level semantic and syntactic information of consecutive words from input text bodies. The upgraded models, such as LSTM (Graves 2012 ) and GRU (Cho et al. 2014 ), have been proposed to address the vanishing or exploding gradient problems caused by vanilla RNN. CNN-based structures have been applied to capture N-gram features by using one or more convolution and pooling layers, such as Dynamic CNN (Kalchbrenner et al. 2014 ) and TextCNN (Kim 2014 ). However, these models can only capture local dependencies of consecutive words. To capture longer-term or non-Euclidean relations, improved RNN structures, such as Tree-LSTM (Tai et al. 2015 ) and MT-LSTM (Liu et al. 2015 ), and global semantic information, like TopicRNN (Dieng et al. 2016 ), have been proposed. Additionally, graph (Peng et al. 2018 ) and tree structure (Mou et al. 2015 ) enhanced CNNs have been proposed to learn more about global and long-term dependencies.

Attentions and transformers : attention mechanisms (Bahdanau et al. 2014 ) have been widely adopted to capture long-range dependencies, such as hierarchical attention networks (Abreu et al. 2019 ) and attention-based hybrid models (Yang et al. 2016 ). More attention-based text classification frameworks are summarized by (Minaee et al. 2021 ). Self-attention-based transformer architectures have achieved state-of-the-art performance on many text classification benchmarks via unsupervised pre-training tasks to generate strong contextual word representations (Devlin et al. 2018 ; Liu et al. 2019 ). However, although those large-scale models implicitly store general domain knowledge and are widely used to generate more representative textual representations, they only focus on learning the relation between input text bodies and ignore the global and corpus-level information (Lu et al. 2020 ; Lin et al. 2021 ).

1.4 Outline

The outline of this survey is as follows:

Section  1 presents the research questions and provides an overview of applying Graph Neural Networks to text classification tasks, along with the scope and organization of this survey.

Section  2 provides background information on text classification and graph neural networks and introduces the key concepts of applying GNNs to text classification from a designer’s perspective.

Section  3 and Sect.  4 discuss previous work on Corpus-level Graph Neural Networks and Document-level Graph Neural Networks, respectively, and provide a comparative analysis of the strengths and weaknesses of these two approaches.

Section  5 introduces the commonly used datasets and evaluation metrics in GNN for text classification.

Section  6 reports the performance of various GNN models on a range of benchmark datasets for text classification and discusses the key findings.

The challenges for the existing methods and some potential future works are discussed in Sect.  7 .

In Sect.  8 , we present the conclusions of our survey on GNN for text classification and discuss potential directions for future work.

figure 1

Categorizing the graph neural network text classification models

2 Backgrounds of GNN

2.1 definition of graph.

A graph in this paper is represented as \(G = (V, E)\) , where V and E represent a set of nodes (vertices) and edges of G , respectively. A single node in the node set is represented \(v_{i} \in V\) , as well as \(e_{ij} = (v_{i},v_{j}) \in E\) donates an edge between node \(v_{i}\) and \(v_{j}\) . The adjacent matrix of graph G is represented as A , where \(A \in {\mathbb {R}}^{n \times n}\) and n is the number of nodes in graph G . If \(e_{ij} \in E\) , \(A_{ij} = 1\) , otherwise \(A_{ij} = 0\) . In addition, we use \({{\varvec{X}}}\) and \({{\varvec{E}}}\) to represent the nodes and edges representations in graph G , where \({{\varvec{X}}} \in {\mathbb {R}}^{n \times m}\) and \({{\varvec{E}}} \in {\mathbb {R}}^{n \times c}\) . \({{\varvec{x}}}_i \in {\mathbb {R}}^m\) represents the m -dimensional vector of node \(v_{i}\) and \({{\varvec{e}}}_{ij} \in {\mathbb {R}}^c\) represents the c -dimensional vector of edge \(e_{ij}\) (most of the recent studies set \(c=1\) to represent a weighting scalar). A donates the edge feature weighted adjacent matrix.

2.2 Traditional graph-based algorithms

Before GNNs were broadly used for representing irregular relations, traditional graph-based algorithms have been applied to model the non-Euclidean structures in text classification e.g. Random Walk (Szummer and Jaakkola 2001 ; Zhou and Li 2005 ), Graph Matching (Schenker et al. 2004 ; Silva et al. 2014 ), Graph Clustering (Matsuo et al. 2006 ) which has been well summarized in (Wu et al. 2021 ). There are three common limitations of those traditional graph-based algorithms. Firstly, most of those algorithms mainly focus on capturing graph-level structure information without considering the significance of node and edge features. For example, Random Walk-based approaches (Zhou and Li 2005 ; Szummer and Jaakkola 2001 ) mainly focus on using distance or angle between node vectors to calculate transition probability while ignoring the information represented by node vectors. Secondly, since the traditional graph-based algorithms are only suitable for specific tasks, there is no unified learning framework for addressing various practical tasks. For example, Kaur and Kumar ( 2018 ) proposes a graph clustering method that requires a domain knowledge-based ontology graph. Lastly, the traditional graph-based methods are comparative time inefficient like the Graph Edit Distance-based graph matching methods have exponential time complexity (Silva et al. 2014 ).

2.3 Foundations of GNN

To tackle the limitation of traditional graph-based algorithms and better represent non-Euclidean relations in practical applications, Graph Neural Networks are proposed by Scarselli et al. ( 2008 ). GNNs have a unified graph-based framework and simultaneously model the graph structure, node, and edge representations. This section will provide the general mathematical definitions of Graph Neural Networks. The general forward process of GNN can be summarised as follows:

where \({{\varvec{A}}} \in {\mathbb {R}}^{n \times n}\) represents the weighted adjacent matrix and \({{\varvec{H}}}^{ (l)} \in {\mathbb {R}}^{n \times d}\) is the updated node representations at the l -th GNN layers by feeding \(l-1\) -th layer node features \({{\varvec{H}}}^{ (l-1)} \in {\mathbb {R}}^{n \times k}\) ( k is the dimensions of previous layers node representations ) into pre-defined graph filters \({\mathcal {F}}\) .

The most commonly used graph filtering method is defined as follows:

where \(\tilde{{{\varvec{A}}}} = {{\varvec{D}}}^{-\frac{1}{2}}{{\varvec{AD}}}^{-\frac{1}{2}}\) is the normalized symmetric adjacency matrix. \({{\varvec{A}}} \in {\mathbb {R}}^{n \times n}\) is the adjacent matrix of graph G and \({{\varvec{D}}}\) is the degree matrix of \({{\varvec{A}}}\) , where \(D_{ii} = \Sigma _{j}A_{ij}\) . \({{\varvec{W}}} \in {\mathbb {R}}^{k \times d}\) is the weight matrix and \(\phi \) is the activation function. If we design a two layers of GNNs based on the above filter could get a vanilla Graph Convolutional Network (GCN) (Welling and Kipf 2016 ) framework for text classification:

where \({{\varvec{W}}}^0\) and \({{\varvec{W}}}^1\) represent different weight matrix for different GCN layers and \({{\varvec{H}}}\) is the input node features. ReLU function is used for non-linearization and softmax is used to generated predicted categories \({{\varvec{Y}}}\) . The notation of GNN can be found in Table 1 .

2.4 GNN for text classification

This paper mainly discusses how GNNs are applied in Text Classification tasks. Before we present the specific applications in this area, we first introduce the key concepts of applying GNNs to text classification from a designer’s view. We suppose for addressing a text classification task, we need to design a graph \(G = (V,E)\) . The general procedures include Graph Construction , Initial Node Representation , Edge Representations , and Training Setup .

2.4.1 Graph construction

Some applications have explicit graph structures, including constituency or dependency graphs (Tang et al. 2020 ), knowledge graphs (Ostendorff et al. 2019 ; Marin et al. 2014 ), social networks (Dai et al. 2022 ) without constructing graph structure and defining corresponding nodes and edges. However, for text classification, the most common graph structures are implicit, which means we need to define a new graph structure for a specific task, such as designing a word-word or word-document co-occurrence graph. In addition, for text classification tasks, the graph structure can be generally classified into two types:

Corpus-level / document-level : Corpus-level graphs intend to construct the graph to represent the whole corpus, such as Yao et al. ( 2019 ); Liu et al. ( 2020 ); Lin et al. ( 2021 ); Wu et al. ( 2019 ), while the document-level graphs focus on representing the non-Euclidean relations existing in a single text body like Chen et al. ( 2020 ); Nikolentzos et al. ( 2020 ); Zhang et al. ( 2020 ). Supposing a specific corpus \({\mathcal {C}}\) contains a set of documents (text bodies) \({\mathcal {C}} = \{D_1,D_2,...,D_j\}\) and each \(D_i\) contains a set of tokens \(D_i = \{t_{i_1},t_{i_2},...,t_{i_k}\}\) . The vocabulary of \({\mathcal {C}}\) can be represented as \({\mathcal {D}} = \{t_1,t_2,...,t_l\}\) , where l is the length of \({\mathcal {D}}\) . For the most commonly adopted corpus-level graph \(G_{corpus} = (V_{corpus},E_{corpus})\) , a node \(v_i\) in \(V_{corpus}\) follows \(v_i \in {\mathcal {C}} \cup {\mathcal {D}}\) and the edge \(e_{ij} \in E_{corpus}\) is one kind of relations between \(v_i\) and \(v_j\) . Regarding the document level graph \(G_{doc_i} = (V_{doc_i},E_{doc_i})\) , a node \(v_{i_j}\) in \(V_{doc_i}\) follows \(v_{i_j} \in D_i\) .

After designing the graph scale for the specific tasks, specifying the graph types is also important to determine the nodes and their relations. For text classification tasks, the commonly used graph construction ways can be summarized into:

Homogeneous / heterogeneous graphs : homogeneous graphs have the same node and edge type while heterogeneous graphs have various node and edge types. For a graph \(G = (V,E)\) , we use \({\mathcal {N}}^v\) and \({\mathcal {N}}^e\) to represent the number of types of V and E . If \({\mathcal {N}}^v = {\mathcal {N}}^e = 1\) , G is a homogeneous graph. If \({\mathcal {N}}^v >1 \) or \( {\mathcal {N}}^e > 1\) , G is a heterogeous graph.

Static / dynamic graphs : Static graphs aim to use the constructed graph structure by various external or internal information to leverage to enhance the initial node representation such as dependency or constituency graph (Tang et al. 2020 ), co-occurrence between word nodes (Zhang et al. 2020 ), TF-IDF between word and document nodes (Yao et al. 2019 ; Wu et al. 2019 ; Lei et al. 2021 ) and so on. However, compared with the static graph, the dynamic graph initial representations or graph topology are changing during training without certain domain knowledge and human efforts. The feature representations or graph structure can jointly learn with downstream tasks to be optimised together. For example, Wang et al. ( 2020 ) proposed a novel topic-aware GNN text classification model with dynamically updated edges between topic nodes with others (e.g. document, word). Piao et al. ( 2021 ) also designed a dynamic edge-based graph to update the contextual dependencies between nodes. Additionally, Chen et al. ( 2020 ) propose a dynamic GNN model to jointly update the edge and node representation simultaneously. We provide more details about the above-mentioned models in Sect.  3 and Sect.  4 .

Another widely used pair of graph categories are directed or undirected graphs based on whether the directions of edges are bi-directional or not. For text classification, most of the GNN designs follow the unidirectional way. In addition, those graph-type pairs are not parallel, which means they can be combined.

2.4.2 Initial node representation

Based on the pre-defined graph structure and specified graph type, selecting the appropriate initial node representations is the key procedure to ensure the proposed graph structure can effectively learn node. According to the node entity type, the existing node representation approaches for text classification can be generally summarized into:

Word-level representation : non-context word embedding methods such as GloVe (Pennington et al. 2014 ), Word2vec (Mikolov et al. 2013 ), FastText (Bojanowski et al. 2017 ) are widely adopted by many GNN-based text classification framework to represent the node features numerically. However, those embedding methods are restricted to capturing only syntactic similarity and fail to represent the complex semantic relationships between words. They cannot capture the meaning of out-of-vocabulary (OOV) words, and their representations are fixed. Therefore, there are some recent studies selecting ELMo (Peters et al. 2018 ), BERT (Devlin et al. 2018 ), GPT (Radford et al. 2018 ) to get contextual word-level node representation. Notably, even if the one-hot encoding is the simplest word representation method, many GNN-based text classifiers use one-hot encoding and achieve state-of-the-art performance. Few frameworks use randomly initialised vectors to represent the word-level node features.

Document-level representation : similar to other NLP applications, document-level representations are normally acquired by aggregating the word-level representation via some deep learning frameworks. For example, some researchers select by extracting the last-hidden state of LSTM or using the [CLS] token from BERT to represent the input text body numerically. Furthermore, it is also a commonly used document-level node representation way to use TF-IDF-based document vectors.

Most GNN-based text classification frameworks will compare the performance between different node representation methods to conduct quantitative analysis, as well as provide reasonable justifications for demonstrating the effectiveness of the selected initial node representation based on a defined graph structure.

2.4.3 Edge features

Well-defined edge features can effectively improve the graph representation learning efficiency and performance to exploit more explicit and implicit relations between nodes. Based on the predefined graph types, the edge feature types can be divided into structural features and non-structural features . The structural edge features are acquired from explicit relations between nodes, such as dependency or constituency relation between words, word-word adjacency relations, etc. That relationship between nodes is explicitly defined and widely employed in other NLP applications. However, more commonly used edge features are non-structural features which implicitly exist between the nodes and are specifically applied to specific graph-based frameworks. The typically non-structural edge features are firstly defined by Kim ( 2014 ) for GNNs-based text classification tasks, including:

PMI (point-wise mutual information) measures the co-occurrence between two words in a sliding window W and is calculated as:

where \(\#W\) is the number of windows in total, and \(\#W (i)\) , \(\#W (i,j)\) shows the number of windows containing word i and both word i and j respectively.

TF-IDF (term frequency-inverse document frequency) is the broadly used weight of the edges between document-level nodes and word-level nodes.

Except for those two widely used implicit edge features, some specific edge weighting methods are proposed to meet the demands of particular graph structures for exploiting more information of input text bodies.

2.4.4 Training setup

After specifying the graph structure and types, the graph representation learning tasks and training settings also need to be determined to decide how to optimise the designed GNNs. Generally, the graph representation learning tasks can be categorized into three levels, including Node-level , Graph-level and Edge-level . Node-level and graph-level tasks involve node or graph classification, clustering, regression, etc., while edge-level tasks include link prediction or edge classification for predicting the existence of the relation between two nodes or the corresponding edge categories.

Similar to other deep learning model training settings, GNNs also can be divided into supervised , semi-supervised and unsupervised training settings . Supervised training provides labelled training data, while unsupervised training utilises unlabeled data to train the GNNs. However, compared with supervised or unsupervised learning, semi-supervised learning methods are broadly used by GNNs designed for text classification applications, which could be classified into two types:

Inductive learning adjusts the weights of proposed GNNs based on a labelled training set for learning the overall statistics to induce the general trained model for following processing. The unlabeled set can be fed into the trained GNNs to compute the expected outputs.

Transductive learning intends to exploit labelled and unlabeled sets simultaneously for leveraging the relations between different samples to improve the overall performance.

2.4.5 Evolution of GNNs for text classification

TextGCN (Yao et al. 2019 ) and Text-Level-GNN (Huang et al. 2019 ) were the first to frame a text classification task as a node or graph classification task, achieved by constructing graphs based on textual data. Following these works, the field witnessed a proliferation of methodologies, exploring various avenues: (1) advancements in graph learning models, (2) improved graph construction strategies, (3) integration with State-of-the-Art text classification methods like Bert (Devlin et al. 2018 ).

In terms of the advancements in graph learning models, SGC (Wu et al. 2019 ) simplifies the Graph Convolutional Network (GCN) architecture, thereby conserving computational resources, S2GC (Zhu and Koniusz 2020 ) and NMGC (Lei et al. 2021 ) mitigate over-smoothing challenges by integrating skip-connection mechanisms, TensorGCn (Liu et al. 2020 ), TextGTL (Li et al. 2021 ) and ME-GCN (Wang et al. 2022 ) direct their efforts towards the acquisition of enriched edge information, T-VGAE (Xie et al. 2021 ) employs graph auto-encoder methodologies to enhance representation learning. HGAT (Linmei et al. 2019 ), ReGNN (Li et al. 2019 ), HyperGAT (Ding et al. 2020 ), MLGNN (Liao et al. 2021 ) and DADGNN (Liu et al. 2021 ) leverage attention mechanisms for model enhancement. A detailed exposition of these Graph Neural Network (GNN) models can be found in Sect.  3 and 4 .

3 Corpus-level GNN for text classification

We define a corpus-level Graph Neural Network as “constructing a graph to represent the whole corpus"; thus, only one or several graphs will be built for the given corpus. We categorize Corpus-level GNN into four subcategories based on the types of nodes shown in the graph.

3.1 Document and word nodes as a graph

Most corpus-level graphs include word nodes and document nodes, and there are word-document edges and word-word edges. By applying K (normally K =2 or 3) layer GNN, word nodes will serve as a bridge to propagate the information from one document node to another.

3.1.1 PMI and TF-IDF as graph edges: TextGCN, SGC, S \(^2\) GC, NMGC, TG-transformer, bertgCN

TextGCN (Yao et al. 2019 ) (Yao et al. 2019 ) builds a corpus-level graph with training document nodes, test document nodes and word nodes. Before constructing the graph, a common preprocessing method (Kim 2014 ) has been applied, and words shown fewer than five times or in NLTK (Bird et al. 2009 ) stopwords list have been removed. The edge value between the document node and the word node is TF-IDF, and that between the word nodes is PMI. The adjacency matrix of this graph is shown as follows.

A two-layer GCN is applied to the graph, and the dimension of the second layer output equals the number of classes in the dataset. Formally, the forward propagation of TextGCN shows as follows:

where \({\tilde{A}}\) is the normalized adjacency of A and X is one-hot embedding. \(W_0\) and \(W_1\) are learnable parameters of the model. The representation on training documents is used to calculate the loss, and that on test documents is used for prediction. TextGCN is the first work that treats a text classification task as a node classification problem by constructing a corpus-level graph and has inspired many following works.

Based on TextGCN, several works follow the same graph construction method and node initialization but apply different graph propagation models.

SGC (Wu et al. 2019 ) To make GCN efficient, SGC (Simple Graph Convolution) removes the nonlinear activation function in GCN layers; therefore, the K-layer propagation of SGC is shown as follows:

which can be reparameterized into

and K is 2 when applied to text classification tasks. With a smaller number of parameters and only one feedforward layer, SGC saves computation time and resources while improving performance.

S \(^2\) GC (Zhu and Koniusz 2020 ) To solve the over smoothing issues in GCN, (Zhu and Koniusz 2020 ) propose Simple Spectral Graph Convolution (S \(^2\) GC), which includes self-loops using Markov Diffusion Kernel. The output of S \(^2\) GC is calculated as:

And can be generalized into:

Similarly, K = 2 on text classification tasks and \(\alpha \) denotes the trade-off between self-information of the node and consecutive neighbourhood information. S \(^2\) GC can also be viewed as introducing skip connections into GCN.

NMGC (Lei et al. 2021 ) Other than using the sum of each GCN layer in S \(^2\) GC, NMGC applies min pooling using the Multi-hop neighbour Information Fusion (MIF) operator to address over-smoothing problems. A MIF function is defined as:

NMGC-K firstly applies a MIF ( K ) layer, then a GCN layer, and K is 2 or 3. For example, when K = 3, the output is:

NMGC can also be treated as a skip-connection in Graph Neural Networks, making the shallow layer of GNN directly contribute to the final representation.

TG-Transformer (Zhang and Zhang 2020 ) TextGCN treats the document nodes and word nodes as the same type of nodes during propagation, and to introduce heterogeneity into the TextGCN graph, TG-Transformer (Text Graph Transformer) adopts two sets of weights for document nodes and word nodes, respectively. To cope with a large corpus graph, subgraphs are sampled from the TextGCN graph using PageRank algorithm (Page et al. 1999 ). The input embedding of is the sum of three types of embedding: pretrained GloVe embedding, node type embedding, and Weisfeiler-Lehman structural encoding (Niepert et al. 2016 ). During propagation, self-attention (Vaswani et al. 2017 ) with graph residual (Zhang and Meng 2019 ) is applied.

BertGCN (Lin et al. 2021 ) To combine BERT (Devlin et al. 2018 ) and TextGCN, BertGCN enhances TextGCN by replacing the document node initialization with the BERT [CLS] output of each epoch and replacing the word input vector with zeros. BertGCN trains BERT and TextGCN jointly by interpolating the output of TextGCN and BERT:

where \(\lambda \) is the trade-off factor. To optimize the memory during training, a memory bank is used to track the document input and a smaller learning rate is set to BERT module to remain the consistency of the memory bank. BertGCN shows that with the help of TextGCN, BERT can achieve better performance.

3.1.2 Multi-graphs/multi-dimensional edges: tensorGCN, ME-GCN

TensorGCN (Liu et al. 2020 ) Instead of constructing a single corpus-level graph, TensorGCN builds three independent graphs: A semantic-based graph, a Syntactic-based graph, and a Sequential-based graph to incorporate semantic, syntactic and sequential information, respectively and combines them into a tensor graph.

Three graphs share the same set of TF-IDF values for the word-document edge but different values for word-word edges. Semantic-based graph extracts the semantic features from a trained Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997 ) model and connects the words sharing high similarity. The syntactic-based graph uses Stanford CoreNLP parser (Manning et al. 2014 ) and constructs edges between words when they have a larger probability of having a dependency relation. For the Sequential-based graph, the PMI value is applied as TextGCN does.

The propagation includes intra-graph propagation and inter-graph propagation. The model first applies the GCN layer on three graphs separately as intra-graph propagation. Then, the same nodes on three graphs are treated as a virtual graph, and another GCN layer is applied as inter-graph propagation.

ME-GCN (Wang et al. 2022 ) To fully utilize the corpus information and analyze rich relational information of the graph, ME-GCN (Multi-dimensional Edge-Embedded GCN) builds a graph with multi-dimensional word-word, word-document and document-document edges. Word2vec and Doc2vec embedding is firstly trained on the given corpus and the similarity of each dimension of trained embedding is used to construct the multi-dimensional edges. The trained embedding also serves as the input embedding of the graph nodes. During propagation, GCN is firstly applied on each dimension and representations on different dimensions are either concatenated or fed into a pooling method to get the final representations of each layer.

3.1.3 Making textGCN inductive: heteGCN, InducT-GCN, T-VGAE

HeteGCN (Ragesh et al. 2021 ) HeteGCN (Heterogeneous GCN) optimizes the TextGCN by decomposing the TextGCN undirected graph into several directed subgraphs. Several subgraphs from the TextGCN graph are combined sequentially as different layers: feature graph (word-word graph), feature-document graph (word-document graph), and document-feature graph (document-word graph). Different combinations were tested and the best model is shown as:

where \(\varvec{A}_{w-w}\) and \(\varvec{A}_{w-d}\) show the adjacency matrix for the word-word subgraph and word-document subgraph. Since the input of HeteGCN is the word node embeddings without using document nodes, it can also work in an inductive way while the previous corpus-level graph text classification models are all transductive models.

InducT-GCN (Wang et al. 2022 ) InducT-GCN (InducTive Text GCN) aims to extend the transductive TextGCN into an inductive model. Instead of using the whole corpus to build the graph, InducT-GCN builds a training corpus graph and makes the input embedding of the document the TF-IDF vectors, aligning with the one-hot word embeddings. The weights are learned following TextGCN but InducT-GCN builds virtual subgraphs for prediction on new test documents.

T-VGAE (Xie et al. 2021 ) T-VGAE (Topic Variational Graph Auto-Encoder) applies Variational Graph Auto-Encoder on the latent topic of each document to make the model inductive. A vocabulary graph \(A_v\) connects the words using PMI values, is constructed while each document is represented using the TF-IDF vector. All the document vectors are stacked into a matrix which can also be treated as a bipartite graph \(A_d\) . Two graph auto-encoder models are applied on \(A_v\) and \(A_d\) , respectively. The overall workflow shows as:

where \(X^v\) is an Identity Matrix. The \(\text {Encoder}_{GCN}\) and the decoders are applied following VGAE (Kipf and Welling 2016 ) while \(\text {Encoder}_{UDMP}\) is an unidirectional message passing variant of \(\text {Encoder}_{GCN}\) . The training objective is to minimise the reconstruction error, and \(Z_d\) is used for the classification task.

3.2 Document nodes as a graph

To show the global structure of the corpus directly, some models only adopt document nodes in the non-heterogeneous graph.

knn-GCN (Benamira et al. 2019 ) knn-GCN constructs a k-nearest-neighbours graph by connecting the documents with their K nearest neighbours using Euclidean distances of the embedding of each document. The embedding is generated in an unsupervised way: either using the mean of pretrained GloVe word vectors or applying LDA (Blei et al. 2003 ). Both GCN and Attention-based GNN (Thekumparampil et al. 2018 ) are used as the graph model.

TextGTL (Li et al. 2021 ) Similar to TensorGCN, TextGTL (Text-oriented Graph-based Transductive Learning) constructs three different document graphs: Semantics Text Graph, Syntax Text Graph, and Context Text Graph, while all the graphs are non-heterogeneous. Semantics Text Graph uses Generalized Canonical Correlation Analysis (Bach and Jordan 2002 ) and trains a classifier to determine the edge values between two document nodes. Syntax Text Graph uses the Stanford CoreNLP dependency parser (Manning et al. 2014 ) to construct units and also trains a classifier. Context Text Graph defines the edge values by summing up the PMI values of the overlapping words in two documents. Two GCN layers are applied, and the output of each graph is mixed as the output of this layer and input for the next layer for all three graphs:

where \(H^{ (0)}\) is the TF-IDF vector of the documents. Data augmentation with super nodes is also applied in TextGTL to strengthen the information in graph models.

3.3 Word nodes as a graph

By neglecting the document nodes in the graph, a graph with only word nodes shows good performance in deriving the graph-based embedding and is used for downstream tasks. Since no document nodes are included, this method can be easily adapted as an inductive learning model.

VGCN-BERT (Lu et al. 2020 ) VGCN-BERT enhances the input embedding of BERT by concatenating it with the graph embedding. It first constructs a vocabulary graph and uses PMI as the edge value. A variant of the GCN layer called VGCN (Vocabulary GCN) is applied to derive the graph word embedding:

Where BERT embedding is used as the input, the graph word embeddings are concatenated with BERT embedding and fed into the BERT as extra information.

3.4 Extra topic nodes in the graph

Topic information of each document can also provide extra information in corpus-level graph neural networks. Several models also include topic nodes in the graph.

3.4.1 Single layer topic nodes: HGAT, STGCN

HGAT (Linmei et al. 2019 ) HGAT (Heterogeneous GAT) applies LDA (Blei et al. 2003 ) to extract topic information for each document; top P topics with the largest probabilities are selected as connected with the document. Instead of using the words directly to utilize the external knowledge, HGAT applies the entity linking tool TAGME Footnote 1 to identify the entities in the document and connect them. The semantic similarity between entities using pretrained Word2vec with threshold is used to define the connectedness between entity nodes. Since the graph is a heterogeneous graph, a HIN (heterogeneous information network) model is implemented, which propagates solely on each sub-graph depending on the type of node. An HGAT model is applied by considering type-level attention and node-level attention. For a given node, the type-level attention learns the weights of different types of neighbouring nodes while node-level attention captures the importance of different neighbouring nodes when ignoring the type. By using the dual attention mechanism, HGAT can capture the information of type and node at the same time.

STGCN (Yan et al. 2013 ) In terms of short text classification, STGCN (Short-Text GCN) applies BTM to get topic information to avoid the data sparsity problem from LDA. The graph is constructed following TextGCN while extra topic nodes are included. Word-topic and document-topic edge values are from BTM, and a classical two-layer GCN is applied. The word embeddings learned from STGCN are concatenated with BERT embeddings and a bi-LSTM model is applied for final prediction.

3.4.2 Multi-layer topic nodes: DHTG

DHTG (Wang et al. 2020 ) To capture different levels of information, DHTG (Dynamic Hierarchical Topic Graph) introduces hierarchical topic-level nodes in the graph from fine-grain to coarse. Poisson gamma belief network (PGBN) (Zhou et al. 2015 ) is used as a probabilistic deep topic model. The first-layer topics are from the combination of words, while deeper layers are generated by previous layers’ topics with the weights of PGBN, and the weights serve as the edge values of each layer of topics. The cosine similarity is chosen as the edge value for the topics on the same layer. A two-layer GCN is applied, and the model is learned jointly with PGBN, which makes the edge of the topics dynamic.

3.5 Critical analysis

Compared with sequential models like CNN and LSTM, corpus-level GNN is able to capture the global corpus structure information with word nodes as bridges between document nodes and shows great performance without using external resources like pre-trained embedding or pre-trained model. However, the improvement in performance is marginal when pretrained embedding is included. Another issue is that most corpus-level GNN is transductive learning, which is not applicable in the real world. Meanwhile, constructing the whole corpus into a graph requires large memory space, especially when the dataset is large.

A detailed comparison of corpus-level GNN is displayed in Table 2 .

4 Document-level GNN for text classification

By constructing the graph based on each document, a graph classification model can be used as a text classification model. Since each document is represented by one graph and new graphs can be built for test documents, the model can easily work in an inductive way.

4.1 Local word consecutive graph

The simplest way to convert a document into a graph with words as nodes is by connecting the consecutive words within a sliding window.

4.1.1 Simple consecutive graph models: text-Level-GNN, MPAD, TextING

Text-Level-GNN (Huang et al. 2019 ) Text-Level-GNN applies a small sliding window and constructs the graph with a small number of nodes and edges in each graph, which saves memory and computation time. The edge value is trainable and shared across the graphs when connecting the same two words, which also brings global information.

Unlike corpus-level graph models, Text-Level-GNN applies a message passing mechanism (MPM) (Gilmer et al. 2017 ) instead of GCN for graph learning. For each node, the neighbour information is aggregated using max-pooling with trainable edge values as the AGGREGATE function and then the weighted sum is used as the COMBINE function. Sum-pooling and an MLP classifier are applied as the READOUT function to get the representation of each graph. The propagation shows as:

where \(\varvec{h}^{ (l)}_i\) is i th word node presentation of layer l , \(e_{ni}\) is edge weight from node n to node i . A two-layer MPM is applied, and the input of each graph is pretrained GloVe vectors.

MPAD (Nikolentzos et al. 2020 ) MPAD (Message Passing Attention Networks) connects words within a sliding window of size 2 but also includes an additional master node connecting all nodes in the graph. The edge only shows the connectedness of each pair of word nodes and is fixed. A variant of Gated Graph Neural Networks is applied where the AGGREGATE function is the weighted sum and the COMBINE function is GRU (Chung et al. 2014 ). Self-attention is applied in the READOUT function.

To learn the high-level information, the master node is directly concatenated with the READOUT output, working as a skip connection mechanism. Each layer’s READOUT results are concatenated to capture multi-granularity information to get the final representation. Pretrained Word2vec is used as the initialization of word nodes input.

TextING (Zhang et al. 2020 ) To simplify MPAD, TextING ignores the master node in the document-level graphs, which makes the graph sparser. Compared with Text-Level-GNN, TextING has fixed edges. A similar AGGREGATE and COMBINE function are applied under the concept of e-gated Graph Neural Networks (GGNN) (Li et al. 2016 ) with the weighted sum and GRU. However, for the READOUT function, soft attention is used and both max-pooling and mean-pooling are applied to make sure that "every word plays a role in the text and the keywords should contribute more explicitly".

4.1.2 Advanced graph models: MLGNN, TextSSL, DADGNN

MLGNN (Liao et al. 2021 ) MLGNN (Multi-level GNN) builds the same graph as TextING but introduces three levels of MPM: bottom-level, middle-level and top-level. In the bottom-level MPM, the same method with Text-Level-GNN is applied with pretrained Word2vec as input embedding but the edge is non-trainable. A larger window size is adopted in the middle level, and Graph Attention Networks (GAT) (Veličković et al. 2018 ) is applied to learn distant word node information. In the top-level MPM, all word nodes are connected, and multi-head self-attention (Vaswani et al. 2017 ) is applied. By applying three different levels of MPM, MLGNN learns multi-granularity information well.

DADGNN (Liu et al. 2021 ) DADGNN (Deep Attention Diffusion GNN) constructs the same graph as TextING but uses attention diffusion to overcome the over-smoothing issue. Pretrained word embedding is used as the input of each node and an MLP layer is applied. Then, the graph attention matrix is calculated based on the attention to the hidden states of each node. The diffusion matrix is calculated as

where A is the graph attention matrix and \(\epsilon \) is the learnable coefficients. \(A^n\) plays a role of connecting n -hop neighbours and (Liu et al. 2021 ) uses \(n \in [4,7]\) in practice. A multi-head diffusion matrix is applied for layer propagation.

TextSSL (Piao et al. 2021 ) To solve the word ambiguity problem and show the word synonymity and dynamic contextual dependency, TextSSL (Sparse Structure Learning) simultaneously learns the graph using intra-sentence and inter-sentence neighbours. The local syntactic neighbour is defined as the consecutive words, and trainable edges across graphs are also included by using Gumbel-softmax. By applying sparse structure learning, TextSSL manages to select edges with dynamic contextual dependencies.

4.2 Global word co-occurrence graph

Similar to the TextGCN graph, document-level graphs can also use PMI as the word-word edge values.

4.2.1 Only global word co-occurrence: DAGNN

DAGNN (Wu et al. 2019 ) To address the long-distance dependency, hierarchical information and cross-domain learning challenges in domain-adversarial text classification tasks, (Wu et al. 2019 ) propose DAGNN (Domain-Adversarial Graph Neural Network). Each document is represented by a graph with content words as nodes and PMI values as edge values, which can capture long-distance dependency information. Pretrained FastText is chosen as the input word embeddings to handle the out-of-vocabulary issue and a GCN model with skip connection is used to address the over-smoothing problem. The propagation is formulated as:

To learn the hierarchical information of documents, DiffPool (Ying et al. 2018 ) is applied to assign each document into a set of clusters. Finally, adversarial training minimises the loss on source tasks and maximises the differentiation between source and target tasks.

4.2.2 Combine with extra edges: ReGNN, GFN

ReGNN (Li et al. 2019 ) ReGNN (Recursive Graphical Neural Network) uses PMI together with consecutive words as the word edges to capture global and local information. The graph propagation function is the same as GGNN while additive attention (Bahdanau et al. 2015 ) is applied in aggregation. Pretrained GloVe is the input embedding of each word node.

GFN (Dai et al. 2022 ) GFN (Graph Fusion Network) builds four types of graphs using the word co-occurrence statistics, PMI, the similarity of pretrained embedding and Euclidean distance of pretrained embedding. Although four corpus-level graphs are built, the graph learning happens on each document’s subgraphs, making the method a document-level GNN. For each subgraph, each type of graph is learned separately using the graph convolutional method, and then a fusion method of concatenation is used. After an MLP layer, average pooling is applied to get the document representation.

4.3 Other word graphs

Some other ways of connecting words in a document have been explored.

HyperGAT (Ding et al. 2020 ) (Ding et al. 2020 ) proposes HyperGAT (Hypergraph Attention Networks), which builds hypergraphs for each document to capture high-level interaction between words. Two types of hyperedges are included: sequential hyperedges connecting all words in a sentence and semantic hyperedges connecting top-K words after getting the topic of each word using LDA. Like traditional hypergraph propagations, HyperGAT follows the same two steps of updating but with an attention mechanism to highlight the key information: Node-level attention is applied to learn hyperedges representations, and edge-level attention is used to update node representations.

IGCN (Tang et al. 2020 ) Contextual dependency helps in understanding a document, and the graph neural network is no exception. IGCN constructs the graph with the dependency graph to show the connectedness of each pair of words in a document. Then, the word representation learned from Bi-LSTM using POS embedding and word embedding is used to calculate the similarity between each pair of nodes. Attention is used for the output to find the important relevant semantic features.

GTNT (Mei et al. 2021 ) Words with higher TF-IDF values should connect to more word nodes, with this in mind, GTNT (Graph Transformer Networks based Text representation) uses sorted TF-IDF value to determine the degree of each node and applies the Havel-Hakimi algorithm (Hakami 1962 ) to determine the edges between word nodes. A variant of GAT is applied during model learning. Despite the fact that GAT’s attention score is mutual for two nodes, GTNT uses relevant importance to adjust the attention score from one node to another. Pretrained Word2vec is applied as the input of each node.

4.4 Critical analysis

Most document-level GNNs connect consecutive words as edges in the graph and apply a graph neural network model, which makes them similar to CNN, where the receptive field enlarges when graph models go deeper. Also, the major differences among document-level GNNs are the details of graph models, e.g. different pooling methods and different attention calculations, which diminishes the impact of the contribution of these works. Compared with corpus-level GNN, document-level GNN adopts more complex graph models and also suffers from the out-of-memory issue when the number of words in a document is large. A detailed comparison of document-level GNN is displayed in Table  2 .

4.5 Comparison between corpus-level and document-level GNN

The comparison of the framework between Corpus-level and Document-level GNN’s learning is shown in Fig.  2 . A comprehensive comparison between corpus-level GNN and document-level GNN can be found in Table 3 .

figure 2

Corpus-level GNN usually builds a single graph per corpus and learns the node representation while Document-level GNN usually builds one graph per document and learns the graph representation

5 Datasets and metrics

5.1 datasets.

There are many popular text classification benchmark datasets, while this paper mainly focuses on the datasets used by GNN-based text classification applications. Based on the purpose of applications, we divided the commonly adopted datasets into three types including Topic Classification , Sentiment Analysis and Other . Most of these text classification datasets contain a single target label of each text body. The key information of each dataset is listed in Table  4 .

5.1.1 Topic classification

Topic classification models aim to classify input text bodies from diverse sources into predefined categories. News categorization is a typical topic classification task to obtain key information from news and classify them into corresponding topics. The input text bodies normally are paragraphs or whole documents especially for news categorization, while there are still some short text classification datasets from certain domains such as micro-blogs, bibliography, etc. Some typical datasets are listed:

Ohsumed (Joachims 1998 ) is acquired from the MEDLINE database and further processed by Yao et al. ( 2019 ) via selecting certain documents (abstracts) and filtering out the documents belonging to multiple categories. Those documents are classified into 23 cardiovascular diseases. The statistics of Yao et al. ( 2019 ) processed Ohsumed dataset is represented in Table  4 , which is directly employed by other related works.

R8 / R52 are two subsets of the Reuters 21587 dataset Footnote 2 which contain 8 and 52 news topics from Reuters financial news services, respectively.

20NG is another widely used news categorization dataset that contains 20 newsgroups. originally collected it citeLang95, but the procedures are not explicitly described.

AG News (Zhang et al. 2015 ) is a large-scale news categorization dataset compared with other commonly used datasets, which are constructed by selecting the top-4 largest categories from the AG corpus. Each news topic contains 30,000 samples for training and 1900 samples for testing.

Database systems and logic programming (DBLP) is a topic classification dataset to classify the computer science paper titles into six various topics (Mei et al. 2021 ). Different from paragraph or document based topic classification dataset, DBLP aims to categorise scientific paper titles into corresponding categories, the average input sentence length is much lower than others.

Dbpedia (Lehmann et al. 2015 ) is a large-scale multilingual knowledge base that contains 14 non-overlapping categories. Each category contains 40000 samples for training and 5000 samples for testing.

WebKB (Craven et al. 1998 ) is a long corpus web page topic classification dataset.

TREC (Li and Roth 2002 ) is a question topic classification dataset to categorise one question sentence into 6 question categories.

5.1.2 Sentiment analysis

The purpose of sentiment analysis is to analyse and mine the opinion of the textual content which could be treated as a binary or multi-class classification problem. The sources of existing sentiment analysis tasks come from movie reviews, product reviews or user comments, social media posts, etc. Most sentiment analysis datasets aim to predict people’s opinions of one or two input sentences, of which the average length of each input text body is around 25 tokens.

Movie review (MR) (Pang and Lee 2005 ) is a binary sentiment classification dataset for movie reviews, which contains positive and negative data equally distributed. Each review only contains one sentence.

Stanford sentiment treebank (SST) (Socher et al. 2013 ) is an upgraded version of MR which contains two subsets SST-1 and SST-2. SST-1 provides five fine-grained labels, while SST-2 is a binary sentiment classification dataset.

Internet movie database (IMDB) (Maas et al. 2011 ) is also an equally distributed binary classification dataset for sentiment analysis. Different from other short text classification datasets, the average number of words in each review is around 221.

Yelp 2014 (Tang et al. 2015 ) is a large-scale binary category-based sentiment analysis dataset for longer user reviews collected from Yelp.com.

GNN-based text classifiers also use certain binary sentiment classification benchmark datasets. Most of them are gathered from shorter user reviews or comments (normally one or two sentences) from different websites including Amazon Alexa Reviews ( AAR ), Twitter US Airline ( TUA ), Youtube comments ( SenTube-A and SenTube-T ) (Uryupina et al. 2014 ).

5.1.3 Other datasets

There are some datasets targeting other tasks, including hate detection, grammaticality checking, etc. For example, ArangoHate (Arango et al. 2019 ) is a hate detection dataset, a sub-task of intend detection, which contains 2920 hateful documents and 4086 normal documents by resampling the merged datasets from Davidson et al. ( 2017 ) and Waseem ( 2016 ). In addition, Founta et al. ( 2018 ) proposes another large-scale hate language detection dataset, namely FountaHate to classify the tweets into four categories, including 53,851, 14,030, 27,150, and 4,965 samples of normal, spam, hateful and abusive, respectively. Since there is no officially provided training and testing splitting radio for the above datasets, the numbers represented in Table  4 follow the ratios (train/development/test is 85:5:10) defined by Lu et al. ( 2020 ).

5.1.4 Dataset summary

Since an obvious limitation of corpus-level GNN models has high memory consumption limitation (Zhang and Zhang 2020 ; Huang et al. 2019 ; Ding et al. 2020 ), the datasets with a smaller number of documents and vocabulary sizes such as Ohsumed, R8/R52, 20NG or MR are widely used to ensure feasibly build and evaluate corpus-level graphs. For the document-level GNN-based models, some larger-size datasets like AG-News can be adopted without considering the memory consumption problem. From Table  4 , we could find most of the related works mainly focus on the GNN applied in topic classification and sentiment analysis, which means the role of GNNs in other text classification tasks such as spam detection, intent detection, abstractive question answering need to be further exploited.

5.2 Evaluation methods

5.2.1 performance metrics.

In evaluating and comparing the performance of the proposed models with other baselines, accuracy and F1 are the most commonly used metrics to conduct overall performance analysis, ablation studies, and breakdown analysis. We use TP , FP , TN and FN to represent the number of true positive, false positive, true negative and false negative samples. N is the total number of samples.

Accuracy and error rate are basic evaluation metrics adopted by many GNN-based text classifiers such as Li et al. ( 2021 ); Liu et al. ( 2016 ); Wang et al. ( 2020 ); Yao et al. ( 2019 ); Zhang and Zhang ( 2020 ). Most of the related papers run all baselines and their models ten times or five times to show the \(mean \pm standard\) deviation of accuracy for reporting more convincing results. It can be defined as:

Precision , recall and F1 are metrics for measuring the performance, especially for imbalanced datasets. Precision is used to measure the result’s relevancy, while recall is utilized to measure how many truly relevant results are acquired. By calculating the harmonic average of Precision and Recall, we could get F1. Those three measurements can be defined as:

Few papers only utilise recall or precision to evaluate the performance (Mei et al. 2021 ). However, precision and recall are more commonly used together with F1 or Accuracy to evaluate and analyse the performance from different perspectives, e.g. Li et al. ( 2019 ); Linmei et al. ( 2019 ); Lu et al. ( 2020 ); Xie et al. ( 2021 ). In addition, based on different application scenarios, different F1 averaging methods are adopted by those papers to measure the overall F1 score of multi-class (Number of Classes is C ) classification tasks, including:

Macro-F1 applies the same weights to all categories to get overall \(F1_{macro}\) by taking the arithmetic mean.

Micro-F1 is calculated by considering the overall \(P_{micro}\) and \(R_{micro}\) . It can be defined as:

Weighted-F1 is the weighted mean of F1 of each category where the weight \(W_i\) is related to the number of occurrences of the corresponding i th class, which can be defined as:

5.2.2 Other evaluation aspects

Since two limitations of GNN-based models are time and memory consumption, except the commonly used qualitative performance comparison, representing and comparing the GPU or CPU memory consumption and the training time efficiency of proposed models are also adopted by many related studies to demonstrate the practicality in real-world applications. In addition, based on the novelties of various models, specific evaluation methods are conducted to demonstrate the proposed contributions.

Memory consumption (Ding et al. 2020 ; Huang et al. 2019 ; Liu et al. 2021 ) lists the memory consumption of different models for comprehensively evaluating the proposed models in computational efficiency aspect.

Time measurement (Ragesh et al. 2021 ; Pasa et al. 2021 ) performs performance training time comparison between their proposed models and baselines on different benchmarks. Due to the doubts about the efficiency of applying GNNs for text classification, it is an effective way to demonstrate they could balance performance and time efficiency.

Parameter sensitivity is commonly conducted by GNNs studies to investigate the effect of different hyperparameters, e.g. varying sliding window sizes, embedding dimensions of proposed models to represent the model sensitivity via line chart such as Linmei et al. ( 2019 ); Ding et al. ( 2020 ); Liu et al. ( 2021 ).

Number of labelled documents is a widely adopted evaluation method by GNN-based text classification models (Li et al. 2021 ; Wang et al. 2020 ; Linmei et al. 2019 ; Mei et al. 2021 ; Yao et al. 2019 ; Ragesh et al. 2021 ; Ding et al. 2020 ) which mainly analyse the performance trend by using different proportions of training data to test whether the proposed model can work well under the limited labelled training data.

Vocabulary size is similar to the number of labelled documents, but it investigates the effects of using different sizes of vocabulary during the GNN training stage adopted by Wang et al. ( 2020 ).

5.2.3 Metrics summary

For general text classification tasks, accuracy, precision, recall, and varying F1 are commonly used evaluation metrics for comparison with other baselines. However, for GNN-based models, only representing the model performance cannot effectively represent the multi-aspects of the proposed models. In this case, there are many papers conducting external processes to evaluate and analyse the GNN-based classifier from multiple views, including time and memory consumption, model sensitivity and dataset quantity.

6 Performance

While different GNN text classification models may be evaluated on different datasets, there are some datasets that are commonly used across many of these models, including 20NG , R8 , R52 , Ohsumed and MR . The accuracy of various models on these five datasets is presented in Table 5 . Some of the results are reported with ten times average accuracy and standard derivation while some only report the average accuracy. Several conclusions can be drawn:

Models that use external resources usually achieve better performance than those that do not, especially models with BERT and RoBERTa (Lin et al. 2021 ; Ye et al. 2020 ).

Under the same setting, such as using GloVe as the external resource, Corpus-level GNN models (e.g. TG-Transformer (Zhang and Zhang 2020 ), TensorGCN (Liu et al. 2020 )) typically outperform Document-level GNN models (e.g. TextING (Zhang et al. 2020 ), TextSSL (Piao et al. 2021 )). This is because Corpus-level GNN models can work in a transductive way and make use of the test input, whereas Document-level GNN models can only use the training data.

The advantage of Corpus-level GNN models over Document-level GNN models only applies to topic classification datasets and not to sentiment analysis datasets such as MR . This is because sentiment analysis involves analyzing the order of words in a text, which is something that most Corpus-level GNN models cannot do.

7 Challenges and future work

7.1 model performance.

With the development of pre-trained models (Devlin et al. 2018 ; Liu et al. 2019 ), and prompt learning methods (Gao et al. 2021 ; Liu et al. 2021 ) achieve great performance on text classification. Applying GNNs in text classification without this pre-training style will not be able to achieve such good performance. For both corpus-level and document-level GNN text classification models, researching how to combine GNN models with these pretrained models to improve the pretrained model performance can be the future work. Meanwhile, more advanced graph models can be explored, e.g. more heterogeneous graph models on word and document graphs to improve the model performance.

7.2 Graph construction

Most GNN text classification methods use a single, static-value edge to construct graphs based on document statistics. This approach applies to both corpus-level GNN and document-level GNN. However, to better explore the complex relationship between words and documents, more dynamic hyperedges can be utilized. Dynamic edges in GNNs can be learned from various sources, such as the graph structure, document semantic information, or other models. And hyperedges can be built for a more expressive representation of the complex relationships between nodes in the graph.

7.3 Application

While corpus-level GNN text classification models have demonstrated good performance without using external resources, these models are mostly transductive. To apply them in real-world settings, an inductive learning approach should be explored. Although some inductive corpus-level GNNs have been introduced, the large amount of space required to construct the graph and the inconvenience of incremental training still present barriers to deployment. Improving the scalability of online training and testing for inductive corpus-level GNNs represents a promising area for future work.

8 Conclusion

This survey article introduces how Graph Neural Networks have been applied to text classification in two different ways: corpus-level GNN and document-level GNN, with a detailed structural figure. Details of these models have been introduced and discussed, along with the datasets commonly used by these methods. Compared with traditional machine learning and sequential deep learning models, graph neural networks can explore the relationship between words and documents in the global structure (corpus-level GNN) or the local document (document-level GNN), performing well. A detailed performance comparison is applied to investigate the influence of external resources, model learning methods, and types of different datasets. Furthermore, we propose the challenges for GNN text classification models and potential future work.

Data availability

No datasets were generated or analysed during the current study.

https://sobigdata.d4science.org/group/tagme/ .

For the original Reuters 21587 dataset, please refer to this link http://www.daviddlewis.com/resources/testcollections/reuters21578 .

Abreu J, Fred L, Macêdo D, Zanchettin C (2019) Hierarchical attentional hybrid neural networks for document classification. In: International Conference on Artificial Neural Networks, Springer, pp. 396–402

Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. Mining text data. Springer, Boston, pp 163–222

Chapter   Google Scholar  

Alsaeedi A (2020) A survey of term weighting schemes for text classification. Int J Data Mining Model Manag 12 (2):237–254

Google Scholar  

Arango A, Pérez J, Poblete B (2019) Hate speech detection is not as easy as you may think: A closer look at model validation. In: Proceedings of the 42nd International Acm Sigir Conference on Research and Development in Information Retrieval, pp. 45–54

Bach FR, Jordan MI (2002) Kernel independent component analysis. J Mach Learn Res 3:1–48

MathSciNet   Google Scholar  

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings

Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi V, Malinowski M, Tacchetti A, Raposo D, Santoro A, Faulkner R, et al (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261

Benamira A, Devillers B, Lesot E, Ray AK, Saadi M, Malliaros FD (2019) Semi-supervised learning and graph neural networks for fake news detection. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 568–569

Bhavani A, Kumar BS (2021) A review of state art of text classification algorithms. In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), pp. 1484–1490. IEEE

Bird S, Klein E, Loper E (2009) Natural language processing with python: analyzing text with the natural language toolkit. O’Reilly Media Inc., Sebastopol

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

Article   Google Scholar  

Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P (2017) Geometric deep learning: going beyond Euclidean data. IEEE Signal Proc Mag 34 (4):18–42

Cavnar WB, Trenkle JM, et al (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, vol. 161175 . Las Vegas, NV

Chen Y, Wu L, Zaki M (2020) Iterative deep graph learning for graph neural networks: better and robust node embeddings. Adva Neural Inf Proc Syst 33:19314–19326

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

Craven M, McCallum A, PiPasquo D, Mitchell T, Freitag D (1998) Learning to extract symbolic knowledge from the world wide web. Technical report, Carnegie-mellon univ pittsburgh pa school of computer Science

Dai Y, Shou L, Gong M, Xia X, Kang Z, Xu Z, Jiang D (2022) Graph fusion network for text classification. Knowl-Based Syst 236:107659

Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, 11, 512–515

Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: a review. Multimedia Tools Appl 78 (3):3797–3816

Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Dieng AB, Wang C, Gao J, Paisley J (2016) Topicrnn: A recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702

Ding K, Wang J, Li J, Li D, Liu H (2020) Be more with less: Hypergraph attention networks for inductive text classification. arXiv preprint arXiv:2011.00387

Founta AM, Djouvas C, Chatzakou D, Leontiadis I, Blackburn J, Stringhini G, Vakali A, Sirivianos M, Kourtellis N (2018) Large scale crowdsourcing and characterization of twitter abusive behavior. In: Twelfth International AAAI Conference on Web and Social Media

Gao T, Fisch A, Chen D (2021) Making pre-trained language models better few-shot learners. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830

Gao C, Wang X, He X, Li Y (2022) Graph neural networks for recommender system. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 1623–1625

Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian logistic regression for text categorization. Technometrics 49 (3):291–304

Article   MathSciNet   Google Scholar  

Gilmer J, Schoenholz SS, Riley, PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: International Conference on Machine Learning, pp. 1263–1272. PMLR

Graves A (2012) Long short-term memory. Superv Seq Label Recurr Neural Netw. https://doi.org/10.1007/978-3-642-24797-2

Hakami S (1962) On the realizability of a set of integers as degrees of the vertices of a graph. SIAM J Appl Math 10:496–506

Hakim AA, Erwin A, Eng KI, Galinium M, Muliady W (2014) Automated document classification for news article in bahasa indonesia based on term frequency inverse document frequency (tf-idf) approach. In: 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 1–4. IEEE

Hamilton WL, Ying R, Leskovec J (2017) Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584

Harish BS, Guru DS, Manjunath S (2010) Representation and classification of text documents: a brief review. IJCA 110:119

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9 (8):1735–1780

Huang L, Ma D, Li S, Zhang X, Wang H (2019) Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356

Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (volume 1: Long Papers), pp. 1681–1691

Jindal R, Malhotra R, Jain A (2015) Techniques for text classification: literature review and current trends. Webology 12 (2):1–28

Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: European Conference on Machine Learning, pp. 137–142. Springer

Joachims T (2005) Text categorization with support vector machines: Learning with many relevant features. In: Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceedings, pp. 137–142. Springer

Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52 (1):273–292

Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188

Kaur R, Kumar M (2018) Domain ontology graph approach using markov clustering algorithm for text classification. In: International Conference on Intelligent Computing and Applications, pp. 515–531. Springer

Khan A, Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1 (1):4–20

Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 1746–1751

Kipf TN, Welling M (2016) Variational graph auto-encoders. NIPS Workshop on Bayesian Deep Learning

Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10 (4):150

Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Van Kleef P, Auer S et al (2015) Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. semantic web 6 2:167–195. Crossref, ISI (2015)

Lei F, Liu X, Li Z, Dai Q, Wang S (2021) Multihop neighbor information fusion graph convolutional network for text classification. Math Probl Eng. https://doi.org/10.1155/2021/6665588

Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR

Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L (2022) A survey on text classification: from traditional to deep learning. ACM Trans Intell Syst Technol 13 (2):1–41

Liang Z, Ding H, Fu W (2021) A survey on graph neural networks for recommendation. In: 2021 International Conference on Culture-oriented Science & Technology (ICCST), pp. 383–386. IEEE

Liao W, Bak-Jensen B, Pillai JR, Wang Y, Wang Y (2021) A review of graph neural networks and their applications in power systems. J Modern Power Syst Clean Energy 10 (2):345–360

Liao W, Zeng B, Liu J, Wei P, Cheng X, Zhang W (2021) Multi-level graph neural network for text sentiment analysis. Comput Electr Eng 92:107096

Li W, Li S, Ma S, He Y, Chen D, Sun X (2019) Recursive graphical neural networks for text classification. arXiv preprint arXiv:1909.08166

Lilleberg J, Zhu Y, Zhang Y (2015) Support vector machines and word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pp. 136–140. IEEE

Lin Y, Meng Y, Sun X, Han Q, Kuang K, Li J, Wu F (2021) Bertgcn: Transductive text classification by combining gnn and bert. Findings Assoc Comput Linguist 2021:1456–1462

Linmei H, Yang T, Shi C, Ji H, Li X (2019) Heterogeneous graph attention networks for semi-supervised short text classification. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4821–4830

Li C, Peng X, Peng H, Li J, Wang L (2021) Textgtl: Graph-based transductive learning for semi-supervised text classification via structure-sensitive interpolation. IJCAI. ijcai. org

Li X, Roth D (2002) Learning question classifiers. In: COLING 2002: The 19th International Conference on Computational Linguistics

Li Y, Tarlow D, Brockschmidt M, Zemel RS (2016) Gated graph sequence neural networks. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

Liu Z, Tan H (2021) Traffic prediction with graph neural network: a survey. CICTP 2021:467–474

Liu X, You X, Zhang X, Wu J, Lv P (2020) Tensor graph convolutional networks for text classification. Proc AAAI Conf Arti Intel 34:8409–8416

Liu Y, Guan R, Giunchiglia F, Liang Y, Feng X (2021) Deep attention diffusion graph neural networks for text classification. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8142–8152

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

Liu P, Qiu X, Chen X, Wu S, Huang X-J (2015) Multi-timescale long short-term memory neural network for modelling sentences and documents. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2326–2335

Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: IJCAI

Liu C, Zhan Y, Li C, Du B, Wu J, Hu W, Liu T, Tao D (2022) Graph pooling for graph neural networks: Progress, challenges, and opportunities. arXiv preprint arXiv:2204.07321

Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, Tang J (2021) Gpt understands, too. arXiv:2103.10385

Lu Z, Du P, Nie J-Y (2020) Vgcn-bert: augmenting bert with graph embedding for text classification. In: European Conference on Information Retrieval, pp. 369–382. Springer

Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150

Malekzadeh M, Hajibabaee P, Heidari M, Zad S, Uzuner O, Jones JH (2021) Review of graph neural network in text classification. In: 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), pp. 0084–0091 . IEEE

Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60. http://dblp.uni-trier.de/db/conf/acl/acl2014-d.html#ManningSBFBM14

Marin A, Holenstein R, Sarikaya R, Ostendorf M (2014) Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In: Fifteenth Annual Conference of the International Speech Communication Association

Mariyam A, Basha SAH, Raju SV (2021) A literature survey on recurrent attention learning for text classification. In: IOP Conference Series: Materials Science and Engineering, 1042, 012030. IOP Publishing

Matsuo Y, Sakaki T, Uchiyama K, Ishizuka M (2006) Graph-based word clustering using a web search engine. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 542–550

Mei X, Cai X, Yang L, Wang N (2021) Graph transformer networks based text representation. Neurocomputing 463:91–100

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Comput Surv 54 (3):1–40

Mirończuk MM, Protasiewicz J (2018) A recent overview of the state-of-the-art elements of text classification. Exp Syst Appl 106:36–54

Mou L, Men R, Li G, Xu Y, Zhang L, Yan R, Jin Z (2015) Natural language inference by tree-based convolution and heuristic matching. arXiv preprint arXiv:1512.08422

Niepert M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs. In: International Conference on Machine Learning, pp. 2014–2023. PMLR

Nikolentzos G, Tixier A, Vazirgiannis M (2020) Message passing attention networks for document understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34, 8544–8551

Ostendorff M, Bourgonje P, Berger M, Moreno-Schneider J, Rehm G, Gipp B (2019) Enriching bert with knowledge graph embeddings for document classification. arXiv preprint arXiv:1909.08402

Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab

Pang B, Lee L (2005) Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv preprint cs/0506075

Pasa L, Navarin N, Erb W, Sperduti A (2021) Simple graph convolutional networks https://doi.org/10.48550/ARXIV.2106.05809

Patra A, Singh D (2013) A survey report on text classification with different term weighing methods and comparison between classification algorithms. Int J Comput Appl 75 (7):14–18

Peng H, Li J, He Y, Liu Y, Bao M, Wang L, Song Y, Yang Q (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In: Proceedings of the 2018 World Wide Web Conference, pp. 1063–1072

Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543

Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana. https://doi.org/10.18653/v1/N18-1202 . https://aclanthology.org/N18-1202

Piao Y, Lee S, Lee D, Kim S (2021) Sparse structure learning via graph neural networks for inductive document classification. arXiv preprint arXiv:2112.06386

Pintas JT, Fernandes LA, Garcia ACB (2021) Feature selection methods for text classification: a systematic literature review. Artif Intell Rev 54 (8):6149–6200

Radford A, Narasimhan K, Salimans T, Sutskever I, et al (2018) Improving language understanding by generative pre-training

Ragesh R, Sellamanickam S, Iyer A, Bairi R, Lingam V (2021) Hetegcn: Heterogeneous graph convolutional networks for text classification. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 860–868

Ren Y, Wang R, Ji D (2016) A topic-enhanced word embedding for twitter sentiment classification. Inf Sci 369:188–198

Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Trans Neural Netw 20 (1):61–80

Schenker A, Last M, Bunke H, Kandel A (2004) Classification of web documents using graph matching. Int J Pattern Recognit Artif Intel 18 (03):475–496

Selva Birunda S, Kanniga Devi R (2021) A review on word embedding techniques for text classification. Innov Data Commun Technol Appl. https://doi.org/10.1007/978-981-15-9651-3_23

Shah FP, Patel V (2016) A review on feature selection and feature extraction for text classification. In: 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2264–2268. IEEE

Silva FB, Tabbone S, Torres RdS (2014) Bog: A new approach for graph matching. In: 2014 22nd International Conference on Pattern Recognition, pp. 82–87. IEEE

Skarding J, Gabrys B, Musial K (2021) Foundations and modeling of dynamic networks using dynamic graph neural networks: A survey. IEEE Access 9:79143–79168

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642

Szummer M, Jaakkola T (2001) Partially labeled classification with Markov random walks. Adv Neural Inf Proc Syst 14:838

Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1556–1566

Tang H, Mi Y, Xue F, Cao Y (2020) An integration model based on graph convolutional network for text classification. IEEE Access 8:148865–148876

Tang D, Qin B, Liu T (2015) Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432

Thekumparampil KK, Wang C, Oh S, Li L-J (2018) Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735

Thomas JM, Moallemy-Oureh A, Beddar-Wiesing S, Holzhüter C (2022) Graph neural networks designed for different graph types: a survey. arXiv preprint arXiv:2204.03080

Uryupina O, Plank B, Severyn A, Rotondi A, Moschitti A (2014) Sentube: A corpus for sentiment analysis on youtube social media. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 4244–4249

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Proc Syst. https://doi.org/10.48550/arXiv.1706.03762

Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph Attention Networks. International Conference on Learning Representations

Vijayan VK, Bindu K, Parameswaran L (2017) A comprehensive study of text classification algorithms. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1109–1113. IEEE

Wang K, Han SC, Poon J (2022) Induct-gcn: Inductive graph convolutional networks for text classification. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 1243–1249. IEEE

Wang K, Han C, Long S, Poon J (2022) Me-gcn: Multi-dimensional edge-embedded graph convolutional networks for semi-supervised text classification. In: ICLR 2022 Workshop on Deep Learning on Graphs for Natural Language Processing

Wang Z, Wang C, Zhang H, Duan Z, Zhou M, Chen B (2020) Learning dynamic hierarchical topic graph with graph convolutional network for document classification. In: International Conference on Artificial Intelligence and Statistics, pp. 3959–3969. PMLR

Wang J, Zhang S, Xiao Y, Song R (2021) A review on graph neural network methods in financial applications. arXiv preprint arXiv:2111.15367

Waseem Z (2016) Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In: Proceedings of the First Workshop on NLP and Computational Social Science, pp. 138–142

Welling M, Kipf TN (2016) Semi-supervised classification with graph convolutional networks. In: J. International Conference on Learning Representations (ICLR 2017)

Wu Z, Pan S, Chen F, Long G, Zhang C, Philip SY (2020) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32 (1):4–24

Wu L, Chen Y, Shen K, Guo X, Gao H, Li S, Pei J, Long B (2021) Graph neural networks for natural language processing: A survey. arXiv preprint arXiv:2106.06090

Wu M, Pan S, Zhu X, Zhou C, Pan L (2019) Domain-adversarial graph neural networks for text classification. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 648–657. IEEE

Wu F, Souza A, Zhang T, Fifty C, Yu T, Weinberger K (2019) Simplifying graph convolutional networks. In: International Conference on Machine Learning, pp. 6861–6871. PMLR

Xie Y, Xu Z, Zhang J, Wang Z, Ji S (2022) Self-supervised learning of graph neural networks: a unified review. IEEE Trans Pattern Anal Mach Intell 45 (2):2412–2429

Xie Q, Huang J, Du P, Peng M, Nie J-Y (2021) Inductive topic variational graph auto-encoder for text classification. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4218–4227

Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Exp Newslett 12 (1):40–48

Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456

Yang Y, Wei Y, Shen T (2021) A review of graph neural networks for recommender applications. In: 2021 IEEE International Conference on Unmanned Systems (ICUS), pp. 602–607. IEEE

Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489

Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7370–7377

Ye Z, Jiang G, Liu Y, Li Z, Yuan J (2020) Document and word representations generated by graph convolutional network and Bert for short text classification. ECAI 2020:2275–2281

Ying Z, You J, Morris C, Ren X, Hamilton W, Leskovec J (2018) Hierarchical graph representation learning with differentiable pooling. Adv Neural Inf Proc Syst. https://doi.org/10.48550/arXiv.1806.08804

Yin Y, Jin Z (2015) Document sentiment classification based on the word embedding. In: 2015 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering, pp. 456–461. Atlantis Press

Zhang W, Yoshida T, Tang X (2011) A comparative study of tf* idf, lsi and multi-words for text classification. Exp Syst Appl 38 (3):2758–2765

Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. Adv Neural Inf Proc Syst. https://doi.org/10.48550/arXiv.1509.01626

Zhang S, Tong H, Xu J, Maciejewski R (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6 (1):1–23

Zhang X-M, Liang L, Liu L, Tang M-J (2021) Graph neural networks and their current applications in bioinformatics. Front Genet 12:690049

Zhang J, Meng L (2019) Gresnet: Graph residual network for reviving deep gnns from suspended animation. arXiv preprint arXiv:1909.05729

Zhang Y, Yu X, Cui Z, Wu S, Wen Z, Wang L (2020) Every document owns its structure: Inductive text classification via graph neural networks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 334–339

Zhang H, Zhang J (2020) Text graph transformer for document classification. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8322–8327

Zhou Y (2020) A review of text classification based on deep learning. In: Proceedings of the 2020 3rd International Conference on Geoinformatics and Data Analysis, pp. 132–136

Zhou X, Li C (2005) Text classification by Markov random walks with reward. DMIN. Citeseer, Chicago, pp 275–278

Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M (2020) Graph neural networks: a review of methods and applications. AI Open 1:57–81

Zhou M, Cong Y, Chen B (2015) The poisson gamma belief network. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 3043–3051

Zhu H, Koniusz P (2020) Simple spectral graph convolution. In: International Conference on Learning Representations

Zulqarnain M, Ghazali R, Hassim YMM, Rehan M (2020) A comparative review on deep learning models for text classification. Indones J Electr Eng Comput Sci 19 (1):325–335

Download references

Open Access funding enabled and organized by CAUL and its Member Institutions

Author information

Authors and affiliations.

School of Computer Science, The University of Sydney, City Rd, Sydney, NSW, 2006, Australia

Kunze Wang, Yihao Ding & Soyeon Caren Han

Faculty of Engineering and IT, The University of Melbourne, Southbank 234, Melbourne, VIC, 3052, Australia

Yihao Ding & Soyeon Caren Han

You can also search for this author in PubMed   Google Scholar

Contributions

Kunze Wang: Conceived and designed the analysis and Review, Collected the data, Contributed data or analysis, and Wrote the paper Yihao Ding: Collected the data, Contributed data or analysis, and Wrote the paper Soyeon Caren Han: Conceived and designed the analysis and review, and Research Supervision, and Manuscript Drafting

Corresponding author

Correspondence to Soyeon Caren Han .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Wang, K., Ding, Y. & Han, S.C. Graph neural networks for text classification: a survey. Artif Intell Rev 57 , 190 (2024). https://doi.org/10.1007/s10462-024-10808-0

Download citation

Published : 01 July 2024

DOI : https://doi.org/10.1007/s10462-024-10808-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Graph neural networks
  • Text classification
  • Representation learning
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. Simple audio recognition: Recognizing keywords

    This tutorial demonstrated how to carry out simple audio classification/automatic speech recognition using a convolutional neural network with TensorFlow and Python. To learn more, consider the following resources: The Sound classification with YAMNet tutorial shows how to use transfer learning for audio classification.

  2. Automatic Speech Recognition with Transformer

    Introduction. Automatic speech recognition (ASR) consists of transcribing audio speech segments into text. ASR can be treated as a sequence-to-sequence problem, where the audio can be represented as a sequence of feature vectors and the text as a sequence of characters, words, or subword tokens. For this demonstration, we will use the LJSpeech ...

  3. Speech Recognition: a review of the different deep learning approaches

    Automatic speech recognition (ASR) refers to the task of recognizing human speech and translating it into text. This research field has gained a lot of focus over the last decades. It is an important research area for human-to-machine communication. ... Neural networks, both feed-forward and recurrent, can be only used for frame-wise ...

  4. Speech to Text in Python with Deep Learning in 2 minutes

    This might take some time to download. Once done, you can record your voice and save the wav file just next to the file you are writing your code in. You can name your audio to "my-audio.wav". file_name = 'my-audio.wav'. Audio(file_name) With this code, you can play your audio in the Jupyter notebook.

  5. Machine Learning is Fun Part 6: How to do Speech Recognition ...

    A neural network can find patterns in this kind of data more easily than raw sound waves. So this is the data representation we'll actually feed into our neural network. ... Text to speech.

  6. speech-to-text · GitHub Topics · GitHub

    DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers. machine-learning embedded deep-learning offline tensorflow speech-recognition neural-networks speech-to-text deepspeech on-device.

  7. [2106.15561] A Survey on Neural Speech Synthesis

    Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent ...

  8. maneesh-chouksey/speech-to-text-deep-learning-models

    In this notebook, you will build a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline! We begin by investigating the LibriSpeech dataset that will be used to train and evaluate your models. Your algorithm will first convert any raw audio to feature representations that are commonly used for ASR.

  9. GitHub

    NeuralSpeechis a research project at Microsoft Research Asia, which focuses on neural network based speech processing, including automatic speech recognition (ASR), text-to-speech synthesis (TTS), spatial audio synthesis, video dubbing, etc. Currently this repo covers several research work: Automatic Speech Recognition. FastCorrect, NeurIPS 2021.

  10. [1702.07825] Deep Voice: Real-time Neural Text-to-Speech

    We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency ...

  11. Deep learning speech synthesis

    e. Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

  12. Comparing 4 Popular Open Source Speech To Text Neural Network ...

    I compared pre-trained models for Vosk, NeMo QuartzNet, wav2letter, and DeepSpeech2 for my summer internship. For my company's needs, I recommended NeMo QuartzNet model from NVIDIA. Speech-to-text ...

  13. Speech-To-Text Recognition Using Neural Networks

    researched on how to apply convolutional networks in a speech-to-text task; adapted the convolutional network to recognize speech; tested the model in streaming recognition; How We Taught Neural Networks to Recognize Incoming Audio Signals. For the research, we used an audio signal in the wav format, in 16-bit quantization at a sampling ...

  14. Speech Recognition with Neural Networks

    Speech Recognition with Neural Networks. Wednesday, April 23, 2014. We've previously talked about using recurrent neural networks for generating text, based on a similarly titled paper. Recently, recurrent neural networks have been successfully applied to the difficult problem of speech recognition. In this post, we'll look at the architecture ...

  15. Deep Voice: Real-time Neural Text-to-Speech

    Deep Voice is inspired by traditional text-to-speech pipelines and adopts the same structure, while replacing all components with neural networks and using simpler fea-tures: first we convert text to phoneme and then use an audio synthesis model to convert linguistic features into speech (Taylor, 2009).

  16. Neural Text to Speech (TTS): Making Voice Experiences More Human

    In a nutshell, neural text to speech is a form of machine speech built with neural networks. A neural network is a type of computer architecture modeled on the human brain. Your brain processes data through unbelievably complex webs of electrochemical connections between nerve cells, or neurons. As these connective pathways develop through ...

  17. Text-to-Speech AI: Lifelike Speech Synthesis

    Convert text into natural-sounding speech using an API powered by the best of Google's AI technologies. New customers get up to $300 in free credits to try Text-to-Speech and other Google Cloud products. Try Text-to-Speech free Contact sales. Improve customer interactions with intelligent, lifelike responses.

  18. Text-To-Speech Synthesis

    Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. coqui-ai/TTS • • 24 Oct 2017. This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units.

  19. Transformer TTS: Implementation of a non ...

    Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS). This repo is based, among others, on the following papers: Neural Speech Synthesis with Transformer Network; FastSpeech: Fast, Robust and Controllable Text to Speech; FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

  20. Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology

    WaveNet synthesizes more natural-sounding speech and, on average, produces speech audio that people prefer over other text-to-speech technologies. In late 2016, DeepMind introduced the first version of WaveNet — a neural network trained with a large volume of speech samples that's able to create raw audio waveforms from scratch.

  21. Microsoft previews neural network text-to-speech

    Applying the latest in deep learning innovation, Speech Service, part of Azure Cognitive Services now offers a neural network-powered text-to-speech capability. Access the preview available today. Neural Text-to-Speech makes the voices of your apps nearly indistinguishable from the voices of people. Use it to make conversations with chatbots ...

  22. Best Speech-to-Text API Solutions in 2024

    Convolutional Neural Networks (CNNs): Primarily used for image processing, CNNs have found applications in speech recognition by helping to identify features in audio spectrograms. Transformer Models: The latest advancement in deep learning, transformer models use attention mechanisms to focus on important parts of the input sequence ...

  23. How to create text-to-speech with neural network

    I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.

  24. Addressing Hallucinations in Speech Synthesis LLMs with the NVIDIA NeMo

    LLM-based speech synthesis models produce speech that is not only more natural, but also more expressive, opening up a world of possibilities for applications in various industries. However, similar to their use in text domain, speech LLMs face the hallucinations challenges, which can hinder their real-world deployment. T5-TTS model overview

  25. SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural

    Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to "see", "listen", and "read". In this paper, we design SpikeVoice, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak".

  26. SpeechGen.io Revolutionizes Audio Content Creation with Multi-Voice AI

    The multi-voice AI text-to-speech technology from SpeechGen.io allows users to generate dynamic dialogues and narratives in multiple languages using a single neural network.

  27. Deep correlation network for synthetic speech detection

    Bi-parallel networks consist of different neural models to learn the middle-level representations from front-end acoustical features. The correlation learning network is the core part of the DCN and is proposed to explore the common information between the above middle-level features. ... As increasing development of text-to-speech (TTS) and ...

  28. Graph neural networks for text classification: a survey

    Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph ...