We need to have a push-style frontend since it’s important for
bindings.  Callback pop-style frontend would be a pain to support in
bindings (you need to  call managed/python code from callback then)

For effective noise cancellation we need a lookahead and corresponding
delay in processing

For effective CMN we need a delay in processing in the beginning too for 
parameters estimation

For effective VAD we need a delay in processing to catch the start of the 
speech properly

Our overall delay should be minimal to provide users a real-time
experience

We need to operate on streams of audio, not just on utterances, there
could be adaptation across the stream

The core concept of the live frontend must be FIFO. We push some data
into it  and can get some data back. Every chunk of data must have
stream stamp attached to it. It’s not  necessary that we  will get
same data back, many components are asynchronous - denoise, VAD, CMN 
they might introduce some delays in the beginning of the stream.

Sphinx4 list-based frontend must be a good idea to implement all the required 
queues. At the same time we'd like to keep simple allocation-free structures
like buffers.

Ideally VAD has to be GMM-based with long-context features too

In the future we might want to support long-context features (up to 30 frames)

Ability to handle frame drop is a nice feature to have (occurs in a bad 
channels)

We still need proper accurate timestamps for the results.

We need to support lookahead scoring (phone loop) and multipass scoring.

We need to support minibatch scoring for efficient computation.

We have two types of processing - batch and live? Is batch needed, probably 
not? Most practical applications are live.
