[Corpora-List] PhD position in Joint Embedded Speech Separation, Diarization and Recognition for the Automatic Generation of Meeting Minutes

Firas Hmida firas.hmida at gmail.com
Thu Jul 8 10:26:47 CEST 2021

Dear Colleagues,

Please find below the description of a PhD position in “Joint Embedded Speech Separation, Diarization and Recognition for the Automatic Generation of Meeting Minutes”.

Starting date: October 01, 2021

Deadline for Applications: July 16, 2021

All details are available at: https://recrutement.inria.fr/public/classic/en/offres/2021-03757

Keywords: diarization, speech separation, robust automatic speech recognition, transfer learning, deep learning


Founded in 2015 and awarded two CES Innovation Awards, Vivoka <https://vivoka.com/en/> has created and sells the Voice Development Kit (VDK), the very first solution allowing a company to design a voice interface in a simple, autonomous and quick way. Moreover, this interface is embedded: it can be deployed on devices without an Internet connection and fully preserves privacy. Accelerated by the COVID-19 health crisis and the need for "no-touch" interfaces, Vivoka is now optimizing this technology by developing its own speech and language processing solutions able to compete with the most efficient current technologies. This research project, which involves the entire Vivoka R&D team, is carried out within the framework of a long lasting partnership with Inria's Multispeech <https://team.inria.fr/multispeech/> team.

The hired PhD student will share his/her time between Vivoka's R&D team and Inria's Multispeech team. He/she will benefit from the startup spirit of Vivoka, where he/she will interact with other PhD students, interns and researchers hired as part of the partnership and the engineers responsible for integrating their results into the VDK. He/she will also benefit from the skills of the Multispeech team, the largest research team in the field of speech processing in France, and the overall Inria environment.


Conversational Automatic Speech Recognition (ASR) has seen tremendous progress over the past decade, with a word error rate now similar to that of humans for a single speaker speaking close to the microphone [1]. As soon as the speaker moves away from the microphone, the error rate increases due to reverberation, ambient noise, and overlapping speech from other speakers. The automatic generation of meeting minutes thus involves solving a set of tasks: i) segmenting the signal according to the number of speakers and who is speaking at each time (diarisation) [2], ii) separating overlapping speech signals [3] and enhancing them with respect to ambient noise and reverberation, iii) ensuring the robustness of ASR with respect to diarization errors and signal distortions introduced by separation and enhancement [4], and iv) removing disfluencies from the word-for-word transcription in order to obtain readable minutes.

The objective of this PhD is to design a system which can jointly address the first three tasks given a single-channel or a multichannel signal and which can be embedded in a device with limited computing power (for example a mobile phone), while being able to compete with current Cloud-based technologies.

[1] W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke, D. Yu and G. Zweig, "Toward human parity in conversational speech recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12): 2410-2423, 2017.

[2] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue and K. Nagamatsu, “End-to-end neural diarization: reformulating speaker diarization as simple multi-label classification,” arXiv preprint arXiv:2003.02966, 2020.

[3] M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis, J. Heitkaemper, M. Olvera, F.-R. Stöter, M. Hu, J. Martı́n-Doñas, D. Ditter, A. Frank, A. Deleforge and E. Vincent, “Asteroid: the PyTorch-based audio source separation toolkit for researchers,” in Interspeech, pp. 2637-2641, 2020.

[4] Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4):796–806, 2016.


Master 2 in computer science, data science or signal processing.

Programming experience in Python and in a deep learning framework.

Previous experience in the field of speech processing or computational footprint reduction is a plus.

Instructions for applying

Application deadline: July 16, 2021

Submit your complete application data online at https://recrutement.inria.fr/public/classic/en/offres/2021-03757 and send a copy to recrutement at vivoka.com

Applications will be considered on the fly. It is therefore advisable to apply as soon as possible. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 17711 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20210708/af3eeff4/attachment.txt>

More information about the Corpora mailing list