Automatische Satzsegmentierung von Sprache unter Verwendung von prosodischen Merkmalen
Segmentation of speech into sentences plays an important role as a first step in several speech processing fields. Automatic Speech Recognition (ASR) algorithms mostly produce just a stream of non-structured words without detecting the hidden structure in spoken language. However, natural language processing devices often have a strong need for sentence-like units to work properly. Apart from, it is very time-consuming to label huge speech data amounts by hand. Thus, it is necessary to develop an algorithm which analyzes broadcast speech corpora databases (e.g.: Aix-MARSEC) and outputs sentence boundaries using prosodic features.
The algorithm can be described as following: At the beginning, an adaptive, energy-based voice-activity-detector (VAD) is used to gather all active regions and calculate the pause lengths and intensity as first features. These blocks are then used as input for a pitch estimation algorithm. To assess tendencies at the region boundaries it is needful to calculate an optimal (in the least-squares sense) piecewise polynomial approximation and then calculate different prosodic features (f0-rise/fall, f0-gradient: : :). Consequently, the extracted features are combined in a decision tree to determine the sentence boundaries.