To get the most accurate results, there should be audio data. This includes clear recordings with very little noise together with a good range of voice types and environmental sounds that could properly represent the real-world scenes. For example, in the studio where audio is recorded as studio-quality & in natural surroundings where the audio is captured. The dataset can help in designing more robust AI systems.