Video datasets for deep learning are usually limited in quality and quantity when compared to image datasets e.g. ImageNet. Within the video dataset samples, the dance category is usually further limited in both diversity of genres and quantity of samples available.

To help alleviate this problem Dancelogue has made public it's annotated dataset covering a greater range of dance moves. This includes dance genre like hip hop, salsa, african (afro), house, jazz, ballet, dancehall and breakdancing.

Dataset Structure

The dataset can be downloaded as a json file and follows the convention set by the ActivityNet and Kinetics dataset. The format is as follows:

{
	"version": "1.0",
	"dataset": [{
		"url": "https://www.youtube.com/watch?v=Yqo_w0DWWwA",
		"annotations": [{
			"label": "afro",
			"segment": [164.4, 160.9]
		}]
	}]
}

The parent json object currently has 2 keys. The version key which is used to track the dataset between releases. The dataset key which represents the samples and is an array of objects encoding the annotated segments.

Each object in the dataset array contains two main keys, which are:

  • The url which is the source video for the data.
  • The annotations key is an array of the labelled segments within a video where each segment contains a label key which is the name of the type of dance, the segment key is an array containing 2 values, the first is the start time which indicates the beginning of the temporal boundary, and second is the end time which marks the termination of the temporal boundary.

The dataset can be downloaded from the following link:

How the dataset was built

Candidate YouTube videos containing dance moves were selected. The temporal boundaries containing a single move in each video were annotated with the start and end variables  thus marking the temporal boundaries for a dance move. An example of such a move is the shoot dance move from hip hop as shown below.

https://www.youtube.com/watch?v=7t8Gies2Zps

This means the clips have variable length ranging from 1 second - 5 second depending on the duration of the specific dance move. Note this is different from previously aggregated dance data in that a single clip contains a range of movements from a particular dance data, while the dancelogue dataset contains a single move.

The temporal annotation was conducted using the Dancelogue annotation pipeline which allows for the annotation of over 100 moves per hour. The moves were then tagged with the dance genre as well as the specific dance move. However, the release of the initial dataset does not contain the specific dance move name (e.g. shoot from hip hop) but rather the dance genre as to which the moves belong to. The json structure was then generated which allows for researchers and enthusiasts to generate their own trimmed version of the videos for classification.

Dataset Distribution

The dataset has 741 segmented instances across 271 videos labelled with 8 different dance genres. This section will go into detail about how the segment instances are distributed across the videos.

Segment Distribution

As mentioned earlier, multiple segments can be belong a single video, this is evident as the number of segments is significantly greater than the number of videos i.e. 741 segments vs 271 videos. The distribution of the segments across the videos is as follows.

There are approximately 141 videos which contain one segment per video, which is 19 % of the total annotated segments.  On the other end of the spectrum, one video has 70 segments. There is the danger that there is not enough variation with the moves if the source is from a single video, however the moves have been manually annotated to mitigate this by having a unique dance move for each segment for a specific genre.

Label Distribution

The label distribution is as follows:

As can be seen, most dance genres have 100 segments instances, however in the case of house dancing and breakdance, this is lower at about 70 and 71 consecutively. This deficit will be addressed in future releases.

Duration Distribution

The duration distribution is as follows:

From the pie chart above it can be seen that a majority of the segmented instances are between 2 - 5 seconds, which is between 60 - 300 frames assuming a video with a temporal resolution of 30 fps.

Conclusion

It was shown that the new dancelogue dataset contains videos in greater detail than previous datasets in that the individual dance moves were annotated such that each data point contains only a single move from a specific dance category.

The dataset provided only forms the initial release and is lacking in quantity. This deficit will be alleviated in a future date as well as datasets containing only dance moves from a specific genre will be released to better understand and classify dance in videos.