Flintstones Dataset
The Flintstones dataset is composed of brief, densely annotated clips that describe actions, characters, objects, and setting of a scene. Clip annotations include identification and localization of characters in keyframes, identification of the scene setting, scene captioning, object annotation, and entity tracking to provide annotations for all frames. The dataset contains segmentation masks for characters and objects, and additionally each clip's foreground characters and objects are excised to provide clean background frames. The dataset is divided roughly into visual and textual components.
-
Textual Annotations are stored in-
flintstones_annotations_v1-0.json
-
And visual data components are spread over several directories within-
flintstones_dataset.tgz
-
The clip assignments to Train/val/test used in 'Imagine This! Scripts to Compositions to Videos'- train-val-test_split.json
Dataset file directories
Within flintstones_dataset.tgz you will find-
-
background_frames
The background directory contains the filled backgrounds with characters and objects removed. Each video has a corresponding background file, name as
video global ID_bg.npy.npz
These are compressed numpy arrays of dimension 75 x 128 x 128 x 3
There's no distinction made between static and moving backgrounds in these files, and static backgrounds will have a single frame repeated.
-
entity_segmentation
The segmentation directory contains the entity segmentations masks. Each entity (characters and objects) in the dataset has a corresponding mask file, named as
entity global ID_segm.npy.npz
These are compressed numpy binary arrays of dimension 75 x 128 x 128
-
entity_tracking
The tracking directory contains the entity bounding boxes for every video frame. Each entity (characters and objects) in the dataset has a corresponding tracking file, named as
entity global ID.npy
These are compressed numpy arrays of dimension 75 x 4, with the bounding boxes defined as (x1, y1, x2, y2). The origin is at the upper left corner of the image, the x-axis corresponds to width and the y-axis to height. After loading this file with numpy, this array can be accessed with the key 'arr_0'.
-
video_frames
The video directory contains the clip videos. Each video has a corresponding video frame file, name as
video global ID.npy
These are compressed numpy arrays of dimension 75 x 128 x 128 x 3 (i.e. 75 frames x 128x128 pixel image x 3 rgb channels)
Annotation Structure
Here is the structure of the textual annotations in flintstones_annotations_v1-0.json :
[ ### at the top level is a list of clips
{ ### each clip is a dictionary with several fields
"characters": [
{
"actions": [""], # list of verbs associated with character
"entityLabel: "", # label given to character
"entitySpan": [], # label's span in the clip description
"globalID": "", # character's global id string
"labelNPC": "" # noun phrase chunk associated with label
"rectangles": [[], [], []] # human annotated keyframe bounding boxes
}
.
.
.
],
"description": "", # A brief natural language description of the clip
"globalID": "", # clip global id string
"objects": [
{
"entityLabel": "", # label given to object
"entitySpan": [], # label's span in the clip description
"globalID": "", # character's global id string
"labelNPC": "", # noun phrase chunk associated with label
"rectangles": [[], [], []] # human annotated keyframe bounding boxes
},
.
.
.
],
"parse": {
"coref": {
"clusters": [[[], []], ... ], # Coref cluster spans
"named_clusters": [[[], []], ... ] # Coref clusters
},
"noun_phrase_chunks": {
"aligned_description": "", # description string reformed from token spans
"chunks": [[], ...], # chunk spans
"named_chunks": [[], ...], # noun phrase chunks
"token_spans": [[, , ], ...], # spans for all description tokens
},
"pos_tags": [[, ], ] # part of speach tags
},
"setting": "" # one to two words describing the clip's setting (e.g. room, movie theatre... )
},
{
next clip
},
.
.
.
]