Flintstones Dataset

The Flintstones dataset is composed of brief, densely annotated clips that describe actions, characters, objects, and setting of a scene. Clip annotations include identification and localization of characters in keyframes, identification of the scene setting, scene captioning, object annotation, and entity tracking to provide annotations for all frames. The dataset contains segmentation masks for characters and objects, and additionally each clip's foreground characters and objects are excised to provide clean background frames. The dataset is divided roughly into visual and textual components.

Textual Annotations are stored in-

flintstones_annotations_v1-0.json
And visual data components are spread over several directories within-

flintstones_dataset.tgz
The clip assignments to Train/val/test used in 'Imagine This! Scripts to Compositions to Videos'- train-val-test_split.json

Dataset file directories

Within flintstones_dataset.tgz you will find-

background_frames

The background directory contains the filled backgrounds with characters and objects removed. Each video has a corresponding background file, name as

video global ID_bg.npy.npz

These are compressed numpy arrays of dimension 75 x 128 x 128 x 3

There's no distinction made between static and moving backgrounds in these files, and static backgrounds will have a single frame repeated.
entity_segmentation

The segmentation directory contains the entity segmentations masks. Each entity (characters and objects) in the dataset has a corresponding mask file, named as

entity global ID_segm.npy.npz

These are compressed numpy binary arrays of dimension 75 x 128 x 128
entity_tracking

The tracking directory contains the entity bounding boxes for every video frame. Each entity (characters and objects) in the dataset has a corresponding tracking file, named as

entity global ID.npy

These are compressed numpy arrays of dimension 75 x 4, with the bounding boxes defined as (x1, y1, x2, y2). The origin is at the upper left corner of the image, the x-axis corresponds to width and the y-axis to height. After loading this file with numpy, this array can be accessed with the key 'arr_0'.
video_frames

The video directory contains the clip videos. Each video has a corresponding video frame file, name as

video global ID.npy

These are compressed numpy arrays of dimension 75 x 128 x 128 x 3 (i.e. 75 frames x 128x128 pixel image x 3 rgb channels)

Annotation Structure

Here is the structure of the textual annotations in flintstones_annotations_v1-0.json :

    [  ### at the top level is a list of clips
        { ### each clip is a dictionary with several fields
            "characters": [
                            {
                                "actions": [""],        # list of verbs associated with character
                                "entityLabel: "",       # label given to character
                                "entitySpan": [],       #   label's span in the clip description
                                "globalID": "",         # character's global id string
                                "labelNPC": ""          # noun phrase chunk associated with label
                                "rectangles": [[], [], []]      # human annotated keyframe bounding boxes
                            }
                            .
                            .
                            .
                        ],
            "description": "",      # A brief natural language description of the clip
            "globalID": "",         # clip global id string
            "objects": [
                            {
                            "entityLabel": "",   # label given to object
                            "entitySpan": [],    # label's span in the clip description
                            "globalID": "",          # character's global id string
                            "labelNPC": "",          # noun phrase chunk associated with label
                            "rectangles": [[], [], []]     # human annotated keyframe bounding boxes

                            },
                                        .
                                        .
                                        .
                        ],
            "parse": {
                "coref": {
                    "clusters": [[[], []], ... ],   # Coref cluster spans
                    "named_clusters": [[[], []], ... ]    # Coref clusters
                },
                "noun_phrase_chunks": {
                    "aligned_description": "",   # description string reformed from token spans
                    "chunks": [[], ...],     # chunk spans
                    "named_chunks": [[], ...],     # noun phrase chunks
                    "token_spans": [[, , ], ...],     # spans for all description tokens
                },
                "pos_tags": [[, ], ]     # part of speach tags
                },
            "setting": ""       # one to two words describing the clip's setting (e.g. room, movie theatre... )
            },
            {
            next clip
            },
            .
            .
            .
        ]