Data assets and file structure¶

We represent each scene with a visit_id (6-digit number) and each video sequence with a video_id (8-digit number). For each scene, we provide a high-resolution point cloud generated by combining multiple Faro laser scans of the scene. Additionally, each scene is accompanied by video sequences (three on average) recorded with a 2020 iPad Pro.

File structure¶

Data is organized per visit_id as follows:

📦 PATH/TO/DATA/DIR/

┣ 📂 <visit_id>/

┣ 📄 <visit_id>_laser_scan.ply: combined Faro laser scan with 5mm resolution
┣ 📄 <visit_id>_crop_mask.npy: binary mask to crop extraneous points from the combined Faro laser scan
┣ 📄 <visit_id>_annotations.json: functional interactive element annotations
┣ 📄 <visit_id>_descriptions.json: natural language task descriptions
┣ 📄 <visit_id>_motions.json: motion annotations
┗ 📂 <video_id>/

┣ 📂 lowres_wide/: low resolution RGB frames of the wide camera (256x192) - 60 FPS

┣ 📄 <video_id>_<timestamp>.png: images (.png format) are indexed by timestamps
┗ •••

┣ 📂 lowres_wide_intrinsics/: camera intrinsics for the low resolution frames

┣ 📄 <video_id>_<timestamp>.pincam: intrinsics files (.pincam format) are indexed by timestamps
┗ •••

┣ 📂 lowres_depth/: depth images acquired with the iPad Lidar sensor (256x192) - 60 FPS

┣ 📄 <video_id>_<timestamp>.png: images (.png format in millimeters) are indexed by timestamps
┗ •••

┣ 📄 lowres_poses.traj: contains the ARKit camera pose trajectory in the ARKit coordinate system
┣ 📂 hires_wide/: high resolution RGB images of the wide camera (1920x1440 in landscape mode, 1440x1920 in portrait mode) - 10 FPS

┣ 📄 <video_id>_<timestamp>.jpg: images (.jpg format) are indexed by timestamps
┗ •••

┣ 📂 hires_wide_intrinsics/: camera intrinsics for the high resolution images

┣ 📄 <video_id>_<timestamp>.pincam: intrinsics files (.pincam format) are indexed by timestamps
┗ •••

┣ 📂 hires_depth/: the ground-truth depth image projected from the mesh generated by Faro’s laser scanners (1920x1440 in landscape mode, 1440x1920 in portrait mode) - 10 FPS

┣ 📄 <video_id>_<timestamp>.png: images (.png format in millimeters) are indexed by timestamps
┗ •••

┣ 📄 hires_poses.traj: contains the camera poses estimated with COLMAP in the Faro laser scan coordinate system
┣ 📄 <video_id>_transform.npy: 4x4 transformation matrix that registers the Faro laser scan to the ARKit coordinate system
┣ 📄 <video_id>_3dod_mesh.ply: ARKit 3D mesh reconstruction
┗ 📄 <video_id>.{mov,mp4}: Video captured with ARKit (available in both .mov and .mp4 formats)

Data download¶

To download the data, please refer to Toolkit > Data Downloader.

Annotation data¶

For each scene, the annotation data are provided in annotation.json, descriptions.json and motions.json. For information about the data format, please refer to Dataset > Annotations.

Laser scan and ARKit coordinate system¶

ARKit related data assets are expressed in the ARKit coordinate system, while Faro related data assets are expressed in the Faro laser scan coordinate system. Specifically, the ARKit-estimated camera poses and the ARKit 3D mesh reconstruction are expressed in the ARKit coordinate system, whereas the COLMAP-estimated camera poses and the Faro laser scan is expressed in the Faro coordinate system. We note that the ARKit coordinate systems among different video sequences of the same scene are not necessarily the same.

The transformation matrix <video_id>_transform.npy can be used to register the Faro laser scan to the ARKit coordinate system.

Combined Faro laser scan¶

ARKitScenes provides multiple laser scans per scene (four on average) by placing a Faro Focus S70 laser scanner in different positions in the scene. We use the provided laser scanner’s poses for each scene and combine the laser scans under the same coordinate system to increase the scene coverage. Afterwards, we downsample the combined laser scan with a voxel size of 5mm, which is sufficient to preserve the details of the functional interactive elements of the scene (e.g., small buttons, knobs, handles, etc.), while enabling processing by machine learning models. All annotations are performed on the combined laser scan. The resulting data asset is <visit_id>_laser_scan.ply.

As the laser scan might include extraneous points due to transparent surfaces (e.g., windows), we provide a binary mask to crop them (<visit_id>_crop_mask.npy).

iPad video sequences¶

Each scene is accompanied with iPad video sequences (three on average) and the related data assets. As an improvement upon original ARKitScenes, we provide a higher number of registered high resolution frames for each video sequence along with accurate COLMAP-estimated poses and Faro-rendered depth maps.

Furthermore, the provided high-resolution video frames are rotated so that the sky direction is up. As a result, frames may be in landscape (1920x1440) or portrait (1440x1920) orientation.

ARKit versus COLMAP poses¶

ARKit camera poses (lowres_poses.traj) are derived from the iPad's on-device ARKit world tracking. The original ARKitScenes dataset provides high-resolution iPad frames which are not temporally synchronized with these ARKit poses, thus rigid body motion interpolation needs to be performed to estimate the camera pose, which can introduce errors. When backprojecting from the iPad frames to the laser scan, this interpolation and ARKit’s inherent inaccuracies can lead to misalignment errors of approximately 1-2 cm in some cases.

To address this, we employ SuperGlue and COLMAP to estimate accurate camera poses in the Faro laser scan coordinate system (hires_poses.traj). As input for this pipeline, we sample high-resolution frames from the video sequences at 10 FPS (hires_wide).

The camera poses are stored in .traj files. These files contain a line for each pose

.traj files

timestamp angle_axis_x angle_axis_y angle_axis_z translation_x translation_y translation_z
...

Column 1 contains the timestamp, columns 2-4 contain the rotation in angle-axis representation (in radians) and columns 5-7 contain the translation (in meters). These files can be easily parsed by utilizing our Toolkit's data parser.

The .traj files can be easily parsed with our toolkit.

Camera intrinsics¶

For each image, the intrinsic matrix is stored in a .pincam file. This file contains a single space-delimited line of text with the following fields:

<video_id>_<timestamp>.pincam

width height focal_length_x focal_length_y principal_point_x principal_point_y

The .pincam files can be easily parsed with our toolkit.