Retroactive S3 Specification

Heap is currently developing the next version (v2) of our Amazon S3 export offering to support retroactive data-out capabilities. This product will enable any downstream system (i.e. Hadoop, Snowflake, Fivetran) to access to Heap data at scale while taking advantage of codeless event creation, retroactivity, and cross device user identity. This specification represents the most current information and level of detail around this effort - details are subject to change based on customer feedback and internal requirements.

Process Overview

Heap will provide a periodic dump of data into S3 (nightly by default). Data will be delivered in the form of Avro-encoded files, each of which corresponds to one downstream table (though there can be multiple files per table). Dumps will be incremental, though individual table dumps can be full resyncs, depending on whether the table was recently toggled or the event definition modified.

We’ll include the following tables:

  • users
  • pageviews
  • sessions
  • toggled event tables (separate tables per event)
  • user_migrations(a fully materialized mapping of users merged as a result of heap.identify calls)

Each periodic dump will be accompanied by a dump manifest metadata file, which will describe the target schema and provide a full list of relevant data files for each table.

Metadata

For each dump, there will be a metadata file including the following information:

  • dump_id - a monotonically increasing sequence number for dumps.
  • tables - for each table synced:
    • name - the name of the table.
    • columns - an array consisting of the columns contained in the table - this can be used to determine which columns need to be added/removed downstream. This can also be derived from the Avro schema, but the information is provided here for convenience.
    • files - an array of full s3 paths to the Avro-encoded files for the relevant table.
    • incremental - a boolean denoting whether the data for the table is incremental on top of previous dumps. A value of false means it is a full/fresh resync of this table, and previous data is invalid.
  • property_definitions - s3 path to the defined property definition file.

An example of this metadata file can be found below:

{
  "dump_id": 1234,
  "tables": [
    {
      "name": "users",
      "files": [
        "s3://customer/sync_1234/users/a97432cba49732.avro",
        "s3://customer/sync_1234/users/584cdba3973c32.avro",
        "s3://customer/sync_1234/users/32917bc3297a3c.avro"
      ],
      "columns": [
        "user_id",
        "last_modified",
        ...
      ],
      "incremental": true
    },
    {
      "name": "user_migrations",
      "files": [
        "s3://customer/sync_1234/user_migrations/2a345bc452456c.avro",
        "s3://customer/sync_1234/user_migrations/4382abc432862c.avro"
      ],
      "columns": [
        "from_user_id",
        "to_user_id",
        ...
      ],
      "incremental": false  // Will always be false for migrations
    },
    {
      "name": "defined_event",
      "files": [
        "s3://customer/sync_1234/defined_event/2fa2dbe2456c.avro"
      ],
      "columns": [
        "user_id",
        "event_id",
        "time",
        "session_id",
        ...
      ],
      "incremental": true
    }
  ],
  "property_definitions": "s3://customer/sync_1234/property_definitions.json"
}

Data Delivery

Data will sync directly to customers’ S3 buckets. Customers will create an IAM policy/role for Heap, and we’ll assume that role when dumping to S3. The target S3 bucket name needs to begin with the prefix heap-rs3- for Heap's systems to have access to it.

Granting Access

Add the following policy to the destination S3 bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1441164338000",
      "Effect": "Allow",
      "Action": [
        "s3:*"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket-name>",
        "arn:aws:s3:::<bucket-name>/*"
      ],
      "Principal": {
        "AWS": [
          "arn:aws:iam::085120003701:root"
        ]
      }
    }
  ]
}

How Do I Add an S3 Bucket Policy? https://docs.aws.amazon.com/AmazonS3/latest/user-guide/add-bucket-policy.html
Bucket Owner Granting Cross-Account Bucket Permissions https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html (Heap is Account B in this scenario)

Completion of a dump is signaled by delivery of a new manifest file. Clients should poll s3://<BUCKET>/heap_exports/manifests/* for new manifests. Upon receipt of a new manifest, ETL can proceed downstream.

Defined Properties JSON File

We will sync defined property definitions daily and provide a JSON file containing all defined properties and their definitions. Downstream consumers will be responsible for applying these definitions to generate the defined property values for each row. The JSON file format is described in detail in this doc.

ETL Considerations

  • Data across dumps/files are not guaranteed to be disjoint. As a result, downstream consumers are responsible for de-duplication.
  • Updated users (users with properties that have changed since the last sync) will re-appear in the sync files, and thus every repeated occurrence of a user (check on user_id) should replace the old one to ensure property updates are picked up.
  • user_migrations is a fully materialized mapping of from_user_ids to to_user_ids. Downstream consumers are responsible for joining this with events/users tables downstream to determine which user each event belongs to.
  • For v2, we only sync defined property definitions rather than the actual defined property values. Downstream consumers are responsible for applying these definitions to generate the defined property values for each row.
  • Schemas are expected to evolve over time (i.e. properties can be added to the user and events tables)