Data Specification
Sift Lab - Data Specification
A datamodel in Sift Lab is made up of three data tables. Each line in the interaction table is linked to a specific user and item in the item and user tables.
classDiagram
`Transaction data` --> Datamodel
`Product data` --> Datamodel
`User data` --> Datamodel
Data encoding
Data must be encoded as UTF-8 + RFC 4180 (see more info at https://tools.ietf.org/html/rfc4180). Below are some specifics regarding the cell and row formatting.
Case | Example |
---|---|
Line break | \n |
Quotes must be escaped | “14”” TV” |
Cell wrap with double if the value contains , or \n | “Hello, world” |
Number, avoid space and comma. | 12345678.9 |
Timestamp/date, use unix timestamp or formatted date string. | 1591718519 or “2020-03-17 15:01:02” or “2006-01-02T15:04:05-0700” |
A valid example
id,value,number
1,"Hello, world!",123.45
2,"14"" TV",123.45
Transaction data
Provide a file containing transactional data between your products and users. Usually represents order rows.
If necessary, you may provide an additional file with data that is associated with an order number. An example would be an order header containing shipping cost, payment fee etc. In this case, provide an order number in the transactional data, and use the order number as the unique identifier in the orders file.
Field | Role | Description |
---|---|---|
user | User (required) | Must uniquely identify a user, same as in User table. |
id | Category (required) | Must uniquely identify a transaction/order row. Can be created from combining other fields. |
item | Item (required) | Must uniquely identify an item, same as in Item table |
date_time | Timestamp (required) | Timestamp at which the interaction occurred. |
timezone | Timezone | Timezone at which the interaction occurred. format (+6) |
revenue | Revenue | Revenue of transaction/order row (This is ex VAT) |
quantity | Quantity | Number columns must be adjusted for quantity, ex the Revenue/Margin column must be equal to the Total Revenue/Margin for that interaction. |
order_id | Category | Must uniquely identify an order |
row_type | Category | specify what type of interaction it is Purchase, Return, etc. |
returned_quantity | Number | negative in the case of returns (placed as a new order with negative quantity) |
discount_rate | Number | Discount for all items in the row. Also called voucher/promo_code (25% discount = 0.25 NOT 25) |
tax_rate | Number | 25% discount = 0.25 NOT 25 |
payment_method | Category | |
country | Category | |
market | Category | |
store | Category | |
sales_channel | Category | ex. retail, ecomm, etc. |
currency | Category | SEK, USD, EUR |
exchange_rate | Number | Exchange rate for the currence at the time the purchase was made |
campaign | Category | |
utm_tag | Category | |
days_to_delivery | Number | Days from placed order to received shipment |
coupons | Category | |
order_status | Category | |
is_return | Category | Yes/No/Partially |
cancelled_quantity | Number | |
ship_to_country | Category | |
ship_to_city | Category | |
handling_fee | Number | cost of handling the order |
shipping_fee | Number | cost of shipping |
shipping_method | Category | Method used, such as: Home delivery, Click & Collect |
shipping_partner | Category | Transportation service provider |
full_price | Number | |
revenue_after_returns | Number | |
freight_cost | Number | |
margin_gm1 | Number | GM1 = Difference between frt rcvd and paid |
margin_gm2 | Number | GM2 = Gross margin after labour |
margin_gm3 | Number | GM3 = Gross margin after deductions |
margin_gm4 | Number | GM4 = Gross margin after commission |
margin_gm5 | Number | GM5 = Gross margin after interest |
margin_gm6 | Number | GM6 = Gross margin after insurance |
...< ANY > | Number, Category | Additional metadata may be added that you find interesting to examine in our analytics. |
Product data
Provide a file containing product (item) meta data. Make sure all product ids that are present in the transactions are also present in this file.
Categorical data in hierarchical structure should be added as separate columns: category_1
, category_2
, etc. If a product can have multiple categories but without hierarchy, provide them as a semi-colon separated string: “categoryA;categoryC:categoryE”
. An example would be genres for a movie, as a movie can be associated with multiple genres in no particular order.
Field | Role | Description |
---|---|---|
item | Item (required) | Must uniquely identify a product, same as in transactions. This is what the machine learning trains on and is what will be used in recommendations |
title | Format | The displayed name that will be associated with the product. |
image | Image | URL to product image. |
link | Category | URL to product page |
sales_price | Number | Current selling price including discounts |
full_price | Number | Price without any discounts |
availability | Category | Yes / No |
in_stock | Number | How many units are sellable in stock |
sku | Category | Stock keeping unit (e.g. specific shirt in a certain color and size) |
product_variant | Category | Variation specified product, usually product page (e.g. specific shirt in a certain color, any size) |
parent_product | Category | The named product (e.g. specific shirt in any color or size) |
category_level_1 | Category | The highest category level |
category_level_2 | Category | 2:nd highest category level |
category_level_3 | Category | 3:rd highest category level |
category_level_4 | Category | 4:th highest category level |
category_level_5,6,7... | Category | etc. |
size | Category | Standard size format (S/M/L, M6/M8/M10, 90/120/140) |
cogs | Number | Cost of goods |
description | Category | Description of item. Limited to 255 characters. |
color | Category | |
brand | Category | |
county_of_origin | Category | Where the product is produced |
condition | Category | Such as New or Used |
collection | Category | such as SS19, SS20, SS21 etc. |
supplier | Category | |
shipping_weight | Number | |
product_launch | Timestamp | Date for product release |
data_source | Category | In the case of multiple datasources, this identifies which source the data is coming from. |
ml_blocked | - | Blocks specific items from the machine learning. Not the same as excluding them by filtration. |
...< ANY > | Number, Category | Additional metadata may be added that you find interesting to examine in our analytics. |
User data
Provide a file containing user meta data. Make sure all user ids that are present in the transactions are also present in this file.
In order to have deletes propagate to our side, provide users with empty data on all fields except id or a “is_deleted”
field that we will use to clear all metadata.
Field | Data type | Note |
---|---|---|
user | User (required) | Must uniquely identify a user, same as in transactions. |
market | Category | |
country | Category | |
country_code | Category | |
region | Category | |
area | Category | |
city | Category | |
disctrict | Category | |
zip_code | Category | |
customer_store | Category | |
first_name | Category | |
last_name | Category | |
age | Number | |
birthdate | Timestamp | |
gender | Category | |
phone_number | Category | |
control_group | Number | Used during tests, can be imported or generated. |
mosaic_group | Category | |
nps | Category | |
accepts_communication | Category | Can be several fields depending on setup. Used to identify users that can be communicated with by Yes/No. |
subscribe_date | Timestamp | |
unsubscribe_date | Timestamp | |
discount_cohort | Category | |
returned_revenue_percent_cohort | Category | |
returned_revenue_cohort | Category | |
revenue_cohort | Category | |
returned_quantity_percentage_cohort | Category | |
quantity_sum | Number | |
quantity_sum_returned | Number | |
quantity_return_percentage | Number | |
total_value_paid | Number | |
total_value_returned | Number | |
total_value_return_percentage | Number | |
average_discount | Number | |
customer_net_value | Number | |
registration_date | Timestamp | |
...< ANY > | Number, Category | Additional metadata may be added that you find interesting to examine in our analytics. |
Marketing spend data (optional)
Provide data related to marketing activities, costs, and associated outcomes (like orders). This data is crucial for analyzing campaign performance, return on investment (ROI), cost per acquisition (CPA), and understanding the effectiveness of different marketing channels and sources. Typically, this involves two types of inputs: one detailing costs and performance metrics (clicks, impressions) and another linking specific orders back to marketing efforts.
The following fields are recommended, often sourced from platforms like Google Ads, Facebook Ads, or internal tracking systems. Ensure consistent naming for campaign_name and source across cost and order data for accurate linkage.
Order Attribution Data
This table details which orders are attributed to specific marketing activities. It links orders back to campaigns and sources, helping understand which efforts drive conversions. The order field should match an order_id in the main Transaction data table.
Field | Role | Description |
---|---|---|
date | Timestamp (required) | The date the order attributed to the marketing activity was placed. |
order | Category (required) | Identifier for the order. Should correspond to order_id in Transaction data. |
campaign_name | Category (required) | Name of the marketing campaign attributed to this order. |
source | Category (required) | The marketing source attributed to this order (e.g., 'google', 'facebook', 'newsletter'). |
is_paid | Category (required) | Indicates if the attributed source was paid ('paid'/'organic', 'true'/'false', '1'/'0'). |
business_area | Category | Business unit or area associated with the order or attributed campaign. |
website | Category | Specific website or domain where the order originated or related to the campaign. |
is_influencer | Category | Flag ('true'/'false', '1'/'0') indicating if the attribution is related to influencer marketing. |
...< ANY > | Category, Number | Additional metadata relevant to order attribution (e.g., specific UTM parameters, landing page). |
Cost Data
This table contains information about the costs incurred and performance metrics for marketing activities on a daily basis, broken down by campaign and source.
Field | Role | Description |
---|---|---|
date | Timestamp (required) | The date the marketing cost was incurred. Daily resolution is required. |
campaign_name | Category (required) | Name of the marketing campaign for which the cost was incurred. |
source | Category (required) | The marketing source associated with the cost (e.g., 'google', 'facebook', 'bing'). |
is_paid | Category (required) | Indicates if the source is paid ('paid'/'organic', 'true'/'false', '1'/'0'). Typically 'true'/'paid' for cost data. |
cost | Number (required) | The amount spent on this date for the specified campaign and source. |
clicks | Number | Number of clicks generated by the campaign/source on this date. |
impressions | Number | Number of impressions generated by the campaign/source on this date. |
channels | Category | Broader marketing channel (e.g., 'SEM', 'Social', 'Display', 'Email'). |
business_area | Category | Business unit or area the marketing spend is associated with. |
website | Category | Specific website or domain related to the campaign spend. |
is_influencer | Category | Flag ('true'/'false', '1'/'0') indicating if the cost is related to influencer marketing. |
...< ANY > | Category, Number | Additional cost or performance metadata (e.g., ad group ID, cost type, conversions reported by platform). |