Blocks

Aggregate Blocks

Aggregate

This Block allows you to apply aggregate functions like average, sum, max, min and standard deviation on your mentioned columns.

Generate Aggregates

{
  "AVERAGE": {
    "columns": ["id"]
  },
  "SUM": {
    "columns": ["id"]
  },
  "MAX": {
    "columns": ["id"]
  },
  "MIN": {
    "columns": ["id"]
  },
  "STD_DEV": {
    "columns": ["id"]
  },
  "COUNT": {
    "columns": ["id"]
  }
}

Group By

This Block allows you to Group By your data on multiple columns with Max aggregate function.

GroupBy Configuration

{
  "GenerateGroupBy": {
    "columns": [
      "country",
      "population"
    ],
    "aggregate_on": {
      "AVERAGE": {
        "columns": ["id"]
      },
      "SUM": {
        "columns": ["id"]
      },
      "MAX": {
        "columns": ["id"]
      },
      "MIN": {
        "columns": ["id"]
      },
      "STD_DEV": {
        "columns": ["id"]
      },
      "COUNT": {
        "columns": ["id"]
      }
    }
  }
}

Quartiles

Calculates quartiles Q1, Q3, median, IQR for EDA.

Quartile Columns

{
  "column_names": ["test"]
}

Binary Connectors

Binary Reader - HDFS

This Block reads a list of Binary files based on the path in a column of a CSV file. Input is a data queue containing file path column from a CSV file and output is a Binary object pushed to the data queue for each Binary file path.

File Parameters

{
  "base_path": "/image_base_path/",
  "max_records": 7,
  "outputTopicAliasName": "binary_reader_error",
  "defaultMimeType": "image/png"
}

Connection Parameters

{
  "port": "50070",
  "hostName": "localhost"
}

Column Details

{
  "columns": [
    {
      "pathColumn": "PATH",
      "outputColumn": "NEW_PATH",
      "onReadErrorOutputData": "binary_reader_err",
      "MIMEType": "image/png"
    }
  ]
}

Binary Writer - HDFS

This Blocks writes Binary data from the data queue and writes a file to an HDFS location.

File Parameters

{
  "base_path": "/base_path/",
  "max_records": 7,
  "outputTopicAliasName": "binary_reader_error",
  "defaultMimeType": "image/png"
}

Connection Parameters

{
  "port": "50070",
  "hostName": "localhost"
}

Column Details

{
  "columns": [
    {
      "pathColumn": "PATH",
      "outputColumn": "NEW_PATH",
      "onReadErrorOutputData": "binary_reader_err",
      "MIMEType": "image/png"
    }
  ]
}

HDFS Binary Reader (Glob)

Read Binary Files from folder or glob.

File Parameters

{
  "folder_path": "/user/devansh/",
  "max_records": -1,
  "outputTopicAliasName": "Alias Name",
  "MIMEType": "image/png",
  "data_output_column": "COL",
  "file_path_output_column": ""
}

Image Connectors

Binary To Image Array

This Blocks converts Binary data from data queue into an image Numpy array and pushes it back to data queue.

File Parameters

{
  "max_records": 7,
  "outputTopicAliasName": "binary_reader_error"
}

Column Details

{
  "columns": [
    {
      "imageColumn": "NEW_PATH",
      "outputColumn": "ACTUAL_IMAGE",
      "resizeDimension": [
        28,
        28
      ],
      "onReadErrorOutputData": "binary_reader_err"
    }
  ]
}

Image Array To Binary

This Blocks converts an image Numpy array from data queue into Binary data and pushes it back to data queue.

File Parameters

{
  "max_records": 7,
  "outputTopicAliasName": "binary_reader_error"
}

Column Details

{
  "columns": [
    {
      "imageColumn": "ACTUAL_IMAGE",
      "outputColumn": "BINARY_IMAGE",
      "MIMEType": "image/png",
      "onReadErrorOutputData": "binary_reader_err"
    }
  ]
}

Image Transformations

OpenCV - Image Transformer

This block allows you to use OpenCV functions to manipulate image numpy arrays from data queue and pushes it back to data queue after the image transformation.

Column Details

{
  "columns": [
    {
      "imageColumn": "NEW_PATH",
      "outputColumn": "ACTUAL_IMAGE",
      "onReadErrorOutputData": "image_conversion_err"
    }
  ]
}

File Parameters

{
  "max_records": 100,
  "outputTopicAliasName": "image_conversion_topic"
}

Image Operations

{
  "expression": "image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)"
}

Row Blocks

Arithmetic Operation

This Block allows you to run any Python arithmetic expressions or functions like sin(), etc on your data.

Arithmetic Expression

[
  {
    "outputColumn": "ID_new",
    "outputType": "FLOAT",
    "onErrorDefaultValue": 0,
    "expression": "np.sin(column('ID'))"
  }
]

Constant Key

{
  "key1": "val1",
  "key2": "val2"
}

Data Join

This Block takes 2 data inputs and joins them based on Given columns. Current Version supports join only on single columns. Join type can be inner, outer, left or right.

Join Paramerters

{
  "firstDatasourceColumnName": "id",
  "secondDatasourceColumnName": "id",
  "targetColumnName": "CREDIT_SCORE_JOINED",
  "joinType": "left"
}

Date Operation

This Block performs operations on dates and pushes to queue.

Date Expression

[
  {
    "outputColumn": "Date_New",
    "outputType": "DATE",
    "onErrorDefaultValue": 0,
    "expression": "change_format('DATE1', '%y')"
  }
]

Drop Duplicates

This Block allows you to drop dubplicate rows.

Drop Parameters

{
  "columns": [
    "<column_name>"
  ]
}

Filter Records

This Block allows you to select only a set of rows from your data based on the matching expression provided by you.

Filter Expression

[
  {
    "expression": "column('id') > 4",
    "outputTopicAliasName": "filter_op2"
  }
]
{
  "key1": "val1",
  "key2": "val2"
}

Impute Missing Values

This Block allows you to replace missing values such as null, ?, blank etc. or any user-defined missing value for example <null>, with custom value of your choice or with some value inferred from a previous Block.

Missing Value Parameters

[
  {
    "column": "ID",
    "replaceValue": "23"
  }
]

Constant Key

{
  "key1": "val1",
  "key2": "val2"
}

Merge

Reads data from multiple datastream and write to single datastream.

Data Source Parameters

[
  {
    "queueTopicName": "a3422ed4-85af-46d2-aa2f-851ff9c186a2"
  },
  {
    "queueTopicName": "2aa1f73d-c6f0-4810-89b4-d11030e53334"
  }
]

Merge Parameters

{
  "MergeTopicsInSequence": true
}

Normalize

This Block will noramlize the column values with the type you choose. Type can be ZSCORE or MIN_MAX

Normalization Parameters

{
  "type": "ZSCORE",
  "columns": [
    "MOVIEID",
    "TITLE"
  ]
}

Aggregates Parameters

{
  "AVERAGE": {
    "MOVIEID": {
      "result": 2
    },
    "TITLE": {
      "result": 2
    }
  },
  "STD_DEV": {
    "MOVIEID": {
      "result": 4.031128874149275
    },
    "TITLE": {
      "result": 2
    }
  }
}

One Hot Encode

This Block will one hot encode a list of columns you mention from your data.

Encode Details

{
  "columnList": [
    "gender"
  ],
  "stopOnLimit": 1000
}

Output Transformer

This Block allows you to Randomize the given data.

Transformer Parameters

{
  "expression": "key('input[0]')"
}

Randomized Splits

This Block allows you to Randomize the given data into given no of splits.

Randomization Parameters

{
  "random_seed": null,
  "no_of_splits": 1
}

Schema Modifier

This Block allows you to modify the schema of your data with operations like drop, rename or datatype change of columns.

Schema

{
  "DROP_COLUMN": [
    "col1",
    "id"
  ],
  "SELECT": [
    "CREDIT_SCORE"
  ],
  "RENAME_COLUMN": [
    {
      "oldColName": "col1",
      "newColName": "hello"
    },
    {
      "oldColName": "name",
      "newColName": "world"
    }
  ],
  "UPDATE_DATATYPE": [
    {
      "colName": "ID",
      "dataType": "STRING"
    },
    {
      "colName": "gender",
      "dataType": "str"
    }
  ]
}

Text Operation

This Block allows you to apply Python String functions on a column in your data.

Text Expression

[
  {
    "outputColumn": "ID_new",
    "outputType": "STRING",
    "onErrorDefaultValue": 0,
    "expression": "str(column('input_column'))"
  }
]

Sampling Blocks

Sampling

Allows user to sample data between the range specified.

Sampling Configuration

{
  "Sampling": {
    "rows": {
      "start_row": 0,
      "end_row": 1000
    }
  }
}

Source Connectors

AWS S3 Reader

Reads from s3 and writes to a datastream.

Connection Parameters

{
  "accessKey": "ACCESS_KEY_HERE",
  "secretKey": "SECRET_KEY_HERE"
}

Data Source Parameters

{
  "fileWithFullPath": "s3://path/to/file.csv",
  "delimiter": ",",
  "ignoreAndContinue": true,
  "outputTopicAliasName": "topicName",
  "calculateColumnSize": true,
  "maxReadLines": 100,
  "hasHeader": true
}

Binary Folder - Google Drive

This source connector will read Binary files inside a Google Drive folder.

Data Source Parameters

{
  "folderId": "<folder_id>",
  "maxFilesToRead": 10,
  "fileNameRegex": ""
}

Output Parameters

{
  "dataOutputColumn": "file_data",
  "filePathColumn": "filePath",
  "outputTopicAliasName": "drive_data"
}

CSV - HDFS

This source connector will read a CSV file from a HDFS instance of your choice.

Data Source Parameters

{
  "fileWithFullPath": "/path/to/file.csv",
  "delimiter": ",",
  "ignoreAndContinue": true,
  "outputTopicAliasName": "topicName",
  "calculateColumnSize": true,
  "maxReadLines": 100,
  "hasHeader": true
}

Connection Parameters

{
  "port": "50070",
  "hostName": "localhost"
}

CSV - My Space

This source connector will read a CSV file from the My Space of your Workspace.

Data Source Parameters:

{
  "fileName": "file_name.csv",
  "delimiter": ",",
  "ignoreAndContinue": true,
  "outputTopicAliasName": "topicName",
  "maxReadLines": 100,
  "calculateColumnSize": true,
  "hasHeader": true
}

CSV - Community Space

This source connector will read a CSV file made public in the Community of your Workspace.

User Details

{
  "emailId": "username@domain.com"
}

Data Source Parameters:

{
  "fileName": "filename.csv",
  "maxReadLines": 100,
  "delimiter": ",",
  "ignoreAndContinue": true,
  "outputTopicAliasName": "topicName",
  "calculateColumnSize": true,
  "hasHeader": true
}

CSV/Sheet - Google Drive

This source connector will read a Google Spread sheet or CSV file from Google Drive.

Connection Parameters

{
  "email": "username@domain.com"
}

Data Source Parameters:

{
  "fileId": "<file_id>",
  "hasHeader": true,
  "outputTopicAliasName": "gSheetData",
  "ignoreAndContinue": true,
  "calculateColumnSize": true,
  "sheet": "sheet_number",
  "delimiter": ","
}

MySQL Reader

Reads data from MySQL and pushes it to a datastream.

Data Extraction Query Parameters

{
  "query": "select * from TestTable;",
  "maxReadLines": 100
}

Connection Parameters

{
  "host": "localhost",
  "port": "3306",
  "username": "root",
  "password": "root",
  "databaseName": "Test",
  "ssl": "false"
}

Structured Data Generator

Generates random data and pushes it to a data stream. Allow only discrete in allowed_values as ["1-100"]. expressions can be specified as in_range(low, high) for INTEGER & FLOAT.

Data Source Parameters:

{
  "maxReadLines": 1000,
  "seed": 1234
}

Output Configuration

{
  "outputTopicAliasName": "topicName"
}

Column Configuration

{
  "columns": [
    {
      "name": "COLUMN_1",
      "type": "INTEGER",
      "allowed_values": "",
      "expression": "in_range(1, 25)"
    },
    {
      "name": "COLUMN_2",
      "type": "FLOAT",
      "allowed_values": [
        1.2,
        3.4
      ],
      "expression": "in_range(1.2, 3.4)"
    },
    {
      "name": "COLUMN_3",
      "type": "STRING",
      "allowed_values": [
        "MALE",
        "FEMALE"
      ]
    }
  ]
}

Target Connectors

AWS S3 Writer

Reads data from a datastream and writes to an Amazon S3 bucket.

Connection Parameters

{
  "accessKey": "AKIAIFC7KVDN3BTBVAIQ",
  "secretKey": "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"
}

Data Target

{
  "fileName": "s3://bb-test-for-amit/files/test.csv",
  "delimiter": ""
}

Binary - Google Drive

This Blocks reads binary data from the data queue and writes a file to a Google Drive Folder.

Connection Parameters

{
  "email": "<email>"
}

Column Details

{
  "columns": [
    {
      "dataColumn": "FILE_DATA",
      "outputColumn": "FILEPATH",
      "fileName": "'A'+RANDOM()+'.jpg'",
      "onReadErrorOutputData": "Binary Writer Error"
    }
  ]
}

File Parameters

{
  "folderId": "<folder_id>",
  "outputTopicAliasName": "gdrive_binary_image_writer",
  "overwrite": true
}

CSV - My Space Writer

This source connector will write a CSV file to the My Space of your Workspace.

Data Target

{
  "fileName": "result11.csv",
  "delimiter": ",",
  "overwrite": true
}

CSV - HDFS Writer

This target connector will write a CSV file with data from the data queue to the HDFS instance specified in configuration.

Connection Parameters

{
  "hadoopIp": "192.168.60.53",
  "webUiPort": "50070"
}

Data Target

{
  "filePath": "/test/result.csv",
  "delimiter": ",",
  "overwrite": true
}

CSV/Sheet - Google Drive

This target connector will write a sheet or CSV file to Google Drive.

Data Target

{
  "fileName": "test_file",
  "sheetName": "test_sheet",
  "overwrite": true,
  "isPreview": false,
  "folderId": "<folder_id>",
  "batchSize": 100
}

Authentication Parameters

{
  "email": "<email>"
}

MySQL Writer

Reads data from a datastream and writes into a MySQL database.

Connection Parameters

{
  "hostName": "localhost",
  "port": "3306",
  "userName": "root",
  "password": "root"
}