Blocks
Aggregate Blocks
Aggregate
This Block allows you to apply aggregate functions like average, sum, max, min and standard deviation on your mentioned columns.
Generate Aggregates
{
"AVERAGE": {
"columns": ["id"]
},
"SUM": {
"columns": ["id"]
},
"MAX": {
"columns": ["id"]
},
"MIN": {
"columns": ["id"]
},
"STD_DEV": {
"columns": ["id"]
},
"COUNT": {
"columns": ["id"]
}
}
Group By
This Block allows you to Group By your data on multiple columns with Max aggregate function.
GroupBy Configuration
{
"GenerateGroupBy": {
"columns": [
"country",
"population"
],
"aggregate_on": {
"AVERAGE": {
"columns": ["id"]
},
"SUM": {
"columns": ["id"]
},
"MAX": {
"columns": ["id"]
},
"MIN": {
"columns": ["id"]
},
"STD_DEV": {
"columns": ["id"]
},
"COUNT": {
"columns": ["id"]
}
}
}
}
Quartiles
Calculates quartiles Q1, Q3, median, IQR for EDA.
Quartile Columns
{
"column_names": ["test"]
}
Binary Connectors
Binary Reader - HDFS
This Block reads a list of Binary files based on the path in a column of a CSV file. Input is a data queue containing file path column from a CSV file and output is a Binary object pushed to the data queue for each Binary file path.
File Parameters
{
"base_path": "/image_base_path/",
"max_records": 7,
"outputTopicAliasName": "binary_reader_error",
"defaultMimeType": "image/png"
}
Connection Parameters
{
"port": "50070",
"hostName": "localhost"
}
Column Details
{
"columns": [
{
"pathColumn": "PATH",
"outputColumn": "NEW_PATH",
"onReadErrorOutputData": "binary_reader_err",
"MIMEType": "image/png"
}
]
}
Binary Writer - HDFS
This Blocks writes Binary data from the data queue and writes a file to an HDFS location.
File Parameters
{
"base_path": "/base_path/",
"max_records": 7,
"outputTopicAliasName": "binary_reader_error",
"defaultMimeType": "image/png"
}
Connection Parameters
{
"port": "50070",
"hostName": "localhost"
}
Column Details
{
"columns": [
{
"pathColumn": "PATH",
"outputColumn": "NEW_PATH",
"onReadErrorOutputData": "binary_reader_err",
"MIMEType": "image/png"
}
]
}
HDFS Binary Reader (Glob)
Read Binary Files from folder or glob.
File Parameters
{
"folder_path": "/user/devansh/",
"max_records": -1,
"outputTopicAliasName": "Alias Name",
"MIMEType": "image/png",
"data_output_column": "COL",
"file_path_output_column": ""
}
Image Connectors
Binary To Image Array
This Blocks converts Binary data from data queue into an image Numpy array and pushes it back to data queue.
File Parameters
{
"max_records": 7,
"outputTopicAliasName": "binary_reader_error"
}
Column Details
{
"columns": [
{
"imageColumn": "NEW_PATH",
"outputColumn": "ACTUAL_IMAGE",
"resizeDimension": [
28,
28
],
"onReadErrorOutputData": "binary_reader_err"
}
]
}
Image Array To Binary
This Blocks converts an image Numpy array from data queue into Binary data and pushes it back to data queue.
File Parameters
{
"max_records": 7,
"outputTopicAliasName": "binary_reader_error"
}
Column Details
{
"columns": [
{
"imageColumn": "ACTUAL_IMAGE",
"outputColumn": "BINARY_IMAGE",
"MIMEType": "image/png",
"onReadErrorOutputData": "binary_reader_err"
}
]
}
Image Transformations
OpenCV - Image Transformer
This block allows you to use OpenCV functions to manipulate image numpy arrays from data queue and pushes it back to data queue after the image transformation.
Column Details
{
"columns": [
{
"imageColumn": "NEW_PATH",
"outputColumn": "ACTUAL_IMAGE",
"onReadErrorOutputData": "image_conversion_err"
}
]
}
File Parameters
{
"max_records": 100,
"outputTopicAliasName": "image_conversion_topic"
}
Image Operations
{
"expression": "image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)"
}
Row Blocks
Arithmetic Operation
This Block allows you to run any Python arithmetic expressions or functions like sin(), etc on your data.
Arithmetic Expression
[
{
"outputColumn": "ID_new",
"outputType": "FLOAT",
"onErrorDefaultValue": 0,
"expression": "np.sin(column('ID'))"
}
]
Constant Key
{
"key1": "val1",
"key2": "val2"
}
Data Join
This Block takes 2 data inputs and joins them based on Given columns. Current Version supports join only on single columns. Join type can be inner, outer, left or right.
Join Paramerters
{
"firstDatasourceColumnName": "id",
"secondDatasourceColumnName": "id",
"targetColumnName": "CREDIT_SCORE_JOINED",
"joinType": "left"
}
Date Operation
This Block performs operations on dates and pushes to queue.
Date Expression
[
{
"outputColumn": "Date_New",
"outputType": "DATE",
"onErrorDefaultValue": 0,
"expression": "change_format('DATE1', '%y')"
}
]
Drop Duplicates
This Block allows you to drop dubplicate rows.
Drop Parameters
{
"columns": [
"<column_name>"
]
}
Filter Records
This Block allows you to select only a set of rows from your data based on the matching expression provided by you.
Filter Expression
[
{
"expression": "column('id') > 4",
"outputTopicAliasName": "filter_op2"
}
]
{
"key1": "val1",
"key2": "val2"
}
Impute Missing Values
This Block allows you to replace missing values such as null, ?, blank etc. or any user-defined missing value for example <null>
, with custom value of your choice or with some value inferred from a previous Block.
Missing Value Parameters
[
{
"column": "ID",
"replaceValue": "23"
}
]
Constant Key
{
"key1": "val1",
"key2": "val2"
}
Merge
Reads data from multiple datastream and write to single datastream.
Data Source Parameters
[
{
"queueTopicName": "a3422ed4-85af-46d2-aa2f-851ff9c186a2"
},
{
"queueTopicName": "2aa1f73d-c6f0-4810-89b4-d11030e53334"
}
]
Merge Parameters
{
"MergeTopicsInSequence": true
}
Normalize
This Block will noramlize the column values with the type you choose. Type can be ZSCORE
or MIN_MAX
Normalization Parameters
{
"type": "ZSCORE",
"columns": [
"MOVIEID",
"TITLE"
]
}
Aggregates Parameters
{
"AVERAGE": {
"MOVIEID": {
"result": 2
},
"TITLE": {
"result": 2
}
},
"STD_DEV": {
"MOVIEID": {
"result": 4.031128874149275
},
"TITLE": {
"result": 2
}
}
}
One Hot Encode
This Block will one hot encode a list of columns you mention from your data.
Encode Details
{
"columnList": [
"gender"
],
"stopOnLimit": 1000
}
Output Transformer
This Block allows you to Randomize the given data.
Transformer Parameters
{
"expression": "key('input[0]')"
}
Randomized Splits
This Block allows you to Randomize the given data into given no of splits.
Randomization Parameters
{
"random_seed": null,
"no_of_splits": 1
}
Schema Modifier
This Block allows you to modify the schema of your data with operations like drop, rename or datatype change of columns.
Schema
{
"DROP_COLUMN": [
"col1",
"id"
],
"SELECT": [
"CREDIT_SCORE"
],
"RENAME_COLUMN": [
{
"oldColName": "col1",
"newColName": "hello"
},
{
"oldColName": "name",
"newColName": "world"
}
],
"UPDATE_DATATYPE": [
{
"colName": "ID",
"dataType": "STRING"
},
{
"colName": "gender",
"dataType": "str"
}
]
}
Text Operation
This Block allows you to apply Python String functions on a column in your data.
Text Expression
[
{
"outputColumn": "ID_new",
"outputType": "STRING",
"onErrorDefaultValue": 0,
"expression": "str(column('input_column'))"
}
]
Sampling Blocks
Sampling
Allows user to sample data between the range specified.
Sampling Configuration
{
"Sampling": {
"rows": {
"start_row": 0,
"end_row": 1000
}
}
}
Source Connectors
AWS S3 Reader
Reads from s3 and writes to a datastream.
Connection Parameters
{
"accessKey": "ACCESS_KEY_HERE",
"secretKey": "SECRET_KEY_HERE"
}
Data Source Parameters
{
"fileWithFullPath": "s3://path/to/file.csv",
"delimiter": ",",
"ignoreAndContinue": true,
"outputTopicAliasName": "topicName",
"calculateColumnSize": true,
"maxReadLines": 100,
"hasHeader": true
}
Binary Folder - Google Drive
This source connector will read Binary files inside a Google Drive folder.
Data Source Parameters
{
"folderId": "<folder_id>",
"maxFilesToRead": 10,
"fileNameRegex": ""
}
Output Parameters
{
"dataOutputColumn": "file_data",
"filePathColumn": "filePath",
"outputTopicAliasName": "drive_data"
}
CSV - HDFS
This source connector will read a CSV file from a HDFS instance of your choice.
Data Source Parameters
{
"fileWithFullPath": "/path/to/file.csv",
"delimiter": ",",
"ignoreAndContinue": true,
"outputTopicAliasName": "topicName",
"calculateColumnSize": true,
"maxReadLines": 100,
"hasHeader": true
}
Connection Parameters
{
"port": "50070",
"hostName": "localhost"
}
CSV - My Space
This source connector will read a CSV file from the My Space of your Workspace.
Data Source Parameters:
{
"fileName": "file_name.csv",
"delimiter": ",",
"ignoreAndContinue": true,
"outputTopicAliasName": "topicName",
"maxReadLines": 100,
"calculateColumnSize": true,
"hasHeader": true
}
CSV - Community Space
This source connector will read a CSV file made public in the Community of your Workspace.
User Details
{
"emailId": "username@domain.com"
}
Data Source Parameters:
{
"fileName": "filename.csv",
"maxReadLines": 100,
"delimiter": ",",
"ignoreAndContinue": true,
"outputTopicAliasName": "topicName",
"calculateColumnSize": true,
"hasHeader": true
}
CSV/Sheet - Google Drive
This source connector will read a Google Spread sheet or CSV file from Google Drive.
Connection Parameters
{
"email": "username@domain.com"
}
Data Source Parameters:
{
"fileId": "<file_id>",
"hasHeader": true,
"outputTopicAliasName": "gSheetData",
"ignoreAndContinue": true,
"calculateColumnSize": true,
"sheet": "sheet_number",
"delimiter": ","
}
MySQL Reader
Reads data from MySQL and pushes it to a datastream.
Data Extraction Query Parameters
{
"query": "select * from TestTable;",
"maxReadLines": 100
}
Connection Parameters
{
"host": "localhost",
"port": "3306",
"username": "root",
"password": "root",
"databaseName": "Test",
"ssl": "false"
}
Structured Data Generator
Generates random data and pushes it to a data stream. Allow only discrete in allowed_values as ["1-100"]. expressions can be specified as in_range(low, high)
for INTEGER
& FLOAT
.
Data Source Parameters:
{
"maxReadLines": 1000,
"seed": 1234
}
Output Configuration
{
"outputTopicAliasName": "topicName"
}
Column Configuration
{
"columns": [
{
"name": "COLUMN_1",
"type": "INTEGER",
"allowed_values": "",
"expression": "in_range(1, 25)"
},
{
"name": "COLUMN_2",
"type": "FLOAT",
"allowed_values": [
1.2,
3.4
],
"expression": "in_range(1.2, 3.4)"
},
{
"name": "COLUMN_3",
"type": "STRING",
"allowed_values": [
"MALE",
"FEMALE"
]
}
]
}
Target Connectors
AWS S3 Writer
Reads data from a datastream and writes to an Amazon S3 bucket.
Connection Parameters
{
"accessKey": "AKIAIFC7KVDN3BTBVAIQ",
"secretKey": "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"
}
Data Target
{
"fileName": "s3://bb-test-for-amit/files/test.csv",
"delimiter": ""
}
Binary - Google Drive
This Blocks reads binary data from the data queue and writes a file to a Google Drive Folder.
Connection Parameters
{
"email": "<email>"
}
Column Details
{
"columns": [
{
"dataColumn": "FILE_DATA",
"outputColumn": "FILEPATH",
"fileName": "'A'+RANDOM()+'.jpg'",
"onReadErrorOutputData": "Binary Writer Error"
}
]
}
File Parameters
{
"folderId": "<folder_id>",
"outputTopicAliasName": "gdrive_binary_image_writer",
"overwrite": true
}
CSV - My Space Writer
This source connector will write a CSV file to the My Space of your Workspace.
Data Target
{
"fileName": "result11.csv",
"delimiter": ",",
"overwrite": true
}
CSV - HDFS Writer
This target connector will write a CSV file with data from the data queue to the HDFS instance specified in configuration.
Connection Parameters
{
"hadoopIp": "192.168.60.53",
"webUiPort": "50070"
}
Data Target
{
"filePath": "/test/result.csv",
"delimiter": ",",
"overwrite": true
}
CSV/Sheet - Google Drive
This target connector will write a sheet or CSV file to Google Drive.
Data Target
{
"fileName": "test_file",
"sheetName": "test_sheet",
"overwrite": true,
"isPreview": false,
"folderId": "<folder_id>",
"batchSize": 100
}
Authentication Parameters
{
"email": "<email>"
}
MySQL Writer
Reads data from a datastream and writes into a MySQL database.
Connection Parameters
{
"hostName": "localhost",
"port": "3306",
"userName": "root",
"password": "root"
}