Documentation/Analysis/Data Analysis

Data Analysis Nodes

Comprehensive data analysis toolkit for statistical analysis, outlier detection, correlation analysis, and data preprocessing for machine learning workflows.

Node Reference

Detailed documentation for each data analysis node available in Bioshift.

Statistical Summary

Generate comprehensive statistical summaries of numerical data

Type: stat_summaryCategory: Statistics

Key Features

  • Mean, median, mode calculations
  • Standard deviation and variance
  • Min/max values and range
  • Skewness and kurtosis
  • Percentiles and quartiles

Input Ports

datadata

DataFrame with numerical columns

Output Ports

summary_statsdata

Statistical summary DataFrame

descriptive_statsdata

Detailed descriptive statistics

Correlation Analysis

Compute correlation matrices and analyze relationships between variables

Type: correlation_analysisCategory: Statistics

Key Features

  • Pearson correlation coefficient
  • Spearman rank correlation
  • Correlation significance testing
  • Heatmap visualization
  • Threshold-based filtering

Input Ports

datadata

DataFrame with numerical columns

Output Ports

correlation_matrixdata

Correlation coefficient matrix

correlation_plotimage

Heatmap visualization

significant_pairsdata

Highly correlated variable pairs

Outlier Detection

Identify and handle outliers using multiple statistical methods

Type: outlier_detectionCategory: Statistics

Key Features

  • Z-score method
  • IQR (Interquartile Range) method
  • Isolation Forest algorithm
  • Local Outlier Factor (LOF)
  • Multiple outlier handling options

Input Ports

datadata

DataFrame with numerical columns

Output Ports

cleaned_datadata

Data with outliers removed/replaced

outlier_indicesdata

Indices of detected outliers

outlier_reportdata

Detailed outlier analysis report

Data Normalization

Normalize and scale data using various transformation methods

Type: data_normalizationCategory: Preprocessing

Key Features

  • Min-Max scaling (0-1 range)
  • Standard scaling (z-score)
  • Robust scaling (median/IQR based)
  • Log transformation
  • Box-Cox transformation

Input Ports

datadata

DataFrame with numerical columns

Output Ports

normalized_datadata

Normalized data

scaling_paramsdata

Scaling parameters for inverse transform

normalization_reportdata

Normalization method report

Feature Engineering

Create new features through mathematical transformations

Type: feature_engineeringCategory: Preprocessing

Key Features

  • Polynomial features
  • Interaction features
  • Mathematical transformations
  • Binning/discretization
  • Feature selection methods

Input Ports

datadata

DataFrame with numerical columns

Output Ports

engineered_datadata

Data with new features

feature_importancedata

Feature importance scores

transformation_logdata

Applied transformations log

Workflow Examples

Common data analysis workflows you can build with these nodes.

Data Quality Assessment

Complete workflow for analyzing data quality and preparing for ML

  1. 1Load dataset using CSV Reader node
  2. 2Generate statistical summary
  3. 3Detect and handle outliers
  4. 4Check for missing values
  5. 5Analyze feature correlations
  6. 6Normalize numerical features
  7. 7Create derived features
  8. 8Export clean dataset

Exploratory Data Analysis

Comprehensive EDA workflow for understanding dataset characteristics

  1. 1Import data from multiple sources
  2. 2Generate statistical summaries
  3. 3Create correlation heatmaps
  4. 4Plot distribution histograms
  5. 5Identify data patterns
  6. 6Detect anomalies and outliers
  7. 7Generate automated insights
  8. 8Create summary report

Feature Selection Pipeline

Automated feature selection and engineering workflow

  1. 1Load training dataset
  2. 2Calculate feature correlations
  3. 3Remove highly correlated features
  4. 4Apply feature scaling
  5. 5Generate polynomial features
  6. 6Select top features
  7. 7Validate with cross-validation
  8. 8Export feature set