Dataset Analysis

Dataset Analysis#

Prior to executing training tasks, performing comprehensive dataset feature analysis is essential. This process enables systematic evaluation of training memory requirements through statistical characteristics (e.g., edge distribution in individual graph structures), informs parameter configuration strategies (including neural network irreducible representations), and guides performance expectation assessments while establishing model evaluation baselines.

DFT Data Features#

Command Line Interface#

The dock analyze dataset features command provides comprehensive analysis of DFT dataset characteristics, enabling users to make informed decisions about model architecture and training strategy.

dock analyze dataset features [OPTIONS] DATA_PATH

Basic Usage#

# Analyze dataset with default settings
dock analyze dataset features /path/to/your/data

# Analyze dataset with specified orbital types
dock analyze dataset features . --common-orbital-types s3p2d1f1

# Analyze dataset considering parity and with parallel processing
dock analyze dataset features /path/to/data --consider-parity -p 4

Options Explained#

Option	Alias	Description	Default
`--parallel-num`	`-p`	Number of parallel processes for analysis. `-1` uses all available CPU cores.	`-1`
`--tier-num`	`-t`	Directory tier level for DFT data organization: • `-1`: `<data_path>/dft` • `0`: `<data_path>/dft/<data_dirs>` • `1`: `<data_path>/dft/<tier1>/<data_dirs>`	`0`
`--common-orbital-types`	`--common`	User-defined common orbital types specification (e.g., `s3p2d1f1`)	Auto-detected
`--consider-parity`		Include parity consideration in irreps analysis	`False`
`--help`	`-h`	Display help message	N/A

Output Interpretation#

The command generates a comprehensive analysis report with multiple sections:

Example Output for Water (H\(_2\)O) Dataset#

📊 BASIC DATASET INFO
-----------------------
  • Spinful:                False
  • Parity consideration:   False
  • Total data points:      4,999

🧪 ELEMENT & ORBITAL INFO
---------------------------
  • Elements included:      H, O (2 elements)
  • Orbital source:         auto_detected
  • Common orbital types:   s3p2d1f1

🎯 IRREPS INFORMATION
-----------------------
  Irreps Type          Irreps                                             Dimension 
  .................... .................................................. ..........
  Common orbital       15x0e+24x1e+22x2e+18x3e+8x4e+3x5e+1x6e             441     
  Suggested            16x0e+24x1e+24x2e+24x3e+8x4e+4x5e+2x6e             518     
  Exp2                 32x0e+16x1e+8x2e+4x3e+2x4e+2x5e+2x6e               214     
  Trivial              32x0e+32x1e+32x2e+32x3e+32x4e+32x5e+32x6e          1568    

💡 RECOMMENDATIONS
--------------------
  1. Moderate dataset size - regular training recommended
  2. High-dimensional irreps - consider dimensionality reduction techniques

Key Information Sections#

1. Basic Dataset Info

Spinful: Indicates if the system is spin-polarized (True) or spin-restricted (False)
Parity consideration: Whether spatial inversion symmetry is considered in irreps
Total data points: Number of DFT calculation samples available for training

2. Element & Orbital Info

Elements included: Chemical elements present in the dataset with count
Orbital source: Whether orbital types were auto-detected or user-specified
Common orbital types: Orbital basis specification in sNpNdNfN format
- Example: s3p2d1f1 means 3 s-orbitals, 2 p-orbitals, 1 d-orbital, 1 f-orbital

3. Irreps Information The irreps (irreducible representations) table shows four different symmetry-adapted representations:

Irreps Type	Purpose	When to Use
Common orbital	Symmetry-adapted basis from detected orbital types	Default choice for most models
Suggested	Optimized representation based on symmetry analysis	For best performance with the current dataset
Exp2	Reduced-dimensional representation	When computational efficiency is prioritized
Trivial	Complete (non-symmetry-adapted) representation	As a baseline or for debugging

The notation NxMe indicates N occurrences of irreducible representation with angular momentum M and parity e (even). For example, 15x0e means 15 scalar (ℓ=0) even-parity representations.

4. Dimension Analysis Each irreps type has a total dimension indicating the feature vector size. This directly impacts:

Model parameter count
Memory requirements
Training time
Model capacity

5. Recommendations Context-specific suggestions based on dataset characteristics:

Dataset size guidance: Small (<100 samples), Moderate (100-6000), Large (>6000)
Architecture suggestions: Based on irreps dimensionality and system properties
Training strategies: Recommendations for regularization, augmentation, etc.

Practical Usage Scenarios#

1. Initial Dataset Assessment#

# Quick initial assessment with defaults
dock analyze dataset features /your/data/inputs

Use this to get a baseline understanding of your dataset’s scale and complexity.

2. Orbital Basis Optimization#

# Compare different orbital basis choices
dock analyze dataset features ./inputs --common-orbital-types s10p5d3
dock analyze dataset features ./inputs --common-orbital-types s12p7d5f3

Compare dimension reductions and select the optimal balance between accuracy and efficiency.

3. Training Configuration Planning#

Based on the output:

Total data points: Determines train/validation/test split ratios
Irreps dimensions: Guides model architecture design and hidden layer sizes
Memory estimate: Informs GPU memory requirements and batch size selection

4. Performance Benchmarking#

# Analyze with parity consideration for comparison
dock analyze dataset features /path --consider-parity

Compare dimensions with and without parity to understand the impact of symmetry constraints.

Interpreting Dimension Values#

Common orbital dimension: 441

Represents the symmetry-adapted feature space size
Determines the input layer size of the neural network
Higher values capture more complex chemical environments but require more parameters

Suggested vs Common:

The Suggested Irreps are usually the recommended choice.
Suggested dimension (518) > Common (441): The algorithm suggests expanding the representation
This indicates potential benefits from a richer feature space

Exp2 dimension: 214

Approximately half of common orbital dimension
Useful for testing if reduced features still capture essential physics

Trivial dimension: 1568

Non-symmetry-adapted, complete basis
Much larger dimension highlights the compression achieved by symmetry adaptation

Advanced Analysis Tips#

Tier Number Selection: Use -t option to match your directory structure

# For data organized as: 
#  - data/dft/tire1/material1
#  - data/dft/tire1/material2
#  - data/dft/tire2/material3
#  - ...
dock analyze dataset features data -t 1

Parallel Processing: For large datasets, adjust parallel processing

# Use 8 cores for faster analysis
dock analyze dataset features /path -p 8

Orbital Type Notation: The orbital type string follows sNpNdNfN format where:
- s3: 3 s-orbitals (typically 1s, 2s, 3s)
- p2: 2 p-orbitals (typically 2p, 3p)
- d1: 1 d-orbital (typically 3d)
- f1: 1 f-orbital (typically 4f)

Common Output Patterns#

Dataset Type	Expected Irreps Dimension	Typical Recommendations
Small molecule	2000-5000	Data augmentation, transfer learning
Bulk material	200-600+	High-capacity model, regularization
Spin-polarized	Approximately 2× larger	Ensure spin-symmetry handling
Multi-element	Higher dimension	Element-specific embeddings

Next Steps After Analysis#

Model Architecture Design: Use the irreps dimensions to design your neural network layers
Training Configuration: Set batch size based on memory estimates
Performance Expectations: Use dataset size to estimate training convergence time
Validation Strategy: Split data appropriately based on total samples

This analysis provides the foundational understanding needed to configure effective machine learning models for materials science applications, ensuring appropriate model capacity for your specific dataset while maintaining computational efficiency.

Edge number statistics#

User can use the terminal tool to analyze the edge number of the structures in the dataset. The edge number is related to the memory usage and speed of the training/inference processes.

dock analyze dataset edge ./inputs -p 4

The description of the arguments can be shown by

dock analyze dataset edge -h

Usage: dock analyze dataset edge [OPTIONS] DATA_PATH

  Statistic and show edge related information.

Options:
  -p, --parallel-num INTEGER  The parallel processing number, -1 for using all of the cores.  [default: -1]
  -t, --tier-num INTEGER      The tier number of the dft data directory, -1 for <data_path>/dft, 0 for <data_path>/dft/<data_dirs>, 1 for <data_path>/dft/<tier1>/<data_dirs>, etc. [default: 0]
  --edge-bins INTEGER         The bins count width, default is auto decided.
  --plot-dpi INTEGER          The bin count plot DPI.  [default: 300]
  -h, --help                  Show this message and exit.

For advanced users, deeph-dock provides the DatasetAnalyzer class.

from deepx_dock.analyze.dataset.analyze_dataset import DatasetAnalyzer

data_path = "./inputs"
parallel_num = 4
tier_num = 0
edge_bins = None
plot_dpi = 300

inspector = DatasetAnalyzer(
    data_path=data_path,
    n_jobs=parallel_num,
    n_tier=tier_num,
)
features = inspector.gen_dft_features()
inspector.statistic_edge_quantity(
    features=features,
    bins=edge_bins,
)

[do] Locate all DFT data directories ...
[rawdata] Processing DFT data in inputs/dft.
  +-[search]: 13it [00:00, 10429.60it/s]
[rawdata] Found 13 structures in inputs/dft.

inspector.plot_edge_quantity(dpi=plot_dpi)

../../../_images/a32b89b75b68e2751dcbe8274ccbd90c1de7a37de4745e8dfa8f80ceb6872ccf.png

Dataset split generation#

User can use the terminal tool to split the dataset into train set, validation set, and test set. A json file containing the split structure names is dumped.

dock analyze dataset split ./inputs -p 4
cat ./dataset_split.json

{"train": ["25001", "1001", "4992", "25000", "4990", "4991", "4995"], "validate": ["25003", "4994"], "test": ["1002", "4993", "25002", "1000"]}