Cryo-EM Heterogeneous Dataset Simulation¶
You can simulate a heterogeneous dataset from multiple PDBs through our API, as shown in our Tutorial, or through the command line:
Input Format¶
The simulate_data command uses a custom YAML input format. A sample config file is shown below:
# Parameters marked with (*) are optional
# All parameters that can be defined in a range ([min_value, max_value]) can also be
# defined as a single value (e.g., offset_x_in_angstroms: 0.0)
number_of_images: 20000
batch_size_for_generation: 1000 # images generated in parallel
images_per_file: 5000
data_sign: dark-on-light # Relion convention, dark particles on a light background
box_size: 128
pixel_size: 2.0
offset_x_in_angstroms:
- min_value
- max_value
offset_y_in_angstroms:
- min_value
- max_value
amplitude_contrast_ratio: 0.1
phase_shift: 0.0
spherical_aberration_in_mm: 2.7
voltage_in_kilovolts: 300.0
ctf_scale_factor: 1.0
astigmatism_angle_in_degrees:
- min_value
- max_value
astigmatism_in_angstroms: [min, max]
defocus_in_angstroms:
- min_value
- max_value
envelope_b_factor:
- min_value
- max_value
# Noise is added by normalizing the variance of the signal.
# The following sets the region in the image used to compute the variance.
# We recommend half the box size (may vary with pixel size and offset range).
mask_radius: 64
mask_rolloff_width: 1.0 # smooths the edge of the mask
noise_snr:
- min_value
- max_value
atomic_models_params:
atom_selection: not element H
atomic_models_probabilities:
- 0.7
- 0.3
loads_b_factors: true
path_to_atomic_models: # use of wildcard is permitted, e.g., `path/to/atomic_models/initial_model_*.pdb`
- path/to/atomic_models/initial_model_0.pdb
- path/to/atomic_models/initial_model_1.pdb
- path/to/atomic_models/initial_model_2.pdb
- ...
overwrite: true
path_to_relion_project: output/directory/for/mrcs/files/
path_to_starfile: path/to/starfile.star
rng_seed: 0 # seed for noise and parameter generation
Comments on the input parameters¶
When multiple models are provided, all structures are aligned to the first model in the list (or the first in alphabetical order if a wildcard is used). This ensures that no spurious heterogeneity is introduced by misaligned models when the true poses are used for reconstruction. All models must share the same topology. Parameters specified as a range are sampled uniformly over that range.
Outputs¶
*.mrcs: particle images in.mrcsformat.- A STAR file containing pose and CTF information.
- A copy of the config file, written to the
path_to_relion_projectdirectory.
Loading the data¶
The simulated data can be visualized using the cryoJAX API:
from cryospax import (
RelionParticleParameterFile,
RelionParticleStackDataset,
)
relion_dataset = RelionParticleStackDataset(
RelionParticleParameterFile(path_to_starfile=...),
path_to_relion_project=...,
)
relion_dataset[0:100]
>>> dict(
"parameters": dict("pose": ..., "transfer_theory"=..., "image_config"=...),
"images": ... # Array with images
)
The inputs for RelionParticleStackDataset should match the corresponding fields in the config file. For more details, see the cryoJAX tutorial on Loading Cryo-EM Images.