Creating Projects
Introduction
The MDV project structure contains all the information required to store data and views about the biological study in question. We will use the pbmc3k project as an example:
pbmc3k-mdv
├── datafile.h5
├── datasources.json
├── scanpy-pbmc3k.h5ad
├── state.json
├── tracks
└── views.json
Briefly:
- datafile.h5
- HDF5 data file containing the actual dataset (in this case a single-cell RNA sequencing data based on the context) Binary format file for efficient storage and access of large scientific datasets
- datasources.json
- Contains metadata and schema information for data columns/fields
- Defines data types, statistical summaries (min/max, quantiles) for fields like n_genes, percent_mito, etc.
- Describes cell-level and gene-level data characteristics for visualization purposes
- scanpy-pbmc3k.h5ad
- Would typically be a scanpy AnnData format file containing single-cell data (in this example, PBMC = Peripheral Blood Mononuclear Cells, 3k = ~3000 cells). Note this file type only need be included for ChatMDV projects as ChatMDV uses scanpy to do certain operations.
- state.json
- Simple configuration file defining available views ("default") and user permissions ("edit")
- Controls initial view settings and access levels
- tracks (only used in genomics based projects)
- Directory containing genomic track files (hg38.bed.gz and index file)
- Template files for genome browser functionality
- views.json
- Defines visualization view configurations with empty initial charts for cells and genes
- Sets up the default view structure for the data visualization interface
This project structure can be generated using mdvtools which comprises a lightweight version of MDV without the catalogue interface and infrastructure you would get installing the full docker version. mdvtools is for advanced users and data processing workflows, providing command-line tools for creating projects directly from common bioinformatics file formats.
Installation of mdvtools
To ensure a clean and isolated environment for your project, it's recommended to create a Python virtual environment. This helps manage dependencies and avoid conflicts with other projects. Follow these steps to set up a virtual environment and install mdvtools:
-
Create a Virtual Environment: Navigate to your project directory and run the following command to create a virtual environment named
venv:python -m venv venv -
Activate the Virtual Environment:
- On Windows, use:
.\venv\Scripts\activate - On macOS and Linux, use:
source venv/bin/activate
- On Windows, use:
-
Install mdvtools: With the virtual environment activated, install
mdvtoolsusing pip:pip install mdvtools
By using a virtual environment, you ensure that mdvtools and its dependencies are contained within your project, reducing the risk of version conflicts.
This provides access to the mdvtools command-line interface.
Supported Input Formats
| Input Type | File Format | CLI Command | Supports Zip |
|---|---|---|---|
| Scanpy | .h5ad | convert-scanpy | ✅ |
| MuData | .h5mu | convert-mudata | ✅ |
| Spatial | .zarr | convert-spatial | ✅ |
| VCF | .vcf | convert-vcf | ✅ |
Converting Scanpy Files (.h5ad)
Convert a Scanpy .h5ad file to an MDV project:
mdvtools convert-scanpy output_folder example.h5ad \
--chunk_data \
--add_layer_data=False \
--max_dims=3 \
--delete_existing
With ChatMDV support (enables natural language interaction):
mdvtools convert-scanpy output_folder example.h5ad \
--chatmdv \
--zip
Converting MuData Files (.h5mu)
Convert multi-modal data to MDV format:
mdvtools convert-mudata output_folder example.h5mu \
--chunk_data --max_dims=3 --delete_existing
Converting SpatialData Objects
MDV can convert SpatialData objects (the standardized format for spatial omics data) directly into MDV projects. This conversion handles multiple SpatialData objects, merges their tables, resolves spatial regions, and transforms coordinates.
Command Line Usage
Convert SpatialData objects using the command line:
mdvtools convert-spatial <output_folder> <spatialdata_path>
Arguments:
output_folder: Path where the MDV project will be createdspatialdata_path: Path to the folder containing one or more SpatialData objects (typically.zarrfiles)
Example:
mdvtools convert-spatial mdvproject_lung ./spatialdata_lung_data
Options
--preserve-existing: Preserve existing project data when converting--link: Create symbolic links to the original SpatialData objects instead of copying them--output-geojson: Output GeoJSON files for each region (deprecated in favor of spatialdata.js layers)--serve: Automatically serve the project after conversion--point-transform: Strategy for coordinate transformation (auto,image,xenium,identity,annotated-element). Default isauto.
Example with options:
mdvtools convert-spatial output_folder ./spatialdata --link --serve
Converting VCF Files
Convert genomic variant data:
mdvtools convert-vcf output_folder variants.vcf
Creating Zip Archives
Add --zip to any conversion command to create a zip file for easy web upload:
mdvtools convert-scanpy output_folder example.h5ad --zip
Merging Projects
Merge one MDV project into another existing project. This is useful for:
- Integrating datasets from different batches or experiments
- Combining data from multiple samples or conditions
- Adding datasources from one project into a consolidated project
Command Line Usage
mdvtools merge-project <base_project> <extra_project>
Arguments:
base_project: Path to the existing MDV project that will receive the merged data (modified in-place)extra_project: Path to the MDV project to merge into the base project
Example:
mdvtools merge-project ./project1 ./project2
This merges project2 into project1, with project1 being updated to include datasources and views from project2.
Options
--prefix: Prefix applied to imported datasource names. If omitted, a prefix is automatically derived from the extra project path to avoid naming collisions.--view-prefix: Prefix applied to imported view names. Defaults to the datasource prefix (with trailing underscore removed).
Example with options:
mdvtools merge-project ./main_project ./batch2 --prefix "batch2_" --view-prefix "B2"
Merging Considerations
- Both projects must be valid MDV projects with required files (
datafile.h5,datasources.json,views.json,state.json) - Projects should have compatible data structures
- Column names and data types should align across projects
- The base project is modified in-place to include datasources and views from the extra project
- Datasource names are automatically prefixed to avoid collisions
- Original project metadata is preserved where possible
- If datasource name collisions occur even with prefixing, the merge will fail with an error