API reference

This page documents the public Cell-GPS functions intended for direct use in analysis scripts. The preferred import path is cellgps; the legacy sfplot namespace remains available for backward compatibility.

Core COSTE and StructureMap functions

cellgps.compute_cophenetic_distances_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype', output_dir: str | None = None, method: str = 'average', show_corr: bool = False) Tuple[DataFrame, DataFrame][source]

Compute and return cophenetic distance matrices in both row and column dimensions, then apply linear normalization to [0, 1] for each separately.

If z_col is provided, uses (x, y, z) for distance computation; otherwise uses only (x, y).

Parameters:

dfpd.DataFrame

DataFrame containing cell data.

x_col, y_col, z_colstr, optional

Column names for spatial coordinates. z_col defaults to None.

celltype_colstr, optional

Column name for cell type.

output_dirOptional[str]

Output file directory; if None, uses the current working directory.

methodstr, optional

Linkage method for hierarchical clustering. Defaults to “average”.

show_corrbool, optional

Whether to print the cophenetic correlation coefficient for rows and columns. Defaults to False.

Returns:

Tuple[pd.DataFrame, pd.DataFrame]

Row and column cophenetic distance matrices, both normalized to [0, 1].

cellgps.compute_cophenetic_distances_from_adata(adata: anndata.AnnData, cluster_col: str = 'Cluster', output_dir: str | None = None, method: str = 'average') Tuple[DataFrame, DataFrame][source]

Compute and return cophenetic distance matrices in both row and column dimensions (using cophenet), then apply linear normalization to [0,1] for each separately.

Unlike the previous version, min and max values are computed independently for rows and columns.

cellgps.compute_searcher_findee_distance_matrix_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype') DataFrame[source]

Compute and return a directed inter-cluster average nearest-neighbor distance matrix. Row and column indices are the clusters (cell types) present in df; rows represent “Searcher” clusters, columns represent “Findee” clusters. Each element is the average nearest-neighbor distance from all cells in the row cluster to all cells in the column cluster. Clusters with no cells in the data will not appear in the result matrix.

Parameters:

dfpd.DataFrame

DataFrame containing cell coordinates and type data.

x_col, y_colstr, optional

Column names for cell x/y coordinates. Defaults to “x” and “y”.

z_colOptional[str], optional

Column name for the z coordinate; if provided it is used, otherwise None means 2D only.

celltype_colstr, optional

Column name for cell type / cluster labels. Defaults to “celltype”.

Returns:

pd.DataFrame

Distance matrix DataFrame with cluster names as index and columns. Shape is (n_clusters, n_clusters); values are the average nearest-neighbor distance between the corresponding cluster pairs. NaN if unavailable.

cellgps.compute_cophenetic_from_distance_matrix(distance_matrix: DataFrame, method: str = 'average', show_corr: bool = False) Tuple[DataFrame, DataFrame][source]

Perform hierarchical clustering in both row and column directions on the given inter-cluster distance matrix, and compute cophenetic distance matrices. Results are independently normalized to [0,1] for rows and columns.

Parameters:

distance_matrixpd.DataFrame

Input distance matrix with source clusters as rows and target clusters as columns (e.g. output of compute_searcher_findee_distance_matrix_from_df).

methodstr, optional

Linkage method for hierarchical clustering. Defaults to “average”.

show_corrbool, optional

Whether to print the cophenetic correlation coefficient (printed separately for rows and columns). Defaults to False.

Returns:

Tuple[pd.DataFrame, pd.DataFrame]

(row_coph, col_coph). Cophenetic distance matrices (DataFrames) for row and column clusters, each independently normalized to [0,1].

cellgps.compute_cophenetic_distances_from_df_memory_opt(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype', method: str = 'average', show_corr: bool = False, batch_size: int | None = None) Tuple[DataFrame, DataFrame][source]

Same functionality as the original compute_cophenetic_distances_from_df, but reduces memory usage via batched computation.

cellgps.pick_batch_size(n_cells: int, dims: int = 2, frac: float = 0.3, hard_min: int = 50000, hard_max: int | None = None, bytes_per_row: int | None = None, safety_gb: float = 8.0, env_override_var: str = 'BATCH_SIZE_OVERRIDE') int[source]

Pick a batch size that better utilizes RAM on big machines.

Key ideas: - Allow an env override (for quick experiments). - Subtract a fixed safety buffer (safety_gb) from available RAM. - Make bytes_per_row configurable; provide a conservative default. - Optional hard_max; if None, we don’t clamp by a hard cap.

Parameters

n_cellsint

Total number of items to process.

dimsint

Dimensionality; may influence copies inside algorithms.

fracfloat

Fraction of available RAM to budget.

hard_minint

Lower bound for stability on small RAM.

hard_maxOptional[int]

Upper bound; set None to disable hard clamping.

bytes_per_rowOptional[int]

Estimated peak bytes per row for the step. If None, pick a conservative default.

safety_gbfloat

Keep this amount of RAM free regardless (OS/page cache/etc.).

env_override_varstr

If set, this env var forces the batch size (int), bypassing heuristics.

Returns

int

A batch size in [hard_min, n_cells] (and <= hard_max if provided).

Topology extensions

cellgps.compute_weighted_cophenetic_distances_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, group_col: str = 'celltype', weight_col: str | None = 'weight', min_weight: float = 0.0, method: str = 'average', show_corr: bool = False) tuple[DataFrame, DataFrame][source]
cellgps.compute_weighted_searcher_findee_distance_matrix_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, group_col: str = 'celltype', weight_col: str | None = 'weight', min_weight: float = 0.0) DataFrame[source]

Compute a weighted directed searcher→findee average nearest-neighbor matrix.

The weighting scheme is intentionally conservative to preserve backward compatibility with the original t_and_c logic: the nearest-neighbor geometry is unchanged, while the row-wise aggregation becomes a weighted average over source/searcher points. When every point has unit weight, the result is exactly equivalent to compute_searcher_findee_distance_matrix_from_df.

cellgps.build_entity_points_from_expression(reference_df: DataFrame, expression_df: DataFrame, *, entities: Iterable[str] | None = None, cell_id_col: str = 'cell_id', x_col: str = 'x', y_col: str = 'y', min_weight: float = 0.0, entity_col: str = 'entity', weight_col: str = 'weight') DataFrame[source]
cellgps.compute_entity_structuremap(entity_points_df: DataFrame, *, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, entity_col: str = 'entity', weight_col: str = 'weight', min_weight: float = 0.0, method: str = 'average') DataFrame[source]
cellgps.compute_entity_to_cell_topology(reference_df: DataFrame, entity_points_df: DataFrame, *, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype', entity_col: str = 'entity', weight_col: str = 'weight', min_weight: float = 0.0, method: str = 'average') DataFrame[source]

Generalize transcript-by-cell topology to arbitrary weighted entities.

reference_df contains the fixed cell-type template. entity_points_df contains an entity label plus spatial points and weights. For every entity we temporarily append its weighted point cloud to the reference template, compute a weighted StructureMap, and extract the entity→celltype row.

cellgps.compute_pathway_activity_matrix(expression_df: DataFrame, pathway_definitions: Mapping[str, Any] | DataFrame, *, method: str = 'rank_mean', normalize: bool = True) DataFrame[source]
cellgps.ligand_receptor_topology_analysis(*, reference_df: DataFrame | None = None, expression_df: DataFrame | None = None, lr_pairs: DataFrame, output_dir: str | PathLike[str] | None = None, adata: Any = None, entity_points_df: DataFrame | None = None, tbc_results: str | PathLike[str] | None = None, t_and_c_df: DataFrame | None = None, cluster_col: str = 'Cluster', cell_id_col: str = 'cell_id', x_col: str = 'x', y_col: str = 'y', celltype_col: str = 'celltype', ligand_col: str = 'ligand', receptor_col: str = 'receptor', prior_col: str = 'evidence_weight', structure_map: DataFrame | None = None, structure_map_df: DataFrame | None = None, anchor_mode: str = 'precomputed', expression_support_mode: str = 'pseudobulk_detection', contact_mode: str = 'strength_coverage', entity_min_weight: float = 0.0, detection_threshold: float = 0.0, k_neighbors: int = 8, radius: float | None = None, topology_method: str = 'average', top_n_pairs: int = 12, hotspot_quantile: float = 0.9, min_cross_edges: int = 50, contact_expr_threshold: str | float = 'q75_nonzero', use_raw: bool = False) dict[str, Any][source]
cellgps.ligand_receptor_target_consistency(lr_scores: DataFrame, receiver_signatures: Mapping[str, Any] | DataFrame, ligand_target_prior: DataFrame, *, ligand_col: str = 'ligand', receiver_col: str = 'receiver_celltype', target_col: str = 'target', prior_weight_col: str = 'weight', signature_gene_col: str = 'gene', signature_weight_col: str = 'score') DataFrame[source]

Compute a NicheNet-like downstream target consistency layer.

The default scoring is intentionally lightweight: for each ligand and receiver cell type we compute the weighted overlap between the ligand prior targets and the receiver signature genes. The output can be merged back onto the ligand_receptor_topology_analysis result table.

cellgps.pathway_topology_analysis(*, pathway_definitions: Mapping[str, Any] | DataFrame, reference_df: DataFrame | None = None, expression_df: DataFrame | None = None, output_dir: str | PathLike[str] | None = None, adata: Any = None, tbc_results: str | PathLike[str] | None = None, t_and_c_df: DataFrame | None = None, cluster_col: str = 'Cluster', cell_id_col: str = 'cell_id', x_col: str = 'x', y_col: str = 'y', celltype_col: str = 'celltype', scoring_method: str = 'weighted_sum', view: str = 'intrinsic', structure_map: DataFrame | None = None, structure_map_df: DataFrame | None = None, anchor_mode: str = 'precomputed', pathway_modes: Sequence[str] = ('gene_topology_aggregate', 'activity_point_cloud'), primary_pathway_mode: str = 'gene_topology_aggregate', pathway_aggregate: str = 'weighted_median', activity_threshold_schedule: Sequence[float] = (0.95, 0.9, 0.8, 0.7, 0.6, 0.5), min_activity_cells: int = 50, entity_min_weight: float = 0.0, k_neighbors: int = 8, radius: float | None = None, topology_method: str = 'average', hotspot_quantile: float = 0.9, use_raw: bool = False) dict[str, Any][source]

Preprocessing and input helpers

cellgps.load_xenium_data(folder: str, normalize: bool = True)[source]

Load and preprocess a Xenium run through pyXenium.io.read_xenium.

cellgps.load_xenium_table_bundle(folder: str | PathLike[str], *, cells_path: str | PathLike[str] | None = None, cell_groups_path: str | PathLike[str] | None = None, feature_matrix_path: str | PathLike[str] | None = None, normalize: bool = False, cluster_col: str = 'Clusters', cell_id_col: str = 'Barcode', x_col: str = 'x_centroid', y_col: str = 'y_centroid')[source]

Load a Xenium table bundle through pyXenium.io.read_xenium.

The returned object keeps the requested cluster labels in adata.obs[cluster_col] and mirrors them into adata.obs["Cluster"] for backward compatibility with the existing Cell-GPS API.

cellgps.merge_xenium_clusters_into_adata(sdata, xenium_dir: str, table_key: str = 'table', clustering_root: str = 'analysis/clustering', barcode_col: str = 'Barcode', cluster_col: str = 'Cluster') Tuple['anndata.AnnData', List[str], Dict[str, float]][source]

Auto-collect xenium_dir/analysis/clustering/**/clusters.csv and merge clustering columns into sdata.tables[table_key].obs. Prefers linking via obs[‘cell_id’]; falls back to shapes index mapping if unavailable. Returns (adata, list of new column names, per-column non-NA hit rate report).

cellgps.read_visium_bin(base: Path, dataset_id: str, use_filtered: bool = True, keep_tmp: bool = False)[source]

Adapter for spatialdata-io 0.3.0, reads Visium HD output containing Parquet coordinates. Does not write any files to base.

Plotting

cellgps.plot_cophenetic_heatmap(matrix: DataFrame, matrix_name: str | None = None, output_dir: str | None = None, output_filename: str | None = None, figsize: Tuple[float, float] | None = None, cmap: str = 'RdBu', linewidths: float = 0.5, annot: bool = False, sample: str = 'Sample', xlabel: str | None = None, ylabel: str | None = None, show_dendrogram: bool = True, quiet: bool = True, return_figure: bool = False, return_image: bool = False, dpi: int = 300)[source]
Draw a cophenetic heatmap (seaborn.clustermap), guaranteeing:
  • Text in PDF is editable

  • Legend position is auto-adjusted

  • figsize is dynamically adjusted

  • fontTools.subset & findfont logs are silenced

Parameters:

…existing parameters… return_figure: whether to return the figure object instead of saving to file return_image: whether to return a high-resolution PIL image instead of the figure object dpi: image DPI resolution, only effective when return_image=True

Returns:

If return_figure=True, returns a seaborn.ClusterGrid object If return_image=True, returns a PIL.Image object Otherwise returns None

cellgps.generate_cluster_distance_heatmap_from_adata(adata: anndata.AnnData, cluster_col: str = 'Cluster', output_dir: str | None = None, output_filename: str | None = None, figsize: tuple = (8, 8), cmap: str = 'RdBu', max_scale: float = 10, show_dendrogram: bool = True)[source]

Generate and save a distance heatmap from each cell cluster to its nearest cluster center.

Parameters:

adataanndata.AnnData

AnnData object containing preprocessed data.

cluster_colstr, optional

Column name in adata.obs containing cluster information. Defaults to “Cluster”.

output_dirOptional[str]

Output directory for the PDF file. Defaults to current working directory.

output_filenameOptional[str]

Output file name. If not specified, uses “clustermap_output_{sample}.pdf”.

figsizetuple, optional

Size of the heatmap. Defaults to (7, 7).

cmapstr, optional

Colormap for the heatmap. Defaults to “RdBu”.

max_scalefloat, optional

max_value parameter for sc.pp.scale, used to clip Z-scores. Defaults to 10.

Returns:

None

cellgps.generate_cluster_distance_heatmap_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', celltype_col: str = 'celltype', sample: str = 'Sample', output_dir: str | None = None, output_filename: str | None = None, figsize: tuple = (8, 8), cmap: str = 'RdBu', show_dendrogram: bool = True)[source]

Generate and save a distance heatmap from each cell cluster to its nearest cluster center.

Parameters:

dfpd.DataFrame

DataFrame containing cell data.

x_colstr, optional

Column name for x coordinates. Defaults to ‘x’.

y_colstr, optional

Column name for y coordinates. Defaults to ‘y’.

celltype_colstr, optional

Column name for cell type. Defaults to ‘celltype’.

output_dirOptional[str]

Output directory for the PDF file. Defaults to current working directory.

output_filenameOptional[str]

Output file name. If not specified, uses “clustermap_output.pdf”.

figsizetuple, optional

Size of the heatmap. Defaults to (8, 8).

cmapstr, optional

Colormap for the heatmap. Defaults to “RdBu”.

Returns:

None

cellgps.generate_cluster_distance_heatmap_from_path(base_path: str, sample: str, figsize: tuple = (8, 8), output_dir: str | None = None, show_dendrogram: bool = True)[source]

Generate and save a distance heatmap from each cell cluster to its nearest cluster center.

Parameters:

base_pathstr

Base path where data is stored.

samplestr

Sample name used to specify the data folder.

output_dirOptional[str]

Output directory for the PDF file. Defaults to current working directory.

Returns:

None

cellgps.circle_heatmap(bg_df: DataFrame, circle_df: DataFrame, *, cmap: str = 'RdBu', size_exponent: float = 1.0, circle_fill: str = 'white', circle_edge: str = 'black', circle_edge_lw: float = 0.5, add_legend: bool = True, legend_title: str = 'Transcript Percentage (%)', figsize: tuple = (8, 6), ax: Axes = None)[source]
Draw a combined heatmap and circles plot:
  • bg_df: scores between 0–1, represented with red-white-blue;

  • circle_df: percentages 0–100 (%), encoded as circle area;

  • 0% draws no circle, 100% maps exactly to a circle of cell diameter;

  • The legend only shows five percentages: [5, 25, 45, 65, 85].