API reference¶
This page documents the public Cell-GPS functions intended for direct use in
analysis scripts. The preferred import path is cellgps; the legacy
sfplot namespace remains available for backward compatibility.
Core COSTE and StructureMap functions¶
- cellgps.compute_cophenetic_distances_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype', output_dir: str | None = None, method: str = 'average', show_corr: bool = False) Tuple[DataFrame, DataFrame][source]¶
Compute and return cophenetic distance matrices in both row and column dimensions, then apply linear normalization to [0, 1] for each separately.
If z_col is provided, uses (x, y, z) for distance computation; otherwise uses only (x, y).
Parameters:¶
- dfpd.DataFrame
DataFrame containing cell data.
- x_col, y_col, z_colstr, optional
Column names for spatial coordinates. z_col defaults to None.
- celltype_colstr, optional
Column name for cell type.
- output_dirOptional[str]
Output file directory; if None, uses the current working directory.
- methodstr, optional
Linkage method for hierarchical clustering. Defaults to “average”.
- show_corrbool, optional
Whether to print the cophenetic correlation coefficient for rows and columns. Defaults to False.
Returns:¶
- Tuple[pd.DataFrame, pd.DataFrame]
Row and column cophenetic distance matrices, both normalized to [0, 1].
- cellgps.compute_cophenetic_distances_from_adata(adata: anndata.AnnData, cluster_col: str = 'Cluster', output_dir: str | None = None, method: str = 'average') Tuple[DataFrame, DataFrame][source]¶
Compute and return cophenetic distance matrices in both row and column dimensions (using cophenet), then apply linear normalization to [0,1] for each separately.
Unlike the previous version, min and max values are computed independently for rows and columns.
- cellgps.compute_searcher_findee_distance_matrix_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype') DataFrame[source]¶
Compute and return a directed inter-cluster average nearest-neighbor distance matrix. Row and column indices are the clusters (cell types) present in df; rows represent “Searcher” clusters, columns represent “Findee” clusters. Each element is the average nearest-neighbor distance from all cells in the row cluster to all cells in the column cluster. Clusters with no cells in the data will not appear in the result matrix.
Parameters:¶
- dfpd.DataFrame
DataFrame containing cell coordinates and type data.
- x_col, y_colstr, optional
Column names for cell x/y coordinates. Defaults to “x” and “y”.
- z_colOptional[str], optional
Column name for the z coordinate; if provided it is used, otherwise None means 2D only.
- celltype_colstr, optional
Column name for cell type / cluster labels. Defaults to “celltype”.
Returns:¶
- pd.DataFrame
Distance matrix DataFrame with cluster names as index and columns. Shape is (n_clusters, n_clusters); values are the average nearest-neighbor distance between the corresponding cluster pairs. NaN if unavailable.
- cellgps.compute_cophenetic_from_distance_matrix(distance_matrix: DataFrame, method: str = 'average', show_corr: bool = False) Tuple[DataFrame, DataFrame][source]¶
Perform hierarchical clustering in both row and column directions on the given inter-cluster distance matrix, and compute cophenetic distance matrices. Results are independently normalized to [0,1] for rows and columns.
Parameters:¶
- distance_matrixpd.DataFrame
Input distance matrix with source clusters as rows and target clusters as columns (e.g. output of compute_searcher_findee_distance_matrix_from_df).
- methodstr, optional
Linkage method for hierarchical clustering. Defaults to “average”.
- show_corrbool, optional
Whether to print the cophenetic correlation coefficient (printed separately for rows and columns). Defaults to False.
Returns:¶
- Tuple[pd.DataFrame, pd.DataFrame]
(row_coph, col_coph). Cophenetic distance matrices (DataFrames) for row and column clusters, each independently normalized to [0,1].
- cellgps.compute_cophenetic_distances_from_df_memory_opt(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype', method: str = 'average', show_corr: bool = False, batch_size: int | None = None) Tuple[DataFrame, DataFrame][source]¶
Same functionality as the original compute_cophenetic_distances_from_df, but reduces memory usage via batched computation.
- cellgps.pick_batch_size(n_cells: int, dims: int = 2, frac: float = 0.3, hard_min: int = 50000, hard_max: int | None = None, bytes_per_row: int | None = None, safety_gb: float = 8.0, env_override_var: str = 'BATCH_SIZE_OVERRIDE') int[source]¶
Pick a batch size that better utilizes RAM on big machines.
Key ideas: - Allow an env override (for quick experiments). - Subtract a fixed safety buffer (safety_gb) from available RAM. - Make bytes_per_row configurable; provide a conservative default. - Optional hard_max; if None, we don’t clamp by a hard cap.
Parameters¶
- n_cellsint
Total number of items to process.
- dimsint
Dimensionality; may influence copies inside algorithms.
- fracfloat
Fraction of available RAM to budget.
- hard_minint
Lower bound for stability on small RAM.
- hard_maxOptional[int]
Upper bound; set None to disable hard clamping.
- bytes_per_rowOptional[int]
Estimated peak bytes per row for the step. If None, pick a conservative default.
- safety_gbfloat
Keep this amount of RAM free regardless (OS/page cache/etc.).
- env_override_varstr
If set, this env var forces the batch size (int), bypassing heuristics.
Returns¶
- int
A batch size in [hard_min, n_cells] (and <= hard_max if provided).
Topology extensions¶
- cellgps.compute_weighted_cophenetic_distances_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, group_col: str = 'celltype', weight_col: str | None = 'weight', min_weight: float = 0.0, method: str = 'average', show_corr: bool = False) tuple[DataFrame, DataFrame][source]¶
- cellgps.compute_weighted_searcher_findee_distance_matrix_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, group_col: str = 'celltype', weight_col: str | None = 'weight', min_weight: float = 0.0) DataFrame[source]¶
Compute a weighted directed searcher→findee average nearest-neighbor matrix.
The weighting scheme is intentionally conservative to preserve backward compatibility with the original
t_and_clogic: the nearest-neighbor geometry is unchanged, while the row-wise aggregation becomes a weighted average over source/searcher points. When every point has unit weight, the result is exactly equivalent tocompute_searcher_findee_distance_matrix_from_df.
- cellgps.build_entity_points_from_expression(reference_df: DataFrame, expression_df: DataFrame, *, entities: Iterable[str] | None = None, cell_id_col: str = 'cell_id', x_col: str = 'x', y_col: str = 'y', min_weight: float = 0.0, entity_col: str = 'entity', weight_col: str = 'weight') DataFrame[source]¶
- cellgps.compute_entity_structuremap(entity_points_df: DataFrame, *, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, entity_col: str = 'entity', weight_col: str = 'weight', min_weight: float = 0.0, method: str = 'average') DataFrame[source]¶
- cellgps.compute_entity_to_cell_topology(reference_df: DataFrame, entity_points_df: DataFrame, *, x_col: str = 'x', y_col: str = 'y', z_col: str | None = None, celltype_col: str = 'celltype', entity_col: str = 'entity', weight_col: str = 'weight', min_weight: float = 0.0, method: str = 'average') DataFrame[source]¶
Generalize transcript-by-cell topology to arbitrary weighted entities.
reference_dfcontains the fixed cell-type template.entity_points_dfcontains an entity label plus spatial points and weights. For every entity we temporarily append its weighted point cloud to the reference template, compute a weighted StructureMap, and extract the entity→celltype row.
- cellgps.compute_pathway_activity_matrix(expression_df: DataFrame, pathway_definitions: Mapping[str, Any] | DataFrame, *, method: str = 'rank_mean', normalize: bool = True) DataFrame[source]¶
- cellgps.ligand_receptor_topology_analysis(*, reference_df: DataFrame | None = None, expression_df: DataFrame | None = None, lr_pairs: DataFrame, output_dir: str | PathLike[str] | None = None, adata: Any = None, entity_points_df: DataFrame | None = None, tbc_results: str | PathLike[str] | None = None, t_and_c_df: DataFrame | None = None, cluster_col: str = 'Cluster', cell_id_col: str = 'cell_id', x_col: str = 'x', y_col: str = 'y', celltype_col: str = 'celltype', ligand_col: str = 'ligand', receptor_col: str = 'receptor', prior_col: str = 'evidence_weight', structure_map: DataFrame | None = None, structure_map_df: DataFrame | None = None, anchor_mode: str = 'precomputed', expression_support_mode: str = 'pseudobulk_detection', contact_mode: str = 'strength_coverage', entity_min_weight: float = 0.0, detection_threshold: float = 0.0, k_neighbors: int = 8, radius: float | None = None, topology_method: str = 'average', top_n_pairs: int = 12, hotspot_quantile: float = 0.9, min_cross_edges: int = 50, contact_expr_threshold: str | float = 'q75_nonzero', use_raw: bool = False) dict[str, Any][source]¶
- cellgps.ligand_receptor_target_consistency(lr_scores: DataFrame, receiver_signatures: Mapping[str, Any] | DataFrame, ligand_target_prior: DataFrame, *, ligand_col: str = 'ligand', receiver_col: str = 'receiver_celltype', target_col: str = 'target', prior_weight_col: str = 'weight', signature_gene_col: str = 'gene', signature_weight_col: str = 'score') DataFrame[source]¶
Compute a NicheNet-like downstream target consistency layer.
The default scoring is intentionally lightweight: for each ligand and receiver cell type we compute the weighted overlap between the ligand prior targets and the receiver signature genes. The output can be merged back onto the
ligand_receptor_topology_analysisresult table.
- cellgps.pathway_topology_analysis(*, pathway_definitions: Mapping[str, Any] | DataFrame, reference_df: DataFrame | None = None, expression_df: DataFrame | None = None, output_dir: str | PathLike[str] | None = None, adata: Any = None, tbc_results: str | PathLike[str] | None = None, t_and_c_df: DataFrame | None = None, cluster_col: str = 'Cluster', cell_id_col: str = 'cell_id', x_col: str = 'x', y_col: str = 'y', celltype_col: str = 'celltype', scoring_method: str = 'weighted_sum', view: str = 'intrinsic', structure_map: DataFrame | None = None, structure_map_df: DataFrame | None = None, anchor_mode: str = 'precomputed', pathway_modes: Sequence[str] = ('gene_topology_aggregate', 'activity_point_cloud'), primary_pathway_mode: str = 'gene_topology_aggregate', pathway_aggregate: str = 'weighted_median', activity_threshold_schedule: Sequence[float] = (0.95, 0.9, 0.8, 0.7, 0.6, 0.5), min_activity_cells: int = 50, entity_min_weight: float = 0.0, k_neighbors: int = 8, radius: float | None = None, topology_method: str = 'average', hotspot_quantile: float = 0.9, use_raw: bool = False) dict[str, Any][source]¶
Preprocessing and input helpers¶
- cellgps.load_xenium_data(folder: str, normalize: bool = True)[source]¶
Load and preprocess a Xenium run through
pyXenium.io.read_xenium.
- cellgps.load_xenium_table_bundle(folder: str | PathLike[str], *, cells_path: str | PathLike[str] | None = None, cell_groups_path: str | PathLike[str] | None = None, feature_matrix_path: str | PathLike[str] | None = None, normalize: bool = False, cluster_col: str = 'Clusters', cell_id_col: str = 'Barcode', x_col: str = 'x_centroid', y_col: str = 'y_centroid')[source]¶
Load a Xenium table bundle through
pyXenium.io.read_xenium.The returned object keeps the requested cluster labels in
adata.obs[cluster_col]and mirrors them intoadata.obs["Cluster"]for backward compatibility with the existing Cell-GPS API.
- cellgps.merge_xenium_clusters_into_adata(sdata, xenium_dir: str, table_key: str = 'table', clustering_root: str = 'analysis/clustering', barcode_col: str = 'Barcode', cluster_col: str = 'Cluster') Tuple['anndata.AnnData', List[str], Dict[str, float]][source]¶
Auto-collect xenium_dir/analysis/clustering/**/clusters.csv and merge clustering columns into sdata.tables[table_key].obs. Prefers linking via obs[‘cell_id’]; falls back to shapes index mapping if unavailable. Returns (adata, list of new column names, per-column non-NA hit rate report).
Plotting¶
- cellgps.plot_cophenetic_heatmap(matrix: DataFrame, matrix_name: str | None = None, output_dir: str | None = None, output_filename: str | None = None, figsize: Tuple[float, float] | None = None, cmap: str = 'RdBu', linewidths: float = 0.5, annot: bool = False, sample: str = 'Sample', xlabel: str | None = None, ylabel: str | None = None, show_dendrogram: bool = True, quiet: bool = True, return_figure: bool = False, return_image: bool = False, dpi: int = 300)[source]¶
- Draw a cophenetic heatmap (seaborn.clustermap), guaranteeing:
Text in PDF is editable
Legend position is auto-adjusted
figsize is dynamically adjusted
fontTools.subset & findfont logs are silenced
- Parameters:
…existing parameters… return_figure: whether to return the figure object instead of saving to file return_image: whether to return a high-resolution PIL image instead of the figure object dpi: image DPI resolution, only effective when return_image=True
- Returns:
If return_figure=True, returns a seaborn.ClusterGrid object If return_image=True, returns a PIL.Image object Otherwise returns None
- cellgps.generate_cluster_distance_heatmap_from_adata(adata: anndata.AnnData, cluster_col: str = 'Cluster', output_dir: str | None = None, output_filename: str | None = None, figsize: tuple = (8, 8), cmap: str = 'RdBu', max_scale: float = 10, show_dendrogram: bool = True)[source]¶
Generate and save a distance heatmap from each cell cluster to its nearest cluster center.
Parameters:¶
- adataanndata.AnnData
AnnData object containing preprocessed data.
- cluster_colstr, optional
Column name in adata.obs containing cluster information. Defaults to “Cluster”.
- output_dirOptional[str]
Output directory for the PDF file. Defaults to current working directory.
- output_filenameOptional[str]
Output file name. If not specified, uses “clustermap_output_{sample}.pdf”.
- figsizetuple, optional
Size of the heatmap. Defaults to (7, 7).
- cmapstr, optional
Colormap for the heatmap. Defaults to “RdBu”.
- max_scalefloat, optional
max_value parameter for sc.pp.scale, used to clip Z-scores. Defaults to 10.
Returns:¶
None
- cellgps.generate_cluster_distance_heatmap_from_df(df: DataFrame, x_col: str = 'x', y_col: str = 'y', celltype_col: str = 'celltype', sample: str = 'Sample', output_dir: str | None = None, output_filename: str | None = None, figsize: tuple = (8, 8), cmap: str = 'RdBu', show_dendrogram: bool = True)[source]¶
Generate and save a distance heatmap from each cell cluster to its nearest cluster center.
Parameters:¶
- dfpd.DataFrame
DataFrame containing cell data.
- x_colstr, optional
Column name for x coordinates. Defaults to ‘x’.
- y_colstr, optional
Column name for y coordinates. Defaults to ‘y’.
- celltype_colstr, optional
Column name for cell type. Defaults to ‘celltype’.
- output_dirOptional[str]
Output directory for the PDF file. Defaults to current working directory.
- output_filenameOptional[str]
Output file name. If not specified, uses “clustermap_output.pdf”.
- figsizetuple, optional
Size of the heatmap. Defaults to (8, 8).
- cmapstr, optional
Colormap for the heatmap. Defaults to “RdBu”.
Returns:¶
None
- cellgps.generate_cluster_distance_heatmap_from_path(base_path: str, sample: str, figsize: tuple = (8, 8), output_dir: str | None = None, show_dendrogram: bool = True)[source]¶
Generate and save a distance heatmap from each cell cluster to its nearest cluster center.
Parameters:¶
- base_pathstr
Base path where data is stored.
- samplestr
Sample name used to specify the data folder.
- output_dirOptional[str]
Output directory for the PDF file. Defaults to current working directory.
Returns:¶
None
- cellgps.circle_heatmap(bg_df: DataFrame, circle_df: DataFrame, *, cmap: str = 'RdBu', size_exponent: float = 1.0, circle_fill: str = 'white', circle_edge: str = 'black', circle_edge_lw: float = 0.5, add_legend: bool = True, legend_title: str = 'Transcript Percentage (%)', figsize: tuple = (8, 6), ax: Axes = None)[source]¶
- Draw a combined heatmap and circles plot:
bg_df: scores between 0–1, represented with red-white-blue;
circle_df: percentages 0–100 (%), encoded as circle area;
0% draws no circle, 100% maps exactly to a circle of cell diameter;
The legend only shows five percentages: [5, 25, 45, 65, 85].