Single Cells Are Biological Tokens : Towards Cell Language Models

The rapid advancement of single-cell technologies allows for simultaneous measurement of multiple molecular features within individual cells, providing unprecedented multimodal data through single-cell multi-omics and spatial omics technologies. This dissertation addresses the complex challenges of modeling these multimodal interactions using deep learning techniques. We propose two series of studies: the first, is the application of graph neural networks and graph transformers to model relations between multimodal features, incorporating external domain knowledge. We propose Single-cell Multi-Omics GNN (scMoGNN) and Single-cell Multi-Omics Transformer (scMoFormer), the latter extends the former one and demonstrates the prospect of transformers in single-cell multi-omics representation learning. The second is the application of transformers in spatial omics representation learning. We propose Spatial Transformer (SpaFormer), a transformer-based masked autoencoder learning framework for extracting cell context information and imputing spatial transcriptomics data. Despite the effectiveness of these models, their knowledge transferability across tasks and datasets remains limited. To overcome this, we introduce a new transformer-based foundation model, Cell Pre-trained Language Model (CellPLM), that encodes inter-cellular relations and multimodal features, demonstrating the significant potential of foundation models for future research in single-cell biology.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Wen, Hongzhi

Thesis Advisors: Tang, Jiliang

Committee Members: Tu, Guan-Hua
Liu, Hui
Xie, Yuying

Date Published: 2024

Subjects: Bioinformatics
Artificial intelligence

Program of Study: Computer Science - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: 127 pages

Permalink: https://doi.org/doi:10.25335/xdxk-jd57

Single Cells Are Biological Tokens : Towards Cell Language Models

Full text