Authors: Julie Chabalier (corresponding author) [1]; Jean Mosser [2,3]; Anita Burgun [1]
Background
Interpretation of data resulting from high-throughput analyses is a challenge in bioinformatics. Two major information sources are usually used to make this interpretation: expression data and biological annotations mainly based on the Gene Ontology[TM] (GO) [1]. According to Eisen et al., expression data organize genes into functional categories [2]. Genes that are expressed together share common functions. Therefore, the interpretation of microarray experimental data is usually performed through the following "standard" approach: 1) the genes are organized into clusters depending on their differential expression pattern and, 2) for each cluster, the main objective is to translate the list of genes into a functional profile able to offer insight into the cellular mechanisms relevant in the given condition [3]. Several tools have been proposed for ontological analysis of gene expression data (for review see [4]). Among them, following the standard approach used to interpret expression data, Gibbons and Roth proposed to judge the quality of the expression-based clustering methods using GO terms [5]. However, as argued in [6, 7], complex biological functions emerge from interactions between gene products. Integrated systems, defined as the assembling of individual gene products in such complexes, can collaborate in broader biological processes. For example, in Bacillus subtilis, an ABC transporter and a two-component regulatory system, respectively involved in transport and signal transduction, collaborate into a same biological process: antibiotic resistance [8]. Therefore, if different functions can be involved in a common biological process, we can make the assumption that genes can be differentially expressed in such a process. Consequently, the standard approach makes it difficult to underline functional relationships between gene products when they belong to different expression clusters.
Complementary to the standard approach, we define a transversal analysis that aims to predict functional networks of gene products based on the biological processes they belong to. Simultaneously, genes involved in such networks are clustered according to their expression patterns. The combined visualization of functional networks and expression clusters is expected to offer new insight on the roles of the gene products. We propose to use the ontological-based similarity to predict functional gene product networks. Based on the GO term similarity, the semantic similarity between gene products consists in the comparison of the different terms assigned to a pair of gene products. Typically, two approaches can be performed to compute the term similarity into hierarchies. The path based method relies on the edge-counting approach defined in [9]. The shorter the path one node to the other, the more similar they are. However, the semantic distances between any two adjacent nodes are not necessarily equal. Indeed, the distance shrinks as one descends the hierarchy, since differentiation is based on finer and finer details. The information content method is based on Lin, Jiang and Resnik measures [10, 11, 12]. This approach relies on the frequency of a concept in a large corpus. Based on this approach, ongoing works propose to establish functional relationships between gene products [13, 14, 15, 16]. As discussed in [17], the information content approach tends to give better results for the term similarities than the path based method. However, applied to the gene similarity, it does not always meaningfully estimate similarity between genes because it does not take into account the hierarchy organizing terms (e.g. [18]).
The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model [19]. Gene products are described as vectors of GO terms. The major contribution of this approach is the possibility of using a weighting scheme over the annotations. The comparison of such annotation vectors results in a matrix of gene similarity. Combined with expression data, the matrix is displayed as a set of functional gene networks. Each gene-gene relation is associated to the shared annotations. Hierarchy issues are addressed by an 1)
This paper is organized as follows. First, biological results and their comparison with the KEGG pathways are presented and discussed, then the transversal analysis methodology is detailed.
Results
Overview of the transversal approach
The standard approach to interpret transcriptomic data aims to retrieve the biological processes mainly involved in a specific condition (for example, mitosis, oncogenesis and proliferation processes are involved in cancer). For this purpose, a collection of differentially expressed genes (up-regulated or down-regulated genes) is characterized by a set of ranked GO terms. Complementary to this approach, the transversal analysis exploits the GO term similarity to cluster the gene products. The behaviors of the resulting networks are analyzed according to the gene expression. Briefly, our method proceeds as follows (see Figure 1):
- starting with a collection of gene products that have been clustered according to their expression with an expression clustering-based method;
- selection of GO terms associated with each gene product according to an
- construction of a weighted term vector for each gene product (Figure 1b);
- pairwise comparison of these vectors in a Vector Space Model. This comparison results in a half-matrix of gene product similarities (Figure 1c);
- selection of a similarity threshold to obtain the pairs of gene products that have a high degree of similarity;
- displaying the resulting pairs of gene products associated with their corresponding expression clusters (Figure 1d). A gene product pair is displayed as two nodes linked by an edge. It results in a set of "transversal networks". The most frequent terms are used to describe each network as a biological profile (Figure 1e).
At this step, the resulting networks are biologically interpreted. This analysis can be refined by performing several runs at various levels of abstraction (named
A detailed description of the methodology is provided in the Methods section.
Dataset
The transversal analysis was applied to a set of genes related to enterocyte differentiation. These genes were previously studied by a standard approach [20]. In this paper, we refer to this set of genes as the Bedrine-Ferran gene set (BF set). As CaCo-2 cells spontaneously differentiate in enterocytes, this cell line was used to characterize genes whose expression varies during differentiation by means of microarray experiments. The authors performed a clustering with Self-organized Maps (see Methods section) and the resulting expression clusters are used in our approach combined with the transversal networks. These experiments led to the identification of 186 significant genes through the in vitro differentiation process: 50 were …

Комментариев нет:
Отправить комментарий