What is the significance of the protein sequence of transmembrane proteins?

Membrane proteins are key gates that control various vital cellular functions. Membrane proteins are often detected using transmembrane topology prediction tools. While transmembrane topology prediction tools can detect integral membrane proteins, they do not address surface-bound proteins. In this study, we focused on finding the best techniques for distinguishing all types of membrane proteins.

Results

This research first demonstrates the shortcomings of merely using transmembrane topology prediction tools to detect all types of membrane proteins. Then, the performance of various feature extraction techniques in combination with different machine learning algorithms was explored. The experimental results obtained by cross-validation and independent testing suggest that applying an integrative approach that combines the results of transmembrane topology prediction and position-specific scoring matrix [Pse-PSSM] optimized evidence-theoretic k nearest neighbor [OET-KNN] predictors yields the best performance.

Conclusion

The integrative approach outperforms the state-of-the-art methods in terms of accuracy and MCC, where the accuracy reached a 92.51% in independent testing, compared to the 89.53% and 79.42% accuracies achieved by the state-of-the-art methods.

Background

Membrane proteins play essential roles in transport, signaling, adhesion, and metabolism, which positions them as a leading drug target; over half of the current FDA-approved drugs target membrane proteins [1]. Membrane proteins are among the least characterized proteins in terms of their structure and function due to their hydrophobic surfaces and poor conformational stability. Distinguishing membrane proteins can help direct future experiments and provide clues regarding the functions of these proteins.

A major class of membrane proteins are transmembrane proteins. These proteins have one or more transmembrane segments [TMSs] embedded in the lipid bilayer in addition to extramembranous hydrophilic segments that extend into the water-soluble domains on each side of the lipid bilayer. The embedded segments are distinguishable because they contain residues with hydrophobic properties that interact with the hydrophobic [nonpolar] tails of the membrane phospholipids. Other classes of membrane proteins include surface-bound proteins that do not extend into the hydrophobic interior of the lipid bilayer; they are typically bound to the lipid head groups at the membrane surface or attach to other transmembrane proteins. Unlike transmembrane proteins, surface-bound proteins such as peripheral and lipid-anchored proteins do not have TMSs; they are therefore more difficult to distinguish from other globular proteins.

Two distinct approaches, namely, transmembrane topology prediction and membrane structure type prediction, are primarily used to detect membrane proteins. While transmembrane topology tools predict only a subset of membrane proteins [transmembrane proteins], they are applied more often than membrane structure type prediction tools due to the vast number of tools available and because transmembrane proteins constitute a major class of membrane proteins. However, by overlooking other classes of membrane proteins, essential information is lost. By contrast, membrane structure type predictions can be used to detect all classes of membrane proteins. In this work, we focus on detecting membrane proteins of all types and answering this question: given a protein sequence Q, is it a membrane protein?

The state-of-the-art tools that have achieved the highest overall performance in predicting all types of membrane proteins are MemType-2L [2] and iMem-2LSAAC [3]. While MemType-2L [2] has been in use for over a decade, it has maintained its popularity due to its simple yet effective methodology. MemType-2L incorporates evolutionary information by representing protein samples with pseudo position-specific scoring matrix [Pse-PSSM] vectors and combining the results obtained from individual optimized evidence-theoretic k nearest neighbor [OET-KNN classifiers]. By contrast, iMem-2LSAAC uses the split amino acid composition [SAAC] to extract features from protein samples and then support vector machine [SVM] to train the predictor.

MemType-2L is the only accessible tool for the prediction of all types of membrane proteins. When we tested it on a new set of membrane proteins, the accuracy reached only 80%, compared with the estimated accuracy of 92.7% in the original paper. This is because it was trained on the available protein sequences from 2006; and this protein sequence landscape has drastically changed, where a large surge in protein sequence entries has been recorded since then. It is therefore essential to build a new accessible tool that accommodates all membrane data.

The main contributions of this work can be summarized as follows:

  • We establish a new benchmark dataset for membrane proteins [DS-M].

  • We evaluate the performances of traditional transmembrane topology prediction tools on DS-M to predict all types of membrane proteins.

  • We compare the performances of various machine learning techniques to detect membrane proteins; this comparison involved applying different feature extraction techniques to encode protein sequences and choosing the proper machine learning algorithm to build a model using the extracted vectors.

  • We introduce a novel method, TooT-M, which integrates different techniques that achieves superior performance compared to all other methods, including the state-of-the-art methods.

Transmembrane topology prediction

Transmembrane topology prediction methods predict the number of TMSs and their respective positions in the primary protein sequence. Transmembrane proteins are integral membrane proteins [IMPs] that span the lipid bilayer and have exposed portions on both sides of the membrane. It is expected that the portions that span the membrane contain hydrophobic [nonpolar] amino acids, while the portions that are on either side of the membrane consist mostly of hydrophilic [polar] amino acids. The TMSs can have either \[\alpha\]-helical or \[\beta\]-barrel structures, so prediction methods are classified into \[\alpha\]-helix prediction methods and \[\beta\]-barrel prediction methods.

Previous prediction methods depended solely on simple measurements such as the hydrophobicity of the amino acids [4]. Major improvements were made after the “positive-inside rule” [5] was introduced by Von Heijne, which came from the observation that positively charged amino acids, such as arginine and lysine, tend to appear on the cytoplasmic side of the lipid bilayer. Current methods combine hydrophobicity analysis and the positive-inside rule together with machine learning techniques and evolutionary information.

For example, the membrane protein structure and topology support vector machine MEMSAT–SVM method [6], introduced in 2009, uses four support vector machines [SVMs] to predict transmembrane helices, inside and outside loops, re-entrant helices and signal peptides. In addition, it includes evolutionary information on many homologous protein sequences in the form of a sequence profile. This method outputs predicted topologies ranked by the overall likelihood and incorporates signal peptide and re-entrant helix prediction. The reported accuracy is 89% for the correct topology and location of TM helices and 95% for correct number of TM helices. However, recent studies using experimental data report that MEMSAT–SVM does not perform as well when evaluated on different datasets [7, 8].

State-of-the-art methods use consensus algorithms that combine the outputs from different predictors. The consensus prediction of membrane protein topology [TOPCONS2] method [8] achieved the highest reported prediction accuracy based on benchmark datasets [9]. It successfully distinguishes between globular and transmembrane proteins and between transmembrane regions and signal peptides. In addition, it is highly efficient, making it ideal for proteome-wide analyses. The TOPCONS2 method combines the outputs from different predictors that can also predict signal peptides [namely, Philius [10], PolyPhobius [11], OCTOPUS [12], signal peptide OCTOPUS [SPOCTOPUS] [13], and SCAMPI [14]] into a topology profile where each residue is represented by one of four values: the signal peptide [S], a membrane region [M], the inside membrane [I], or outside membrane [O]. Then, a hidden Markov model is used to process the resulting profile and predict the final topology with the highest-scoring state path.

Regarding \[\beta\]-barrel membrane protein prediction, a variety of methods have been introduced, such as methods that combine statistical propensities [15], k-nearest neighbor [KNN] methods [16], neural networks [17, 18], hidden Markov models [19,20,21,22], SVMs [23], and amino acid compositions [AACs] [24, 25]. Approaches based on hidden Markov models have been found to achieve statistically significant performance when compared to other types of machine learning techniques [26]. Major methods for detecting \[\beta\]-barrel outer membrane proteins are HHomp [27], \[\beta\]-barrel protein OCTOPUS [BOCTOPUS] [21], and PRED-TMBB2 [22], with reported MCCs of 0.98, 0.93, and 0.92, respectively, when applied to the same dataset. The BOCTOPUS and HHomp techniques are much slower than PRED-TMBB2 [22].

Prediction of the membrane protein structural type

Methods for predicting membrane type can predict up to eight different membrane protein structural subtypes categorized as single-pass types I, II, III, and IV; multipass transmembrane; glycophosphatidylinositol [GPI]-anchored; lipid-anchored; and peripheral membrane proteins. A comprehensive review by Butts et al. [28] elucidates these methods in detail. Generally, prediction is performed in two stages: the first stage identifies the protein sequence as membrane or nonmembrane, while the second stage differentiates among specific membrane protein subtypes. This research focuses on detecting all membrane proteins, regardless of their type [the first stage].

The MemType-2 [2] predictor was introduced in 2007 by Chou and Shen. It is a two-layer predictor that uses the first layer to identify a query protein as a membrane or nonmembrane protein. Then, if the protein is predicted as a membrane protein, the second layer identifies the structural type from among the eight categories. The MemType-2L predictor incorporates evolutionary information by representing the protein samples with Pse-PSSM vectors and combining the results obtained by OET-KNN classifiers. It achieved an overall accuracy of 92.7% in the membrane detection layer. The reported performance in the first layer is obtained by applying the jackknife test on the provided dataset.

Butts et al. [29] introduced a tool that predicts all types of membrane proteins; it uses statistical moments to extract features from the protein samples and then trains a multilayer neural network with backpropagation to predict the membrane proteins. This tool achieved an overall accuracy of 91.23% when applying the jackknife test on the dataset from Chou and Shen [2], which was a slightly lower performance than the MemType-2L predictor.

The iMem-2LSAAC was introduced in 2017 by Arif et al. [3]. iMem-2LSAAC is a two-layer predictor that uses the first layer to predict whether a query protein is a membrane protein. Then, in the case of membrane proteins, it continues to the second layer to identify the structural category. It utilizes the split amino acid composition [SAAC] to extract the features from the protein samples and then applies an SVM to train the predictor. iMem-2LSAAC achieved an overall accuracy of 94.61% in the first layer when applying the jackknife estimator on their dataset.

Methods

Dataset

The latest publicly available benchmark dataset that contains both membrane and nonmembrane proteins was constructed by Chou and Shen [2] and was used to construct the MemType-2L predictor. Their dataset was collected from the Swiss-Prot database version 51.0, released on October 6, 2006. Furthermore, they eliminated proteins with 80% or more similarity in their sequences to reduce homology bias. Chou and Shen’s dataset contains a total of 15,547 proteins, of which 7582 are membrane proteins and 7965 are nonmembrane proteins.

Because of the rapidly increasing sizes of biological databases, we built a new updated dataset, DS-M. This dataset was collected from the Swiss-Prot database. The annotated membrane proteins were retrieved by extracting all of the proteins that are located in the membrane, using the following search query:

The remainder of the Swiss-Prot entries were designated as nonmembrane proteins.

The sequences in both classes were filtered by adhering to the following criteria:

  • Step 1: Protein sequences that have evidence “inferred from homology” for the existence of a protein were removed.

  • Step 2: Protein sequences less than 50 amino acids long were removed, as they could be fragments.

  • Step 3: Protein sequences that have no Gene Ontology MF annotation or annotation based only on computational evidence [inferred from electronic annotation, IEA] were excluded.

  • Step 4: Protein sequences with more than 60% pairwise sequence identity were removed via a CD-HIT [30] program to avoid any homology bias.

All sequences from the membrane class and randomly selected sequences from the nonmembrane class were used to form the benchmark dataset. The data were randomly divided [stratified by class] into the training [90%] and testing sets [10%]. To further limit homology bias between the training and testing sets, the sequences in the testing set were filtered such that no sequence has more than 30% pairwise identity to any sequence in the training set. The number of sequences in the training and testing datasets are illustrated in Table 1.

The dataset contains samples from different species, with the most sequences coming from Homo sapiens [18%], Arabidopsis thaliana [14%], Mus musculus [11%], Saccharomyces cerevisiae [8%], and Saccharomyces pombe [6%].

Approximately 84% of the membrane data collected have a structural type annotation. Fig. 1 indicates that of the annotated proteins, approximately 75% are transmembrane proteins [single or multipass], while the remainder are peripheral, lipid-anchored, or GPI-anchored proteins.

Fig. 1

Membrane structural types

Full size image

Fig. 2

Receiver operating characteristic analysis. Receiver operating characteristic [ROC] curves and the area-under-curve [AUC] scores for each model built using a OET-KNN; b KNN; c SVM; d GBM; e RF logarithms

Full size image

Fig. 3

Choice of the optimal constituent classifiers among 50 classifiers. In the pair [x, y], x refers to the number of top-ranked components in the optimal feature set, and y refers to the achieved accuracy using those x components. The accuracy peaked when the number of top-ranked components were 3, 5, 15, 11, 1 for the OET-KNN V50-, KNN V50-, SVM-, GBM-, and RF-based ensembles, respectively

Full size image

Fig. 4

Choice of the optimal constituent classifiers among 500 classifiers. In the pair [x, y], x refers to the number of top-ranked components in the optimal feature set, and y refers to the achieved accuracy using those x components. The optimal numbers of features for the OET-KNN V500 and KNN V500 ensembles were 20 and 21, respectively. The performance started to deteriorate as more votes were accounted for. Overall, the results suggest that the selective voting approach outperforms the all voting approach

Full size image

Fig. 5

Comparison with other state-of-the-art methods on the DS-M dataset

Full size image

Fig. 6

Receiver operating characteristic analysis. ROC curves and the area-under-curve [AUC] scores for TooT-M and the state-of-the-art methods on DS-M dataset

Full size image

Table 1 Membrane dataset DS-M

Full size table

Topology prediction tools

A protein is regarded as a membrane protein if at least one TMS is detected. With respect to \[\alpha\]-helical transmembrane proteins, three tools were applied. First, TOPCONS2 [8] which is considered to be the state-of-the-art method and known for its ability to distinguish signal peptides from transmembrane regions, TOPCONS2 results were obtained through its available web server. The second tool is HMMTOP [31], which is a highly efficient tool commonly used in the literature, HMMTOP results were also obtained through its web server. The third tool is TMHMM [32], also commonly applied in the literature, and its results were obtained from its web server.

Table 2 LOOCV performance of the individual models

Full size table

Table 3 Performances of the all voting ensemble classifiers on the main dataset

Full size table

Table 4 Performances of the selective voting ensemble classifiers on the main dataset

Full size table

Regarding \[\beta\]-barrel transmembrane proteins, we applied PRED-TMBB2 [22], which shows comparable performance to the state-of-the-art \[\beta\]-barrel predictors but is much more efficient in terms of the runtime [22], The results of PRED-TMBB2 were obtained from its available web server.

Protein sequence encoding

After establishing the dataset, it is necessary to find the best representation of the protein sequences used to train the prediction engine. Generally, there are two options: sequential or discrete representations [2]. In sequential representations, a sample protein is represented by its amino acid sequence and then used in a similarity search-based tool such as BLAST [33]. A major drawback of relying on the similarity is that it fails when proteins with the same function share a low sequence similarity. In discrete representations, a sample protein is represented by a set of discrete numbers that are usually the result of feature engineering. In this study, we encoded the protein sequences using the AAC, PAAC, and PseAAC baseline compositions. In addition, we applied the Pse-PSSM and SAAC as described below.

Table 5 Transmembrane topology prediction performance on the training dataset

Full size table

Table 6 TooT-M LOOCV performance

Full size table

Table 7 Comparison with other state-of-the-art methods on the DS-M dataset

Full size table

Table 8 Comparison with the iMem-2LSAAC predictor on the DS1 dataset

Full size table

Table 9 Comparison with the MemType-2L predictor on the DS2 dataset

Full size table

Amino acid composition [AAC]

The AAC is the normalized occurrence frequency of each amino acid. The fractions of all 20 natural amino acids are calculated by:

$$\begin{aligned} c_i = \frac{F_i}{L} \qquad i=[1,2,3,\ldots ,20] \end{aligned}$$

[1]

where \[F_i\] is the frequency of the \[i{\mathrm{th}}\] amino acid and L is the length of the sequence. Each protein’s AAC is represented as a vector of size 20 as follows:

$$\begin{aligned} AAC[P] = \left[ c_{1} , c_{2} , c_{3} , \ldots , c_{20} \right] \end{aligned}$$

[2]

where \[c_{i}\] is the composition of the \[i{\mathrm{th}}\] amino acid.

Pair amino acid composition [PAAC]

The PAAC has an advantage over the AAC because it encapsulates information about the fraction of the amino acids as well as their order. It is used to quantify the preference of amino acid residue pairs in a sequence. The PAAC is calculated by:

$$\begin{aligned} d_{i,j} = \frac{F_{i,j}}{L-1} \qquad i,j=[1,2,3,\ldots ,20] \end{aligned}$$

[3]

where \[F_{i,j}\] is the frequency of the \[i{\mathrm{th}}\] and \[j{\mathrm{th}}\] amino acids of a pair [dipeptide] and L is the length of the sequence. Similar to the AAC, the PAAC is represented as a vector of size 400 as follows:

$$\begin{aligned} PAAC[P] = \left[ d_{1,1} , d_{1,2}, d_{1,3} , \ldots , d_{20,20} \right] \end{aligned}$$

[4]

where \[d_{i,j}\] is the dipeptide composition of the \[i{\mathrm{th}}\] and \[j{\mathrm{th}}\] amino acids.

Pseudo-amino acid composition [PseAAC]

The PseAAC was proposed in 2001 by Chou [34] and showed a remarkable improvement in the prediction quality when compared to the conventional AAC. PseAAC is a combination of the 20 components of the conventional AAC and a set of sequence-order correlation factors that incorporate some biochemical properties. Given a protein sequence of length L,

$$\begin{aligned} R_1 R_2 R_3 R_4 \ldots R_L \end{aligned}$$

[5]

a set of descriptors called sequence-order-correlated factors are defined as follows:

$$\begin{aligned} \left\{ \begin{array}{c} \theta _1 = \displaystyle \frac{1}{L-1} \sum _{i=1}^{L-1} \Theta [R_i,R_{i+1}] \\ \theta _2 = \displaystyle \frac{1}{L-2} \sum _{i=1}^{L-2} \Theta [R_i,R_{i+2}] \\ \theta _3 = \displaystyle \frac{1}{L-3} \sum _{i=1}^{L-3} \Theta [R_i,R_{i+3}] \\ . \\ . \\ . \\ \theta _\lambda = \displaystyle \frac{1}{L-\lambda } \sum _{i=1}^{L-\lambda } \Theta [R_i,R_{i+\lambda }] \\ \end{array} \right. \end{aligned}$$

[6]

The parameter \[\lambda\] is chosen such that \[[\lambda

Chủ Đề