Posted 10.10.2019 by admin

Molecular Descriptors For Cheminformatics Pdf Download

Molecular Descriptors For Cheminformatics Pdf Download 7,9/10 4019 votes

Molecular Descriptors For Chemoinformatics
Molecular Descriptors For Cheminformatics Pdf Download Full

Molecular descriptors and fingerprints have been routinely used in QSAR/SAR analysis, virtual drug screening, compound search/ranking, drug ADME/T prediction and other drug discovery processes. Since the calculation of such quantitative representations of molecules may require substantial computational skills and efforts, several tools have been previously developed to make an attempt to ease. This will bring up a context menu for that column which allows users to remove the column from the table (Remove Column), or by adding a new column using data from corresponding Cytoscape attributes (Add New Column→Cytoscape attributes→) or calculated molecular descriptors (Add New Column→Molecular descriptors→). See the section below. The MOLE db - Molecular Descriptors Data Base is a free on-line database constituted of 1124 molecular descriptors calculated on 234773 molecules. This data base is intended as a research and teaching tool and basically allows the researcher to: a) search for a specific group of molecules and analyse the corresponding values of molecular descriptors b) save in an output file the values of a.

. Part of the book series (MIMB, volume 275) Abstract Three sets of molecular descriptors that can be computed from a molecular connection table are defined. The descriptors are based on the subdivision and classification of the molecular surface area according to atomic properties (such as contribution to log P, molar refractivity, and partial charge). The resulting 32 descriptors are shown (a) to be weakly correlated with each other; (b) to encode many traditional molecular descriptors; and (c) to be useful for QSAR, QSPAR, and compound classification.

Citation: Dehmer M, Emmert-Streib F, Tripathi S (2013) Large-Scale Evaluation of Molecular Descriptors by Means of Clustering. PLoS ONE 8(12): e83956. Editor: Danilo Roccatano, Jacobs University Bremen, Germany Received: June 6, 2013; Accepted: November 9, 2013; Published: December 31, 2013 Copyright: © 2013 Dehmer et al. This is an open-access article distributed under the terms of the, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Matthias Dehmer and Shailesh Tripathi thank the Austrian Science Funds for supporting this work (project P22029-N13). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors declare that co-author Frank Emmert-Streib is a PLOS ONE Editorial Board member. This does not alter their adherence to all the PLOS ONE policies on sharing data and materials. Introduction Molecular descriptors map molecular structures to the reals by taking physical, chemical or structural information into account. A large number of descriptors have been developed to describe different properties of molecular graphs. Therefore, these descriptors can be classified into different categories depending what kind of information is used (e.g., physical, chemical or structural information) to define such a measure.

The commercial software package Dragon (version 6.0.26) contains 4885 molecular descriptors which are classified into 29 categories. The problem of analyzing molecular descriptors by applying clustering techniques has been already explored –.

These are usually based on using principal component analysis (PCA) and correlation-based methods for the identification of different descriptors. For example, Todeschini et al. And Basak et al.

Evaluated descriptors on a rather small collection of molecular graphs using PCA and ranked them based on the intercorrelation. In order to find similarities between molecular descriptors, Basak et al., used a PCA-based clustering technique on both a hydrocarbon dataset and mixed chemical compounds. Taraviras et al. Performed a cluster analysis with 240 descriptors by using different clustering algorithms. The weak point of the just sketched approaches is that the corresponding study has not been performed on a large scale (large data sets) and with distinct descriptors belonging to several categories. Also, the optimal number of different descriptors (dimension) has not been validated statistically.

In this paper, we overcome these problems. A thorough evaluation of the vast amount of developed descriptors is required to identify categories of descriptors which capture structural information differently.

In our analysis we evaluate 6 categories (see next section) of structural descriptors by means of clustering. The main contribution of this paper is to explore the dimension of the descriptor space, i.e., how many different descriptors exist among all which have been introduced so far. Here, we put the emphasis on 919 structural descriptors from Dragon. In particular, we find that only a very few descriptors are different. In this context that means they are least correlated and, therefore, capture structural information differently. Data In order to evaluate the above mentioned 6 categories of descriptors, we use 3 data sets namely:. contains (non-isomorphic) molecular structures (only skeletons, i.e., without vertex- and edge labels) inferred from the NIST spectral database.

contains exhaustively generated (non-isomorphic) tree structures with 15 vertices each. contains exhaustively generated (non-isomorphic) graphs with 8 vertices each.

To perform our analysis, we calculate the descriptor values for these three datasets. We removed those descriptors which give constant and erroneous values by using the three data sets. The erroneous values are produced by those descriptors for which we have not been able to calculate a descriptor value of a network without additional physical or chemical information. Finally, we the above mentioned six categories contain 24, 301, 57, 28, 40, 469 descriptors. Clustering Techniques Clustering is an unsupervised learning technique which aims to find different groups or clusters of objects in data.

The groups are described as a collection of objects which are closer to each other than the rest of the objects. An example thereof is hierarchical clustering as groups of the objects are arranged in a hierarchical order by a so-called dendogram. The objects which are clustered in one group have a higher degree of similarity than the objects which are clustered in different groups. Thus a resulting clustering solution allows to determine clusters where each cluster shows distinct property of the data. The similarity or dissimilarity between two objects is usually determined by using a Similarity/distance function which measures the similarity/distance between data points of different objects.

Examples are the Euclidean distance, the Manhattan distance or the correlation-based distance. A dissimilarity can be described as follows: Several algorithms have been developed for cluster analysis. These algorithms can be divided into several categories namely partition-based clustering, hierarchical clustering, density-based clustering, grid-based clustering and fuzzy clustering,.

Thus k-means, soft k-means Clustering, k-medoids Clustering are some examples representing non-hierarchical clustering methods. Hierarchical clustering itself can be divided into two categories called agglomerative and divisive clustering. As known, several concrete methods thereof have been developed such as single linkage, complete linkage and average linkage, see. In order to evaluate the descriptors, we perform hierarchical clustering (average linkage) by using the mentioned Dragon descriptors and the Spearman rank correlation as a distance measure. Here, we denote the correlation matrix between descriptors as.

Then, the distance between a pair of descriptors is defined. (1) In order to choose a clustering method we use the cophenetic correlation measure. A high correlation coefficient shows that the distance between the data points is well preserved by the created dendogram of the hierarchical clustering solution. In our analysis, the cophenetic correlation coefficient is highest for the average clustering solution for all three data-set compared to other clustering algorithms. We calculate the cophenetic correlation for seven hierarchical clustering algorithms which are the Ward, Single, Complete, Average, Mcquitty, Median and the Centroid-method. The cophentic correlation coefficients for the average clustering solutions for three data-sets are 0.84, 0.89 and 0.93.

Cluster Validity Cluster validity, is used to evaluate the quality of clustering solution (by using a certain clustering algorithm), e.g., the optimum number of clusters in the data, or whether the resulting cluster solution fits the data. Known clustering validation techniques are divided into three categories namely internal, external and relative validity criteria.

External validation criteria evaluate clustering solutions with a predefined clustering structure. Using internal validation criteria relates to find the optimal number of clusters which is based on the intrinsic knowledge of data. Relative validation criteria are used to compare two different clustering solutions.

In order to perform analyses, we use external and internal clustering validation criteria. For the external validation, we compared the clustering solution with a predefined group of clusters which serve as reference clusters. The external clustering validity of a clustering solution with respect to the given reference cluster is estimated by using the information-theoretic quantity (normalized mutual information), defined by (2)where (3) (4) (5) Hereby, we assume that we have two clustering solutions and which have and clusters. The overlap between these two clusters is shown in the contingency. We calculated for all three data-sets with different number of clusters. The Optimal Number of Clusters The optimal number of clusters (internal cluster validity) are determined by consensus clustering, which has been here performed as follows.

Assume we evaluate descriptors on a dataset containing molecular graphs. Thus we get descriptor values for each descriptor. First, we resample the data of sample-size, times for descriptors to generate clustering solutions, for clusters, where. After that we calculate the consensus indices for each cluster, which is defined as follows: (6) As to the measure, we use the adjusted rand index defined.

(7) The number of clusters for which attains its maximum is chosen as the optimal number of clusters, namely. Determining a Highly Correlated Subset of Descriptors Let be a set of descriptors and is its cardinality. Let be a subset of. The selected descriptors can be reduced to a set of descriptors,. The remaining descriptors will have a significant correlation with at least one of the descriptor in the set and the descriptors in are not significantly correlated.

If two descriptors are showing a significant correlation with each other, then we conclude that they capture structural information similarly. In order to predict the significance of the correlation between two descriptors, we perform the following approach: Let be a dataset of descriptors and samples.

First, we generate bootstrap datasets, possessing sample size, where. Then, for each dataset, we perform a correlation test, between each pair of descriptors and obtained a p value for each pair. Thus, we test hypotheses for all pairs. In order to control the false positives in the multiple hypothesis testing problem, we use the bonferroni correction method for multiple testing correction (MTC) and obtained adjusted p-values. For each pair these adjusted p-values are denoted. In order to decide whether the correlation between a pair is significant, we choose. After applying the correlation test and MTC, we obtain a binary matrix which is defined follows: (9) Finally we calculate a summary-statistic, T(i,j), for each pair of descriptors by averaging the values, i.e., (10) In order to decide whether the correlation between two descriptors is strong, we choose a cut-off threshold.

If for the summary-statistic between two descriptors holds the inequality, then we define two descriptors to be strongly correlated with each other. The descriptors in the set have been chosen as follows. Suppose a descriptor has a maximum number of summary-statistics greater or equal (i.e., where ), then the descriptor is ranked first, and is included in the subset. Then we remove the descriptor and the other descriptors with which has summary-statistic. Then, we apply the same procedure to the remaining descriptors until we find any descriptor having maximum number of summary-statistics with remaining descriptors. Note that some of the descriptors do not have any summary-statistic greater than with any of the other descriptors.

These descriptors are described as lowly correlated descriptors and such descriptors are also included in the subset. This procedure reduces descriptors to descriptors. That means starting with a set of descriptors, we hypothesize that the set identify structural properties of a graph class distinctly. The remaining descriptors are showing stronger similarity (correlation) with at least one of the descriptor of set. Interpretation of the Results The clustering of descriptors for three datasets is shown.

In this figure, the six categories of descriptors are shown in different colors. The figure indicates that the descriptors of each categories have not been clustered correctly regarding their respective groups. For the external validity of the resulting clustering solution, we estimated (normalized mutual information) between reference cluster, (the descriptors of six categories, and are considered as the groups of the reference cluster) and the number of clusters of the clustering solution by cutting at different heights. The estimated normalized mutual information is calculated by sampling the data times. Results for the three data-sets (average NMI) are shown in.

The average normalized mutual information plot between the reference cluster and the clusters created by performing average hierarchical clustering shows that they are quite dissimilar, that is the predicted clusters and the reference cluster are not similar at all. Also, the descriptors of different categories are strongly correlated with each other.

The normalized mutual information, between reference clusters, and the number of clusters, obtained by hierarchical clustering for three data-sets (left), (right) and (bottom). For each has been generated by sampling the data sets, where (data set ). The total number of descriptors equals 919.

They belong to 6 different categories which are as follows: connectivity indices (24), edge adjacency indices (301), topological indices (57), walk path counts (28), information indices (40) and 2D Matrix-based (469). Next, we predict the optimal number of clusters, by using consensus indices measure for different number of clusters generated by a clustering solution. The plots for the consensus indices for the three data sets are shown in.

The consensus indices are calculated for, clusters. For different number of clusters for the three data-sets does not show an absolute maximum. Therefore we selected the first local maxima which gives the optimal number of clusters.

The optimal number of clusters are shown with a dotted red line in the. The consensus indices ( ) for the optimal number of clusters ( ) and the total number of descriptors (, where ) in each cluster for the three data-sets, and are shown in. The optimal number of clusters are very little for all three data-sets and for all data-sets. The first cluster is the largest one which contains more than of descriptors. The cardinalities of the remaining clusters are smaller as they contain much less descriptors. The largest cluster for all three datasets contains descriptors from all six categories which means that most of the descriptors from different categories have a strong correlation among the descriptors and, therefore, they measure structural information similarly. The optimal number of clusters for the three data-sets obtained by using consensus indices (CI).

As a next step, we examine the so-called overlap between the optimal number of clusters shown in and the six categories of descriptors. That means we have to determine how many different descriptors are distributed over different groups (belonging to the optimal number of clusters). This distribution over different clusters could give some information namely which category might capture structural information of the graphs more uniquely than others. The results are shown in and we are going to interpret these results as follows. The intersection of the descriptors between the optimal clusters and the categories of descriptors show that the edge adjacency indices are grouped into different cluster for all three data-sets in comparison to the remaining categories. The 2D Matrix-based descriptors are grouped into different clusters by using and. The information indices are grouped into two different clusters by using all three data-sets.

The measures from the category walk path counts and topological indices are grouped into different clusters by using only. This shows that these descriptors behave differently on trees. The overlap indicates that the group of edge adjacency indices contains more descriptors which capture structural information of the graphs differently compared to other categories. The descriptors in predicted clusters (rows) overlapping with different categories of descriptors. Next, we find a subset of descriptors,.

The main idea is to find a smaller set of descriptors which are little correlated with each and, hence, those graph measures captures structural information uniquely. If they would be strongly correlated, they would capture similar structural information of the graphs. Importantly, the remaining descriptors have much stronger correlation with them. The procedure to obtain a subset of descriptors is described in the section 'Methods and Results'.

We obtained for datasets shown in. The levelplot of correlation for the subset of descriptors of three data-sets are shown in. For all three data-sets, we can clearly see that the descriptors of these subsets are not strongly correlated. These subset of descriptors for all three data-set might detect structural features of the molecular graphs uniquely. Given the subset; then, the remaining descriptors have at least one pair for which the summary statistic is greater than with descriptors. Moreover we now examine for all data-sets which descriptors from (shown in ) belong to which group out of the six categories of descriptors.

The results are summarized in. For each data-set, we start with a different number of descriptors for the different categories. The subset does not contain any descriptor from the connectivity indices for all three data-sets, however, only two descriptors from walk path counts are contained in by using.

Two, four and three descriptors from the category topological indices are contained in for all three data-sets. Three, two and three descriptors from the category information indices are in for three data-sets.

Molecular Descriptors For Cheminformatics Pdf Download

Seven, three and three descriptors from the category 2D Matrix-based are in for three data-sets. Seven, eleven and seven descriptors from the category edge adjacency indices are in for,. These are the maximal numbers of descriptors compared to other categories of descriptors.

The large occurrence of the descriptors from the category edge adjacency indices shows again that these descriptors quantify structural information more uniquely than others. The number of descriptors of which belong to six different categories by using three data sets.

Also, we examine the overlap between the descriptors from and the descriptors in the found clusters; the intersections between them are shown in. Interestingly, at least one descriptor (for all data-sets) overlap with the descriptors of each cluster, except for the ninth cluster by using. The overlap with the found clusters show that the measures contained in (for three data-sets) have the potential to quantify unique structural features of molecular graphs. Summary and Conclusions In this paper, we have evaluated Dragon descriptors to investigate to what extent these measures quantify structural information of molecular graphs uniquely.

From our analysis, it is clear that measures which are strongly correlated are not useful as they capture structural information similarly. From this, the question of determining the usefulness or quality of topological indices arises. We found by calculating the information-theoretic quantity NMI that the used six categories of descriptors are strongly correlated with other categories of descriptors. This indicates that despite being categorized into different groups, these descriptors are providing similar information. From this, one can conclude that many of them they have been introduced in an unconsidered manner.

Again, the question how useful such indices are seems to be quite important and deserves further attention. By using all three data sets, the most suitable descriptor subset contains those measures which have the largest number of significant correlations with the remaining descriptors but they are not significantly correlated with each other. Forms a reduced set of descriptors (the original sets contains descriptors) and their sizes are feasible approximations of the effective dimension of the descriptor space by using all three datasets. For each individual data set, we found the size of to be ( dataset), ( dataset) and ( dataset). Because most of the descriptors we have used are redundant, i.e., they are highly correlated, the estimation of the effective dimension is an intriguing problem.

In our context, the dimension is the number of different descriptors among all. By performing our analysis, we obtained a lower bound on the dimension of descriptors space regarding the different classes.

Molecular Descriptors For Chemoinformatics

Note that these descriptors (the ones in ) depend on the used data set. By inspecting these subsets, we see that the majority thereof are from the category of the edge-adjacency indices. This implies that the edge-adjacency based descriptors can capture more structural diversity when quantifying structural properties of molecular graphs. As another result of this paper, we see that it would not be appropriate to select descriptors more or less randomly for QSAR problems. Neither the random selection nor using all available descriptors would be appropriate as demonstrated in our paper.

Molecular Descriptors For Cheminformatics Pdf Download Full

To tackle this problem, we suggested a statistical analysis evidenced by using clustering. Again, we note that our method applied to six categories of descriptors reduces the descriptor space for three datasets. In this paper we have presented a statistical approach by using correlation test to select a smaller subset of descriptors which captures information similarly. By employing bootstrapping and a probabilistic measure for the selection process, we have identified the most informative set of descriptors. As seen, a set of descriptors can cover a dataset best, but studying this important issue in depth might be future work.