Using natural language processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters
Literature Information
Jiayun Pang, Alexander W. R. Pine, Abdulai Sulemana
Hansen solubility parameters (HSPs) have three components, δd, δp and δh, accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how molecular structure affects miscibility/solubility. HSP is widely used throughout the pipeline of pharmaceutical research and yet has not been as well studied computationally as the aqueous solubility. In the current study, we predicted HSPs using only the SMILES of molecules and utilise the molecular embedding approach inspired by Natural Language Processing (NLP). Two pre-trained deep learning models – Mol2Vec and ChemBERTa have been used to derive the embeddings. A dataset of ∼1200 organic molecules with experimentally determined HSPs was used as the labelled dataset. Upon finetuning, the ChemBERTa model “learned” relevant molecular features and shifted attention to functional groups that give rise to the relevant HSPs. The finetuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan fingerprint method albeit not to a significant extent. Interestingly, the embedding models can predict δd significantly better than δh and δp and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility. Our study indicates that the extent of transfer learning leveraged from the pre-trained models is related to the labelled molecular properties. It also highlights how δp and δh may have large intrinsic errors in the way they are defined and therefore introduces inherent limitations to their accurate prediction using machine learning models. Our work reveals several interesting findings that will help explore the potential of BERT-based models for molecular property prediction. It may also guide the possible refinement of the Hansen solubility framework, which will generate a wide impact across the pharmaceutical industry and research.
Related Literature
Solid state effects on the electronic structure of H2OEP
G. Di Santo, M. Caputo, A. Goldoni, M. Kumar, M. Pedio
DOI: 10.1039/C4CP03450C
Effect of the electropositive elements A = Sc, La, and Ce on the microscopic dynamics of AV2Al20
Michael Marek Koza, Andreas Leithe-Jasper, Erik Sischka, Walter Schnelle, Horst Borrmann, Hannu Mutka, Yuri Grin
DOI: 10.1039/C4CP04097J
Study of structural and dynamic characteristics of copper(ii) amino acid complexes in solutions by combined EPR and NMR relaxation methods
Valery G. Shtyrlin, Anvar Sh. Mukhtarov, Georgy V. Mamin, Siegfried Stapf, Carlos Mattea, Alexander A. Krutikov, Alexander N. Il'in, Nikita Yu. Serov
DOI: 10.1039/C4CP00255E
Structural and dynamical characteristics of trehalose and sucrose matrices at different hydration levels as probed by FTIR and high-field EPR
M. Malferrari, A. Nalepa, F. Francia, W. Lubitz, A. Savitsky
DOI: 10.1039/C3CP54043J
Identification of an emitting molecular species by time-resolved fluorescence applied to the excited state dynamics of pigment yellow 101
Seung Noh Lee, Jaeheung Park, Manho Lim, Taiha Joo
DOI: 10.1039/C3CP54546F
Silica-surface reorganization during organotin grafting evidenced by 119Sn DNP SENS: a tandem reaction of gem-silanols and strained siloxane bridges
Matthew P. Conley, Aaron J. Rossini, Aleix Comas-Vives, Maxence Valla, Gilles Casano, Olivier Ouari, Paul Tordo, Anne Lesage, Lyndon Emsley, Christophe Copéret
DOI: 10.1039/C4CP01973C
Bonding and spectroscopic properties of complexes of SO2–O2 and SO2–N2 and its atmospheric consequences
Samiyara Begum, Ranga Subramanian
DOI: 10.1039/C4CP01084A
High DNP efficiency of TEMPONE radicals in liquid toluene at low concentrations
Nikolay Enkin, Guoquan Liu, Igor Tkach, Marina Bennati
DOI: 10.1039/C4CP00854E
Computational studies of electrochemical CO2 reduction on subnanometer transition metal clusters
Cong Liu, Haiying He, Peter Zapol, Larry A. Curtiss
DOI: 10.1039/C4CP02690J
Systematic study on novel catalytic activity of CO oxidation driven by strong electronic interaction between the monatomic-layered Pt30 cluster disk and the Si substrate
Hisato Yasumatsu, Nobuyuki Fukui
DOI: 10.1039/C4CP02221A
You might also like
What are the main uses of 1H-Indazole-6-carbonitrile (CAS: 141290-59-7)?
1H-Indazole-6-carbonitrile finds applications in pharmaceuticals, where it serve...
How should waste containing Dioctyl (2E)-2-butenedioate (CAS: 2997-85-5) be handled?
Waste containing Dioctyl (2E)-2-butenedioate (CAS: 2997-85-5) should be collecte...
What industries use Sodium [(1,2-benzoxazol-3-ylmethyl)sulfonyl]azanide (CAS: 68291-98-5)?
Sodium [(1,2-benzoxazol-3-ylmethyl)sulfonyl]azanide is primarily used in pharmac...
Are there alternatives to Dimethyl 4-(4,4,5,5-tetramethyl-1,3,2-dioxaborolan-2-yl)-2,6-pyridinedicarboxylate (CAS: 741709-66-0) in synthesis?
Dimethyl 4-(4,4,5,5-tetramethyl-1,3,2-dioxaborolan-2-yl)-2,6-pyridinedicarboxyla...
How should waste containing 2-Fluoro-6-hydrazinopyridine (CAS: 80714-39-2) be handled?
Waste containing 2-Fluoro-6-hydrazinopyridine (CAS: 80714-39-2) should be manage...
What is 6-Formyl-2-pyridinecarboxylic acid (CAS: 499214-11-8)?
6-Formyl-2-pyridinecarboxylic acid is an organic compound with the molecular for...
What is the market or research trend for 3-(3,4-dimethoxyphenyl)-2,5-dimethyl-N-(2-morpholin-4-ylethyl)pyrazolo[1,5-a]pyrimidin-7-amine (CAS: 900874-91-1)?
Research trends for this compound indicate a focus on its potential applications...
How is 9H-Tribenzo[b,d,f]azepine (CAS: 29875-73-8) typically synthesized?
9H-Tribenzo[b,d,f]azepine is typically synthesized via a multi-step process invo...
How is 1-Cyclopropyl-7-ethoxy-6-fluoro-8-methoxy-4-oxo-1,4-dihydro-3-quinolinecarboxylic acid (CAS: 1797982-51-4) typically synthesized?
1-Cyclopropyl-7-ethoxy-6-fluoro-8-methoxy-4-oxo-1,4-dihydro-3-quinolinecarboxyli...
How should waste containing Methyl 3-oxo-1,2,3,4-tetrahydro-6-quinoxalinecarboxylate (CAS: 671820-52-3) be handled?
Waste containing Methyl 3-oxo-1,2,3,4-tetrahydro-6-quinoxalinecarboxylate (CAS: ...











![Sodium 3-[(E)-(4-anilinophenyl)diazenyl]benzenesulfonate structure Sodium 3-[(E)-(4-anilinophenyl)diazenyl]benzenesulfonate structure](https://static.chemtradehub.com/structs/587/587-98-4-035f.webp)



