Using natural language processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters
Literature Information
Jiayun Pang, Alexander W. R. Pine, Abdulai Sulemana
Hansen solubility parameters (HSPs) have three components, δd, δp and δh, accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how molecular structure affects miscibility/solubility. HSP is widely used throughout the pipeline of pharmaceutical research and yet has not been as well studied computationally as the aqueous solubility. In the current study, we predicted HSPs using only the SMILES of molecules and utilise the molecular embedding approach inspired by Natural Language Processing (NLP). Two pre-trained deep learning models – Mol2Vec and ChemBERTa have been used to derive the embeddings. A dataset of ∼1200 organic molecules with experimentally determined HSPs was used as the labelled dataset. Upon finetuning, the ChemBERTa model “learned” relevant molecular features and shifted attention to functional groups that give rise to the relevant HSPs. The finetuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan fingerprint method albeit not to a significant extent. Interestingly, the embedding models can predict δd significantly better than δh and δp and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility. Our study indicates that the extent of transfer learning leveraged from the pre-trained models is related to the labelled molecular properties. It also highlights how δp and δh may have large intrinsic errors in the way they are defined and therefore introduces inherent limitations to their accurate prediction using machine learning models. Our work reveals several interesting findings that will help explore the potential of BERT-based models for molecular property prediction. It may also guide the possible refinement of the Hansen solubility framework, which will generate a wide impact across the pharmaceutical industry and research.
Recommended Journals

Cellulose

Heteroatom Chemistry

Electroanalysis

Critical Reviews in Solid State and Materials Sciences

Biocatalysis and Biotransformation

Medicinal Chemistry Research

Journal of the Indian Institute of Science

Main Group Chemistry

Herald of the Russian Academy of Sciences

Bioorganic & Medicinal Chemistry Letters
Related Literature
A missing allene of heavy Group 14 elements: 2-germadisilaallene
Takeaki Iwamoto, Takashi Abe, Chizuko Kabuto, Mitsuo Kira
DOI: 10.1039/B509878E
Unusual variations in the incidence of Z′ > 1 in oxo-anion structures
Kirsty M. Anderson, Andres E. Goeta, Kirsty S. B. Hancock, Jonathan W. Steed
DOI: 10.1039/B602492K
Phosphoester-transfer mechanism of an RNA-cleaving acidic deoxyribozyme revealed by radioactivity tracking and enzymatic digestion
Srinivas A. Kandadai, William Chiuman, Yingfu Li
DOI: 10.1039/B604682G
Ag/SiO2: a novel catalyst with high activity and selectivity for hydrogenation of chloronitrobenzenes
Yangying Chen, Chuang Wang, Hongyang Liu, Jieshan Qiu, Xinhe Bao
DOI: 10.1039/B509595F
Bond length and bond multiplicity: σ-bond prevents short π-bonds
R. Bruce King, Henry F. Schaefer III
DOI: 10.1039/B602116F
The direct α-zincation of amides, phosphonates and phosphine oxides by H–Zn exchange
Mark L. Hlavinka, Jeffrey F. Greco, John R. Hagadorn
DOI: 10.1039/B509190J
Anion-templated assembly of interpenetrated and interlocked structures
Paul D. Beer, Mark R. Sambrook, David Curiel
DOI: 10.1039/B516435B
Enantioselective conjugate addition of phenylboronic acid to enones catalysed by a chiral tropos/atropos rhodium complex at the coalescence temperature
Chiara Monti, Cesare Gennari, Umberto Piarulli
DOI: 10.1039/B508832A
Smart amphiphiles: hydro/organogelators for in situreduction of gold
Praveen Kumar Vemula, George John
DOI: 10.1039/B518289A
You might also like
What are the main uses of 1H-Indazole-6-carbonitrile (CAS: 141290-59-7)?
1H-Indazole-6-carbonitrile finds applications in pharmaceuticals, where it serve...
How should waste containing Dioctyl (2E)-2-butenedioate (CAS: 2997-85-5) be handled?
Waste containing Dioctyl (2E)-2-butenedioate (CAS: 2997-85-5) should be collecte...
What industries use Sodium [(1,2-benzoxazol-3-ylmethyl)sulfonyl]azanide (CAS: 68291-98-5)?
Sodium [(1,2-benzoxazol-3-ylmethyl)sulfonyl]azanide is primarily used in pharmac...
Are there alternatives to Dimethyl 4-(4,4,5,5-tetramethyl-1,3,2-dioxaborolan-2-yl)-2,6-pyridinedicarboxylate (CAS: 741709-66-0) in synthesis?
Dimethyl 4-(4,4,5,5-tetramethyl-1,3,2-dioxaborolan-2-yl)-2,6-pyridinedicarboxyla...
How should waste containing 2-Fluoro-6-hydrazinopyridine (CAS: 80714-39-2) be handled?
Waste containing 2-Fluoro-6-hydrazinopyridine (CAS: 80714-39-2) should be manage...
What is 6-Formyl-2-pyridinecarboxylic acid (CAS: 499214-11-8)?
6-Formyl-2-pyridinecarboxylic acid is an organic compound with the molecular for...
What is the market or research trend for 3-(3,4-dimethoxyphenyl)-2,5-dimethyl-N-(2-morpholin-4-ylethyl)pyrazolo[1,5-a]pyrimidin-7-amine (CAS: 900874-91-1)?
Research trends for this compound indicate a focus on its potential applications...
How is 9H-Tribenzo[b,d,f]azepine (CAS: 29875-73-8) typically synthesized?
9H-Tribenzo[b,d,f]azepine is typically synthesized via a multi-step process invo...
How is 1-Cyclopropyl-7-ethoxy-6-fluoro-8-methoxy-4-oxo-1,4-dihydro-3-quinolinecarboxylic acid (CAS: 1797982-51-4) typically synthesized?
1-Cyclopropyl-7-ethoxy-6-fluoro-8-methoxy-4-oxo-1,4-dihydro-3-quinolinecarboxyli...
How should waste containing Methyl 3-oxo-1,2,3,4-tetrahydro-6-quinoxalinecarboxylate (CAS: 671820-52-3) be handled?
Waste containing Methyl 3-oxo-1,2,3,4-tetrahydro-6-quinoxalinecarboxylate (CAS: ...

![Sodium (2Z)-7-{[(2R)-2-amino-2-carboxyethyl]sulfanyl}-2-({[(1S)-2,2-dimethylcyclopropyl]carbonyl}amino)-2-heptenoate structure Sodium (2Z)-7-{[(2R)-2-amino-2-carboxyethyl]sulfanyl}-2-({[(1S)-2,2-dimethylcyclopropyl]carbonyl}amino)-2-heptenoate structure](https://static.chemtradehub.com/structs/811/81129-83-1-441c.webp)

![(1S)-1,5-Anhydro-1-[3-(1-benzothiophen-2-ylmethyl)-4-fluorophenyl]-D-glucitol structure (1S)-1,5-Anhydro-1-[3-(1-benzothiophen-2-ylmethyl)-4-fluorophenyl]-D-glucitol structure](https://static.chemtradehub.com/structs/761/761423-87-4-dbeb.webp)

