of micropollutant in a drinking water treatment plant (DWTP) by machine learning using 69 micropollutants
monitoring data at 18 DWTPs for three years. The molecular structure, which contains physicochemical characteristics,
was embedded as a fixed-length vector that is advantageous for data-driven analysis and machine
learning. First, the molecular structure of the micropollutants was converted to a sequence of tokens using the
simplified molecular-input line-entry system (SMILES) pair encoding tokenizer, a frequency-based tokenization
method. It was then compressed into fixed-length vectors using an autoencoder trained on various molecular
structures within the Chemical Entities of Biological Interest. To validate the proposed models, a binary classification
of micropollutant treatability was performed using the embedded molecular structure of micropollutants
with various external features, such as concentration, season, and the presence of specific drinking water
treatment processes by machine learning. The accuracy of the developed model for the 69 micropollutants in this
study was 0.86, and the molecular structure was determined to be the most important feature. Furthermore, an
accuracy of 0.71 was obtained in external validation for pharmaceuticals and personal care products that were
not used for training. This shows that the proposed embedding vector can be generalized to unseen molecules
during the training process, which means that it reflects the characteristics of the molecular structures.