Battery Data is a cutting-edge research project that leverages advanced techniques in natural language processing (NLP) and text mining to develop comprehensive databases of battery materials.
The project aims to advance battery technology by extracting and analyzing valuable information from scientific research papers in this field.
A battery materials property database was auto-generated using a bespoke battery version of ChemDataExtractor, which is a tool using a hybrid of rule-based and machine learning-based algorithms to interpret language text that makes up the majority of scientific documents.
Database auto-generation techniques include web-scraping, chemical-named entity recognition (CNER), and relationship extraction. A desktop graphical user interface(GUI) is provided to aid the use of this database and visualisation.
Large language models, including BatteryBERT, BatterySciBERT and BatteryOnlyBERT, are released for the text mining use. These models can be easily fine-tuned for down-stream tasks. All the transformer models can be downloaded from the Hugging Face Hub batterydata, which can be easily accessed by the Python package transformers.
The extractive Q&A system and search engine are built on Q&A fine-tuned BatteryBERT, in order to probe the database to give users a chance to ask dynamic questions from the data, and probe all knowledge in the literature about battery materials.
The document classifier is built on fine-tuned BatteryBERT for sequence classification, in order to classify battery or non-battery text given a certain paragraph from abstract of full text.
BatteryDataExtractor is an open-source toolkit of transformer-based text-mining software that is specifically designed to extract and analyze battery-related text data. By enabling researchers to extract meaningful insights from large volumes of scientific research papers related to battery technology, BatteryDataExtractor has the potential to accelerate the pace of research and drive progress in this important field.