Package _global.tri.oxidationstates
Class Main
java.lang.Object
_global.tri.oxidationstates.Main
This is the main class containing the starting points for various oxidaiton analyzer routines.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic StringThis directory contains the data set from "Novel inorganic crystal structures predicted using autonomous simulation agents" (https://doi.org/10.1038/s41597-022-01438-8)static StringThis subdirectory contains structures with assigned oxidation states, as assigned by different methods.static StringThis is where we store the model parameters (oxidation state boundaries)static StringAll input and output files should be contained under this directorystatic StringThis contains all of the structures from the broad ICSD data setstatic StringThis is where we store the training data sets -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidaddWebIonsToData(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType) The "web ions" are the polyatomic ions used in the manuscript and web site.static voidReads the states calculated by BERTOS based on composition and assigns them to sites in a way that minimizes the GII.static voidReads the states calculated by PyMatGen and assigns them to sites in a way that minimizes the GII.static voidcleanDataGII(_global.tri.oxidationstates.Main.DataType inType, _global.tri.oxidationstates.Main.DataType outType) Looks through all entries of the data set given by inType and determines whether the GII is lower than the GII for the ICSD (calculated by assinging ICSD states to sites in a way that minimizes the GII).static voidcompareAssignments(_global.tri.oxidationstates.Main.DataType dataType) This method prints out summary statistics comparing sets of oxidation states assignments (e.g.static voidfindIons(_global.tri.oxidationstates.Main.DataType dataType) Looks through the training set for the given data type to extract polyatomic ions, as determined by the OxideIonFinder, ZintlIonFinder, or CompositionIonFinderstatic voidfitParameters(OxidationStateData data, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads) Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1static voidfitParameters(OxidationStateData data, LikelihoodCalculator initCalculator, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads) Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1static voidgetCAMDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean noPoly) Assign oxidation states to the data set used in the CAMD search for new materials.static voidgetCAMDDiscoveryCurve(_global.tri.oxidationstates.Main.DataType dataType) This prints the data showing the percent of compositions with stable structures found vs the percent of total compositions evaluated using the CAMD data sorted by Likelihood score.static voidgetCAMDHullHistogram(_global.tri.oxidationstates.Main.DataType dataType) Get the histogram of what percentage of structures are on the hull as a function of likelihood score for the CAMD data.static OxidationStateDatagetData(_global.tri.oxidationstates.Main.DataType dataType) Return the data set for the given data type.static StringgetDataFileName(_global.tri.oxidationstates.Main.DataType dataType) Return the file name for the given data setReads energies above the convex hull from a file extracted from the Materials Project and creates a Map (dictionary, for you Python folks) keyed by the icsd_id and with the energy above the hull as the valuestatic voidgetICSDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean reassign) Assigns oxidation states to all of the structures in the given dataset using oxidation states provided in the ICSD.static voidStarting with a directory of CIF files extracted from the ICSD, select all charge calanced structures and use them to construct the initial training data filestatic StringgetIonStructureDirName(_global.tri.oxidationstates.Main.DataType dataType) Return the polyatomic ion structure directory name for the given data setstatic LikelihoodCalculatorgetLikelihoodCalculator(_global.tri.oxidationstates.Main.DataType dataType) Return a likelihood calculator for the given data type.static StringgetParamFileName(_global.tri.oxidationstates.Main.DataType dataType) Returns the name of the fitted parameter file for the given data type.static StringgetPolyatomicIonDirName(_global.tri.oxidationstates.Main.DataType dataType) Return the name of directory of found polyatomic ions for the given data setstatic voidgetTrainingAssigments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency) Assigns oxidation states to all of the structures in the given dataset using the model trained on that dataset and writes out corresponding CIF files to a directory.static voidgetValidationAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency) Assigns oxidation states calculated using 10-fold cross validation to all of the structures in the given dataset and writes out corresponding CIF files to a directory.static StringReturns the directory name for the polyatomic ions used for the paper and web sitestatic voidgroupIons(_global.tri.oxidationstates.Main.DataType dataType, int minOccurrences, int numSamples) Places ions of the same composition into groups of structurally similar ions, where the atomic oxidation states of all ions in the set need to match.static voidgroupIonsByOxidationState(double minAllowedFraction, boolean removeZeroes, int minAllowedOccurrences) Place the mean structures found bygroupIons(_global.tri.oxidationstates.Main.DataType,int,int)in groups by total oxidation state, calculated by adding the oxidation states of all atoms in the ion.static voidLoads the polyatomic ions used for the web site and paper to the IonFactory.static voidThis is the main entry point for the program.static voidprepareDataFromICSD(_global.tri.oxidationstates.Main.DataType outDataType) Reads a directory of CIF files exported from the ICSD and builds an initial training data file for all ordered, charge-balanced structures.static voidprintAllKnownStates(_global.tri.oxidationstates.Main.DataType dataType) Write all of the distinct oxidation states in a given data setstatic voidprintElectrochemicalSeries(_global.tri.oxidationstates.Main.DataType dataType) Generate a table for the electrochemical series, consisting of boundaries for all redox pairs in the data set.static voidremoveRareIons(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType) Removes all entries with rare ions from a data set, where an ion is considered "rare" if it appears in fewer than 25 entries.static voidsplitData(OxidationStateData data, int numSplits, String outDirName) Randomly splits the data for leave-k-out cross validation in a way that ensures that no composition appears in both the test and training set for any split.static OxidationStateSetstateSetFromBertosString(String bertosString) Reads an oxidation state prediction as output by BERTOS and converts it into an OxidationStateSet for use with this code.static voidThis method provides an example of how to call the API to replicate the table-generating functionality of the web app.static voidtrainModel(String[] args) This method is designed for use in a command-line interface to fit the model.static voidwriteBoundaryJSON(_global.tri.oxidationstates.Main.DataType dataType) Write the JSON file containing the oxidation state boundaries.static voidwriteDataFilesForVisualization(_global.tri.oxidationstates.Main.DataType dataType, boolean smoothCutoff) Geneate the files we use to generate the wavy bar plots (or equivalent straight line plots).static voidWrite files in xyz format for each of the polyatomic ions used for the web site nad paper
-
Field Details
-
ROOT_DIR
All input and output files should be contained under this directory -
STRUCT_DIR
This contains all of the structures from the broad ICSD data set -
TRAINING_DATA_DIR
This is where we store the training data sets -
PARAMETER_DIR
This is where we store the model parameters (oxidation state boundaries) -
GII_DIR
This subdirectory contains structures with assigned oxidation states, as assigned by different methods. It also should contain files for oxidation state assignments output from pymatgen and BERTOS. -
CAMD_DIR
This directory contains the data set from "Novel inorganic crystal structures predicted using autonomous simulation agents" (https://doi.org/10.1038/s41597-022-01438-8)
-
-
Constructor Details
-
Main
public Main()
-
-
Method Details
-
main
This is the main entry point for the program. Other routines are called from here.- Parameters:
args- Command line arguments
-
trainModel
This method is designed for use in a command-line interface to fit the model.- Parameters:
args- The arguments that can be passed to this method. They will be parsed as follows: args[0]: The name of the file containing the training data. args[1]: The name of the file to be written containing parameters of the trained model. The file name should start with "parameters". A corresponding file containing boundary values will also be written, where the name of the boundaries file will be name of the parameters file with "parameters" replaced with "boundaries". args[2]: The name of the directory containing the atomic structures of allowed polyatomic ions. args[3] (optional): This many optimization steps will be run before updating the files "boundaries.txt" and "parameters.txt" and restarting the optimization from the written values. The default value is 1000. args[4] (optional): The parameter to be multiplied by the sum of the spreads in the minimum and maximum boundary values for regularization. The default value is 0. args[5] (optional): The number of threads to run simultaneously for the optimization. The default value is 1.
-
fitParameters
public static void fitParameters(OxidationStateData data, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads) Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1- Parameters:
data- The training datamaxIterationsPerPass- After this many optimization steps have been performed, the code will write the current parameters to a file and restart optimization from those parameters.regParameter- The regularization parameter to determine how much weight to give the regularization term.paramFileName- The name of the output file containing the parameters. The file name should start with "parameters". A separate file containing the boundary values will also be written, where the name of the boundary file will be the same as the name of the parameters file, with the word "parameters" replaced by the word "boundaries".numThreads- The number of threads to use when training the model using parallel processing.
-
fitParameters
public static void fitParameters(OxidationStateData data, LikelihoodCalculator initCalculator, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads) Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1- Parameters:
data- The training datainitCalculator- The Likelihood calculator containing an initial version of the modelmaxIterationsPerPass- After this many optimization steps have been performed, the code will write the current parameters to a file and restart optimization from those parameters.regParameter- The regularization parameter to determine how much weight to give the regularization term.paramFileName- The name of the output file containing the parameters. The file name should start with "parameters". A separate file containing the boundary values will also be written, where the name of the boundary file will be the same as the name of the parameters file, with the word "parameters" replaced by the word "boundaries".numThreads- The number of threads to use when training the model using parallel processing.
-
prepareDataFromICSD
public static void prepareDataFromICSD(_global.tri.oxidationstates.Main.DataType outDataType) Reads a directory of CIF files exported from the ICSD and builds an initial training data file for all ordered, charge-balanced structures. The CIF files should be in ICSD_CIFS/All CIFs/. POSCAR-formatted structures will be written to the Structures directory.- Parameters:
outDataType- The dataType for the generated training data
-
findIons
public static void findIons(_global.tri.oxidationstates.Main.DataType dataType) Looks through the training set for the given data type to extract polyatomic ions, as determined by the OxideIonFinder, ZintlIonFinder, or CompositionIonFinder- Parameters:
dataType- The data set to search through.
-
groupIons
public static void groupIons(_global.tri.oxidationstates.Main.DataType dataType, int minOccurrences, int numSamples) Places ions of the same composition into groups of structurally similar ions, where the atomic oxidation states of all ions in the set need to match.- Parameters:
dataType- The data set for which we are grouping the ionsminOccurrences- The method will ignore any compositions for which the number of found ions is less than this value. This is useful for screening out rare, very large ions that take a long time to group.numSamples- The number of structures to sample when calculating the mean structure for the group
-
groupIonsByOxidationState
public static void groupIonsByOxidationState(double minAllowedFraction, boolean removeZeroes, int minAllowedOccurrences) Place the mean structures found bygroupIons(_global.tri.oxidationstates.Main.DataType,int,int)in groups by total oxidation state, calculated by adding the oxidation states of all atoms in the ion. The atomic oxidation states do not need to match each other for ions to be placed in the same group. For each group, selects a representative structure that is structurally similar to all other ions in the group.- Parameters:
minAllowedFraction- A representative structure will only be generated if the ratio of the number of structurally similar ions (ignoring total oxidation states) to the total number of ions with the same composition is at least this.removeZeroes- True if any ions with zero oxidation state should be removedminAllowedOccurrences- A representative structure will only be generated if the ratio of the number of structurally similar ions (ignoring total oxidation states) to the total number of ions with the same composition is greater than this.
-
getInitialTrainingData
public static void getInitialTrainingData()Starting with a directory of CIF files extracted from the ICSD, select all charge calanced structures and use them to construct the initial training data file -
getEnergiesAboveHull
Reads energies above the convex hull from a file extracted from the Materials Project and creates a Map (dictionary, for you Python folks) keyed by the icsd_id and with the energy above the hull as the value- Returns:
- A Map keyed by the icsd_id and with the energy above the hull as the value.
-
addWebIonsToData
public static void addWebIonsToData(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType) The "web ions" are the polyatomic ions used in the manuscript and web site. This method identifies all entries in the "inDataType" data set that contain these ions, creates a new version of that entry that contains the composition written in terms of polyatomic ions, and adds the new entry to the data set. The new data set is written to the file corresponding to "outDataType". TODO this whole method should probably be added to the OxidationStateData class.- Parameters:
inDataType- The data set to which entries with polyatomic ions should be addedoutDataType- The name of the new data set
-
removeRareIons
public static void removeRareIons(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType) Removes all entries with rare ions from a data set, where an ion is considered "rare" if it appears in fewer than 25 entries. The ion removal process is run repeatedly until no more ions are removed.- Parameters:
inDataType- The data set from which ions should be removedoutDataType- The new data set to be written.
-
getValidationAssignments
public static void getValidationAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency) Assigns oxidation states calculated using 10-fold cross validation to all of the structures in the given dataset and writes out corresponding CIF files to a directory. The description of each CIF file will be the GII for that assignment, and oxidation states will be assigned to sites in a way that minimizes the GII.- Parameters:
dataType- The data set to be used. Fitted parameters will be read for this data set when calculated the Likelihood Score.frequency- True if the Frequency Score should be used to assign oxidation states, false if the Likelihood score should be used.
-
getTrainingAssigments
public static void getTrainingAssigments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency) Assigns oxidation states to all of the structures in the given dataset using the model trained on that dataset and writes out corresponding CIF files to a directory. The description of each CIF file will be the GII for that assignment, and oxidation states will be assigned to sites in a way that minimizes the GII.- Parameters:
dataType- The data set to be used. Fitted parameters will be read for this data set when calculated the Likelihood Score.frequency- True if the Frequency Score should be used to assign oxidation states, false if the Likelihood score should be used.
-
getICSDAssignments
public static void getICSDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean reassign) Assigns oxidation states to all of the structures in the given dataset using oxidation states provided in the ICSD. The description of each CIF file will be the GII for that assignment.- Parameters:
dataType- The data set to be used. Fitted parameters will be read for this data set when calculated the Likelihood Score.reassign- True if the oxidation states should be reassigned to sites in a way the minimizes the GII, which is usually (but not always) how they are assigned in the ICSD. False if we should just use the ICSD assignments.
-
assignBertosStates
public static void assignBertosStates()Reads the states calculated by BERTOS based on composition and assigns them to sites in a way that minimizes the GII. The assigned states are used to calculate the GII -
stateSetFromBertosString
Reads an oxidation state prediction as output by BERTOS and converts it into an OxidationStateSet for use with this code.- Parameters:
bertosString- A set of oxidation states for a given composition as calculated using BERTOS- Returns:
- An OxidationStateSet representing the assignment in the bertosString.
-
calcGIIForPymatgenStructs
public static void calcGIIForPymatgenStructs()Reads the states calculated by PyMatGen and assigns them to sites in a way that minimizes the GII. The assigned states are used to calculate the GII -
cleanDataGII
public static void cleanDataGII(_global.tri.oxidationstates.Main.DataType inType, _global.tri.oxidationstates.Main.DataType outType) Looks through all entries of the data set given by inType and determines whether the GII is lower than the GII for the ICSD (calculated by assinging ICSD states to sites in a way that minimizes the GII). A new data set, outType, is created with such entries removed. Rare ions are also removed. If an To facilitate calculations and analysis, additional files are written listing the "remainder" entries (those that were removed) based on GII and rare ions. Test-training splits for leave-10-out cross-validation are also generated.- Parameters:
inType- The data set that we are cleaningoutType- The cleaned data set
-
splitData
Randomly splits the data for leave-k-out cross validation in a way that ensures that no composition appears in both the test and training set for any split.- Parameters:
data- The data set to be splitnumSplits- The number of splits to generate (What "k" is in leave-k-out cross validation).outDirName- Where to write the splits being generated
-
compareAssignments
public static void compareAssignments(_global.tri.oxidationstates.Main.DataType dataType) This method prints out summary statistics comparing sets of oxidation states assignments (e.g. from different methods for predicting oxidation states) for a given data set.- Parameters:
dataType- The data set for which we are comparing assignments.
-
getCAMDAssignments
public static void getCAMDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean noPoly) Assign oxidation states to the data set used in the CAMD search for new materials.- Parameters:
dataType- The data set that we will use to generate the assignments (we will use parameters fit to this data).noPoly- True if we should write the CAMD compositions in terms of polyatomic ions, false otherwise.
-
getCAMDDiscoveryCurve
public static void getCAMDDiscoveryCurve(_global.tri.oxidationstates.Main.DataType dataType) This prints the data showing the percent of compositions with stable structures found vs the percent of total compositions evaluated using the CAMD data sorted by Likelihood score.- Parameters:
dataType- The data set for which analysis should be performed. (The oxidation analyzer was trained using this data set.)
-
getCAMDHullHistogram
public static void getCAMDHullHistogram(_global.tri.oxidationstates.Main.DataType dataType) Get the histogram of what percentage of structures are on the hull as a function of likelihood score for the CAMD data.- Parameters:
dataType- THe data set used to parameterize the oxidation predictions.
-
writeBoundaryJSON
public static void writeBoundaryJSON(_global.tri.oxidationstates.Main.DataType dataType) Write the JSON file containing the oxidation state boundaries.- Parameters:
dataType- The data set for which we will write the boundaries.
-
writeXYZFiles
public static void writeXYZFiles()Write files in xyz format for each of the polyatomic ions used for the web site nad paper -
loadWebIons
public static void loadWebIons()Loads the polyatomic ions used for the web site and paper to the IonFactory. This should usually be called first, as the code frequently checks IonFactory to get the list of known polyatomic ions. It's OK to call this multiple times on the same directory of ions. -
getWebIonDirName
Returns the directory name for the polyatomic ions used for the paper and web site- Returns:
- the directory name for the polyatomic ions used for the paper and web site
-
printAllKnownStates
public static void printAllKnownStates(_global.tri.oxidationstates.Main.DataType dataType) Write all of the distinct oxidation states in a given data set- Parameters:
dataType- The data set for which we should print out the oxidation states
-
printElectrochemicalSeries
public static void printElectrochemicalSeries(_global.tri.oxidationstates.Main.DataType dataType) Generate a table for the electrochemical series, consisting of boundaries for all redox pairs in the data set. In general it will not be sorted, so you may want to sort the output.- Parameters:
dataType- The data set for which we will print out the electrochemical series.
-
writeDataFilesForVisualization
public static void writeDataFilesForVisualization(_global.tri.oxidationstates.Main.DataType dataType, boolean smoothCutoff) Geneate the files we use to generate the wavy bar plots (or equivalent straight line plots).- Parameters:
dataType- The data set for which we are generating the plots.smoothCutoff- True if the plots should show the logistic function cutoff, and false if they should just show the mean boundary values.
-
getData
Return the data set for the given data type.- Parameters:
dataType- The type of data set we want- Returns:
- The data set containing entries with compositions and oxidation states.
-
getDataFileName
Return the file name for the given data set- Parameters:
dataType- The data set we are interested in- Returns:
- The file name
-
getIonStructureDirName
Return the polyatomic ion structure directory name for the given data set- Parameters:
dataType- The data set we are interested in- Returns:
- The directory name
-
getPolyatomicIonDirName
Return the name of directory of found polyatomic ions for the given data set- Parameters:
dataType- The data set we are interested in- Returns:
- The directory name
-
getLikelihoodCalculator
public static LikelihoodCalculator getLikelihoodCalculator(_global.tri.oxidationstates.Main.DataType dataType) Return a likelihood calculator for the given data type.- Parameters:
dataType- The data set we are interested in.- Returns:
- A likelihood calculator using parameters trained on the given data set.
-
getParamFileName
Returns the name of the fitted parameter file for the given data type. This method assumes a regularization parameter of 5E-6.- Parameters:
dataType- The given data type- Returns:
- the name of the fitted parameter file for the given data type
-
testWebAPI
public static void testWebAPI()This method provides an example of how to call the API to replicate the table-generating functionality of the web app.
-